1 Introduction

In future digital societies, an increasing number of individuals are likely to merge their bodies with digital technologies. Today, examples of such hybrid bodies include exoskeletons, bionic prostheses, implanted sensors, and artificial organs. While some of these devices are designed to work autonomously, other devices such as bionic prostheses require user control. Bionic prostheses are mechanical devices that replace and sometimes enhance original body parts.

Fig. 1
figure 1

The experience of wearing an arm prosthesis in augmented reality: our system removes the arm from the vision and replaces it with a virtual prosthesis model to study the perception in interaction scenarios. From left to right: (1) user wears a green glove and a mixed reality headset to interact with our system. (2) The user’s arm is virtually overdrawn to remove it from their vision (Diminished Reality). (3) A virtual prosthesis model replaces the user’s arm. (4) The user interacts with virtual objects

Merging humans with technology will affect the psychological processes underlying social human interaction, e.g., with respect to social perception and stereotyping. Studying such effects is important not only from a psychological, but also from a technological point of view as the findings of these investigations will help to develop technologies that adapt better to the expectations and needs of humans. Recent advancements in the area of bionics have already resulted in new high-tech bionic limbs that change how people with disabilities who wear such bionic prostheses are perceived and stereotyped [37].

The major difficulty for studies about bionic prostheses is the current limited availability of participants and technology. This can be overcome by the use of virtual technologies. Using virtual reality to study bionic bodies not only eliminates the need of physical resources, but also allows working with innovative, non-existing designs. Furthermore, VR allows a large numbers of individuals to experiment with artificial body parts, thereby opening this important area of research and societal debates for many participants, resulting in a large societal discourse. These arguments motivate the presented work that targets the development of an appropriate platform for experimental studies on psychological effects with respect to technological body augmentations. Importantly, a virtual prosthesis in VR and the corresponding sense of body ownership does not imply that users of this technology can share experience of actual persons with disabilities. However, our system allows researchers to study the interaction with a bionic prosthesis in a larger population.

To conduct relevant studies of the self- and other-perception of augmented humans, we require that users of our system perceive their environment—including their own bodies—as naturally as possible. This requirement already excludes the use of purely virtual environments where virtual characters are used as representatives of humans. Instead, we aim at an augmented reality where only certain body parts of the user are replaced by virtual parts. A major difficulty in realizing such an AR system is to remove moving body parts from the user’s vision in real time—a process which is referred to as diminished reality (DR).

In this work, we present a novel AR/DR system that visually replaces the user’s right arm with a virtual prosthesis and provides an intuitive control of the prosthesis in interactions. Our main contributions in realizing this system are as follows:

  • First, we provide a seamless visual integration of the prosthesis into the user’s view while they move freely in the laboratory. For this, the user is equipped with an HMD using a video see-through (VST) display and a colored glove to simplify the segmentation of the hand in the videos (see Fig. 1 (1)). In a first rendering pass, we fill the pixels detected in the segmentation step with background information obtained from a reconstructed 3D model of our laboratory (see Fig. 1 (2)). In the second pass, we render the prosthesis model and other virtual objects into the user’s view (see Fig. 1 (3) and (4)). Using depth information provided by the 3D reconstruction and a depth sensing system integrated in the HMD occlusions can be correctly resolved.

  • Second, we provide a solution for re-targeting captured hand data to the prosthesis model. Note that the tracking device provides data to control a virtual hand but not the prosthesis which has a different kinematic skeleton. However, disparities between the measured hand pose and the visualized prosthesis pose prevent an intuitive user control of the prosthesis and need to be minimized. Therefore, our re-targeting method determines data describing a prosthesis pose that matches the provided hand pose as closely as possible. According to the particular design of the prosthesis, we differentiate between the thumb and the other fingers in the fitting process.

To evaluate the suitability of our system, we conducted a first evaluation study on self-presence, immersion, and sense of body ownership experienced by users of the system. While the first two qualities are of general importance for studies of virtual technologies, the significance of the latter results from our particular application.

A major goal of prosthetic treatment is the perceptual integration of a prosthesis into an amputee’s body representation, which means that the users of a prosthesis develop the experience that the device is an actual part of their body. In the context of our AR/DR system, we consider embodiment as a prerequisite for obtaining significant effects for the planned studies.

Recent studies on body ownership are restricted to VR, because technical requirements for such studies are much lower in comparison to an AR setting (see Sect. 2). However, we expect that the handling of a virtual body part in a real environment as mediated by an AR system supports the emergence of body ownership. Therefore, an AR system appears to provide a more appropriate platform for studying body ownership phenomena.

Accordingly, we formulate the following hypothesis and research questions:

  1. H1:

    The AR/DR systems conveys a sense of ownership for the virtual arm prosthesis.

  2. Q1:

    Does the AR/DR system induce a feeling of self-presence?

  3. Q2:

    Is self-presence related to a perceived sense of body ownership?

Since our results support the hypothesis and show that the system induces a feeling of self-presence, we conclude that the realized AR/DR system works as intended and is ready for the use as an experimental platform for further psychological studies on the use of bionic prostheses. With regard to research question 2, the data revealed a strong positive correlation between ownership and sense of presence. Thus, the novelty of our contribution is twofold. In addition to the unique properties of the realized AR/DR system described above, it is grounded in the fact that we extend the study of body ownership to an augmented reality.

The layout of this paper is as follows. In Sect. 2, achievements of related work are summarized and differences to our approach are pointed out. In Sect. 3, we provide the technical details of the realized AR/DR system. In particular, we report on the methods used to blend the prosthesis into the user’s vision and to control it intuitively. Section 4 contains a description of the conducted study, the results of which are summarized in Sect. 5. Section 6 contains our conclusions and outlook to future work.

2 Related work

2.1 Embodiment in virtual reality

The term embodiment is controversially discussed and remains ambiguous [48]. Embodiment is occasionally associated with the sub-concepts of self-location, agency, and body ownership [28]; ownership/integrity, agency, and anatomical plausibility [9]; or sense of ownership and agency [63]. However, researchers agree that the sense of body ownership, i.e., the sense that a part of the body belongs to oneself and the perception of the body as the source of an experienced sensation, is the key feature of embodiment [9, 49, 51]. Research on body ownership dates back to the rubber hand illusion experiment of Botvinick and Cohen (1998) where a sense of ownership over a rubber hand is induced by applying synchronous visual-tactile stimulation to a seen rubber hand and the unseen real hand [11]. Until today, most investigations of body ownership employ the rubber hand illusion (RHI) paradigm [58] and apply adaptions of the RHI embodiment scale.

In recent years, research on body ownership has shifted toward the use of virtual environments and the effects of different aspects of a VR system on body ownership have been investigated. This includes the geometric representation of the virtual body part [5], the chosen perspective [46], the use of head-tracking [56], sensory-motor stimulation [29], and the integration of a brain computer interface [45].

It has been shown that in VR users can develop a sense of embodiment with respect to alterations or deviations on the own body. For example, Piryankova et al. show that users experience embodiment even with an under- or overweight avatar [49]. Kilteni et al. let users experience having unnaturally long arms [29]. Some studies have investigated the effect of altering the hand in particular. Hoyet et al. let users experience having a hand with six fingers [25, 31], while Schwind et al. do the opposite and remove the pinky [53]. However, Argelaguet et al. show that a body ownership is stronger for hand models with similar degrees of freedom to the human hand [5].

2.2 Body augmentation in augmented reality

Martin et al. present a digital reproduction of the famous rubber hand illusion where a horizontally oriented monitor replaces the table of the original experiment [34]. A leap motion controller tracks the user’s hand underneath the monitor and the user can observe an object on the display in the position of the hidden hand. Although this installation shows typical AR features, one important aspect is missing for considering it a full AR application: Since the display shows the replacing object floating in an empty space, no visual integration of the object into the real scene occurs. Instead, the display appears as a window into a virtual space. For this installation, the authors did not study effects on body ownership.

Apart from the installation of Martin et al., contributions to body augmentation in AR are restricted to providing a view of the altered body from an outside perspective. For example, multiple applications for virtual dressing rooms have been built in the recent years [1, 4, 26, 30, 32]. Similarly, Putri et al. demonstrate a system to virtually alter hair styles [50]. The real-time augmentation of human bodies with virtual information has been demonstrated by Hoang et al.. They project muscles and bones onto a person’s body during movements to enhance the education of physiotherapists [17, 24].

Calmon et al. demonstrate an application that detects markers painted on human skin and replaces the markers with tattoos [13]. Here, image inpainting is used to remove the markers from the image. However, this application does not run in real time but is intended for the processing of a single image.

2.3 Diminished reality

If real objects are removed from the device mediated view into a real scene, the observer experiences a diminished reality (DR). Mori et al. give an overview over different approaches toward diminished reality [39]. In our context, diminished reality refers to the removal of the arm from the user’s vision.

The challenging task of DR lies in restoring the occluded background from available information. Hardware-oriented approaches rely on the information provided by multiple cameras assuming that there is always a camera that has an unobstructed view on the occluded objects [36, 40]. Besides the high hardware demands, the need for precise calibration and low-latency synchronization are the drawbacks of this method.

Algorithmic approaches have been extensively studied in the recent years. In image inpainting information on the boundary of a removed region is extrapolated in its interior. Early methods following this approach have high computational costs, but recent variants have been shown to run in real time [20, 21]. Stereoscopic vision can improve quality by using information that is only visible to one camera [22, 23, 62]. Furthermore, disparities can be used to estimate depth information in the inpainted region. That way, it is possible to combine depth estimation and inpainting into a simultaneous process [42]. Recently, this has been exploited to extract depth layers for 3D photography [54]. Additionally, multiple inpainting approaches based on generative neural networks have been proposed [15].

However, since the available information may not allow to draw conclusions about the occluded background, such algorithms perform best when only a small portion of the image in regions of similar texture is obstructed. In particular, these methods suffer from an insufficient reconstruction of important image features, such as straight edges.

If the background is static, pre-recorded imagery can be used to reconstruct the background. First, this idea has been applied to 2D images [14] and then it was extended to 3D by Mori et al. [41]. They reconstruct a virtual representation of the unobscured environment in an observation phase and use it to overdraw regions during the application phase. Our approach is similar in that we create a virtual reconstruction of the laboratory using photogrammetry. However, since we create the reconstruction offline, we are able to adjust the lighting conditions and to manually correct the geometry in regions of poor reconstruction quality.

In photogrammetry, a 3D model of a real object is created from a set of images showing the object from different sides and perspectives. Rong et al. provide a recent survey on different approaches [52]. The key point in photogrammetry is to establish features in different images that correspond to the same point in 3D space. These features allow to estimate the camera parameters for each image and then to compute the 3D coordinates of the corresponding points. A triangulation of the obtained point cloud yields the 3D reconstruction which can then be post-processed (e.g., by reducing resolution or smoothing) [8].

Diminishing an image by overdrawing a foreground object with background information extracted from a 3D reconstruction has an additional advantage compared to inpainting approaches. Since the 3D model does not only provide color values but depth information, it becomes possible to correctly resolve occlusions when new content is rendered into a diminished image. We make use of this possibility when rendering the prosthesis and further virtual interaction objects into the images (see Sect. 3.4).

2.4 Segmentation

Detecting those pixels that belong to the object to be removed is a preparatory step before any DR method can be applied. However, contributions to DR—in particular those that are concerned with processing a single image—often do not pay much attention to this step but consider the segmentation to be given, e.g., as an oversized region of interest [39].

For the implementation of DR in an immersive application, segmentation becomes a crucial issue as segmenting a dynamic scene under real-time constraints is still an open research problem. If the object to be removed is tracked and of simple geometry, a possible approach is to render a virtual copy of the object and extract its silhouette [39]. In our case, this approach is not feasible because of the complex geometry of the hand.

Huge progress has been made in the area of semantic segmentation of video streams by employing convolutional neural networks. There are several suggestions for CNN-based segmentation of hands as MPSP-Net [19], Hand-CNN [43], and Refined U-Net [3, 57]. A comparison of the most recent methods presented in [57] shows that accuracies above 90% can be achieved on certain test data sets. However, a major issue that prevents the use of such methods in real-time applications dealing with different lighting conditions is the missing robustness. Therefore, we simplified the segmentation problem by equipping users with a colored glove that can be extracted from the videos using chroma-keying (see Sect. 3.2).

3 The AR/DR system

3.1 System overview

The system consists of the following components:

  • A mixed reality head-mounted device (HMD) supporting video-see-through (VST),

  • a depth sensing system that provides depth information for real objects in the view of the HMD,

  • a tracking system to capture the 3D pose of the arm, hand, and fingers,

  • a virtual model of a bionic prosthesis,

  • the application software.

Fig. 2
figure 2

Overview over the steps of our pipeline

The software is organized as a pipeline of consecutive steps (see Fig. 2). For each monocular image captured by the VST cameras, we remove the user’s arm in the diminished reality step. We then estimate a pose for the prosthesis and its fingers by re-targeting the tracking data from an external hand tracking sensor to the prosthesis model and use it to render a photo-realistic pair of images in the rendering step. The result is blended with the camera input and displayed HMD.

3.2 Diminished reality

The removal of the user’s arm from an image proceeds in two steps. First, the pixels that belong to the user’s arm are detected. Second, the identified pixels are filled with background information.

Fig. 3
figure 3

The result of segmentation using chroma-keying from the user’s perspective. From left to right: (1) image of a user wearing a green glove, (2) pixels identified by chroma-keying are highlighted in pink, (3) rendering of the prosthesis shows that it does not cover the whole area of segmented pixels

Fig. 4
figure 4

Workflow of reconstruction using photogrammetry. From left to right: (1) Photographs are taken with the camera facing the opposite wall of the room, (2) Example images showing different perspectives onto one wall in our laboratory, (3) Illustration of overlap in the photographs, (4) Reconstruction of the corresponding area

As described in Sect. 2.3, purely software-based approaches for image segmentation do not yet fulfil the robustness requirements of our application. Therefore, we equip users with a colored glove (see Fig. 3 (1)) which allows us to use chroma-keying to find the pixels of user’s arm. Chroma-keying is a well-established method to detect a certain color key within an image. It is commonly used to blend virtual and real-world scenery (e.g., using a green-screen in movie productions) [60]. In software implementations, the color space of the image is typically converted into HSL. Then, a range of hue values can be filtered from the image. If a pixel lies within this hue range, it is included in the keyed region. On modern GPUs, it can be implemented as a per-pixel operation on parallel pixel shaders, which makes it an efficient operation to perform. For the system used in our study, we used an implementation that is included in the Varjo SDK [61].

The result of segmentation is a pixel mask within each image of the stereoscopic image pair provided by the HMD (see Fig. 3 (2)). In the following, each of these masks is filled with a rendering of the virtual prosthesis. In this way, virtual and real content is blended in the user’s vision. However, in most cases the prosthesis does not fill out the pixel mask completely due to differences in the shapes of the arm and the prosthesis (e.g., fingers can differ in length, size and orientation compared to the user’s hand). Therefore, we fill in background information into those pixels of the mask that are not filled during the rendering of the prosthesis. This is an important step as it lays the ground for a seamless integration of the prostheses into the user’s vision. With respect to the implementation, it is advantageous to first fill the segmented pixel mask completely with background information and then render the prosthesis on top of it (see Sect. 3.4) as this avoids the detection of those pixels that are not filled by rendering the prosthesis. The diminished reality step is completed when the background information is filled into the segmentation mask. How this background information is computed is explained in the following.

As our experiments take place in a static, controlled laboratory setting and involve only one dynamic real-world object (the user’s own body), we create the background information from a static 3D reconstruction of our laboratory. In contrast to algorithmic inpainting that extrapolates pixel information from the boundary of a hole into its inside (see Sect. 2.3), the 3D reconstruction gives access to the background that is occluded by the user’s arm. The pixels of the mask are filled with background information by rendering the 3D model from a position and orientation that matches the user’s view of the laboratory mediated by the HMD. For this, the room model needs to be placed in the world coordinate system (WCS) \({\textbf{x}}\), \({\textbf{y}}\), \({\textbf{z}}\) of our application such that it aligns with the real world. We use the convention that the floor coincides with the \({\textbf{x}}{\textbf{z}}\)-plane and \({\textbf{y}}\) denotes the up-direction. To facilitate the fitting, we use markings taped on the floor of our laboratory (see Fig. 4), which form squares of equal size. The origin of the reconstruction model is placed on \({\textbf{x}}{\textbf{z}}\)-plane. Then, the model is rotated around the \({\textbf{y}}\)-axis until the real lines and the rendered lines are parallel. Next, the model is scaled until the sizes of real and rendered squares match. Finally, a translation is applied that moves the rendered markings on top of the real ones.

Fig. 5
figure 5

Example images showing the visual integration of the prosthesis into the user’s vision

To create the 3D reconstruction of the room, we followed a photogrammetric approach using the commercial software Agisoft Metashape [2]. The photographs were taken in multiple subsets with a camera mounted on a tripod. For each subset, the location of the tripod stayed fix but the camera angle was changed for each shot to cover the view into the room from different perspectives. The change in the camera rotation was restricted to have roughly 60% coverage between two neighboring photographs (see Fig. 4 (1-3)). For each position of the tripod, this procedure was carried out for two different heights (120 cm and 170 cm). To also cover the ceiling and floor, additional photographs were taken at the lowest and highest tripod height, respectively. In total we took 420 photographs of the laboratory this way. However, we discarded 42 of these for quality reasons. For example, reflective objects (as those with a metallic surface) that are present in an image may cause difficulties to establish correspondence points with other images.

The inherent inaccuracy of the depth estimation becomes particularly visible when planar surfaces (e.g., tabletops) are reconstructed. To avoid a poor visual appearance of such objects in the 3D model, we manually smoothed plane surfaces using the 3D modeling software Blender [10]. Surfaces that are flat and have a homogeneous texture present an even bigger problem for the reconstruction because of the difficulty to find correspondence points between two images of such a surface. In the resulting model, those areas are typically not reconstructed, leaving holes in the model. This case applies to a diffuse, gray projection screen which is part of our laboratory equipment. This object was manually modeled and then textured using some of the source photographs.

For a seamless visual integration of the 3D room model into the user’s view, the colors in the textures of the 3D model must closely match the colors of images produced by the cameras of the HMD. An obvious approach to achieve this is to use the cameras of the HMD also to create the images for the reconstruction. However, the HMDs are essential hardware components in our project and we keep them up-to-date to provide users the highest visual quality available. The unavoidable use of different cameras thus leads to the problem to minimize differences in color.

We chose a high-quality reflex camera (Nikon D810 DSLR) to create the source photographs for the reconstruction. To match the color temperature of the created images with those of the VST cameras, we proceeded as follows. First, we took a reference photograph with the Nikon DSLR camera and extracted the color temperature and tint using a gray card. We then adjusted the white balance of the reconstruction texture to the neutral white point [12]. In the application, we disable automatic white balance and set the constant white point for the video-see-through camera system of the HMD to match the values measured for the Nikon. In addition, we assured a homogeneous brightness in the source images by avoiding the automatic mode. In this mode, the camera automatically brightens or darkens images, if the focused area appears too dark or too bright. Overlapping regions may thus have different brightness levels in different images. To prevent this, we took all photographs using the same fixed exposure. Furthermore, to ensure constant lighting conditions, we covered all windows during the experiments and only used artificial light. The result of the visual integration of virtual and real content can be seen in Fig. 5.

The described approach to realize a diminished reality has consequences for our application which need to be considered. First, the use of chroma-keying requires the user to wear a colored glove while using the application. However, this requirement does not represent a restriction to the use of the system in any way. While adjusting the HMD to a particular user is a little tedious, our one-size-fits-all glove can be put on in no time. Wearing the glove was not perceived as an inconvenience and no limitations of the interactions due to the glove could be observed.

However, using a static 3D model for background reconstruction may in some situations obstruct the seamless visual integration of the virtual content into the user’s view. When a user watches their arm while moving it in front of their own body, it is the body that represents the correct background and not the laboratory model. In our solution, the user will see the prosthesis in front of their body but the prosthesis may appear surrounded by a few pixels with colors from images of the laboratory. To avoid the occurrences of such visual flaws, we designed all interactions in the user study in a way that require the user to act with an arm that is stretched away from the body. In this way, during interactions the described visual artifacts do not occur.

3.3 Pose re-targeting

3.3.1 Prosthesis design

The design of the prosthesis model draws inspiration from real-world bionic arm prostheses. We aimed for a lightweight design with organic shapes, though it should be clearly visible that the prosthesis resembles a highly technical device in order to invoke a strong response from the participants. The prosthesis features a working screen on the back of the hand that can display additional information, such as the current date and time or a screen-saver (see Fig. 1) to imitate a device similar to a smartphone. We realized a set of textures, each of which give the prosthesis the appearance of a different material to find out preferences of the users (see Fig. 8). However, during the experiment the prosthesis was rendered with the same wooden texture to maintain the experimental conditions unchanged (see Sect. 4.2).

Fig. 6
figure 6

Degrees of freedom (DoF) per joint in a human hand (right) and our prosthesis model (left). Note the different mechanics of the thumb in the two models. The hand bones model is taken from [38]

It is difficult to reproduce the complex anatomy of a human hand with mechanical components. Therefore, our design uses components similar to those that can be found in a real prosthesis. The differences to the mechanics of a human hand are most notable in the thumb (see Fig. 6). We use the same model for all participants, i.e., the size of the model does not scale with the size of the user’s hand.

To position and orient the prosthesis model in the virtual space, we process sensor data taken from an external hand tracking device. Such systems have been developed to control a virtual human hand and thus provide positional and orientational data for the joints of the user’s hand. However, different to the intended use of the tracking system, we use the provided data to control an artificial hand with clear differences in the anatomy. This means that for each set of data describing a human hand pose we have to determine data describing a prosthesis pose that matches the hand pose as closely as possible. This requirement of a close match between the two poses follows from the fact that the displayed prosthesis provides the visual feedback needed for an intuitive control of the prosthesis. In particular, for the successful execution of a grasping task it is mandatory that the visualized positions of the fingertips correspond to the measured fingertip positions provided by the sensors. To achieve this, we introduce a novel pose re-targeting strategy that aligns the fingers of the prosthesis model according to the provided hand data. For all fingers except the thumb, we use an inverse kinematics approach to move the fingertips close to the measured positions. This approach assumes a similar kinematics for the fingers of the two models and thus cannot be applied for the thumb. For that finger, we determine the degrees of freedom by taking recourse to particular angles derived from the targeted hand pose.

3.3.2 Finger poses

Fig. 7
figure 7

The effect of inverse kinematics for finger pose re-targeting. Left column is without, right column with inverse kinematics applied. We visualize the prosthesis together with the gloved hand to show the improved positions of the fingertips. The thumb does not receive inverse kinematics

The prosthesis movement is animated using kinematic chains, which are represented as a hierarchy of joints and bones. Each joint controls one bone. However, a bone can have multiple child joints, e.g., in the hand where the wrist bone has five child joints, one for each finger. Each joint has a fixed offset relative to the parent joint that corresponds to the bone length. When animating, only the orientation at each joint relative to its parent is altered. The root node of the kinematic chain corresponds to the elbow joint. This node lies outside of the geometry of our prosthesis (see Fig. 8).

As a starting point of the animation, the rest-pose of the prosthesis (see Fig. 6 left) it is associated with the world coordinate system (WCS) as follows. The elbow and wrist joints are located on the \({\textbf{z}}\)-axis of the WCS. The orientations of both joints align with the axes of the WCS. In the rest-pose, all finger joints of our model lie on a plane. This plane coincides with the \({\textbf{x}}{\textbf{z}}\) plane of the WCS. The prosthesis is oriented palm down, i.e., the \({\textbf{y}}\)-axis corresponds to the normal on the back of the hand and the \({\textbf{x}}\)-axis points away from the thumb.

The kinematic chain for a single finger different from the thumb consists of three joints \({\mathbb {J}}_i\) positioned at \({\textbf{P}}_i, 0 \le i < 3\). Here, \({\textbf{P}}_0\) denotes the position of the root joint that connects the finger with the metacarpus (i.e., the part of the hand between the wrist joint and the fingers). \({\textbf{P}}_3\) denotes the position of the fingertip.

For convenience, we introduce a local coordinate system \({\textbf{x}}_i\), \({\textbf{y}}_i\), \({\textbf{z}}_i\) at each joint \({\mathbb {J}}_i\). The \({\textbf{z}}\)-axis points toward the next child in the chain, i.e., \({\textbf{z}}_i = \frac{{{\textbf{P}}_{i + 1} - {\textbf{P}}_i}}{\left\Vert {\textbf{P}}_{i + 1} - {\textbf{P}}_i\right\Vert }\). \({\textbf{x}}_i\) is obtained by applying the concatenation of all joint orientations along the kinematic chain from the elbow to the joint \({\mathbb {J}}_i\) to the \({\textbf{x}}\)-axis of the WCS. Finally, \({\textbf{y}}_i\) is defined as the cross product \({\textbf{x}}_i \times {\textbf{z}}_i\) to obtain a right-handed coordinate system.

For each of the four considered fingers, the joints \({\mathbb {J}}_1\) and \({\mathbb {J}}_2\) only have one degree of freedom that specifies the rotation around their \({\textbf{x}}\)-axis. We refer to this rotational parameter as the pitch of the joint. Only the root joint \({\mathbb {J}}_0\) has an additional degree of freedom, which we refer to as the yaw and which describes the spread of the finger. Since the yaw of \({\mathbb {J}}_1\) and \({\mathbb {J}}_2\) is zero, the local coordinate systems of all three joints share the same \({\textbf{x}}\)-axis.

The prosthesis is aligned with the user’s arm, by first orienting the model according to the tracking data. More precisely, the position and orientation for the elbow joint are set as measured by the sensor. For all other joints, the measured orientation is directly applied to corresponding joints. Due to anatomical differences between the user’s hand and the model, this results in a slight mismatch between the hand pose and the prosthesis pose. Since this discrepancy may have a negative effect on precise interactions (e.g., grasping an object) with the prosthesis, we developed a method to correct it. Since the positions of the fingertips are highly significant in many types of actions of a hand, our method changes the pose of the prosthesis such that the fingertips coincide with the those of the tracked hand (see Fig. 7). This matching proceeds in two steps. Since all three joints share the same \({\textbf{x}}\)-axis, we adjust the pitch at each joint in one step. In the second step, we adjust the yaw at the root joint.

For the first step, consider a vector \({\textbf{d}}_{tip} = {\textbf{P}}_{tip} - {\textbf{P}}_0\) pointing from the position of the root joint of a finger to the position of the fingertip as measured by the hand tracking system. Together with the vector \({\textbf{y}}_0\), \({\textbf{d}}_{tip}\) forms a plane \({\mathcal {T}}\) with the normal \({\textbf{n}} = \frac{{\textbf{d}}_{tip}}{\left\Vert {\textbf{d}}_{tip}\right\Vert } \times {\textbf{y}}_{0}\). By projecting each joint position \({\textbf{P}}_i, 0 \le i \le 3\) into \({\mathcal {T}}\), we obtain a planar kinematic chain, which is used to adjust the pitch for each joint. This is done using an approach similar to a constrained FABRIK algorithm [6, 7]. The idea behind this approach is to adjust a kinematic chain repeatedly in forward and backward direction. Our application of the tracking data can be considered as the initial forward pass which is then corrected in a backward pass.

We first move the fingertip into the desired position \({\textbf{P}}_3 = {\textbf{P}}_{tip}\) without changing the positions of the other joints. This will change the bone length associated with the joint \({\mathbb {J}}_2\). Consequently, the position \({\textbf{P}}_2\) has to be changed as well. This is done according to the equation \({\textbf{P}}_i = {\textbf{P}}_{i + 1} - l_i {\textbf{z}}_i, i = 2\), where \(l_i\) corresponds to the length of the bone. This process now needs to be applied to the remaining joints in the chain. To summarize, the positions of each joint are calculated as:

$$\begin{aligned} {\textbf{P}}_i = {\textbf{P}}_{i + 1} - l_{i} \left( {\textbf{z}}_i - \frac{{\textbf{z}}_i \cdot {\textbf{n}}}{\left\Vert {\textbf{n}}^2\right\Vert } {\textbf{n}}\right) , 0 \le i < 3 \end{aligned}$$
(1)

In general, the root joint obtained this way will differ from the original root. Hence, in the FABRIK approach one would continue with performing the same process again multiple times altering between forward and backward passes until the chain converges. However, for the special case of the chain of finger joints, we found the accuracy achieved by one backward pass to be sufficient for our application.

Finally, we compute the pitch at each joint from \(\theta _{i} = {{\,\textrm{acos}\,}}\left( {\textbf{z}}_{i} \cdot {\textbf{z}}_{i + 1} \right) \) and the yaw at the root joint using \(\phi _0 = {{\,\textrm{acos}\,}}\left( {\textbf{z}}_0 \cdot {\textbf{d}}_{tip}\right) \). When computing \(\phi _0\) this way, it also contains the pitch. For an exact calculation of \({\textbf{d}}_{tip}\) needs to be projected into a plane perpendicular to \({\textbf{y}}_0\). In practice, we found the difference to be negligible.

3.3.3 Thumb pose

The thumb has a more complex mechanic than the other fingers. It consists of two joints with two degrees of freedom, namely the carpometacarpal (CMC) and the metacarpophalangeal (MCP) joints, and one joint with one degree of freedom, the interphalangeal (IP) joint [27]. In the following, we will denote the positions of these joints in a kinematic chain as \({\textbf{P}}_{CMC}\), \({\textbf{P}}_{MCP}\), and \({\textbf{P}}_{IP}\) and the direction vectors of the associated bones as \({\textbf{d}}_{CMC}\) and \({\textbf{d}}_{MCP}\). Furthermore, \({\textbf{d}}_{IP}\) denotes the vector pointing from \({\textbf{P}}_{IP}\) to the tip of the thumb. To replicate the biomechanics of the thumb with a mechanical device is difficult, not only because of the additional degree of freedom, but also because of the wide motion range of the CMC. Therefore, in a real bionic prosthesis, the thumb is usually modeled in a simplified way as a chain of four 1 DoF joints and our prosthesis model follows this design pattern (see Fig. 6 and Sect. 3.3.1). We denote the single rotational DoF of each joint as \(\theta _i, 0 \le i \le 4\).

The discrepancy between the two kinematic chains creates a problem for the control of the thumb based on data of the motion capturing. Since the motion capture system follows the design of a real thumb, it provides orientational data for three joints that cannot be directly associated with the DoFs of our model. Therefore, we follow an approach of associating the DoFs of our model with angles formed by the bones of the thumb with each other and with the metacarpus. For this, we consider the local coordinate system \({\textbf{x}}_w\), \({\textbf{y}}_w\), \({\textbf{z}}_w\) of the wrist with an orientation that is obtained by applying the concatenation of the orientations of the elbow and wrist joints to the WCS.

The first DoF \(\theta _0\) of our model describes a rotation of the thumb around the \({\textbf{z}}_w\)-axis, which outlines the direction from the wrist to the fingers. To compute \(\theta _0\), we project \({\textbf{d}}_{CMC}\) onto the \({\textbf{x}}_w{\textbf{y}}_w\) plane of the coordinate system and measure the signed angle to the direction \(-{\textbf{x}}_w\). Inverting \({\textbf{x}}_w\) is necessary, because the prosthesis has been aligned with the WCS such that the \({\textbf{x}}\)-axis points away from the thumb. For the computation, we use the following formulas:

$$\begin{aligned} \begin{aligned}&{\textbf{d}}'_{CMC} = {\textbf{d}}_{CMC} - \frac{{\textbf{d}}_{CMC} \cdot {\textbf{z}}_{w}}{\left\Vert {\textbf{z}}_{w}\right\Vert } {\textbf{z}}_{w} \\&\theta _{0} = {{\,\textrm{atan2}\,}}\left( \left( -{\textbf{x}}_{w} \times {\textbf{d}}'_{CMC}\right) \cdot {\textbf{z}}_{w}, -{\textbf{x}}_{w} \cdot {\textbf{d}}'_{CMC}\right) \end{aligned} \end{aligned}$$
(2)

The DoF \(\theta _1\) describes the spread of the thumb away from the hand. Thus, it can be computed as:

$$\begin{aligned} \begin{aligned} \theta _1 =&{{\,\textrm{acos}\,}}\left( {\textbf{z}}_w \cdot {\textbf{d}}_{CMC} \right) \end{aligned} \end{aligned}$$
(3)

The remaining DoFs relate to angles formed between the bones of the thumb and can be expressed as:

$$\begin{aligned} \begin{aligned} \theta _2 =&{{\,\textrm{acos}\,}}\left( {\textbf{d}}_{CMC} \cdot {\textbf{d}}_{MCP} \right) \\ \theta _3 =&{{\,\textrm{acos}\,}}\left( {\textbf{d}}_{MCP} \cdot {\textbf{d}}_{IP} \right) \end{aligned} \end{aligned}$$
(4)

Following this approach leads to a reconstruction of the thumb pose that we found to be accurate enough for our application. Thus, no inverse kinematics is applied to the thumb.

3.4 Rendering

For a high level of immersion, we aimed for a photo-realistic rendering of the prosthesis model. This has been achieved by implementing our application using the high definition render pipeline of the Unity3D engine [59]. In this pipeline, a physically based light transport simulation is used to compute lighting [47], and physical parameters are used to specify particular light conditions. Since we already measured the color temperature and tint of the light sources in our laboratory (Sect. 3.2), we use the measured values to set the lighting conditions in the virtual scene and in this way replicate the real light conditions.

The system needs to handle occlusions of virtual objects by real-world objects. For example, we hide the socket of the prosthesis below the sleeve of a pullover or jacket (see third image in Fig. 1). To achieve this, we perform two (stereoscopic) rendering passes. For simplicity, we describe the whole process for monoscopic rendering.

In the first pass, which is part of the DR step, the virtual room model is rendered into those pixels detected in the segmentation step (see Sect. 3.2). Since at each time step the frame buffer is pre-filled with an image provided by a VST camera, the rendering partially overwrites the camera image with computed background information. In an analogous way, the depth buffer is pre-filled with depth values provided by the depth sensing system (see Sect. 3.1). During the rendering, for each considered pixel an associated depth value is computed that encodes the distance from the camera to the virtual model. Storing these depths values in the depth buffer overwrites measured values with computed ones. In this way, we prepare frame buffer and depth buffer for the occlusion handling in the second rendering pass.

The second pass renders the prosthesis model and other virtual objects. During this pass, we perform a depth test: For each rendered pixel, the distance to the camera is computed and compared to the depth value stored in the depth buffer. Only if the computed depth is smaller than the stored depth, the pixel is drawn overwriting the stored information for this pixel in the frame buffer. In this way, both types of occlusions can be correctly resolved. If a pixel of a virtual object is located between the camera and the real world, the computed depth is smaller than the sensed depth. In this case, the pixel is drawn overwriting the real-world object by a virtual occluder. On the contrary, if the computed depth is equal or bigger than the stored depth, the pixel is not drawn because the real-world object occludes the virtual one.

3.5 Hardware

Over the course of the project, we realized versions of our application that are based on different VR/AR hardware. The experimental study reported in Sect. 4 was conducted with our most advanced version that includes a Varjo XR-3 mixed reality headset [61], which provides several key features that are needed in our application:

  • Integrated Leap Motion controller, that allows us to track the user’s hand without the need for additional hardware, such as tracking markers or a data glove.

  • LiDAR-based depth sensing, that we use to hide the socket of the prosthesis model below a sleeve of a pullover or jacket of the user. Furthermore, it provides better results in many occlusion scenarios, compared to vision-based depth estimation methods.

  • Real-time environment reflections allow to stream the environment into a cube map, allowing for reflective materials.

  • Support for external positional tracking to acquire the world-space coordinates using two HTC Vive Lighthouse base stations.

The application runs on a desktop computer with an Intel i9-9900K CPU and a NVidia RTX 2080 Ti GPU with 64 Gb of RAM available at an average of 40 FPS with a resolution of \(1444\times 1236\) pixels per eye.

4 Experimental study

Prior to data collection, we preregistered the study on OSFFootnote 1 and obtained approval from the university ethics committee. We report sample size considerations, data exclusions (if any), manipulations, and all measures below [55].

4.1 Sample

An a-priory power calculation using G*Power 3.1.9.7 [16] showed that detecting a medium-sized effect for the one-sample tests implied by the hypothesis requires a sample of at least 27 individuals for achieving a statistical power of 0.80. We thus recruited 27 participants through a university mailing list of individuals who are willing to participate in empirical studies for payment or course credit. We advertised the study as a study on testing a new VR system and asked healthy individuals without mobility impairments and without photosensitivity issues to participate. Participants received 10 EUR or study participation credit for participation. The latter is required in some subjects at the university and four participants chose this option.

Of the 27 participants, 21 identified as female and six as male. On average, participants were 27.37 years old, SD = 6.68. All participants were highly educated: 17 held a university degree such as a Bachelor’s degree (i.e., these were most likely master or PhD students), while the rest held a university entrance qualification diploma (i.e., these were most likely undergraduate students). All participants gave their written informed consent before participating in the study.

4.2 Procedure

Fig. 8
figure 8

Our prosthesis comes with different materials. We asked the participants which material they prefer. The majority of the participants chose either black plastic or galvanized metal, indicating a preference for more simplistic designs

The study was conducted as a non-randomized quasi-experimental study where all participants were assigned to the same condition, performed the same task within the virtual reality environment, and filled in the provided questionnaire afterward. The study took place at the VR laboratory of the second author. Upon arrival at the laboratory, participants put on the HMD with the help of the experimenter. The experiment was designed in two phases. The participants were not given any time constraints or goals. They could freely explore the environment in each phase and proceed to the next phase. Each phase provides them with a different task:

  1. 1.

    The first phase was designed for the participant to get used to the prosthesis. The user did not experience any alterations in the environment yet, with the exception that their arm was replaced with the prosthesis, which they could freely control and investigate.

  2. 2.

    The second phase hosted a virtual table with a small number of interactable objects on it. The set of objects contained different rudimentary bricks that the participants could grab and stack; some marbles, which could be pushed around; and a button, that could be used switch on or off a light attached to the table. Users were free to interact with the objects, without any specific goal.

During both phases, the prosthesis was rendered in the same way, using a texture of a bright wood material, and it was not possible for the participants to change the material. However, in preparation of further studies we were interested to know the preferences of the users with respect to the realized materials. Therefore, after completing both phases, participants were asked to change the appearance of their prosthesis using a selection of materials (see Fig. 8). We asked them to try out each material and to select their favorite one.

After the experiment, we asked the participants to answer the provided online questionnaire to collect their demographic variables and evaluate their experience of the system. Participants rated the perceived sense of body ownership (referred to as ownership in the following), self-presence, and immersion. Also, they evaluated the AR/DR system concerning the ease of use and rated their overall user experience.

4.3 Measures

We assessed the study variables as follows:

Sense of Ownership We measured the degree to which participants felt that the virtual prosthesis was actually part of their body (i.e., the sense of body ownership conveyed by the virtual prosthesis) with the embodiment scale for the rubber hand illusion [51] which we adapted to the virtual prosthesis by replacing mentions of the “rubber hand” in the original items with “virtual prosthesis.” The scale consists of seven items (e.g., “During the simulation, it seemed like the virtual prosthesis belonged to me”) with the original answer scale ranging from -3 (fully disagree) to 3 (fully agree). The scale was reliable, \(\omega \) = 0.76.Footnote 2

Self-Presence We measured self-presence in the VR environment with the corresponding sub-scale of the Multimodel Presence Scale for virtual reality environments [33]. The scale consists of seven items (e.g., “I felt like my virtual embodiment was an extension of my real body within the virtual environment”) that were presented with a scale ranging from 1 (fully disagree) to 5 (fully agree). The scale was reliable, \(\omega \) = 0.68.

Immersion We measured immersion with the physical presence sub-scale of the Multimodel Presence Scale for virtual reality environments [33]. The scale consists of four items (e.g., “I had a sense of acting in the virtual environment, rather than operating something from outside.”) that were presented with a scale ranging from 1 (fully disagree) to 5 (fully agree). The scale was reliable, \(\omega \) = 0.77.

User Experience To determine the quality of the overall user experience, we measured it with eleven items from the Evaluation of Virtual Reality Games questionnaire [44] (e.g., “The VR glasses are 1 = very uncomfortable; 5 = very comfortable”), and six system-specific self-developed items (e.g., “The virtual prosthesis fitted my arm 1 = very poor; 5 = very excellent”). The scale was reliable, \(\omega \) = 0.84.

Affinity for Technology To assess whether sense of ownership or presence depend on user’s affinity for technology, we measured it with the Affinity for Technology Interaction (ATI) Scale [18]. It consists of nine items (e.g., “I enjoy spending time becoming acquainted with a new technical system”) that we presented with a response scale ranging from 1 = completely disagree to 6 = completely agree. The scale was reliable, \(\omega \) = 0.97.

5 Results

Means, standard deviations, and bivariate correlations are given in Table 1.

Table 1 Means, standard deviations, and bivariate correlations of measurement variables
Table 2 One-sample t tests investigating whether participant’s mean assessments of the system lie above the scale centerpoint

As visible from the table, ownership, self-presence, immersion, and user experience were all unrelated to age, gender, and technology affinity. To test whether study participants experience ownership, self-presence, and immersion, we test whether users’ average levels on these scales lie above their respective scale centerpoints (e.g., whether users experience ownership above 0 or self-presence above 3) with one-sample t tests, see Table 2.

The table shows that participants experience a significant sense of body ownership with regard to the virtual prosthesis, albeit a small one (0.83 on a scale ranging from -3 to 3). The statistical significance of the difference between the scale centerpoint and the sample mean is also associated with a large statistical effect size (Cohen’s d> 1). The data thus supported hypothesis 1 (Sect. 1).

Regarding the experience of self-presence (research question 1), the results exhibit the same pattern: Participant’s average response was above the scale centerpoint and the difference between the scale centerpoint was highly significant and associated with a large effect. The data thus showed that a feeling of self-presence is induced by the AR/DR system. Moreover, with regard to research question 2, the data revealed a strong positive correlation between ownership and sense of presence.

Participants also experienced a significant amount of immersion and reported a favorable user experience.

With regard to material preference, most participants chose a material with non-obtrusive texture, such as the galvanized metal or black plastic ones. We will take this information into account when preparing the stimulus material for further studies (see Sect. 6).

6 Conclusion and Outlook

We developed an augmented reality system that can replace the user’s hand with a prosthetic hand in real time. We investigated users’ perceptions of ownership, self-presence, immersion, and user experience and correlated them with users’ affinity for technology. Perceptions of ownership, self-presence, immersion, and user experience were unrelated to users’ age, gender, and technological affinity. Furthermore, perceptions of ownership, self-presence, immersion, and user experience were all favorably, in the sense that users’ average perceptions were significantly above the scale mean and all of these differences were associated with large effect sizes. Accordingly, the results support the hypothesis and show that the system works as intended.

Our system allows experiencing body augmentations from a first-person perspective. We carefully designed the interaction with the virtual prosthesis model in a way that feels intuitive for the user. Both aspects help to achieve a strong reaction with regard to sense of embodiment and body ownership in particular. Our system can help creating applications for researching effects on perception and behavior in virtual environments, making it easier to achieve sample sizes in experiments including persons with impairments. Our system allows for able-bodied persons to experience body augmentations from a novel perspective.

Our system currently only works for the right arm, but there are no technical reasons that prevent its application to the left (or both) sides. Some participants stated that they are left-handed; for them, our system was slightly harder to control. They described it similar to trying to use a scissor with the right hand. However, our sample size is too small to investigate these effects.

Participants frequently reported that missing tactile feedback leads to a slight dissociation in the experience when trying to interact with virtual objects. We think that our system would benefit from providing a multi-sensory feedback in general, including tactile and acoustic. However, missing tactile feedback resembles the experience of wearing a prosthesis, as these, at least today, cannot provide tactile feedback as well.

We asked participants to select a favorite material for their prosthesis at the end of the experiment (see Sect. 5 and Fig. 8). In the future, we want to investigate the effect of materials on the self- and other-perception of users of bionic prostheses and compare results obtained with the AR/DR system to a conjoint analysis conducted with realistic photograph montages.

A major limitation for our system is the static room model we use in the diminished reality step, which prevents us from investigating social interaction scenarios with other humans. Such scenarios would allow for more elaborate experiments that could also involve looking at another person with a bionic arm. This outside perspective is another important step for researching stereotypes and the topic for our future research.