Abstract
For virtual reality (VR) applications, estimating full-body pose in real-time is becoming increasingly popular. Previous works have reconstructed full-body motion in real time from an HTC VIVE headset and five VIVE Tracker measurements by solving the inverse kinematics (IK) problem. However, an IK solver may yield unnatural poses and shaky motion. This paper introduces Deep Tracker poser (DTP): a method for real-time full-body pose estimation in VR. This task is difficult due to the ambiguous mapping from the sparse measurements to full-body pose. The data obtained from VR sensors is calibrated, normalized and fed into the deep neural networks (DNN). To learn from sufficient data, we propose synthesizing a VR sensor dataset called AMASS-VR from the AMASS, a collection of various motion capture datasets. Furthermore, feet tracking loss is a common problem of VIVE Tracker. To improve the accuracy and robustness of DTP to the occlusion noise, we simulate the occlusion noise by Gaussian random noise. Then we synthesize an occlusion dataset AMASS-OCC and fine-tune DTP on that. We evaluate DTP by comparing with other popular methods in terms of the accuracy and computational cost. The results indicate that DTP outperforms others in terms of the positional error (1.04 cm) and rotational error (4.22 °). The quantitative and qualitative results show that DTP reconstructs accurate and natural full-body pose even under serious feet occlusion, which indicates the superiority of the DTP in modelling the mapping from sparse joint data to the full-body pose.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Full-body pose estimation has become increasingly necessary for virtual reality (VR) applications to achieve higher immersion. With the development of motion capture (MoCap) technologies, this problem seems to have been solved by existing MoCap systems. However, some studies have shown that representing full-body avatars in VR is still a great challenge (Caserman et al. 2020). The MoCap systems based on vision technology represent the most popular methods, which can be divided into marker-based and markerless. Currently, marker-based optical systems are most effective for this purpose. Many studies have exploited such systems in the development of VR applications (Leoncini et al. 2017). However, commercial marker-based systems are expensive and require complex setup. Markerless systems reconstruct full-body motion using several red–green–blue (RGB) or RGB–depth (RGB-D) cameras (Xu et al. 2018; Habermann et al. 2019; Li et al. 2021). Compared with the marker-based systems, markerless MoCap systems are more lightweight (Liu et al. 2018), but they have the significant drawback: they can only track the full-body of users robustly when they are standing in front of the camera. This limitation implies the stable operation of markerless systems usually requires users to face the camera (Greuter and Roberts 2014). While in VR scenarios, users are typically allowed to face any direction. This flexibility in orientation makes markerless systems susceptible to self-occlusion or environmental occlusion. Therefore, markerless MoCap systems are not sufficiently accurate and robust for VR applications where achieving stable and accurate performance is essential. Our method is based on the HTC VIVE Tracker, which is essentially a marker-based approach. Compared to markerless methods, HTC VIVE Tracker offers better accuracy and stability. Inertial measurement units (IMUs) could offer a compromise between cost and accuracy. Unfortunately, they suffer from high latency when integrated into VR (Johnson et al. 2016), Johnson et al. reported an end-to-end latency of approximately 300 ms when using Perception Neuron motion capture system for real-time full-body motion reconstruction in VR.
In recent years, with the development of VR devices, an increasing number of researchers have attempted to reconstruct full-body pose from the measurements of off-the-shelf VR devices (Jiang et al. 2016; Yang et al. 2021). In particular, MoCap systems based on the HTC VIVE headset and VIVE Tracker measurements have become popular. Caserman et al. have shown the feasibility of reconstructing full-body pose from sparse VIVE Tracker measurements by solving the inverse kinematics (IK) problem (Caserman et al. 2019b). Nevertheless, this is challenging because the IK problem is inherently underconstrained, and its ambiguity may result in unnatural poses and shaky motion.
The growing popularity of deep learning networks and the availability of large-scale MoCap data have inspired researchers to leverage human motion priors (Aristidou et al. 2018). Previous works have successfully estimated full-body pose from sparse IMU sensors based on deep learning methods (Huang et al. 2018). Yi et al. have shown that the positions of joints are easier to estimate than orientations (Yi et al. 2021). However, the measurements recorded by the VIVE Tracker device are different from those of IMU sensors. Although tracker measurements provide the accurate global positions, they suffer from occlusion problems.
In order to improve the accuracy and robustness to occlusion noise, we introduce Deep Tracker Poser (DTP): a deep learning method for full-body pose estimation in real time from the measurements recorded by an HTC VIVE headset and five HTC VIVE Trackers. DTP contains three parts: preprocessing, deep neural network and post processing. The data obtained from the VR sensor is calibrated, converted and normalized in the pre-processing stage to obtain the input of the deep neural network. The DNN first maps low-dimensional input data to a higher-dimensional space by the embedding layer, then utilizes Transformer encoder to learn human motion characteristics and finally obtains 6-dimensional (6D) representation of the full-body joint data through simple linear layer decoding. In the post-processing stage, the 6D representation is orthonormalized using Gram-Schmidt to obtain the rotation matrix, which is finally applied to the SMPL model (Loper et al. 2015) to obtain the estimated full-body pose.
To obtain sufficient data for generalization, we synthesize VR sensor dataset AMASS-VR from the AMASS dataset (Mahmood et al. 2019). The AMASS data is represented by the parameters of the Skinned Multi-Person Linear model with articulated hands (SMPL + H) (Loper et al. 2015; Romero et al. 2017). In order to improve the robustness to the feet occlusion, which is a common problem in VIVE tracker, we further synthesize an occluding dataset and fine-tune our DTP on that.
To evaluate the performance of DTP, we compare DTP with other methods including Final-IKFootnote 1, PE-DLS (Zeng et al. 2022) and TransPose (Yi et al. 2021) in terms of the accuracy and the computational cost. The results indicate that DTP outperforms other methods in terms of the positional error (1.04 cm), rotational error (4.22 °). Although DTP has higher computational cost, it is sufficient for real-time performance in VR with a low cost of 2.49 ms. Furthermore, by the qualitative and quantitative evaluation, we find that DTP always estimated the full-body pose well even under serious feet occlusion. In conclusion, DTP can generate more accurate and natural full-body poses and obtain more robustness to the occlusion noise. These findings show that DTP is effective in modeling the mapping from sparse tracker measurements to full-body pose and in solving the occlusion problem of VIVE Tracker measurements. This means that DTP contributes to construct a high-accuracy and occlusion-robust MoCap system based on an HTC VIVE headset and only five trackers for VR applications.
In conclusion, the main contributions of our work are as follows:
-
1.
We propose DTP, a novel and effective real-time pose estimation method. The input of this method is the measurements of an HTC VIVE headset and five HTC VIVE Trackers, and the output is the full-body pose. We use Transformer encoder model as the core of deep neural network, so as to capture the motion prior knowledge and achieve accurate mapping from sparse sensor data to the full-body pose.
-
2.
We propose a method of synthesizing VR sensor dataset from the publicly available dataset AMASS to achieve sufficient generalization.
-
3.
We propose to simulate the feet occlusion by Gaussian random noise and synthesize the occlusion data set AMASS-OCC. To further improve the robustness of DTP to the feet occlusion noise, we fine-tune it on AMASS-OCC.
2 Related work
Although MoCap has a long history, representing full-body avatars in VR is still a considerable challenge. Caserman et al. have presented a survey of diverse full-body MoCap techniques (Caserman et al. 2020). In this section, we review three main types of MoCap systems and the development of deep-learning-based full-body pose estimation.
2.1 IK-based methods
With the rapid development of VR, recent studies have attempted to reconstruct full-body motion using off-the-shelf VR devices (Parger et al. 2018). Jiang et al. tracked the head and hands with the HTC VIVE headset and controllers and then estimated the upper-body pose by solving the IK problem (Jiang et al. 2016). Although they recognized the lower body based on animation blending, this approach is not always accurate. Caserman et al. reconstructed full-body motion from an HTC VIVE headset and trackers by solving the IK problem (Caserman et al. 2019b). And they have analyzed the performance of the popular numerical damped least squares (DLS) IK method for full-body reconstruction in VR (Caserman et al. 2019a). However, the main objective of an IK solver is to reach the target, which may result in unnatural poses even if the solution converges because of the inherent ambiguity of the IK problem. More information on IK solvers can be found in this survey (Aristidou et al. 2018).
2.2 Pose estimation
In order to ensure the naturalness of the full-body pose, previous works have demonstrated the feasibility of estimating full-body pose from sparse data (Chai and Hodgins 2005; Slyper and Hodgins 2008; Liu et al. 2011; Kim et al. 2012; Tong et al. 2020) by using data-driven methods. Chai and Hodgins first performed full-body animation using two cameras and six markers (Chai and Hodgins 2005). They proposed modelling the latent space by means of principal component analysis (PCA) and performing a fast search of motion examples using a nearest neighbors search algorithm. Subsequently, Krüger et al. proposed a fast method for similarity searching (Krüger et al. 2010), and Tautges et al. generated full-body animations using only four accelerometers and built a lazy neighborhood graph online for faster searching (Tautges et al. 2011). However, these methods do not scale well with the continuous growth of MoCap databases, and they have poor real-time performance and high spatial complexity because of the online search process.
In contrast to data-driven methods, which search for the closest pose for full-body reconstruction, deep-learning-based methods directly map sparse signals to full-body pose. Holden et al. proposed a convolutional autoencoder to model motion manifolds (Holden et al. 2016). Huang et al. adopted a biRNN to directly map only six IMU measurements to full-body joint orientations (Huang et al. 2018). Following this work, Yi et al. proposed a multistage network architecture to reconstruct full-body pose from six IMU measurements, and they estimated the global translation (Yi et al. 2021).
Recently, many works have successfully used deep learning methods for human pose estimation (Butt et al. 2021; Kim et al. 2021; Madadi et al. 2021; Zheng et al. 2021). Butt et al. introduces effective methods for training data and sensor position selection in sparse inertial sensor-based human posture reconstruction. However, the algorithm utilizes a heuristic algorithm with a greedy strategy to obtain approximate solutions for the established optimization problem, which may result in solutions that are not globally optimal. Madadi et al. use a Bi-directional recurrent autoencoder-based model to estimate 3D human pose from only six magnetic-inertial measurement units. The approach incorporates a 3D angle representation that eliminates yaw angle dependency. We use HTC VIVE devices to measure the pose and don’t suffer this problem. Kim et al. replaced the embedding layer of attention model with an RNN network in order to address the issue of discontinuity. In contrast, our approach completely discards the embedding layer and choose the 6D rotation representation over other common representations as the output for better continuity, as demonstrated in (Zhou et al. 2019). Zheng et al. introduces the first deep unsupervised approach for human body reconstruction with an attention model for estimating body joints from the landmarks. However, they did not take into account the issue of occlusion.
In recent years, CNNs have showcased significant achievements, not only in image-related tasks (Mehta et al. 2018) but also in temporal tasks (Holden et al. 2016; Weytjens and De Weerdt 2020). However, literature suggests that RNNs demonstrate superior performance in pose estimation tasks. Yang et al. have demonstrated that the performance of RNNs in pose estimation outperform that of CNNs (Yang et al. 2021). Temporal models excel in leveraging the temporal information of human motion and are applicable for processing action sequences. RNNs (Huang et al. 2018; Yang et al. 2021; Yi et al. 2021) have fewer parameters, simple structures, and are easy to train, but their capability to handle long-range dependencies is limited. Transformer (Kim et al. 2021) exhibit enhanced modeling capabilities, parallel processing, and excel in capturing global relationships, demonstrating superior performance in handling long-range dependencies. Therefore, this paper adopts the Transformer model as the backbone network.
Yi et al. and Jiang et al. estimate full-body posture based on IMUs (Jiang et al. 2022b; Yi et al. 2022). However, IMUs cannot provide global position information and suffer from drift issues. In contrast, the HTC VIVE Tracker we utilized offers accurate and stable global position and rotation data. In the state-of-the-art work, Du et al. (Du et al. 2023) utilized an MLP-based diffusion model to generate realistic and smooth human motions based on sparse tracking signals. However, it should be noted that they have yet to address the occlusion issues specific to VR devices and their approach relies solely on the pose information of the head and hands. Additionally, (Jiang et al. 2022a; Winkler et al. 2022) similarly only utilize pose information from the head and hands, leading to insufficient accuracy in lower body movements. The approach (Kim et al. 2021) proposed by Kim et al. is most similar to our method. In comparison to their approach, we made adjustments to the network architecture by replacing the embedding layer with a simple two-layer fully connected layer. Additionally, the decoder now utilizes a two-layer fully connected layer instead of a Transformer decoder. This modification results in a more lightweight network with fewer parameters, making it easier to train. Furthermore, due to the lack of consideration for occlusion issues in previous methods (Kim et al. 2021; Yang et al. 2021; Zeng et al. 2022), the robustness of pose estimation under occlusion is significantly limited. Therefore, we introduced a synthetic occlusion dataset, AMASS-OCC, to optimize the deep learning model, further enhancing the robustness of pose estimation in occluded scenarios.
In this paper, we introduce a novel method based on Transformer encoder to estimate full-body pose from the measurements of six VR sensors. Foot occlusion between a VIVE Tracker and the base station is a common problem because of the user’s own body or environment. Yang et al. proposed a deep-neural-network-based method for predicting the lower-body pose based on only the tracking information of the upper-body joints; in this way, they were able to avoid failure in cases of occlusion (Yang et al. 2021). However, due to the lack of tracking information on the lower body, the additional ambiguity results in growing inaccuracy and unnaturalness. To improve the robustness against occlusion noise, we propose a novel occlusion data synthesis method to generate an occlusion dataset on which to fine-tune our model.
3 Methodology
Our objective is to estimate full-body pose using an HTC VIVE headset and five VIVE Trackers. The overall flowchart of our method DTP is illustrated in Fig. 1. We first introduce three key stages of DTP: preprocessing, DNN and postprocessing. Then the method of fine-tuning DTP is shown. Finally, we give the details of our implementation.
3.1 Model representation
The SMPL model (Loper et al. 2015) is utilized to represent the virtual avatar without incorporating hand gestures in this study. The pose is depicted as a configuration of 22 joints, comprising one root joint and 21 local joints. The articulated structure of the virtual avatar, illustrated in Fig. 2, consists of 69 degrees of freedom (DOFs), with each joint possessing 3 DOFs, except for the root joint, which has 6 DOFs. And rotation is defined using a 3 × 3 rotation matrix in this paper.
3.2 Preprocessing
We used the position and rotation measurements of the HTC VIVE headset and five VIVE trackers as the total input to the system. For the sake of description, the HTC VIVE headset and the HTC VIVE Tracker are collectively referred to as VR sensors.
These sensors are strapped to the pelvis, head, left and right hands, left and right ankles. First, we obtain the corresponding positions and orientations of the body parts from the six tracker measurements using the “T-pose” calibration (Zeng et al. 2022). The measurements of six VIVE Trackers are denoted by \({p}_{leaf}^{g}\) and \({R}_{leaf}^{g}\) except for the measurements from the sensor strapped to the pelvis, which are denoted by \({p}_{root}\) and \({R}_{root}\). Then, we normalize the leaf positions and rotations by transforming these measurements from the world frame to the root reference frame. The normalized positions \({p}_{leaf}\) and rotations \({R}_{leaf}\) of the leaf joints are computed as:
Finally, we concatenate these measurements to obtain the input\(x\left(t\right)=[{p}_{root},{R}_{root},\{{p}_{leaf}\},\{{R}_{leaf}\left\}\right]\in {\mathbb{R}}^{72}\), where, \(p\in {\mathbb{R}}^{3}\) is the position of the pelvis, \(R\in {\mathbb{R}}^{9}\) is the flatten rotation matrix representation of the pelvis and \(leaf=1,...,5\). For descriptive purposes, we omit t for the time-varying variables unless it’s needed.
3.3 DNN
The main objective of our task is to estimate full-body joint orientations from sparse tracker measurements. This task is challenging, because the problem is inherently ambiguous, namely, there may be many correct full-body joint orientations with the same sparse tracker measurements. In this paper, we propose a novel method to leverage human motion prior knowledge for pose estimation. Yi et al. and Huang et al.. have shown excellent results of RNN in the full-body pose estimation from six inertial sensors (Huang et al. 2018; Yi et al. 2021). However, Transformer models outperform Recurrent Neural Network (RNN) models in natural language processing tasks, which use an attention mechanism to capture the long-distance dependencies in a sequence, while RNN models can only capture limited context information. Additionally, Transformer models can process input sequences in parallel, while RNN models require sequential processing, making Transformer models more computationally efficient. Therefore, our DTP learn the motion model by using the Transformer model.
We separate the network of full-body pose estimation into three stages: embedding, encoder and decoder. Specifically, we first map \({x}^{72}\) to a 256-Dimension embedding vector, which is a representation of the input in a higher-dimensional space. This embedding vector can provide more detailed information about the input and lead to better performance. Then we use a transformer encoder to process the embedding vector. The transformer encoder utilizes a self-attention mechanism, which allows for the encoding of the input in a more meaningful way, allowing for better performance. Finally, we use a single linear layer to efficiently decode the 256-dimension encoding vector into the full-body poses in 6-Dimension representation \({y}^{126}\).
We feed the input x (t) into the deep neural networks, whose output consists of the root-relative full-body joint rotation \({p}_{full}\left(t\right)=\left[{p}_{j}\right(\text{t}\left)\right]\in {\mathbb{R}}^{3(J-1)}\), where \(\text{j}=\text{1,2}\cdots ,\text{J}-1\) and J is the number of joints in the human skeleton representation (J=22 in this paper). The Transformer architecture allows to leverage temporal information. For real-time prediction, we use a sliding window that contains a few past and future frames. Following (Huang et al. 2018), the sliding window comprises a total of 26 frames, with 20 past frames, the current frame, and 5 future frames. It should be noted that each adjacent sliding window overlaps by 25 frames, with only one frame being different. Therefore, the step size corresponds to 1 frame.
The encoder in our model is the part of Transformer (Vaswani et al. 2017). It consists of a three-layer encoder, where each layer is composed of multi-head attention followed by a residual connection connecting the output to the input. Subsequently, the sum is normalized using a layer normalization. Finally, a fully-connected layer composed of two linear layers is employed, where the first layer converts the vector into a more meaningful representation, and the second layer transforms the representation into the final output.
The detailed structure of the multi-head attention is illustrated in Fig. 3. We first replicate the latent space vector three times and divide it into eight parts. Each part is then fed into a linear layer with 32 neurons before undergoing self-attention using scaled dot-product attention as depicted in (Fig. 3-b). Finally, we concatenate the output of the 8 attention modules into 256-Dimension and use a linear layer to output the results. A two-layer linear layer with ReLU activation function is used to process the results of attention layer as the same in (Vaswani et al. 2017), which allows us to capture the complex relationships in latent space.
Finally, we use a single linear layer remaps the vector from the latent space into the output dimensions. This output is represented in 6 dimensions as \({R}_{full}^{\left(6\right)}\left(t\right)=\left[{R}_{j}^{\left(6\right)}\left(t\right)\right]\in {\mathbb{R}}^{6(J-1)}\). Zhou et al. showed that a 6-dimensional representation of the output outperforms other rotation representations in continuity (Zhou et al. 2019).
3.4 Postprocessing
The continuous 6D representation of the relative rotation is then transformed into rotation matrices \({R_{full}} \in SO\left( 3 \right) \subset {{\mathbb{R}}^{3 \times 3}}\) via Gram-Schmidt orthogonalization. Then, we transform the rotation matrix \({R}_{full}\left(t\right)\) into the world frame. The final global full-body pose is computed as follows:
Combining the root position \({p}_{root}\), the root rotation matrix \({R}_{root}\) and the global full-body rotation matrix \({R}_{full}^{g}\), we obtain the output \(\text{y}=\left[{p}_{root},{R}_{root},{R}_{full}^{g}\right]\in {\mathbb{R}}^{3+9J}\). Thus far, we have obtained all of the parameters necessary to generate an SMPL model (Loper et al. 2015) using only the first 22 joints for visualization, excluding the motion of the hands. And finally, we use forward kinematics to recursively calculate the pose of each joint in the kinematic tree in global frame.
3.5 Loss function
The loss function of the network is defined in terms of the L2 norm, which consists of \({{\mathbb{L}}_{rot}}\) local rotational loss and \({{\mathbb{L}}_{ik}}\) inverse kinematics loss, and \({\mathbb{L}}_{ft}\) velocity of feet loss, denoted by:
\({{\mathbb{L}}_{rot}}\) is the L2 norm between the predicted local joint rotation in 6D representation and the ground truth:
\({{\mathbb{L}}_{ik}}\) is L2 norm between the predicted positions of the end effector and the ground truth:
The foot sliding artifact significantly degrades the quality of the output motion. To address this issue, we introduced a velocity of feet loss \({\mathbb{L}}_{feet}\)that guides our network to generate the ground truth trajectory for foot joints:
where \({p}_{ft}\) is the predicted position of the foot and \({p}_{ft}^{gt}\) is the ground truth position of the foot.
We recommend assigning higher weights to rotation-related loss functions, moderate weights to end effector position loss functions, and appropriate weights to foot velocity loss functions. Weight assignment should be adjusted based on specific application scenarios and task requirements, optimizing model performance through experimentation and analysis. In this paper, we set the weights \({\lambda _{rot}}\)= 1, \({\lambda _{ik}}\)=0.5 and \({\lambda }_{ft}=0.01\).
3.6 Fine-tuning
The problem of self-occlusion of VIVE Tracker is a common issue in the field of computer vision. Occlusion occurs when the VIVE Trackers on feet are partially or completely blocked by the body of the user or chair from the view of Infrared (IR), making it difficult to accurately track the VIVE Tracker. In order to address this issue, researchers have proposed various methods for dealing with occlusion, such as using multiple trackers, employing robust feature descriptors, and utilizing temporal information (He et al. 2020, 2022).
In our experiments, we observed that when the sensor is completely occluded, it outputs a fixed value with position at origin and identity rotation, and when subjected to slight occlusion, the sensor outputs random position and rotation. To account for various occlusion scenarios, we propose using random Gaussian noise to simulate occluded data. And the mean of positional noise is a position at origin data and the mean of rotational noise is identity rotation.
In order to generate a synthetic occluded dataset, we first randomly select a certain percentage of frames as occlusion frames. In these occlusion frames, we randomly occlude one or both feet. The positions and rotations of occluded data can be represented as two random noised vectors \({v}_{occp}\) and \({v}_{occR}\). And they are modeled as Gaussian, with a mean of µ and a fixed diagonal covariance matrix \(\sum\). And the vectors are sampled from the Gaussian distribution:
Where \({\sum _p}=diag(0.04)\) and \({\sum _R}=diag(0.01)\). The rotation vector can be converted into a rotation matrix using Rodrigues’ formula (Murray et al. 2017). For the position vector under occlusion, we restrict it to a spherical range with a radius of L, which is the leg length:
The constraint serves a dual purpose. On one hand, it prevents the position data from becoming excessively large, which could adversely impact the training of the neural network. On the other hand, in practical scenarios, the maximum displacement of the foot relative to the root joint typically does not exceed the combined lengths of the thigh and shin. This constraint ensures that the generated synthetic data maintains a realistic and physiologically plausible representation of human motion.
3.7 Implementation details
We synthesize training data from the AMASS (Mahmood et al. 2019). The raw motion sequences are used to determine the parameters of an SMPL + H model, and we generate a tracker dataset by placing virtual trackers on the corresponding body parts in this model and only use the first 22 joints without considering hand gesture.
Due to the diverse frame rates of motion sequences in different motion databases within AMASS, we resample the original motion sequences to unify all sequences to a consistent frame rate. For a real-time application and especially for VR, the delay between frames should not exceed 20 ms (Raaen 2015). Our system utilizes the Unity3D engine and SteamVR platform. To ensuring users experience a sensation free from dizziness or discomfort, the system maintains a consistent framerate of 60 fps, with a frame delay of approximately 16.67 ms. And we make the assumption that the system load, encompassing rendering and simulation tasks, can adequately meet this frame rate requirement. Therefore, we resampled the dataset to 60 fps. For the motion sequences which are not 60 fps, we use interpolation to obtain the closest value. In the Interpolate algorithm, we perform linear interpolation for the root joint position and employ spherical linear interpolation after converting joint angles to quaternions. Subsequently, we convert the interpolated quaternions back to joint angles.
We feed the network with a fixed window size of 300 frames. For motion sequences that are shorter than 300 frames, we discard them because they mostly represent static poses or excessively fast motion. And we utilize the official dataset splits provided by AMASS (Mahmood et al. 2019) for training, validation and testing. In specific, the training set includes the data from ‘CMU’, ‘MPI_Limits’, ‘Eyes_Japan_Dataset’, ‘KIT’, ‘BML’, ‘EKUT’ and ‘TCD_handMocap’; the validation set comprises data from ‘HumanEva’, ‘MPI_HDM05’, ‘SFU’, and ‘MPI_mosh’; and the testing set consists of data from ‘TotalCapture’ (Trumble et al. 2017). As a result, approximately 80% of the data was allocated for training, while 10% was designated for validation and another 10% for testing.
We implement our network based on PyTorch 1.9.1 with CUDA 11.1. The computer used in this study has an Intel(R) Core (TM) i5-8400 CPU and an NVIDIA GeForce GTX 3090 GPU. For live demo testing, we use an HTC VIVE headset and five VIVE Trackers, and the software used to obtain the tracker data is based on Unity3D. We use the official Adam optimizer of PyTorch with a learning rate of \(\text{l}\text{r}={10}^{-4}\) and the learning rate decays by a factor of 0.5 every \(1 \times {10^4}\) iterations. We train the network using a training strategy with early stopping, which means that we stop training once the validation loss stops decreasing for a certain number of epochs (we set a limit of 20 epochs in this paper). We first train the networks on the raw AMASS dataset, which takes about ten hours to train the DTP. Then we fine-tune it on the occlusion dataset, which takes another six hours.
4 Ablation study
In this section, we report the quantitative and qualitative evaluation of our method through testing on the dataset and the simulated occluded dataset.
We define the reconstruction accuracy in terms of the following metrics: positional error, defined by the mean and standard Euclidean distance error in centimeters between all the estimated joints and ground truth with the position and rotation of the root joint aligned; rotational error, defined by the mean global rotational error in degrees between all the estimated joints including the root and end effectors; mesh error, defined by the mean Euclidean distance between the predicted vertex position and the true vertex position in the SMPL model; jitter error, defined by the mean jerk of all the joints. Jerk is the third derivative of the position of the joint with respect to time, which reflects the smoothness and the naturalness of the motion (Flash and Hogan 1985). In addition, the computational cost is defined as the average computation time of each frame in the pose estimation in online testing without considering other computational costs such as rendering. However, in offline testing, the computational cost is defined as the average time for one forward propagation of the neural network.
All deep learning models discussed in Sect. 4.1 and 4.2 were trained on the same AMASS training dataset, while Sect. 4.3 primarily focuses on the performance of models fine-tuned on the occluded dataset, AMASS-OCC.
4.1 Comparisons of DTP with DTP-RNN
As is known to all, RNNs and Transformer model have been successfully applied to many pose estimation tasks. But the RNN model (Huang et al. 2018; Tong et al. 2020; Yang et al. 2021; Yi et al. 2021) is more commonly used in pose estimation based sparse joint data. In order to figure out if Transformer model contributes to the accuracy of pose estimation in DTP, we obtained the deep learning model DTP-RNN (Fig. 4) based on the RNN model by replacing the Transformer encoder with a two-layer bidirectional RNN model based on LSTM. Then we compare DTP with DTP-RNN in terms of the metrics and the average computational time by the online and offline testing on the dataset.
The online and offline comparisons by testing on the dataset are reported in terms of the mean and standard deviation of the positional error, rotational error, mesh error, jitter error and average computational cost each time. When conducting the offline testing, we feed the whole motion sequence into the network at once. But when conducting online testing, a short temporal window of 26 frames is used to be feed into the network in a sliding manner.
As shown in Table 1 we find that the DTP always has lower errors than DTP-RNN (both offline and online setting). Furthermore, we find the errors of the metrics of DTP and DTP-RNN increase from offline setting to online setting. The computational cost of DTP (3.02 ms) is far lower than that of DTP-RNN (12.74 ms) in the offline testing. However, the online computational cost of DTP (2.36 ms) is slightly higher than that of DTP-RNN (1.04 ms). In addition, compared to the online testing, both models showed a decrease in computational cost in online testing. The computational cost of DTP decreased by approximately 21.85%, while the computational cost of DTP-RNN decreased by over 10 ms, approximately 91.83%. In summary, the average computational time of both DTP and DTP-RNN models decreased from offline testing to online testing, but the decrease of DTP-RNN is greater than that of DTP.
4.2 Comparisons of DTP with other methods
To further evaluate the performance of DTP, we compare DTP with other existing methods in terms of the accuracy and the average computational cost by testing on the dataset with online setting. We select Final-IK, PE-DLS and TransPose as the methods to compare against. Final-IK is the most representative commercial algorithm which is widely used to reconstruct full-body pose in the VR applications. PE-DLS (Zeng et al. 2022) is a hybrid method of analytical and numerical of focusing on reconstructing full-body motion from six VR measurements in our previous work. And TransPose (Yi et al. 2021) is considered to be the state-of-the-art algorithm for pose estimation from six IMUs.
The official version of the Final-IK and PE-DLS is used for testing and we use the default parameter value of Final-IK. As for TransPose, we revised the original TransPose to adapt to VR sensor measurements for consistency called TransPose-T for description. It should be noted that the Final-IK and PE-DLS are used in the Unity3D and implemented in C#, but the TransPose-T and DTP are implemented in Python. We first collect the estimated results from own methods with the same inputs and outputs. Then we achieve visualization on the same SMPL model and obtain the results.
DTP always produces the minimum mean and standard deviation of the positional error, rotation error, mesh error and jitter error, while those of Final-IK are always the maximum. In addition, we find that the rotational error of TransPose-T (7.36 °) was smaller than that of PE-DLS (13.49 °), but its positional error (2.71 cm) and mesh error (3.82 cm) were larger than the positional error (1.85 cm) and mesh error (1.76 cm) of PE-DLS. Furthermore, we find that the two learning-based methods (TransPose-T and DTP) have smaller rotational and jitter errors than those of the two IK-based method (Final-IK and PE-DLS). And we find that the mean position error of DTP (1.04 cm) is 61.62% less than that of TransPose-T (2.71 cm), the mean rotation error of DTP (4.22 °) is 42.66% less than that of TransPose-T (7.36 °) and the mean mesh error of DTP (1.54 cm) is 59.69% less than that of TransPose-T(3.82 cm). In a word, the accuracy of full-body pose estimated by DTP outperforms other methods and more than 40% better than that of TransPose-T in terms of mean rotation error and mean mesh error. However, the mean jitter error (2.31\(\times\)102 m/s3) only 20.62% less than that of TransPose-T (2.91\(\times\)102 m/s3). Although Final-IK is the fastest, it always has the largest error in all the accuracy metrics.
Additionally, we find that PE-DLS exhibits the largest deviation in computational cost compared to Final-IK (0.01 ms), PE-DLS (1.30 ms), and TransPose-T (1.01 ms), with DTP showing the highest computational cost at 2.49 ms. Fortunately, DTP’s computational cost is sufficient for real-time VR applications. Although Final-IK has the lowest computational cost due to its simple vector operations, PE-DLS shows a large standard deviation in computational cost due to its iterative nature. DTP incurs the highest computational cost due to its larger model parameters compared to TransPose-T. However, being based on the Transformer model, DTP benefits from high parallelism, effectively reducing costs through parallel computation. Additionally, DTP’s computational cost is not significantly affected by sequence length, indicating strong scalability. Despite its higher computational cost, DTP’s robust parallelism and scalability render it suitable for real-time VR systems, maintaining a computational cost below the required threshold.
Furthermore, we observed that our reconstruction errors are slightly higher than those reported in the original study. Fortunately, these discrepancies are within one order of magnitude, likely attributable to differences in the datasets. Notably, our computation time is generally faster than that reported in the original work, which is likely due to variations in computing hardware. As shown in Table 2, the accuracy of the DTP algorithm is always better than that of other methods. Both quantitative and qualitative experiments show that DTP is more accurate than other algorithms. And we attribute this to the accurate capture of time-domain features by the Transformer model which utilizing the 8 multi-head attention mechanisms can learn rich feature expressions in their respective subspaces.
By comparing DTP and TransPose-T, we find that although TransPose-T can also generate natural gestures, its accuracy is lower than DTP, which easily leads to problems such as penetration (in Fig. 5). The main reason lies in the fact that TransPose-T uses RNN model to learn human motion models from large-scale databases. Due to its inherent sequence characteristics, sequence information far away is easy to be discarded. Although the continuity of motion sequence is strong, it is possible to capture the close dependence between adjacent frames by using RNN model. However, due to the sparsity of joint data in our task, it is difficult to accurately estimate the full-body pose only by relying on the close relation. Compared with TranPose, our DTP model can capture the dependency between frames that are far apart, so as to compensate for the lack of joint information. Furthermore, the total amount of parameters of DTP model is far greater than that of TransPose-T, so it has stronger modeling ability. These two points are the main reasons why our DTP is superior to TransPose-T.
By comparing the two methods based on deep learning and the two methods based on IK, we find that the rotational error and jitter error of the deep learning methods are smaller than that of the IK methods. This is because the loss function of two deep learning methods is based on the rotation of the joints, and TransPose-T and our DTP choose the 6D representation of rotation as the output instead of other common representations, such as quaternion, Euler Angle and rotation matrix, so as to obtain better continuity, which has been proven that the 6D representation has better continuity in deep learning training (Zhou et al. 2019). However, we find that the positional error and mesh error of TransPose-T are slightly greater than PE-DLS. This is because Pose-S1 and Pose-S2 are trained separately and then combined. Although Pose-S1 estimates full-body joint position, the final output is Pose-S2, which may lead to error when passing Pose-S2, so the positional error and mesh error are increased. The positional error of PE-DLS is smaller than that of TransPose-T, but its rotational error is larger, indicating that the twisting angle error of the joint around the polar axis is larger. This is because PE-DLS is a hybrid IK method of analytical and numerical methods, and the main objective is to track the position of the end effector joint. The joint positions are greatly affected by the swing angle, while the twisting angle cannot contribute to the joint position. Therefore, the twisting angle can only be optimized by the heuristic rule and joint constraint, resulting in a large error. In order to improve the accuracy of joint position, we added the inverse kinematic loss \({\mathbb{L}}_{ik}\), so as to optimize the estimation of joint position and joint rotation at the same time, leading to a lower positional error and mesh error.
Overall, quantitative experimental results show that data-driven methods based on deep learning can generate more accurate and smoother motions than IK methods. In addition, the accuracy of DTP is better than other competing methods. Although its model complexity leads to the highest calculation cost, it is enough to meet the real-time requirements of the VR systems.
In order to highlight our findings and visually compare the estimated results of different methods, some motion clips are shown in Fig. 5. As shown in Fig. 5, DTP always produced the most similar pose to the ground truth. Although TransPose-T also produced natural poses, the penetration can be observed in the HH and FS and visual errors can be also observed in the RW and KK. PE-DLS can track the root and end effectors well, but it fails to reconstruct the elbows (in the HH) and knees (in the FS) in some cases. Similar to PE-DLS, Final-IK track the end effectors well, but it can produce strange and unnatural poses (in the KF). There are visual positional errors of the elbows and knees, and the poses of the shoulders are not natural and look like a ‘shrug’. Furthermore, we find that both the two data-driven methods produce natural poses (the fourth and fifth column in Fig. 5). And DTP can reconstruct full-body poses more accurately. But TransPose-T does not estimate the limbs well and we can see obvious errors of the left arm in the RW and right leg in the KK. Additionally, we can observe more jitter in Final-IK and PE-DLS than that of TransPose-T and DTP. In a word, DTP always generates natural motion and similar to the ground truth.
Through visual observation (Fig. 5), we find that the DTP always generate the most similar poses to the ground truth but others does not, which indicate the DTP outperforms the other methods in accuracy. Compared with Final-IK and PE-DLS, which use analytical and numerical methods to solve the IK problem, our DTP directly utilize large-scale motion capture databases to learn the model of pose estimation. In theory, human movements contain redundant information and it is feasible to reconstruct the motion by a sparse input (Troje 2002; Safonova et al. 2004; Tong et al. 2020). However, it is not easy to estimate the pose of human due to the under-constrained property. Our DTP solve this problem by using the big model Transformer encoder to enhance modeling capabilities, which can capture more complex patterns and relationships in the data with the 5.38 M parameters, leading to better performer on the pose estimation. Human motion is complex and influenced by many factors, including biomechanics (KHATIB et al. 2004), physiology (Soechting and Flanders 1989a, b) and psychology (Jung and Choe 1996). But IK methods have limitations in that they can only use high-level information such as joint constraints to optimize the results. While joint constraints are an important factor in determining human motion, there are many other factors that can affect the way people move, such as comfort, individual differences and environmental factors, which can be difficult to model mathematically. Therefore, our DTP utilize deep learning methods to learn these constraints.
We find that IK methods generate more jitter than deep learning methods, which also be shown in the Table 2. IK methods can sometimes produce discontinuous solutions, which may result in more jitter due to that they are often based on optimizing joint angles to achieve desired end effector positions without taking account the smoothness or naturalness of the resulting motion. It seems susceptible to shaking phenomena as a result of the joint limits. In contrast, deep learning methods can learn to generate smooth and natural motion by capturing the underlying patterns and structure of human motion data Therefore, our model can generate natural and smooth motions without forcing any joint limits, as the joint limits have already been implicitly modelled through training on the motion priors. These patterns and structure can be captured by analyzing large amounts of motion capture data and identifying common movements, poses and transitions between motions. For example, human motion data may exhibit certain symmetries, such as the fact that the left and right sides of the body often move in similar ways. It may also exhibit certain temporal patterns, such as the fact that certain motions tend to follow others in a predictable sequence. By analyzing these patterns and structures, deep learning models can learn to generate accurate motion sequences that are consistent with the underlying structure of human motion, resulting in more realistic and visually appealing animations. In conclusion, the discontinuity of IK solutions leads to greater jitter in the generated motion, while deep learning solutions have better continuity and generate smoother motion.
Furthermore, although TransPose-T is also a data-driven method like our DTP, we observe visual errors in left arm and right leg (in the RW and KK in Fig. 5), which are not observed in our DTP. We attribute our superiority to the inverse kinematics loss function of our DTP. TransPose-T are more focused on rotation prediction without considering the position of joints. But our DTP diverts some attention to the positions of the end effectors, which are important for interaction in VR applications.
In the supplementary videos, we observed that the arm poses of PE-DLS are not very accurate and prone to distortion. This is attributed to the inherent limitations of IK algorithms in ensuring precise intermediate joint accuracy, consistent with the findings in (Zeng et al. 2022). Additionally, TransPose-T exhibits noticeable feet jitter, indicating the apparent “foot skating” issue. The absence of the “foot skating” issue in our method is attributed to the incorporation of inverse kinematics loss\({{\mathbb{L}}_{ik}}\), and velocity of feet loss \({\mathbb{L}}_{ft}\), which greatly mitigates such problems.
In general, our DTP can always estimate full-body poses well by exploiting motion priors and the temporal context of the motion. Furthermore, DTP have better continuity than others and generate smoother motion.
4.3 Comparisons of DTP with and without fine-tuning
Feet occlusion is a common issue in real-time full-body motion capture due to its susceptibility to external environmental factors such as the chair. We first perform online testing on the occluded dataset AMASS-OCC, and we found that the error of DTP on occluded data increased significantly. To further improve the robustness of DTP towards occlusion noise, we assume that simulate the occlusion by random noise and generate AMASS-OCC. Then the DTP was fine-tuned on the AMASS-OCC. Finally, we compare DTP with and without fine-tuning on the testing dataset of the AMASS-OCC in terms of the positional error, rotation error, mesh error and jitter error. We define the model with fine-tuning as DTP1 for description.
Compared to the online testing results on the AMASS-VR dataset (Table 1), we found that the original DTP and DTP-RNN models had increased position error, rotation error, and mesh error, and severe jitter error in the online testing on the occluded dataset AMASS-OCC (Table 3). Compared to the original models (DTP-RNN and DTP), the models fine-tuned on the training set of the AMASS-OCC (DTP-RNN1 and DTP1) showed a decrease in position error, rotation error, and mesh error in online testing on the AMASS-OCC, and also achieved lower jitter error. Although DTP showed the largest errors on the AMASS-OCC dataset, the performance of DTP1 after fine-tuning is significantly improved and achieves optimal performance.
Comparing the results of DTP and DTP1, we found that the joint positional error of DTP1 (1.55 cm) was 69.30% less than that of DTP (5.05 cm), the joint positional error of DTP1 (7.60 °) was 43.45% less than that of DTP (13.44 °) and the joint positional error of DTP1 (2.4 cm) was 60.00% less than that of DTP (6.00 cm). Comparing the results of DTP-RNN and DTP-RNN1, we found that the joint positional error of DTP-RNN1 (3.19 cm) was 69.30% less than that of DTP-RNN (3.86 cm), the joint positional error of DTP-RNN1 (9.16 °) was 43.45% less than that of DTP-RNN (10.22 °) and the joint positional error of DTP-RNN1 (4.86 cm) was 60.00% less than that of DTP-RNN (4.32 cm).
To visually compare the performance of DTP with and without fine-tuning, we give some clips of the testing results on the different feet occlusion settings (in Fig. 6). As shown in Fig. 6, Although the original DTP was unable to accurately estimate full-body pose from occluded data, the DTP1 optimized by fine-tuning can accurately estimate full-body pose in all three occlusion scenarios. We can see from the Fig. 6 that both DTP-RNN and DTP-RNN1 can estimate relatively accurate poses when only the left or right leg is occluded. In addition, we found that foot occlusion not only leads to errors in the estimation of the legs, but also causes deviations in the upper body pose. DTP-RNN and DTP-RNN1 can also generate natural poses, but the errors compared to the ground truth are significantly larger than those of DTP1. And we observed penetration in results of the DTP-RNN1 in DF.
As shown in Table 3, the mean and standard deviation of positional error, rotational error, mesh error and jitter error of the original DTP and DTP-RNN on the AMASS-OCC test set are increased. This indicates that local occlusion can lead to significant decrease in performance of pose estimation performance and generate unnatural pose. In order to improve the stability of DTP for feet occlusion, we fine-tune the DTP and DTP-RNN on the training set of AMASS-OCC. The accuracy of the optimized model DTP1 is 40% better than that of the original DTP model. This indicates that the fine-tuning of DTP model through local occlusion data can help the model better adapt to occlusion data, improve its recognition ability of feet occlusion, and thus obtain better performance. We find that the original DTP model cannot accurately estimate pose when the two feet are occluded, resulting in awkward poses. This indicates that the DTP model before fine-tuning has poor stability against local occlusion noise and can hardly estimate the posture of the lower body. When both input data are noisy. The limitation of DTP model leads to its inability to capture the spatial and temporal characteristics of occlusion data effectively, so as to estimate the pose accurately. Therefore, we further improve and optimize it through fine-tuning. The experimental results (Fig. 6) show that the optimized DTP1 can still maintain a stable performance of full-body pose estimation when both feet are occluded, which means that the fine-tuned DTP1 can recognize noise data and capture the correct time and space dependence, so as to estimate the accurate full-body pose.
In conclusion, fine-tuning can significantly improve the stability of DTP model for occlusion data, and the performance of DTP algorithm after fine-tuning is higher than that of DTP-RNN. However, DTP performs slightly worse than DTP-RNN without fine-tuning, which indicates that more training data and longer training time are needed to achieve higher performance.
5 Results
5.1 Live demos
To evaluate the performance on real tracker data instead of a synthetic dataset, we implemented a real-time MoCap system, which obtain root and leaf joint data from the HTC VIVE headset and five VIVE Trackers and estimate full-body pose by DTP. We also present some clips in Fig. 7, and the results can be seen in video form from supplementary video.
We can use DTP to estimate full-body pose from 6 tracker and achieve real-time performance at 90 fps. Furthermore, we can complete a more challenging task that involves interacting naturally with the virtual environment. The corresponding smooth and natural animations can also be seen in the supplementary video.
5.2 Superiority of transformer
In order to prove the superiority of Transformer in DTP algorithm, we evaluate DTP and DTP-RNN in terms of the accuracy and computational cost by testing on the dataset.
As shown in Table 1, the position error, rotation error, mesh error and jitter error of the DTP model are smaller than those of the DTP-RNN model in the offline and online testing, which means that Transformer model has stronger modeling ability than RNN model and can learn more accurate human movement patterns from large-scale motion datasets. This is because the multi-head attention mechanism can transform input data into multiple different subspaces and capture richer information through these different subspaces, which contribute to learn different behavior patterns and finally combine different behavior patterns as value features. And it also leads to more parameters, which make the training of Transformer become more difficult. In offline test, the model can obtain the data of the entire motion sequence, while in online test, only a fixed 26-frames window can be obtained. Therefore, the error increases in the online test, but the increase is smaller than one unit of measurement.
In the online testing, both DTP model and DTP-RNN model have lower time cost. However, in the off-line test, the time cost of the DTP-RNN model increases sharply by an order of magnitude, while the computational cost of the DTP model remains basically stable, with a variation range of less than 1 ms. This is because the RNN model must be calculated step by step according to time series on the time scale and cannot be implemented in parallel. Thus the calculation cost of DTP-RNN in the offline testing with a long sequence increases significantly. However, the high degree of parallelism in Transformer model makes it insensitive to the time length of input data, and the computational cost difference of forward propagation for input sequences with different time lengths is not significant. Although the online computing cost of DTP is higher than that of DTP-RNN, it is sufficient to meet the real-time requirements in VR applications.
5.3 System evaluations
To evaluate the practicability of DTP algorithm, we implement a real-time motion capture system based on an HTC VIVE headset and five VIVE trackers. In order to intuitively evaluate the system performance, we provided some motion clips of some real-time motion capture experiment in VR.
As shown in Fig. 7, DTP can not only generate natural and smooth movements such as walking, jumping and squatting, but also complete accurate human-object interaction in the interactive tasks, which means that DTP has a strong generalization ability. We believe this error is sufficient to meet the accuracy requirements in most VR applications. Although we can observe some visible errors, the user can’t see what they’re doing, which allowing a small error. In VR applications, the fluidity and naturalness of motion can directly affect the experience of the user. Therefore, the smooth motion capture is one of the key issues to be considered in VR application development. In addition, interactive task experiments show that our DTP can generate accurate end effector positions, which means that our DTP contribute to construct a low-cost motion capture system in interactive VR environments, such as rehabilitation training, simulation training, serious games, etc.
5.4 Limitations and future work
The test results in AMASS-OCC show that the stability of the initial DTP model for feet occlusion is poor which indicates that the DTP model has some limitations when dealing with occluded data, but this does not mean that the model cannot be applied to such problems. Conversely, it can also provide some guidance for subsequent optimizations and fine-tuning. We have significantly improved the performance by fine-tuning on the AMASS-OCC training data, but the occlusion problem cannot be completely solved by DTP. With the emergence of large language models like ChatGPT, we believe that architectures based on large Transformer models have significant potential.
Optimizing computational cost involves various strategies. In the future, we will explore simplified layer architectures or alternative layer types that maintain performance while reducing computational requirements (Touvron et al. 2021; Tang et al. 2024). In addition to exploring algorithmic optimizations, we will implement parallel processing techniques to distribute computations across multiple devices or cores at the hardware level, thereby enhancing the speed of inference (Pope et al. 2022).
In the future, we plan to explore fine-tuning based on large models (Du et al. 2022) for the optimization of stability in occlusion problems. However, during the exploration of large models, we need to carefully consider factors such as managing the cost of model training and addressing issues like reducing the model size through techniques like knowledge distillation. Furthermore, we have only made a brief exploration of the occlusion problem, and there are many further research directions in the future, such as the construction of real local occlusion data set. In addition, we can try to use other data enhancement techniques, such as random erasure, to improve the processing capability of the model for occluded data. Multi-sensor fusion technology (Malleson et al. 2020) can be used to integrate image sensors, which is also one of the popular research routes. As shown in Fig. 6, foot occlusion not only leads to errors in the estimation of the legs, but also causes deviations in the upper body pose. This discovery aligns with the research conducted by (Yang et al. 2021), where they achieved the estimation of lower body movements based on upper body actions. This implies that exploring the separate training of upper and lower body pose estimation to mitigate occlusion issues, especially related to footstep occlusion, is a promising avenue for research.
6 Conclusion
In this paper, we propose DTP, a deep-learning based method, which estimate full-body pose from the measurements recorded by an HTC VIVE headset and five VIVE Trackers. The core idea of DTP is to learn human motion patterns and structures in multiple subspaces using Transformer encoder’s multi-head attention mechanism. In order to learn from a large enough database, we synthesized the VR data set AMASS-VR from the large-scale motion dataset AMASS. By comparing and analyzing the performance of DTP and DTP-RNN, we have proved that in the full-body pose estimation task based on sparse VR sensor data, the DTP model has stronger modeling ability than the DTP-RNN model and can generate more accurate and natural poses. In addition, by comparing with other methods, we find that DTP outperforms existing Final-IK, PE-DLS and TransPose. Although the computational cost of DTP is higher than other methods, we can implement if in parallel due to its high parallelism, so that its computation cost (2.49ms) meets the real-time requirements of VR system.
In order to evaluate the robustness of DTP algorithm to feet occlusion problem in VR system, we propose an algorithm to simulate local occlusion data using random noise and synthesize feet occlusion dataset AMASS-OCC from AMASS. The testing on the occluded dataset shows that the original DTP model has poor robustness to occlusion. In order to further improve the robustness to occlusion noise, we utilize AMASS-OCC to fine-tune the original DTP model. The results show that the accuracy of the fine-tuned DTP model on the AMASS-OCC improved more than 40%.
In order to evaluate the practical value of the DTP, we construct a real-time full-body motion capture system based on the HTC VIVE headset and five VR trackers. The system uses DTP to estimate full-body pose and runs in real time at 90 FPS. The results show that DTP can accurately estimate the full-body pose and generate smooth and natural motion with accurate positions of the end-effector joints, which enable natural interactions in VR. Therefore, the greatest contribution of DTP algorithm is to provide a highly accurate real-time attitude estimation method in VR environment.
Although DTP has achieved the expected experimental results, it still has some limitations. For example, although we have improved the stability of the occlusion problem by fine-tuning, visual errors can still be observed in some cases. This does not mean that DTP cannot be applied to such problems, but rather that it needs to be optimized and improved for specific problems. Future work can optimize the stability of the occlusion problem and train the upper and lower body pose estimation respectively.
Data availability
All data generated or analysed during this study are included in this published article.
Code availability
The source code is not yet fully organized, but we will share it soon.
Notes
Final IK: https://assetstore.unity.com/packages/tools/animation/final-ik-14290, last visited on March 1st, 2023.
References
Aristidou A, Lasenby J, Chrysanthou Y, Shamir A (2018) Inverse kinematics techniques in computer graphics: a survey. Comput Graph Forum 37:35–58. https://doi.org/10.1111/cgf.13310
Butt HT, Taetz B, Musahl M et al (2021) Magnetometer robust deep human pose regression with uncertainty prediction using sparse body worn magnetic inertial measurement units. IEEE Access 9:36657–36673. https://doi.org/10.1109/ACCESS.2021.3062545
Caserman P, Achenbach P, Gobel S (2019a) Analysis of inverse kinematics solutions for full-Body reconstruction in virtual reality. In: 2019 IEEE 7th International Conference on Serious Games and Applications for Health, SeGAH 2019:1–8
Caserman P, Garcia-Agundez A, Konrad R et al (2019b) Real-time body tracking in virtual reality using a vive tracker. Virtual Real 23:155–168. https://doi.org/10.1007/s10055-018-0374-z
Caserman P, Garcia-Agundez A, Gobel S (2020) A survey of full-body motion reconstruction in immersive virtual reality applications. IEEE Trans Vis Comput Graph 26:3089–3108. https://doi.org/10.1109/TVCG.2019.2912607
Chai J, Hodgins JK (2005) Performance animation from low-dimensional control signals. ACM Trans Graph 24:686–696. https://doi.org/10.1145/1073204.1073248
Du Z, Qian Y, Liu X et al (2022) GLM: General Language Model pretraining with autoregressive blank infilling. Proc Annu Meet Assoc Comput Linguist 1:320–335. https://doi.org/10.18653/v1/2022.acl-long.26
Du Y, Kips R, Pumarola A et al (2023) Avatars grow legs: Generating smooth human motion from sparse Tracking inputs with Diffusion Model. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit 2023-June 481–490. https://doi.org/10.1109/CVPR52729.2023.00054
Flash T, Hogan N (1985) The coordination of arm movements: an experimentally confirmed mathematical model. J Neurosci 5:1688–1703. https://doi.org/10.1523/jneurosci.05-07-01688.1985
Greuter S, Roberts DJ (2014) SpaceWalk: Movement and interaction in virtual space with commodity hardware. In: ACM International Conference Proceeding Series. pp 1–7
Habermann M, Xu W, Zollhöfer M et al (2019) LiveCap: real-time human performance capture from monocular video. ACM Trans Graph 38. https://doi.org/10.1145/3311970
He K, Gkioxari G, Dollár P, Girshick R (2020) Mask R-CNN. IEEE Trans Pattern Anal Mach Intell 42:386–397. https://doi.org/10.1109/TPAMI.2018.2844175
He K, Chen X, Xie S, et al (2022) Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp 15979–15988
Holden D, Saito J, Komura T (2016) A deep learning framework for character motion synthesis and editing. ACM Trans Graph 35:1–11. https://doi.org/10.1145/2897824.2925975
Huang Y, Kaufmann M, Aksan E et al (2018) Deep inertial poser: learning to reconstruct human pose from sparse inertial measurements in real time. ACM Trans Graph 37:1–15. https://doi.org/10.1145/3272127.3275108
Jiang F, Yang X, Feng L (2016) Real-time full-body motion reconstruction and recognition for off-the-shelf VR devices. In: Proceedings - VRCAI 2016: 15th ACM SIGGRAPH Conference on Virtual-Reality Continuum and Its Applications in Industry. pp 309–318
Jiang J, Streli P, Qiu H et al (2022a) AvatarPoser: Articulated Full-Body Pose Tracking from Sparse Motion Sensing. In: European Conference on Computer Vision. pp 443–460
Jiang Y, Ye Y, Gopinath D et al (2022b) Transformer Inertial Poser: real-time Human Motion Reconstruction from sparse IMUs with simultaneous terrain generation. Association for Computing Machinery
Johnson M, Humer I, Zimmerman B et al (2016) Low-cost latency compensation in motion tracking for smartphone-based head mounted display. In: Proceedings of the Workshop on Advanced Visual Interfaces AVI. pp 316–317
Jung ES, Choe J (1996) Human reach posture prediction based on psychophysical discomfort. Int J Ind Ergon 18:173–179. https://doi.org/10.1016/0169-8141(95)00080-1
Khatib O, Sentis L, Park J, Warren J (2004) Whole-body dynamic behavior and control of Human-Like Robots. Int J Humanoid Robot 01:29–43. https://doi.org/10.1142/s0219843604000058
Kim J, Seol Y, Lee J (2012) Realtime performance animation using sparse 3D motion sensors. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics. 7660 LNCS:31–42. https://doi.org/10.1007/978-3-642-34710-8_4
Kim SU, Jang H, Im H, Kim J (2021) Human motion reconstruction using deep transformer networks. Pattern Recognit Lett 150:162–169. https://doi.org/10.1016/j.patrec.2021.06.018
Krüger B, Tautges J, Weber A, Zinke A (2010) Fast local and global similarity searches in large motion capture databases. In: Computer Animation 2010 - ACM SIGGRAPH / Eurographics Symposium Proceedings, SCA 2010:1–10
Leoncini P, Sikorski B, Baraniello V et al (2017) Multiple NUI device approach to full body tracking for collaborative virtual environments. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). pp 131–147
Li W, Liu H, Ding R et al (2021) Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation. 1–13
Liu H, Wei X, Chai J et al (2011) Realtime human motion control with a small number of inertial sensors. Proc Symp Interact 3D Graph 133–140. https://doi.org/10.1145/1944745.1944768
Liu X, Feng X, Pan S et al (2018) Skeleton tracking based on Kinect camera and the application in virtual reality system. ACM Int Conf Proceeding Ser 21–25. https://doi.org/10.1145/3198910.3198915
Loper M, Mahmood N, Romero J et al (2015) SMPL: a skinned multi-person linear model. ACM Trans Graph 34:1–16. https://doi.org/10.1145/2816795.2818013
Madadi M, Bertiche H, Escalera S (2021) Deep unsupervised 3D human body reconstruction from a sparse set of landmarks. Int J Comput Vis 129:2499–2512. https://doi.org/10.1007/s11263-021-01488-2
Mahmood N, Ghorbani N, Troje NF et al (2019) AMASS: Archive of motion capture as surface shapes. In: Proceedings of the IEEE International Conference on Computer Vision. pp 5441–5450
Malleson C, Collomosse J, Hilton A (2020) Real-time multi-person motion capture from multi-view video and IMUs. Int J Comput Vis 128:1594–1611. https://doi.org/10.1007/s11263-019-01270-5
Mehta D, Rhodin H, Casas D et al (2018) Monocular 3D human pose estimation in the wild using improved CNN supervision. Proc – 2017 int conf 3D vision, 3DV 2017. 506–516. https://doi.org/10.1109/3DV.2017.00064
Murray RM, Li Z, Sastry SS (2017) A Mathematical introduction to robotic manipulation. CRC
Parger M, Schmalstieg D, Mueller JH, Steinberger M (2018) Human upper-body inverse kinematics for increased embodiment in consumer-grade virtual reality. In: Proceedings of the ACM Symposium on Virtual Reality Software and Technology, VRST. pp 1–10
Pope R, Douglas S, Chowdhery A et al (2022) Efficiently scaling transformer inference. abs/2211.0
Raaen K (2015) Measuring latency in virtual reality systems. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). pp 457–462
Romero J, Tzionas D, Black MJ (2017) Embodied hands: modeling and capturing hands and bodies together. ACM Trans Graph 36:1–17
Safonova A, Hodgins JK, Pollard NS (2004) Synthesizing physically realistic human motion in low-dimensional, behavior-specific spaces. ACM SIGGRAPH 2004 Pap SIGGRAPH 2004:514–521. https://doi.org/10.1145/1186562.1015754
Slyper R, Hodgins JK (2008) Action capture with accelerometers. Comput Animat 2008 - ACM SIGGRAPH / Eurographics Symp SCA 2008 - Proc 193–199
Soechting JF, Flanders M (1989a) Sensorimotor representations for pointing to targets in three-dimensional space. J Neurophysiol 62:582–594. https://doi.org/10.1152/jn.1989.62.2.582
Soechting JF, Flanders M (1989b) Errors in pointing are due to approximations in sensorimotor transformations. J Neurophysiol 62:595–608. https://doi.org/10.1152/jn.1989.62.2.595
Tang Y, Wang Y, Guo J et al (2024) A Survey on Transformer Compression. 1–20
Tautges J, Zinke A, Krüger B et al (2011) Motion reconstruction using sparse accelerometer data. ACM Trans Graph 30:1–12. https://doi.org/10.1145/1966394.1966397
Tong L, Liu R, Peng L (2020) LSTM-based lower limbs motion reconstruction using low-dimensional input of inertial motion capture system. IEEE Sens J 20:3667–3677. https://doi.org/10.1109/JSEN.2019.2959639
Touvron H, Cord M, Douze M et al (2021) Training data-efficient image transformers & distillation through attention. Proc Mach Learn Res 139:10347–10357
Troje NF (2002) Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. J Vis 2:371–387. https://doi.org/10.1167/2.5.2
Trumble M, Gilbert A, Malleson C et al (2017) Total Capture: 3D Human Pose Estimation Fusing Video and Inertial Sensors. In: 2017 British Machine Vision Conference (BMVC)
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Advances in Neural Information Processing Systems. pp 5999–6009
Weytjens H, De Weerdt J (2020) Process outcome prediction: CNN vs. LSTM (with attention). Lect Notes Bus Inf Process 397:321–333. https://doi.org/10.1007/978-3-030-66498-5_24
Winkler A, Won J, Ye Y (2022) QuestSim: Human motion tracking from sparse sensors with simulated avatars. In: SIGGRAPH Asia 2022 Conference Papers. pp 1–8
Xu W, Chatterjee A, Zollhöfer M et al (2018) MonoPerfCap: human performance capture from monocular video. ACM Trans Graph 37. https://doi.org/10.1145/3181973
Yang D, Kim D, Lee SH (2021) LoBSTr: real-time lower-body pose prediction from sparse upper-body tracking signals. Comput Graph Forum 40:265–275. https://doi.org/10.1111/cgf.142631
Yi X, Zhou Y, Xu F (2021) TransPose: real-time 3D human translation and pose estimation with six inertial sensors. ACM Trans Graph 40:1–13. https://doi.org/10.1145/3450626.3459786
Yi X, Zhou Y, Habermann M et al (2022) Physical Inertial Poser (PIP): Physics-aware Real-time Human Motion Tracking from Sparse Inertial Sensors. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp 13157–13168
Zeng Q, Zheng G, Liu Q (2022) PE-DLS: a novel method for performing real-time full-body motion reconstruction in VR based on vive trackers. Virtual Real 26:1391–1407. https://doi.org/10.1007/s10055-022-00635-5
Zheng Z, Ma H, Yan W et al (2021) Training data selection and optimal sensor placement for deep-learning-based sparse inertial sensor human posture reconstruction. Entropy 23:1–18. https://doi.org/10.3390/e23050588
Zhou Y, Barnes C, Lu J et al (2019) On the continuity of rotation representations in neural networks. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, pp 5738–5746
Acknowledgements
We thank the support of the Major Special Science and Technology Project of Hainan Province (ZDKJ202006).
Funding
This work was supported in part by the Major Special Science and Technology Project of Hainan Province under Grant ZDKJ202006.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by [Qiang Zeng]. The first draft of the manuscript was written by [Qiang Zeng] and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zeng, Q., Zheng, G. & Liu, Q. DTP: learning to estimate full-body pose in real-time from sparse VR sensor measurements. Virtual Reality 28, 116 (2024). https://doi.org/10.1007/s10055-024-01011-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10055-024-01011-1