1 Introduction

Full-body pose estimation has become increasingly necessary for virtual reality (VR) applications to achieve higher immersion. With the development of motion capture (MoCap) technologies, this problem seems to have been solved by existing MoCap systems. However, some studies have shown that representing full-body avatars in VR is still a great challenge (Caserman et al. 2020). The MoCap systems based on vision technology represent the most popular methods, which can be divided into marker-based and markerless. Currently, marker-based optical systems are most effective for this purpose. Many studies have exploited such systems in the development of VR applications (Leoncini et al. 2017). However, commercial marker-based systems are expensive and require complex setup. Markerless systems reconstruct full-body motion using several red–green–blue (RGB) or RGB–depth (RGB-D) cameras (Xu et al. 2018; Habermann et al. 2019; Li et al. 2021). Compared with the marker-based systems, markerless MoCap systems are more lightweight (Liu et al. 2018), but they have the significant drawback: they can only track the full-body of users robustly when they are standing in front of the camera. This limitation implies the stable operation of markerless systems usually requires users to face the camera (Greuter and Roberts 2014). While in VR scenarios, users are typically allowed to face any direction. This flexibility in orientation makes markerless systems susceptible to self-occlusion or environmental occlusion. Therefore, markerless MoCap systems are not sufficiently accurate and robust for VR applications where achieving stable and accurate performance is essential. Our method is based on the HTC VIVE Tracker, which is essentially a marker-based approach. Compared to markerless methods, HTC VIVE Tracker offers better accuracy and stability. Inertial measurement units (IMUs) could offer a compromise between cost and accuracy. Unfortunately, they suffer from high latency when integrated into VR (Johnson et al. 2016), Johnson et al. reported an end-to-end latency of approximately 300 ms when using Perception Neuron motion capture system for real-time full-body motion reconstruction in VR.

In recent years, with the development of VR devices, an increasing number of researchers have attempted to reconstruct full-body pose from the measurements of off-the-shelf VR devices (Jiang et al. 2016; Yang et al. 2021). In particular, MoCap systems based on the HTC VIVE headset and VIVE Tracker measurements have become popular. Caserman et al. have shown the feasibility of reconstructing full-body pose from sparse VIVE Tracker measurements by solving the inverse kinematics (IK) problem (Caserman et al. 2019b). Nevertheless, this is challenging because the IK problem is inherently underconstrained, and its ambiguity may result in unnatural poses and shaky motion.

The growing popularity of deep learning networks and the availability of large-scale MoCap data have inspired researchers to leverage human motion priors (Aristidou et al. 2018). Previous works have successfully estimated full-body pose from sparse IMU sensors based on deep learning methods (Huang et al. 2018). Yi et al. have shown that the positions of joints are easier to estimate than orientations (Yi et al. 2021). However, the measurements recorded by the VIVE Tracker device are different from those of IMU sensors. Although tracker measurements provide the accurate global positions, they suffer from occlusion problems.

In order to improve the accuracy and robustness to occlusion noise, we introduce Deep Tracker Poser (DTP): a deep learning method for full-body pose estimation in real time from the measurements recorded by an HTC VIVE headset and five HTC VIVE Trackers. DTP contains three parts: preprocessing, deep neural network and post processing. The data obtained from the VR sensor is calibrated, converted and normalized in the pre-processing stage to obtain the input of the deep neural network. The DNN first maps low-dimensional input data to a higher-dimensional space by the embedding layer, then utilizes Transformer encoder to learn human motion characteristics and finally obtains 6-dimensional (6D) representation of the full-body joint data through simple linear layer decoding. In the post-processing stage, the 6D representation is orthonormalized using Gram-Schmidt to obtain the rotation matrix, which is finally applied to the SMPL model (Loper et al. 2015) to obtain the estimated full-body pose.

To obtain sufficient data for generalization, we synthesize VR sensor dataset AMASS-VR from the AMASS dataset (Mahmood et al. 2019). The AMASS data is represented by the parameters of the Skinned Multi-Person Linear model with articulated hands (SMPL + H) (Loper et al. 2015; Romero et al. 2017). In order to improve the robustness to the feet occlusion, which is a common problem in VIVE tracker, we further synthesize an occluding dataset and fine-tune our DTP on that.

To evaluate the performance of DTP, we compare DTP with other methods including Final-IKFootnote 1, PE-DLS (Zeng et al. 2022) and TransPose (Yi et al. 2021) in terms of the accuracy and the computational cost. The results indicate that DTP outperforms other methods in terms of the positional error (1.04 cm), rotational error (4.22 °). Although DTP has higher computational cost, it is sufficient for real-time performance in VR with a low cost of 2.49 ms. Furthermore, by the qualitative and quantitative evaluation, we find that DTP always estimated the full-body pose well even under serious feet occlusion. In conclusion, DTP can generate more accurate and natural full-body poses and obtain more robustness to the occlusion noise. These findings show that DTP is effective in modeling the mapping from sparse tracker measurements to full-body pose and in solving the occlusion problem of VIVE Tracker measurements. This means that DTP contributes to construct a high-accuracy and occlusion-robust MoCap system based on an HTC VIVE headset and only five trackers for VR applications.

In conclusion, the main contributions of our work are as follows:

  1. 1.

    We propose DTP, a novel and effective real-time pose estimation method. The input of this method is the measurements of an HTC VIVE headset and five HTC VIVE Trackers, and the output is the full-body pose. We use Transformer encoder model as the core of deep neural network, so as to capture the motion prior knowledge and achieve accurate mapping from sparse sensor data to the full-body pose.

  2. 2.

    We propose a method of synthesizing VR sensor dataset from the publicly available dataset AMASS to achieve sufficient generalization.

  3. 3.

    We propose to simulate the feet occlusion by Gaussian random noise and synthesize the occlusion data set AMASS-OCC. To further improve the robustness of DTP to the feet occlusion noise, we fine-tune it on AMASS-OCC.

2 Related work

Although MoCap has a long history, representing full-body avatars in VR is still a considerable challenge. Caserman et al. have presented a survey of diverse full-body MoCap techniques (Caserman et al. 2020). In this section, we review three main types of MoCap systems and the development of deep-learning-based full-body pose estimation.

2.1 IK-based methods

With the rapid development of VR, recent studies have attempted to reconstruct full-body motion using off-the-shelf VR devices (Parger et al. 2018). Jiang et al. tracked the head and hands with the HTC VIVE headset and controllers and then estimated the upper-body pose by solving the IK problem (Jiang et al. 2016). Although they recognized the lower body based on animation blending, this approach is not always accurate. Caserman et al. reconstructed full-body motion from an HTC VIVE headset and trackers by solving the IK problem (Caserman et al. 2019b). And they have analyzed the performance of the popular numerical damped least squares (DLS) IK method for full-body reconstruction in VR (Caserman et al. 2019a). However, the main objective of an IK solver is to reach the target, which may result in unnatural poses even if the solution converges because of the inherent ambiguity of the IK problem. More information on IK solvers can be found in this survey (Aristidou et al. 2018).

2.2 Pose estimation

In order to ensure the naturalness of the full-body pose, previous works have demonstrated the feasibility of estimating full-body pose from sparse data (Chai and Hodgins 2005; Slyper and Hodgins 2008; Liu et al. 2011; Kim et al. 2012; Tong et al. 2020) by using data-driven methods. Chai and Hodgins first performed full-body animation using two cameras and six markers (Chai and Hodgins 2005). They proposed modelling the latent space by means of principal component analysis (PCA) and performing a fast search of motion examples using a nearest neighbors search algorithm. Subsequently, Krüger et al. proposed a fast method for similarity searching (Krüger et al. 2010), and Tautges et al. generated full-body animations using only four accelerometers and built a lazy neighborhood graph online for faster searching (Tautges et al. 2011). However, these methods do not scale well with the continuous growth of MoCap databases, and they have poor real-time performance and high spatial complexity because of the online search process.

In contrast to data-driven methods, which search for the closest pose for full-body reconstruction, deep-learning-based methods directly map sparse signals to full-body pose. Holden et al. proposed a convolutional autoencoder to model motion manifolds (Holden et al. 2016). Huang et al. adopted a biRNN to directly map only six IMU measurements to full-body joint orientations (Huang et al. 2018). Following this work, Yi et al. proposed a multistage network architecture to reconstruct full-body pose from six IMU measurements, and they estimated the global translation (Yi et al. 2021).

Recently, many works have successfully used deep learning methods for human pose estimation (Butt et al. 2021; Kim et al. 2021; Madadi et al. 2021; Zheng et al. 2021). Butt et al. introduces effective methods for training data and sensor position selection in sparse inertial sensor-based human posture reconstruction. However, the algorithm utilizes a heuristic algorithm with a greedy strategy to obtain approximate solutions for the established optimization problem, which may result in solutions that are not globally optimal. Madadi et al. use a Bi-directional recurrent autoencoder-based model to estimate 3D human pose from only six magnetic-inertial measurement units. The approach incorporates a 3D angle representation that eliminates yaw angle dependency. We use HTC VIVE devices to measure the pose and don’t suffer this problem. Kim et al. replaced the embedding layer of attention model with an RNN network in order to address the issue of discontinuity. In contrast, our approach completely discards the embedding layer and choose the 6D rotation representation over other common representations as the output for better continuity, as demonstrated in (Zhou et al. 2019). Zheng et al. introduces the first deep unsupervised approach for human body reconstruction with an attention model for estimating body joints from the landmarks. However, they did not take into account the issue of occlusion.

In recent years, CNNs have showcased significant achievements, not only in image-related tasks (Mehta et al. 2018) but also in temporal tasks (Holden et al. 2016; Weytjens and De Weerdt 2020). However, literature suggests that RNNs demonstrate superior performance in pose estimation tasks. Yang et al. have demonstrated that the performance of RNNs in pose estimation outperform that of CNNs (Yang et al. 2021). Temporal models excel in leveraging the temporal information of human motion and are applicable for processing action sequences. RNNs (Huang et al. 2018; Yang et al. 2021; Yi et al. 2021) have fewer parameters, simple structures, and are easy to train, but their capability to handle long-range dependencies is limited. Transformer (Kim et al. 2021) exhibit enhanced modeling capabilities, parallel processing, and excel in capturing global relationships, demonstrating superior performance in handling long-range dependencies. Therefore, this paper adopts the Transformer model as the backbone network.

Yi et al. and Jiang et al. estimate full-body posture based on IMUs (Jiang et al. 2022b; Yi et al. 2022). However, IMUs cannot provide global position information and suffer from drift issues. In contrast, the HTC VIVE Tracker we utilized offers accurate and stable global position and rotation data. In the state-of-the-art work, Du et al. (Du et al. 2023) utilized an MLP-based diffusion model to generate realistic and smooth human motions based on sparse tracking signals. However, it should be noted that they have yet to address the occlusion issues specific to VR devices and their approach relies solely on the pose information of the head and hands. Additionally, (Jiang et al. 2022a; Winkler et al. 2022) similarly only utilize pose information from the head and hands, leading to insufficient accuracy in lower body movements. The approach (Kim et al. 2021) proposed by Kim et al. is most similar to our method. In comparison to their approach, we made adjustments to the network architecture by replacing the embedding layer with a simple two-layer fully connected layer. Additionally, the decoder now utilizes a two-layer fully connected layer instead of a Transformer decoder. This modification results in a more lightweight network with fewer parameters, making it easier to train. Furthermore, due to the lack of consideration for occlusion issues in previous methods (Kim et al. 2021; Yang et al. 2021; Zeng et al. 2022), the robustness of pose estimation under occlusion is significantly limited. Therefore, we introduced a synthetic occlusion dataset, AMASS-OCC, to optimize the deep learning model, further enhancing the robustness of pose estimation in occluded scenarios.

In this paper, we introduce a novel method based on Transformer encoder to estimate full-body pose from the measurements of six VR sensors. Foot occlusion between a VIVE Tracker and the base station is a common problem because of the user’s own body or environment. Yang et al. proposed a deep-neural-network-based method for predicting the lower-body pose based on only the tracking information of the upper-body joints; in this way, they were able to avoid failure in cases of occlusion (Yang et al. 2021). However, due to the lack of tracking information on the lower body, the additional ambiguity results in growing inaccuracy and unnaturalness. To improve the robustness against occlusion noise, we propose a novel occlusion data synthesis method to generate an occlusion dataset on which to fine-tune our model.

3 Methodology

Our objective is to estimate full-body pose using an HTC VIVE headset and five VIVE Trackers. The overall flowchart of our method DTP is illustrated in Fig. 1. We first introduce three key stages of DTP: preprocessing, DNN and postprocessing. Then the method of fine-tuning DTP is shown. Finally, we give the details of our implementation.

Fig. 1
figure 1

Flowchart of our method

3.1 Model representation

The SMPL model (Loper et al. 2015) is utilized to represent the virtual avatar without incorporating hand gestures in this study. The pose is depicted as a configuration of 22 joints, comprising one root joint and 21 local joints. The articulated structure of the virtual avatar, illustrated in Fig. 2, consists of 69 degrees of freedom (DOFs), with each joint possessing 3 DOFs, except for the root joint, which has 6 DOFs. And rotation is defined using a 3 × 3 rotation matrix in this paper.

Fig. 2
figure 2

The SMPL model. There are 22 joints with 69 DOFs. The solid circles are the tracked joints of root and end effectors and the other hollow circle are the estimated joint from our model

3.2 Preprocessing

We used the position and rotation measurements of the HTC VIVE headset and five VIVE trackers as the total input to the system. For the sake of description, the HTC VIVE headset and the HTC VIVE Tracker are collectively referred to as VR sensors.

These sensors are strapped to the pelvis, head, left and right hands, left and right ankles. First, we obtain the corresponding positions and orientations of the body parts from the six tracker measurements using the “T-pose” calibration (Zeng et al. 2022). The measurements of six VIVE Trackers are denoted by \({p}_{leaf}^{g}\) and \({R}_{leaf}^{g}\) except for the measurements from the sensor strapped to the pelvis, which are denoted by \({p}_{root}\) and \({R}_{root}\). Then, we normalize the leaf positions and rotations by transforming these measurements from the world frame to the root reference frame. The normalized positions \({p}_{leaf}\) and rotations \({R}_{leaf}\) of the leaf joints are computed as:

$${p_{leaf}}=R_{{root}}^{{ - 1}}\left( {p_{{leaf}}^{g} - {p_{root}}} \right)$$
(1)
$${R_{leaf}}=R_{{root}}^{{ - 1}}R_{{leaf}}^{g}$$
(2)

Finally, we concatenate these measurements to obtain the input\(x\left(t\right)=[{p}_{root},{R}_{root},\{{p}_{leaf}\},\{{R}_{leaf}\left\}\right]\in {\mathbb{R}}^{72}\), where, \(p\in {\mathbb{R}}^{3}\) is the position of the pelvis, \(R\in {\mathbb{R}}^{9}\) is the flatten rotation matrix representation of the pelvis and \(leaf=1,...,5\). For descriptive purposes, we omit t for the time-varying variables unless it’s needed.

3.3 DNN

The main objective of our task is to estimate full-body joint orientations from sparse tracker measurements. This task is challenging, because the problem is inherently ambiguous, namely, there may be many correct full-body joint orientations with the same sparse tracker measurements. In this paper, we propose a novel method to leverage human motion prior knowledge for pose estimation. Yi et al. and Huang et al.. have shown excellent results of RNN in the full-body pose estimation from six inertial sensors (Huang et al. 2018; Yi et al. 2021). However, Transformer models outperform Recurrent Neural Network (RNN) models in natural language processing tasks, which use an attention mechanism to capture the long-distance dependencies in a sequence, while RNN models can only capture limited context information. Additionally, Transformer models can process input sequences in parallel, while RNN models require sequential processing, making Transformer models more computationally efficient. Therefore, our DTP learn the motion model by using the Transformer model.

We separate the network of full-body pose estimation into three stages: embedding, encoder and decoder. Specifically, we first map \({x}^{72}\) to a 256-Dimension embedding vector, which is a representation of the input in a higher-dimensional space. This embedding vector can provide more detailed information about the input and lead to better performance. Then we use a transformer encoder to process the embedding vector. The transformer encoder utilizes a self-attention mechanism, which allows for the encoding of the input in a more meaningful way, allowing for better performance. Finally, we use a single linear layer to efficiently decode the 256-dimension encoding vector into the full-body poses in 6-Dimension representation \({y}^{126}\).

Fig. 3
figure 3

Network architecture of our DTP. We map from the input sparse data \({x}^{72}\) to the full-body pose by three stages: embedding, encoding and decoding. First, we embed the input vector \({x}^{72}\) into a 256-dimension vector. Then, we utilize the a 3-layer Transformer encoder to encode the vector. Finally, we use a single linear layer to efficiently decode the vector into the full-body poses in 6-D representation \({y}^{126}\). a The DNN architecture of pose estimation. b The multi-head attention module. c The scaled dot-product attention

We feed the input x (t) into the deep neural networks, whose output consists of the root-relative full-body joint rotation \({p}_{full}\left(t\right)=\left[{p}_{j}\right(\text{t}\left)\right]\in {\mathbb{R}}^{3(J-1)}\), where \(\text{j}=\text{1,2}\cdots ,\text{J}-1\) and J is the number of joints in the human skeleton representation (J=22 in this paper). The Transformer architecture allows to leverage temporal information. For real-time prediction, we use a sliding window that contains a few past and future frames. Following (Huang et al. 2018), the sliding window comprises a total of 26 frames, with 20 past frames, the current frame, and 5 future frames. It should be noted that each adjacent sliding window overlaps by 25 frames, with only one frame being different. Therefore, the step size corresponds to 1 frame.

The encoder in our model is the part of Transformer (Vaswani et al. 2017). It consists of a three-layer encoder, where each layer is composed of multi-head attention followed by a residual connection connecting the output to the input. Subsequently, the sum is normalized using a layer normalization. Finally, a fully-connected layer composed of two linear layers is employed, where the first layer converts the vector into a more meaningful representation, and the second layer transforms the representation into the final output.

The detailed structure of the multi-head attention is illustrated in Fig. 3. We first replicate the latent space vector three times and divide it into eight parts. Each part is then fed into a linear layer with 32 neurons before undergoing self-attention using scaled dot-product attention as depicted in (Fig. 3-b). Finally, we concatenate the output of the 8 attention modules into 256-Dimension and use a linear layer to output the results. A two-layer linear layer with ReLU activation function is used to process the results of attention layer as the same in (Vaswani et al. 2017), which allows us to capture the complex relationships in latent space.

Finally, we use a single linear layer remaps the vector from the latent space into the output dimensions. This output is represented in 6 dimensions as \({R}_{full}^{\left(6\right)}\left(t\right)=\left[{R}_{j}^{\left(6\right)}\left(t\right)\right]\in {\mathbb{R}}^{6(J-1)}\). Zhou et al. showed that a 6-dimensional representation of the output outperforms other rotation representations in continuity (Zhou et al. 2019).

3.4 Postprocessing

The continuous 6D representation of the relative rotation is then transformed into rotation matrices \({R_{full}} \in SO\left( 3 \right) \subset {{\mathbb{R}}^{3 \times 3}}\) via Gram-Schmidt orthogonalization. Then, we transform the rotation matrix \({R}_{full}\left(t\right)\) into the world frame. The final global full-body pose is computed as follows:

$${R}_{full}^{g}={R}_{root}{R}_{full}$$
(3)

Combining the root position \({p}_{root}\), the root rotation matrix \({R}_{root}\) and the global full-body rotation matrix \({R}_{full}^{g}\), we obtain the output \(\text{y}=\left[{p}_{root},{R}_{root},{R}_{full}^{g}\right]\in {\mathbb{R}}^{3+9J}\). Thus far, we have obtained all of the parameters necessary to generate an SMPL model (Loper et al. 2015) using only the first 22 joints for visualization, excluding the motion of the hands. And finally, we use forward kinematics to recursively calculate the pose of each joint in the kinematic tree in global frame.

3.5 Loss function

The loss function of the network is defined in terms of the L2 norm, which consists of \({{\mathbb{L}}_{rot}}\) local rotational loss and \({{\mathbb{L}}_{ik}}\) inverse kinematics loss, and \({\mathbb{L}}_{ft}\) velocity of feet loss, denoted by:

$${\mathbb{L}}_{total}={\lambda }_{rot}{\mathbb{L}}_{rot}+{\lambda }_{ik}{\mathbb{L}}_{ik}+{\lambda }_{ft}{\mathbb{L}}_{ft}$$
(4)

\({{\mathbb{L}}_{rot}}\) is the L2 norm between the predicted local joint rotation in 6D representation and the ground truth:

$${\mathbb{L}}_{rot}={\Vert{y}^{\left(6\right)}\left(t\right)-{y}^{gt\left(6\right)}\left(t\right)\Vert}_{2}$$
(5)

\({{\mathbb{L}}_{ik}}\) is L2 norm between the predicted positions of the end effector and the ground truth:

$${\mathbb{L}}_{ik}={\Vert{p}_{end}\left(t\right)-{p}_{end}^{gt}\left(t\right)\Vert}_{2}$$
(6)

The foot sliding artifact significantly degrades the quality of the output motion. To address this issue, we introduced a velocity of feet loss \({\mathbb{L}}_{feet}\)that guides our network to generate the ground truth trajectory for foot joints:

$${\mathbb{L}}_{ft}={\Vert{[p}_{ft}\left(t\right)-{p}_{ft}\left(t-1\right)]-[{p}_{ft}^{gt}\left(t\right)-{p}_{ft}^{gt}\left(t-1\right)]\Vert}_{2}$$
(7)

where \({p}_{ft}\) is the predicted position of the foot and \({p}_{ft}^{gt}\) is the ground truth position of the foot.

We recommend assigning higher weights to rotation-related loss functions, moderate weights to end effector position loss functions, and appropriate weights to foot velocity loss functions. Weight assignment should be adjusted based on specific application scenarios and task requirements, optimizing model performance through experimentation and analysis. In this paper, we set the weights \({\lambda _{rot}}\)= 1, \({\lambda _{ik}}\)=0.5 and \({\lambda }_{ft}=0.01\).

3.6 Fine-tuning

The problem of self-occlusion of VIVE Tracker is a common issue in the field of computer vision. Occlusion occurs when the VIVE Trackers on feet are partially or completely blocked by the body of the user or chair from the view of Infrared (IR), making it difficult to accurately track the VIVE Tracker. In order to address this issue, researchers have proposed various methods for dealing with occlusion, such as using multiple trackers, employing robust feature descriptors, and utilizing temporal information (He et al. 2020, 2022).

In our experiments, we observed that when the sensor is completely occluded, it outputs a fixed value with position at origin and identity rotation, and when subjected to slight occlusion, the sensor outputs random position and rotation. To account for various occlusion scenarios, we propose using random Gaussian noise to simulate occluded data. And the mean of positional noise is a position at origin data and the mean of rotational noise is identity rotation.

In order to generate a synthetic occluded dataset, we first randomly select a certain percentage of frames as occlusion frames. In these occlusion frames, we randomly occlude one or both feet. The positions and rotations of occluded data can be represented as two random noised vectors \({v}_{occp}\) and \({v}_{occR}\). And they are modeled as Gaussian, with a mean of µ and a fixed diagonal covariance matrix \(\sum\). And the vectors are sampled from the Gaussian distribution:

$${v}_{occp}\sim N({\mu }_{p},{\sum }_{p})$$
(8)
$${v}_{occR}\sim N({\mu }_{R},{\sum }_{R})$$
(9)

Where \({\sum _p}=diag(0.04)\) and \({\sum _R}=diag(0.01)\). The rotation vector can be converted into a rotation matrix using Rodrigues’ formula (Murray et al. 2017). For the position vector under occlusion, we restrict it to a spherical range with a radius of L, which is the leg length:

$${p_{occ}}=\hbox{min} (L,\left| {{v_{occp}} - {p_{root}}} \right|)\frac{{{v_{occp}} - {p_{root}}}}{{\left| {{v_{occp}} - {p_{root}}} \right|}}$$
(10)

The constraint serves a dual purpose. On one hand, it prevents the position data from becoming excessively large, which could adversely impact the training of the neural network. On the other hand, in practical scenarios, the maximum displacement of the foot relative to the root joint typically does not exceed the combined lengths of the thigh and shin. This constraint ensures that the generated synthetic data maintains a realistic and physiologically plausible representation of human motion.

3.7 Implementation details

We synthesize training data from the AMASS (Mahmood et al. 2019). The raw motion sequences are used to determine the parameters of an SMPL + H model, and we generate a tracker dataset by placing virtual trackers on the corresponding body parts in this model and only use the first 22 joints without considering hand gesture.

Due to the diverse frame rates of motion sequences in different motion databases within AMASS, we resample the original motion sequences to unify all sequences to a consistent frame rate. For a real-time application and especially for VR, the delay between frames should not exceed 20 ms (Raaen 2015). Our system utilizes the Unity3D engine and SteamVR platform. To ensuring users experience a sensation free from dizziness or discomfort, the system maintains a consistent framerate of 60 fps, with a frame delay of approximately 16.67 ms. And we make the assumption that the system load, encompassing rendering and simulation tasks, can adequately meet this frame rate requirement. Therefore, we resampled the dataset to 60 fps. For the motion sequences which are not 60 fps, we use interpolation to obtain the closest value. In the Interpolate algorithm, we perform linear interpolation for the root joint position and employ spherical linear interpolation after converting joint angles to quaternions. Subsequently, we convert the interpolated quaternions back to joint angles.

We feed the network with a fixed window size of 300 frames. For motion sequences that are shorter than 300 frames, we discard them because they mostly represent static poses or excessively fast motion. And we utilize the official dataset splits provided by AMASS (Mahmood et al. 2019) for training, validation and testing. In specific, the training set includes the data from ‘CMU’, ‘MPI_Limits’, ‘Eyes_Japan_Dataset’, ‘KIT’, ‘BML’, ‘EKUT’ and ‘TCD_handMocap’; the validation set comprises data from ‘HumanEva’, ‘MPI_HDM05’, ‘SFU’, and ‘MPI_mosh’; and the testing set consists of data from ‘TotalCapture’ (Trumble et al. 2017). As a result, approximately 80% of the data was allocated for training, while 10% was designated for validation and another 10% for testing.

We implement our network based on PyTorch 1.9.1 with CUDA 11.1. The computer used in this study has an Intel(R) Core (TM) i5-8400 CPU and an NVIDIA GeForce GTX 3090 GPU. For live demo testing, we use an HTC VIVE headset and five VIVE Trackers, and the software used to obtain the tracker data is based on Unity3D. We use the official Adam optimizer of PyTorch with a learning rate of \(\text{l}\text{r}={10}^{-4}\) and the learning rate decays by a factor of 0.5 every \(1 \times {10^4}\) iterations. We train the network using a training strategy with early stopping, which means that we stop training once the validation loss stops decreasing for a certain number of epochs (we set a limit of 20 epochs in this paper). We first train the networks on the raw AMASS dataset, which takes about ten hours to train the DTP. Then we fine-tune it on the occlusion dataset, which takes another six hours.

4 Ablation study

In this section, we report the quantitative and qualitative evaluation of our method through testing on the dataset and the simulated occluded dataset.

We define the reconstruction accuracy in terms of the following metrics: positional error, defined by the mean and standard Euclidean distance error in centimeters between all the estimated joints and ground truth with the position and rotation of the root joint aligned; rotational error, defined by the mean global rotational error in degrees between all the estimated joints including the root and end effectors; mesh error, defined by the mean Euclidean distance between the predicted vertex position and the true vertex position in the SMPL model; jitter error, defined by the mean jerk of all the joints. Jerk is the third derivative of the position of the joint with respect to time, which reflects the smoothness and the naturalness of the motion (Flash and Hogan 1985). In addition, the computational cost is defined as the average computation time of each frame in the pose estimation in online testing without considering other computational costs such as rendering. However, in offline testing, the computational cost is defined as the average time for one forward propagation of the neural network.

All deep learning models discussed in Sect. 4.1 and 4.2 were trained on the same AMASS training dataset, while Sect. 4.3 primarily focuses on the performance of models fine-tuned on the occluded dataset, AMASS-OCC.

4.1 Comparisons of DTP with DTP-RNN

As is known to all, RNNs and Transformer model have been successfully applied to many pose estimation tasks. But the RNN model (Huang et al. 2018; Tong et al. 2020; Yang et al. 2021; Yi et al. 2021) is more commonly used in pose estimation based sparse joint data. In order to figure out if Transformer model contributes to the accuracy of pose estimation in DTP, we obtained the deep learning model DTP-RNN (Fig. 4) based on the RNN model by replacing the Transformer encoder with a two-layer bidirectional RNN model based on LSTM. Then we compare DTP with DTP-RNN in terms of the metrics and the average computational time by the online and offline testing on the dataset.

Fig. 4
figure 4

DTP-RNN model composed of two layers of bidirectional LSTMs

The online and offline comparisons by testing on the dataset are reported in terms of the mean and standard deviation of the positional error, rotational error, mesh error, jitter error and average computational cost each time. When conducting the offline testing, we feed the whole motion sequence into the network at once. But when conducting online testing, a short temporal window of 26 frames is used to be feed into the network in a sliding manner.

As shown in Table 1 we find that the DTP always has lower errors than DTP-RNN (both offline and online setting). Furthermore, we find the errors of the metrics of DTP and DTP-RNN increase from offline setting to online setting. The computational cost of DTP (3.02 ms) is far lower than that of DTP-RNN (12.74 ms) in the offline testing. However, the online computational cost of DTP (2.36 ms) is slightly higher than that of DTP-RNN (1.04 ms). In addition, compared to the online testing, both models showed a decrease in computational cost in online testing. The computational cost of DTP decreased by approximately 21.85%, while the computational cost of DTP-RNN decreased by over 10 ms, approximately 91.83%. In summary, the average computational time of both DTP and DTP-RNN models decreased from offline testing to online testing, but the decrease of DTP-RNN is greater than that of DTP.

Table 1 The comparisons of DTP with DTP-RNN model in the offline and online testing. We compared the DTP and DTP-RNN though the offline and online testing on the dataset. And the mean and standard deviation of the positional error, rotational error, mesh error, jitter error and the computational time

4.2 Comparisons of DTP with other methods

To further evaluate the performance of DTP, we compare DTP with other existing methods in terms of the accuracy and the average computational cost by testing on the dataset with online setting. We select Final-IK, PE-DLS and TransPose as the methods to compare against. Final-IK is the most representative commercial algorithm which is widely used to reconstruct full-body pose in the VR applications. PE-DLS (Zeng et al. 2022) is a hybrid method of analytical and numerical of focusing on reconstructing full-body motion from six VR measurements in our previous work. And TransPose (Yi et al. 2021) is considered to be the state-of-the-art algorithm for pose estimation from six IMUs.

The official version of the Final-IK and PE-DLS is used for testing and we use the default parameter value of Final-IK. As for TransPose, we revised the original TransPose to adapt to VR sensor measurements for consistency called TransPose-T for description. It should be noted that the Final-IK and PE-DLS are used in the Unity3D and implemented in C#, but the TransPose-T and DTP are implemented in Python. We first collect the estimated results from own methods with the same inputs and outputs. Then we achieve visualization on the same SMPL model and obtain the results.

Table 2 Comparisons with other methods. We test different methods including Final-IK, PE-DLS, TransPose-T and our DTP to perform the full-body estimation on the TotalCapture dataset. And the mean and standard deviation of the positional error, rotational error, mesh error, jitter error and computational time are reported

DTP always produces the minimum mean and standard deviation of the positional error, rotation error, mesh error and jitter error, while those of Final-IK are always the maximum. In addition, we find that the rotational error of TransPose-T (7.36 °) was smaller than that of PE-DLS (13.49 °), but its positional error (2.71 cm) and mesh error (3.82 cm) were larger than the positional error (1.85 cm) and mesh error (1.76 cm) of PE-DLS. Furthermore, we find that the two learning-based methods (TransPose-T and DTP) have smaller rotational and jitter errors than those of the two IK-based method (Final-IK and PE-DLS). And we find that the mean position error of DTP (1.04 cm) is 61.62% less than that of TransPose-T (2.71 cm), the mean rotation error of DTP (4.22 °) is 42.66% less than that of TransPose-T (7.36 °) and the mean mesh error of DTP (1.54 cm) is 59.69% less than that of TransPose-T(3.82 cm). In a word, the accuracy of full-body pose estimated by DTP outperforms other methods and more than 40% better than that of TransPose-T in terms of mean rotation error and mean mesh error. However, the mean jitter error (2.31\(\times\)102 m/s3) only 20.62% less than that of TransPose-T (2.91\(\times\)102 m/s3). Although Final-IK is the fastest, it always has the largest error in all the accuracy metrics.

Additionally, we find that PE-DLS exhibits the largest deviation in computational cost compared to Final-IK (0.01 ms), PE-DLS (1.30 ms), and TransPose-T (1.01 ms), with DTP showing the highest computational cost at 2.49 ms. Fortunately, DTP’s computational cost is sufficient for real-time VR applications. Although Final-IK has the lowest computational cost due to its simple vector operations, PE-DLS shows a large standard deviation in computational cost due to its iterative nature. DTP incurs the highest computational cost due to its larger model parameters compared to TransPose-T. However, being based on the Transformer model, DTP benefits from high parallelism, effectively reducing costs through parallel computation. Additionally, DTP’s computational cost is not significantly affected by sequence length, indicating strong scalability. Despite its higher computational cost, DTP’s robust parallelism and scalability render it suitable for real-time VR systems, maintaining a computational cost below the required threshold.

Furthermore, we observed that our reconstruction errors are slightly higher than those reported in the original study. Fortunately, these discrepancies are within one order of magnitude, likely attributable to differences in the datasets. Notably, our computation time is generally faster than that reported in the original work, which is likely due to variations in computing hardware. As shown in Table 2, the accuracy of the DTP algorithm is always better than that of other methods. Both quantitative and qualitative experiments show that DTP is more accurate than other algorithms. And we attribute this to the accurate capture of time-domain features by the Transformer model which utilizing the 8 multi-head attention mechanisms can learn rich feature expressions in their respective subspaces.

By comparing DTP and TransPose-T, we find that although TransPose-T can also generate natural gestures, its accuracy is lower than DTP, which easily leads to problems such as penetration (in Fig. 5). The main reason lies in the fact that TransPose-T uses RNN model to learn human motion models from large-scale databases. Due to its inherent sequence characteristics, sequence information far away is easy to be discarded. Although the continuity of motion sequence is strong, it is possible to capture the close dependence between adjacent frames by using RNN model. However, due to the sparsity of joint data in our task, it is difficult to accurately estimate the full-body pose only by relying on the close relation. Compared with TranPose, our DTP model can capture the dependency between frames that are far apart, so as to compensate for the lack of joint information. Furthermore, the total amount of parameters of DTP model is far greater than that of TransPose-T, so it has stronger modeling ability. These two points are the main reasons why our DTP is superior to TransPose-T.

By comparing the two methods based on deep learning and the two methods based on IK, we find that the rotational error and jitter error of the deep learning methods are smaller than that of the IK methods. This is because the loss function of two deep learning methods is based on the rotation of the joints, and TransPose-T and our DTP choose the 6D representation of rotation as the output instead of other common representations, such as quaternion, Euler Angle and rotation matrix, so as to obtain better continuity, which has been proven that the 6D representation has better continuity in deep learning training (Zhou et al. 2019). However, we find that the positional error and mesh error of TransPose-T are slightly greater than PE-DLS. This is because Pose-S1 and Pose-S2 are trained separately and then combined. Although Pose-S1 estimates full-body joint position, the final output is Pose-S2, which may lead to error when passing Pose-S2, so the positional error and mesh error are increased. The positional error of PE-DLS is smaller than that of TransPose-T, but its rotational error is larger, indicating that the twisting angle error of the joint around the polar axis is larger. This is because PE-DLS is a hybrid IK method of analytical and numerical methods, and the main objective is to track the position of the end effector joint. The joint positions are greatly affected by the swing angle, while the twisting angle cannot contribute to the joint position. Therefore, the twisting angle can only be optimized by the heuristic rule and joint constraint, resulting in a large error. In order to improve the accuracy of joint position, we added the inverse kinematic loss \({\mathbb{L}}_{ik}\), so as to optimize the estimation of joint position and joint rotation at the same time, leading to a lower positional error and mesh error.

Overall, quantitative experimental results show that data-driven methods based on deep learning can generate more accurate and smoother motions than IK methods. In addition, the accuracy of DTP is better than other competing methods. Although its model complexity leads to the highest calculation cost, it is enough to meet the real-time requirements of the VR systems.

Fig. 5
figure 5

Example of motion clips. We perform full-body pose estimation by different methods on the TotalCapture dataset. The first column is the ground truth (GT), and the other founr colums are the estimated motion clips by Final-IK, PE-DLS, TransPose-T and DTP. The motions from the top to the bottom are random walking (RW), KK (Kick), HH (Hold Head), Freestyle (FS) and KangFu (KF)

In order to highlight our findings and visually compare the estimated results of different methods, some motion clips are shown in Fig. 5. As shown in Fig. 5, DTP always produced the most similar pose to the ground truth. Although TransPose-T also produced natural poses, the penetration can be observed in the HH and FS and visual errors can be also observed in the RW and KK. PE-DLS can track the root and end effectors well, but it fails to reconstruct the elbows (in the HH) and knees (in the FS) in some cases. Similar to PE-DLS, Final-IK track the end effectors well, but it can produce strange and unnatural poses (in the KF). There are visual positional errors of the elbows and knees, and the poses of the shoulders are not natural and look like a ‘shrug’. Furthermore, we find that both the two data-driven methods produce natural poses (the fourth and fifth column in Fig. 5). And DTP can reconstruct full-body poses more accurately. But TransPose-T does not estimate the limbs well and we can see obvious errors of the left arm in the RW and right leg in the KK. Additionally, we can observe more jitter in Final-IK and PE-DLS than that of TransPose-T and DTP. In a word, DTP always generates natural motion and similar to the ground truth.

Through visual observation (Fig. 5), we find that the DTP always generate the most similar poses to the ground truth but others does not, which indicate the DTP outperforms the other methods in accuracy. Compared with Final-IK and PE-DLS, which use analytical and numerical methods to solve the IK problem, our DTP directly utilize large-scale motion capture databases to learn the model of pose estimation. In theory, human movements contain redundant information and it is feasible to reconstruct the motion by a sparse input (Troje 2002; Safonova et al. 2004; Tong et al. 2020). However, it is not easy to estimate the pose of human due to the under-constrained property. Our DTP solve this problem by using the big model Transformer encoder to enhance modeling capabilities, which can capture more complex patterns and relationships in the data with the 5.38 M parameters, leading to better performer on the pose estimation. Human motion is complex and influenced by many factors, including biomechanics (KHATIB et al. 2004), physiology (Soechting and Flanders 1989a, b) and psychology (Jung and Choe 1996). But IK methods have limitations in that they can only use high-level information such as joint constraints to optimize the results. While joint constraints are an important factor in determining human motion, there are many other factors that can affect the way people move, such as comfort, individual differences and environmental factors, which can be difficult to model mathematically. Therefore, our DTP utilize deep learning methods to learn these constraints.

We find that IK methods generate more jitter than deep learning methods, which also be shown in the Table 2. IK methods can sometimes produce discontinuous solutions, which may result in more jitter due to that they are often based on optimizing joint angles to achieve desired end effector positions without taking account the smoothness or naturalness of the resulting motion. It seems susceptible to shaking phenomena as a result of the joint limits. In contrast, deep learning methods can learn to generate smooth and natural motion by capturing the underlying patterns and structure of human motion data Therefore, our model can generate natural and smooth motions without forcing any joint limits, as the joint limits have already been implicitly modelled through training on the motion priors. These patterns and structure can be captured by analyzing large amounts of motion capture data and identifying common movements, poses and transitions between motions. For example, human motion data may exhibit certain symmetries, such as the fact that the left and right sides of the body often move in similar ways. It may also exhibit certain temporal patterns, such as the fact that certain motions tend to follow others in a predictable sequence. By analyzing these patterns and structures, deep learning models can learn to generate accurate motion sequences that are consistent with the underlying structure of human motion, resulting in more realistic and visually appealing animations. In conclusion, the discontinuity of IK solutions leads to greater jitter in the generated motion, while deep learning solutions have better continuity and generate smoother motion.

Furthermore, although TransPose-T is also a data-driven method like our DTP, we observe visual errors in left arm and right leg (in the RW and KK in Fig. 5), which are not observed in our DTP. We attribute our superiority to the inverse kinematics loss function of our DTP. TransPose-T are more focused on rotation prediction without considering the position of joints. But our DTP diverts some attention to the positions of the end effectors, which are important for interaction in VR applications.

In the supplementary videos, we observed that the arm poses of PE-DLS are not very accurate and prone to distortion. This is attributed to the inherent limitations of IK algorithms in ensuring precise intermediate joint accuracy, consistent with the findings in (Zeng et al. 2022). Additionally, TransPose-T exhibits noticeable feet jitter, indicating the apparent “foot skating” issue. The absence of the “foot skating” issue in our method is attributed to the incorporation of inverse kinematics loss\({{\mathbb{L}}_{ik}}\), and velocity of feet loss \({\mathbb{L}}_{ft}\), which greatly mitigates such problems.

In general, our DTP can always estimate full-body poses well by exploiting motion priors and the temporal context of the motion. Furthermore, DTP have better continuity than others and generate smoother motion.

4.3 Comparisons of DTP with and without fine-tuning

Feet occlusion is a common issue in real-time full-body motion capture due to its susceptibility to external environmental factors such as the chair. We first perform online testing on the occluded dataset AMASS-OCC, and we found that the error of DTP on occluded data increased significantly. To further improve the robustness of DTP towards occlusion noise, we assume that simulate the occlusion by random noise and generate AMASS-OCC. Then the DTP was fine-tuned on the AMASS-OCC. Finally, we compare DTP with and without fine-tuning on the testing dataset of the AMASS-OCC in terms of the positional error, rotation error, mesh error and jitter error. We define the model with fine-tuning as DTP1 for description.

Table 3 Comparisons of the model with and without fine-tuning. We compare DTP with DTP1 by testing on the occluded dataset and report the mean and standard of the positional error, rotation error, mesh error and jitter error

Compared to the online testing results on the AMASS-VR dataset (Table 1), we found that the original DTP and DTP-RNN models had increased position error, rotation error, and mesh error, and severe jitter error in the online testing on the occluded dataset AMASS-OCC (Table 3). Compared to the original models (DTP-RNN and DTP), the models fine-tuned on the training set of the AMASS-OCC (DTP-RNN1 and DTP1) showed a decrease in position error, rotation error, and mesh error in online testing on the AMASS-OCC, and also achieved lower jitter error. Although DTP showed the largest errors on the AMASS-OCC dataset, the performance of DTP1 after fine-tuning is significantly improved and achieves optimal performance.

Fig. 6
figure 6

Example of different feet occlusion cases. We compare different models on the testing data of the AMASS-OCC and different feet occlusion cases were shown. From the top to the bottom: RF (right foot occlusion), LF (left foot occlusion) and DF (double feet occlusion)

Comparing the results of DTP and DTP1, we found that the joint positional error of DTP1 (1.55 cm) was 69.30% less than that of DTP (5.05 cm), the joint positional error of DTP1 (7.60 °) was 43.45% less than that of DTP (13.44 °) and the joint positional error of DTP1 (2.4 cm) was 60.00% less than that of DTP (6.00 cm). Comparing the results of DTP-RNN and DTP-RNN1, we found that the joint positional error of DTP-RNN1 (3.19 cm) was 69.30% less than that of DTP-RNN (3.86 cm), the joint positional error of DTP-RNN1 (9.16 °) was 43.45% less than that of DTP-RNN (10.22 °) and the joint positional error of DTP-RNN1 (4.86 cm) was 60.00% less than that of DTP-RNN (4.32 cm).

To visually compare the performance of DTP with and without fine-tuning, we give some clips of the testing results on the different feet occlusion settings (in Fig. 6). As shown in Fig. 6, Although the original DTP was unable to accurately estimate full-body pose from occluded data, the DTP1 optimized by fine-tuning can accurately estimate full-body pose in all three occlusion scenarios. We can see from the Fig. 6 that both DTP-RNN and DTP-RNN1 can estimate relatively accurate poses when only the left or right leg is occluded. In addition, we found that foot occlusion not only leads to errors in the estimation of the legs, but also causes deviations in the upper body pose. DTP-RNN and DTP-RNN1 can also generate natural poses, but the errors compared to the ground truth are significantly larger than those of DTP1. And we observed penetration in results of the DTP-RNN1 in DF.

As shown in Table 3, the mean and standard deviation of positional error, rotational error, mesh error and jitter error of the original DTP and DTP-RNN on the AMASS-OCC test set are increased. This indicates that local occlusion can lead to significant decrease in performance of pose estimation performance and generate unnatural pose. In order to improve the stability of DTP for feet occlusion, we fine-tune the DTP and DTP-RNN on the training set of AMASS-OCC. The accuracy of the optimized model DTP1 is 40% better than that of the original DTP model. This indicates that the fine-tuning of DTP model through local occlusion data can help the model better adapt to occlusion data, improve its recognition ability of feet occlusion, and thus obtain better performance. We find that the original DTP model cannot accurately estimate pose when the two feet are occluded, resulting in awkward poses. This indicates that the DTP model before fine-tuning has poor stability against local occlusion noise and can hardly estimate the posture of the lower body. When both input data are noisy. The limitation of DTP model leads to its inability to capture the spatial and temporal characteristics of occlusion data effectively, so as to estimate the pose accurately. Therefore, we further improve and optimize it through fine-tuning. The experimental results (Fig. 6) show that the optimized DTP1 can still maintain a stable performance of full-body pose estimation when both feet are occluded, which means that the fine-tuned DTP1 can recognize noise data and capture the correct time and space dependence, so as to estimate the accurate full-body pose.

In conclusion, fine-tuning can significantly improve the stability of DTP model for occlusion data, and the performance of DTP algorithm after fine-tuning is higher than that of DTP-RNN. However, DTP performs slightly worse than DTP-RNN without fine-tuning, which indicates that more training data and longer training time are needed to achieve higher performance.

5 Results

5.1 Live demos

To evaluate the performance on real tracker data instead of a synthetic dataset, we implemented a real-time MoCap system, which obtain root and leaf joint data from the HTC VIVE headset and five VIVE Trackers and estimate full-body pose by DTP. We also present some clips in Fig. 7, and the results can be seen in video form from supplementary video.

Fig. 7
figure 7

Examples of real-time motion capture with our system

We can use DTP to estimate full-body pose from 6 tracker and achieve real-time performance at 90 fps. Furthermore, we can complete a more challenging task that involves interacting naturally with the virtual environment. The corresponding smooth and natural animations can also be seen in the supplementary video.

5.2 Superiority of transformer

In order to prove the superiority of Transformer in DTP algorithm, we evaluate DTP and DTP-RNN in terms of the accuracy and computational cost by testing on the dataset.

As shown in Table 1, the position error, rotation error, mesh error and jitter error of the DTP model are smaller than those of the DTP-RNN model in the offline and online testing, which means that Transformer model has stronger modeling ability than RNN model and can learn more accurate human movement patterns from large-scale motion datasets. This is because the multi-head attention mechanism can transform input data into multiple different subspaces and capture richer information through these different subspaces, which contribute to learn different behavior patterns and finally combine different behavior patterns as value features. And it also leads to more parameters, which make the training of Transformer become more difficult. In offline test, the model can obtain the data of the entire motion sequence, while in online test, only a fixed 26-frames window can be obtained. Therefore, the error increases in the online test, but the increase is smaller than one unit of measurement.

In the online testing, both DTP model and DTP-RNN model have lower time cost. However, in the off-line test, the time cost of the DTP-RNN model increases sharply by an order of magnitude, while the computational cost of the DTP model remains basically stable, with a variation range of less than 1 ms. This is because the RNN model must be calculated step by step according to time series on the time scale and cannot be implemented in parallel. Thus the calculation cost of DTP-RNN in the offline testing with a long sequence increases significantly. However, the high degree of parallelism in Transformer model makes it insensitive to the time length of input data, and the computational cost difference of forward propagation for input sequences with different time lengths is not significant. Although the online computing cost of DTP is higher than that of DTP-RNN, it is sufficient to meet the real-time requirements in VR applications.

5.3 System evaluations

To evaluate the practicability of DTP algorithm, we implement a real-time motion capture system based on an HTC VIVE headset and five VIVE trackers. In order to intuitively evaluate the system performance, we provided some motion clips of some real-time motion capture experiment in VR.

As shown in Fig. 7, DTP can not only generate natural and smooth movements such as walking, jumping and squatting, but also complete accurate human-object interaction in the interactive tasks, which means that DTP has a strong generalization ability. We believe this error is sufficient to meet the accuracy requirements in most VR applications. Although we can observe some visible errors, the user can’t see what they’re doing, which allowing a small error. In VR applications, the fluidity and naturalness of motion can directly affect the experience of the user. Therefore, the smooth motion capture is one of the key issues to be considered in VR application development. In addition, interactive task experiments show that our DTP can generate accurate end effector positions, which means that our DTP contribute to construct a low-cost motion capture system in interactive VR environments, such as rehabilitation training, simulation training, serious games, etc.

5.4 Limitations and future work

The test results in AMASS-OCC show that the stability of the initial DTP model for feet occlusion is poor which indicates that the DTP model has some limitations when dealing with occluded data, but this does not mean that the model cannot be applied to such problems. Conversely, it can also provide some guidance for subsequent optimizations and fine-tuning. We have significantly improved the performance by fine-tuning on the AMASS-OCC training data, but the occlusion problem cannot be completely solved by DTP. With the emergence of large language models like ChatGPT, we believe that architectures based on large Transformer models have significant potential.

Optimizing computational cost involves various strategies. In the future, we will explore simplified layer architectures or alternative layer types that maintain performance while reducing computational requirements (Touvron et al. 2021; Tang et al. 2024). In addition to exploring algorithmic optimizations, we will implement parallel processing techniques to distribute computations across multiple devices or cores at the hardware level, thereby enhancing the speed of inference (Pope et al. 2022).

In the future, we plan to explore fine-tuning based on large models (Du et al. 2022) for the optimization of stability in occlusion problems. However, during the exploration of large models, we need to carefully consider factors such as managing the cost of model training and addressing issues like reducing the model size through techniques like knowledge distillation. Furthermore, we have only made a brief exploration of the occlusion problem, and there are many further research directions in the future, such as the construction of real local occlusion data set. In addition, we can try to use other data enhancement techniques, such as random erasure, to improve the processing capability of the model for occluded data. Multi-sensor fusion technology (Malleson et al. 2020) can be used to integrate image sensors, which is also one of the popular research routes. As shown in Fig. 6, foot occlusion not only leads to errors in the estimation of the legs, but also causes deviations in the upper body pose. This discovery aligns with the research conducted by (Yang et al. 2021), where they achieved the estimation of lower body movements based on upper body actions. This implies that exploring the separate training of upper and lower body pose estimation to mitigate occlusion issues, especially related to footstep occlusion, is a promising avenue for research.

6 Conclusion

In this paper, we propose DTP, a deep-learning based method, which estimate full-body pose from the measurements recorded by an HTC VIVE headset and five VIVE Trackers. The core idea of DTP is to learn human motion patterns and structures in multiple subspaces using Transformer encoder’s multi-head attention mechanism. In order to learn from a large enough database, we synthesized the VR data set AMASS-VR from the large-scale motion dataset AMASS. By comparing and analyzing the performance of DTP and DTP-RNN, we have proved that in the full-body pose estimation task based on sparse VR sensor data, the DTP model has stronger modeling ability than the DTP-RNN model and can generate more accurate and natural poses. In addition, by comparing with other methods, we find that DTP outperforms existing Final-IK, PE-DLS and TransPose. Although the computational cost of DTP is higher than other methods, we can implement if in parallel due to its high parallelism, so that its computation cost (2.49ms) meets the real-time requirements of VR system.

In order to evaluate the robustness of DTP algorithm to feet occlusion problem in VR system, we propose an algorithm to simulate local occlusion data using random noise and synthesize feet occlusion dataset AMASS-OCC from AMASS. The testing on the occluded dataset shows that the original DTP model has poor robustness to occlusion. In order to further improve the robustness to occlusion noise, we utilize AMASS-OCC to fine-tune the original DTP model. The results show that the accuracy of the fine-tuned DTP model on the AMASS-OCC improved more than 40%.

In order to evaluate the practical value of the DTP, we construct a real-time full-body motion capture system based on the HTC VIVE headset and five VR trackers. The system uses DTP to estimate full-body pose and runs in real time at 90 FPS. The results show that DTP can accurately estimate the full-body pose and generate smooth and natural motion with accurate positions of the end-effector joints, which enable natural interactions in VR. Therefore, the greatest contribution of DTP algorithm is to provide a highly accurate real-time attitude estimation method in VR environment.

Although DTP has achieved the expected experimental results, it still has some limitations. For example, although we have improved the stability of the occlusion problem by fine-tuning, visual errors can still be observed in some cases. This does not mean that DTP cannot be applied to such problems, but rather that it needs to be optimized and improved for specific problems. Future work can optimize the stability of the occlusion problem and train the upper and lower body pose estimation respectively.