Fig. 1
figure 1

In the top row, the input data is shown in red, the bottom row shows the output of our model in green, and the ground truth is in white

Introduction

Human motion and psychology are interconnected, as movements reflect and express emotions, contribute to cognitive development, serve as nonverbal communication, and are used in therapeutic applications, highlighting the close relationship between the mind and the body.

Human motion can be represented as a sequence of 3D joints connected via lines representing segments, see Fig. 1, or as a sequence of angles between the segments which we describe in more detail in Section “Human Motion Parameterisation.

The creation of a realistic, human motion animation as a skinned mesh animation is difficult with the production quality. The skinned multi-person linear (SMPL) models [2,3,4,5] express the pose and shape of human bodies in a sparse manner. This is accomplished by representing the human as a skinned mesh, with blend shapes representing the shape of the human, and the underlying skeleton of the skinned mesh representing the pose. Having chosen one representation of human motion, new samples can be generated using neural networks based on different kinds of input [6,7,8,9,10,11].

3D meshes of the human body are usually built around a skeleton for the purpose of animating the human motion. These skeletons are then either animated by hand or by using motion capture (MoCap) to capture real-life human motion as digital animations. Animating skeletons by hand is a time-consuming process and requires a skilled animator. Likewise, MoCap requires specialised equipment and often also requires an animator to clean up the recorded data. Animating humanoids thus consumes a lot of time and money for content creators.

Recently, several solutions allowing for MoCap from a single video camera have been published [12,13,14,15]. These are not widely used, which is likely because the quality is much lower than that of MoCap and handcrafted animations. They would require a significant cleanup pass by an animator to be of use, even for projects with relatively low animation quality requirements.

In this paper, we propose using recent advances in the prediction of human motion through neural networks to improve the quality of human motion, to bridge the gap between cheap recording methods and high-quality recording, see Fig. 1. The main novelty of this paper, however, is the proposition to use two inexpensive, low-quality sources for human pose in 3D and the smoothing of them to achieve high-quality human motion sequences. In other words, the model is trained to clean up 3D human motion from two low-quality input recordings.

To conclude, the contributions of this work are the following.

  • We modified the short-term version of QuaterNet [16] by

    • using a long short-term memory (LSTM) network instead of a gated recurrent unit (GRU) network, and,

    • redefining the loss function as the L1 distance between the predicted and ground truth quaternions.

  • We are the first to propose a model which receives two noisy 3D human motion sequences to perform a de-noising, which promotes a cost-efficient imaging solution, e.g. using webcams.

  • We show that the same architecture can be used for two separate tasks: (1) prediction and (2) smoothing of human motion.

This paper is organised as follows. First, the related work is described in Section “Related Work”. Then prior work on predicting human motion is replicated and extended to the task of motion smoothing in Section “Methods”. The results of the proposed models on both prediction and smoothing of human motion can then be seen in Section “Experiments”, followed by a section dedicated to limitations in “Limitations”. Finally, we conclude this paper with a thorough discussion and mention the proposed future work in Section “Conclusion”.

Related Work

Human Motion Prediction

Forecasting human motion is an important problem in computer vision. It is a central problem, in addition to creating digital animations of human motion, in applications like human robot interaction, autonomous driving, and human tracking. The problem is challenging due to the high variability and the complex nature of the human motion. Traditional, state–space methods, such as hidden Markov models [17] and Gaussian processes [18], have been shown to be suitable for the prediction of simple human motion; see Table 1 for an overview of the methods used for human motion prediction.

In the literature, see e.g. [16, 19], the prediction of sequences of 3D joint positions is commonly divided into short- and long-term predictions. Specifically, short-term refers to predictions limited to under 500 ms, while long-term tasks focus on motions which lay more than 0.5 s in the future, hence often referred to as generation. Recently deep neural networks, especially recurrent neural networks (RNNs), have made larger advances in the prediction of human motion for a longer range [16, 20,21,22,23].

QuaterNet [16], as proposed by Pavllo et al., consists of a two-layer RNN predicting future human motion from past motion, using a forwards kinematics (FK) loss. In that work, the rotations are represented by quaternions, opposed to previous works, where Euler angles or exponential maps are frequently employed. The choice was motivated by the fact that Euler angles and axis-angle representations come with several problems: non-uniqueness, discontinuity in the representation space, and singularities, which can be avoided by quaternions. QuaterNet also introduces a normalisation loss, as normalised quaternions are required to represent valid rotations. The FK loss is calculated by performing FK and then taking the positional loss of the joints. FK is when the joint positions are calculated from the joint rotations using the pre-defined skeleton. FK loss helps against the positional error introduced on the outer limbs by rotational error on the inner limbs, as the positional error of the outer limbs is affected by the rotational error of all parent limbs in the kinematic chain.

Another branch of human motion prediction networks are based on graph representation of the human body and related graph computations. Li et al. [24] proposed a multi-scale graph representation of the human body and encoder–decoder framework for motion prediction. An alternative, end-to-end multi-scale residual graph convolution network was proposed in [25]. Mao et al. [26] proposed motion attention to a graph convolutional network to capture the similarity between the current motion context and the historical motion sub-sequences. To encode temporal information, trajectory space was applied instead of the traditionally used pose space in [27].

Deep generative models, variational autoencoders (VAEs) and generative adversarial networks (GANs) have also been used for human motion prediction with the special aim of facilitating human motion prediction for the long horizon [28]. As an example, the conditional variational autoencoder (CVAE) was used to generate a diverse set of samples of human postures from a pretrained deep generative model in [29]. Spatio-temporal motion inpainting was proposed by a GAN prediction model in [30] and pedestrian trajectories were learnt with GANs in [31]. One more example of long-term human activity and location prediction was proposed in [32].

Human Motion Inpainting

Harvey et al. [33] showed that state-of-the-art motion prediction models cannot be easily converted into a robust transition generator, and proposed a model for human motion inpainting, i.e. a method that can fill in gaps of missing motion in a given motion sequence. It takes past motion and a target frame as input and then generates the frames in between using an RNN. To help the model maintain temporal coherency, a time-to-arrival embedding was added to the input frames.

To create realistic looking and temporally coherent motion, an adversarial loss based on least squares generative adversarial network [34] (LSGAN) was introduced.

Additionally, [33] uses a foot contact loss indicating whether a foot is touching the ground, thereby stabilising the feet as a post-processing step, which helps to combat a phenomenon commonly known as foot sliding. Foot loss can also be found in another recent work involving human motion, such as MotioNet [14].

Human Motion Smoothing

In the previously described work, the application focused on prediction or inpainting, which mostly relies on reliable estimates of human motion data as a starting point. However, raw motion data is often corrupted, i.e. the markers attached to the joints may be occluded, or lack precision, and hence yield noisy and jittery estimates or even miss data entirely. To overcome these issues, research has been conducted to smooth and denoise human motion data; see Table 2.

As to human motion smoothing methods, in [35] used traditional filtering methods and [36] proposed Kalman filtering. While these are older works, we found that traditional methods are still used these days, e.g. in [37] Bezier curves are used and [38] achieves a significant noise reduction by a B-spline-based least squares approach on data from a vicon motion capture system. Different network-based approaches have been employed to tackle the problem of corrupted data. In [39], an attention-based bidirectional recurrent neural network [39] was proposed to denoise hand motion data. Similarly, in [40], an attention mechanism was embedded in the bidirectional LSTM (BLSTM) yielding a deep bidirectional attention network (BAN). However, the current approaches for smoothing do not consider multiple input sources which is a shortcoming we address in this paper. We propose using two corrupted input sequences of 3D human motion to retrieve a smoothed version of the recorded motion. Our motivation for this approach is that currently several solutions to retrieve estimates for 3D human motion sequences are available, e.g. in [41], which come with some errors. We aim to take advantage of several of those corrupted 3D estimates to retrieve one high-quality sequence of human motion.

Table 1 Overview of related literature for human motion prediction
Table 2 Overview of related literature for human motion smoothing and de-noising

Methods

In this section, we first introduce two models which are designed to perform two separate tasks. First, we describe a prediction model, which receives past frames of human motion to predict the next frame one step in the future.

Second, we adapt the prediction model such that it will be tailored to the task of denoising human motion data aka human motion smoothing. Both models are builton an RNN architecture.

Human Motion Parameterisation

The human skeleton applied in this work is parameterised as follows. The joint locations are represented by 3D joint positions where joints are connected to other joints by line segments. If the joint positions are estimated for each frame separately, the length of the segments may change between frames, which is a common problem [44]. However, assuming constant lengths of the line segments, the configuration of the human skeleton can be fully defined by the relative orientations of the line segments, where each orientation is described by the respective quaternion, i.e. 3D angle.

The use of quaternions avoids the gimbal lock problem present with Euler angles.

Prediction Model

The proposed prediction model is designed to use past frames of a human motion sequence to predict the deterministic motion in the next frame. The proposed model is based on the short-erm version of QuaterNet [16], which we modified in two ways: frst, motivated by results from Harvey et al. [33], we use a long short-term memory (LSTM) network instead of a gated recurrent unit (GRU) network. Secondly, we adapted the rotational loss. Instead of using the distance between Euler angles calculated from quaternions,we redefine it as the L1 distance between the icted and ground truth quaternions as

$$\begin{aligned} L_{\textrm{prediction}} = \frac{1}{T} \sum _{t=0}^{T} \sum _{j=0}^{J} \left\| \widehat{{\textbf{q}}}_{j,t} - {\textbf{q}}_{j,t}\right\| _1, \end{aligned}$$
(1)

where T is the sequence length, J is the number of considered joint rotations of the skeleton, \(\widehat{{\textbf{q}}}_{j,t}\) is the joint rotation j of the predicted sequence at time step t, and \({{\textbf{q}}}_{j,t}\) the corresponding ground truth, both represented as quaternions. Hence, we combine the rotational error and the quaternion normalisation error by dropping the explicit term for normalisation used in [16].

The prediction model is designed as an encoder–decoder LSTM, with a two-ayer LSTM encoder with a hidden state of size 1000, followed by a feedforward neural network as the decoder. The decoder converts the hidden state to the target output, size of 4J, see Fig. 2. The model receives the past 50 frames to predict the next frame. To train the model, this process is repeated ten times to generate the next ten frames from the previous model outputs.

Fig. 2
figure 2

The architecture of the prediction model

Smoothing Model

The process of capturing 3D human motion is not exact and yields noisy estimates. To overcome this issue, we aim to provide one noise-free estimate from two noisy recordings of the same motion. For this purpose, to remove the stochastic part of the motion, a smoothing model is designed based on the previously introduced prediction model, i.e. an encoder–decoder LSTM, where the encoder part is defined as a two-layer LSTM with hidden state size of 1000, and the decoder as a feedforward neural network. This model uses the same loss function as the prediction model, as defined in Eq. 1.

In contrast to the prediction model, the smoothing model receives two concatenated frames per time step as input, thereby reflecting the real-world setting that two noisy input streams are provided, e.g. from different camera angles.

From each pair of noisy frames, the network estimates one corresponding noise-free frame. See Section “Data Generation for the Smoothing Model” for details on how the training data were generated for this model.

Experiments

Datasets

In this work, we use two 3D human motion capture datasets: the CMU [45, 46] and the Human3.6M dataset [47, 48].

The CMU MoCap Dataset [45] consists of 2605 human motions of 106 subjects recorded in 3D, totaling 552 minutes of motions at varying frame rates and 3.5M frames. We use the skeleton model which is fully parameterised by 22 body segment orientations.

The CMU Mocap Dataset is one of the 15 optical marker-based MoCap datasets which have been represented in a common framework and parameterisation in the Archive of Motion Capture as Surface Shapes (AMASS) database [46]. Specifically, an SMPL-H [3] variant of the Skinned Multi-person Linear Model (SMPL) [2] was used.

The Human3.6M dataset [47, 48] contains over 3.6 million different human poses, recorded in 2D and 3D from seven subjects performing 210 different motions in 15 subcategories. This yields a total of 176 min recorded at 50 frames per second and \(0.5\textrm{M}\) frames. While the database also contains high-resolution 3D meshes, we focus on the sparse skeleton which consists of a total of 32 joints, and for which 3D joint positions and joint angles are provided.

Implementation Details

We trained the models as follows. We used Adam as optimiser with a learning rate of 0.001, and the gradient norms are clipped to 0.1. Training data is batched, with a batch size of 64. The batches are drawn from the dataset by taking all possible combinations of 60 consecutive frames for each motion and shuffled for each epoch. Additionally, we used teacher forcing to improve the prediction and decrease the training time. We trained the models until there was very little or no improvement, which took several days, close to a week.

Prediction Model

The prediction model proposed in Section “Prediction Model” was trained on the Human3.6M dataset; see Section “Datasets”. We split the dataset as in [16, 20, 33] by using all the motions from subject 5 as test data and the rest as training data. Additionally, since the frame rate differs between the motions, we resampled them to 25 fps by either discarding frames or interpolating new frames, i.e. by down- or upsampling the sequence, respectively. The prediction is performed by taking 60 frames from a motion and then splitting it into 50 past and 10 future frames.

The results were evaluated using the mean absolute error between the Euler angles as

$$\begin{aligned} L_{\textrm{mae}}&= \frac{1}{T} \sum _{t,j} \left\| \left( {\Phi }\left( \widehat{{\textbf{q}}}_{t,j}\right) - {\Phi }\big ({\textbf{q}}_{t,j}\big ) + \pi \right) \ \textbf{mod}\ 2\pi - \pi \right\| _1, \end{aligned}$$
(2)

where T is the sequence length, J is the number of joint rotations, \(\widehat{{\textbf{q}}}_{t,j}\) represents the predicted quaternion of the joint j in frame t, with corresponding ground truth \({{\textbf{q}}}_{t,j}\), and \({\Phi }\) is a function converting quaternions into Euler angles. In Table 3, we compare our results with others, which show that our proposed prediction model is on par with state-of-the-art methods.

Table 3 Comparison of the mean absolute error between Euler angles, as defined in Eq. 2, between our proposed prediction model and other state-of-the-art methods on the Human3.6M dataset

Data Generation for the Smoothing Model

To our knowledge, there is no dataset which offers noisy human motion capture data, along with a non-noisy ground truth; therefore, we created our own data based on the CMU dataset Section “Datasets” to train the smoothing model, described in Section “Smoothing Model”. We employ the provided data as ground truth frames and create noisy input frames from them to train and evaluate our model. Therefore, 60 frames from one motion are selected and then split into 50 past and 10 future frames. Thereafter, noise is added to all frames, i.e. input features, as

$$\begin{aligned} \tilde{{\textbf{q}}}_{m,t,j,a} = {\textbf{q}}_{m,t,j,a} + N, \end{aligned}$$
(3)

where m is the human motion, t is the frame, j is the joint, a is the axis, \({\textbf{q}}_{m,t,j,a}\) is the ground truth quaternion, \(\tilde{{\textbf{q}}}\) is the noisy quaternion, and N represents the noise.

When modelling the noise N, we take into account three different kinds of noise: systematic bias, imprecision, and lost tracking. The noise N is defined as a composition of those as

$$\begin{aligned} N = B + I + L, \end{aligned}$$
(4)

where B is the bias, I is imprecision noise, and L represents the noise from lost tracks.

It has been observed that joint rotations captured through webcam pose detection systems often have a constant bias, depending on the subject captured, which is represented as

$$\begin{aligned} B\sim & {} {\mathcal {N}}\left( 0, \theta _{B}^2\right) , \end{aligned}$$
(5)
$$\begin{aligned} \theta _B\sim & {} {\mathcal {N}}\left( \mu _B, \sigma _B^2\right) , \end{aligned}$$
(6)

where \({\mathcal {N}}\) denotes the normal distribution. The imprecision noise represents small differences from the ground truth that occur in the joint rotations captured through webcam pose detection models as

$$\begin{aligned} I\sim & {} {\mathcal {N}}\left( 0, \theta _{I}^2\right) , \end{aligned}$$
(7)
$$\begin{aligned} \theta _{I}\sim & {} {\mathcal {N}}\left( \mu _I, \sigma _I^2\right) . \end{aligned}$$
(8)

The lost tracking noise L represents that sometimes a joint is not recognised, giving completely arbitrary values for that joint rotation, defined as

$$\begin{aligned} L&= L_1 L_2, \end{aligned}$$
(9)

where

$$\begin{aligned} L_1&\sim {\mathcal {B}}\left( p_L\right) \end{aligned}$$
(10)

models the probability that the model lost track of one frame, where \({\mathcal {B}}\) denotes the Bernoulli distribution, and

$$\begin{aligned} L_2&\sim {\mathcal {N}}(0, \theta _{L}^2), \end{aligned}$$
(11)
$$\begin{aligned} \theta _{L}&\sim {\mathcal {N}}(\mu _L, \sigma _L^2) \end{aligned}$$
(12)

models the amount of noise which is applied if the frame suffers from lost tracking. To conclude, to define the noise, we use a total of seven parameters.

Fig. 3
figure 3

Visualisation of 15 selected frames of subject 6, trial 5 from the CMU dataset. The first row shows the ground truth poses, while the second and third row show the noisy frames generated from the GT, which are concatenated and fed to the model. The fourth row shows the resulting output from the smoothing model. Please especially note the results of the frames 47 and 61, which demonstrate that the model is robust against lost frames. A video of the results is available at https://i.imgur.com/gS3Pin8.mp4

Smoothing Model

The smoothing model, described in Section “Smoothing Model”, was trained and evaluated on the CMU dataset, see Section “Datasets”, with added noise according to Section “Data Generation for the Smoothing Model. To load and manipulate the motions from the CMU dataset, we use Fairmotion [49]. Since one training step requires at least 60 frames, motions with less than 60 frames have been discarded. The data is split, such that \(90\%\) of the motions are used for training, \(5\%\) for validation, and \(5\%\) for testing. For training of the smoothing model, the size of each input source is limited for batching purposes, but all frames are used for evaluation.

The training data were generated according to Section 4.4, with \(\mu _B = \mu _I = 0.005\), \(\sigma _B = \sigma _I = 0.002\), \(\mu _L = 1\), \(\sigma _L = 0.01\), and \(p = 0.01\). To ensure that each motion has a unique distribution of noise, each parameter \(\theta\) is only sampled once per motion, thereby ensuring that the model learns the general composite noise instead of one specific distribution. The same procedure and noise function are used to generate the test data.

Fig. 4
figure 4

Comparison of the L1 distance between quaternions of the ground truth data (GT) and noisy input A (shown in blue), the GT and the noisy input B (shown in red), and the GT and our estimates (shown in green). Inputs A and B refer to the two concatenated views, which the smoothing model receives as input

Fig. 5
figure 5

Illustration of the L1 distance between estimated quaternions to the ground truth (GT) values per motion of the test data generated from the CMU dataset. For convenience, the motion categories are ordered along the x-axis such that the average distance of the two noisy input sequences A and B to the GT increases

Figure 3 shows the GT, the generated noisy input and the estimates from our model, and demonstrates that the model is able to recover information loss from lost frames. For a quantitative evaluation, we computed the L1 distance between quaternions of the ground truth and the noisy inputs, and the ground truth and the estimates, accordingly. The results are shown in Fig. 4 and illustrated per motion in Fig. 5. In both figures, it can be seen that the distances are lower for the smoothing model. To summarise, we found that our proposed smoothing model is able to reduce the noise to a large extent in 3D human motion sequences, thereby confirming that an LSTM-based model is suitable to for this task.

Limitations

During our experiments, we found that training did not converge, which we overcame by hand-tuning the training parameters and trying out different activation functions. While the smoothing model successfully yields smoothed, i.e. denoised, 3D human motion sequences, we found that if we provide one non-corrupted sequence, while the second input sequence is only noise, the outcome will be a jittery human motion sequence.

Conclusion

In this work, we have proposed a novel approach to estimate the human motion by merging and enhancing data from two low-quality sources. As a building block, our work also proposed an LSTM-based prediction model for human motion which was demonstrated to be competitive with previous approaches in the field. The key advantage of our approach lies in its ability to enable low-cost imaging of human motion without the need for expensive hardware traditionally associated with motion capture.

To the best of our knowledge, no suitable dataset currently exists for cleaning, i.e. smoothing of skinned human motion that would be suitable for pose detection from webcam videos.

While training the smoothing network, the lack of a dedicated dataset is not problematic, since simulated training data can be used. However, evaluating the network presents challenges due to the susceptibility of neural networks to shortcut learning [50], where the network may learn unintended shortcuts instead of the desired generalised solution. For instance, a neural network trained to classify objects might incorrectly take the background into account, leading to mislabelling.

One potential approach to mitigate shortcut learning involves evaluating the network using data from a separate dataset that was not used for training purposes. Our evaluation data is related to the training data in two ways. Firstly, the evaluation data stems from the same dataset, making it i.i.d. with respect to the motions it contains. Secondly, the noise used to generate the input motions in the evaluation is not the actual noise encountered in a webcam-based pose estimation pipeline, but rather the same noise estimation used during network training. Thus, the validation data represents the best possible effort considering the limited availability of data for this specific task. However, in the event that datasets of skinned human motion smoothing become accessible, it would be desirable to re-evaluate the model on these out-of-distribution datasets.

Consequently, the lack of annotated data for evaluation implies that the performance of the model on real-world data is uncertain. Overcoming this limitation and implementing various potential improvements is an interesting topic for future work.