Predicting human poses via recurrent attention network

Human motion prediction is a challenging task due to the diversity and randomness of future poses. Due to the inherent topology of pose data, most recent work has used graph convolution networks (GCNs) to accomplish the task of human motion prediction. However, the GCN-based method has a major shortcoming in that the modeling of temporal information is insufficient. In this paper, we propose a simple approach that combines recurrent neural networks (RNNs) and attention mechanism for motion prediction, which considers both spatial relations between different joints and temporal correlation. The query, key and value of the attention mechanism allow us to select the information we need for the subsequent prediction. To solve the difficult problem of RNN training, we utilize uncertainty and strong short-term constraints to optimize the training process. We evaluate our method on several standard benchmark datasets for human motion prediction, i.e., the Human3.6M dataset and the CMU MoCap dataset. The experimental results show that our approach outperforms previous approaches.


Introduction
Human motion prediction (HMP) is a basic task for many downstream applications (see Fig. 1).For instance, autonomous driving, human-robot interaction, security monitoring and other aspects need to predict human motion, so as to obtain effective information in advance.Many early works used nonlinear Markov models [1], Gaussian process dynamical models [2], and restricted Boltzmann machine [3] to accomplish the task of human motion prediction.With the continuous development of deep learning, an increasing number of researchers have adopted deep learning models to estimate human poses, which has greatly improved the accuracy of prediction results.
There are some works adopting convolutional neural networks (CNNs) to address the HMP problem [4][5][6][7].They transform the pose sequence into an image and then use 2D convolutional layers to extract the features.However, human joints do not have the adjacency relationship of image pixels, which cannot directly transfer concepts such as edges in the image.Moreover, the poses are not regular data, and it is difficult to extract the key features by using 2D convolutions.Due to the natural topology of pose sequences, many works utilize the graph convolution network (GCN) to tackle the HMP problem [8][9][10][11][12][13][14][15][16][17][18].They treat a human pose as a graph and view each joint as a node of the graph.The edge of the graph connects two adjacent joints.Then, they use GCNs to learn spatial relations between different joints, which promotes prediction accuracy.However, the GCN-based method has a major shortcoming, that is, the modeling of temporal information is insufficient.While the pose dataset itself is a sequence, the association between different frames can represent movement.Employing spatial-temporal GCNs to model the temporal information seems to be a preferred way to address this problem.However, with the extremely high computation cost, the model can only consider a small time span, which makes it difficult to extract temporal information well.Because of the sequential nature of the pose data, some works also utilize a recurrent neural network (RNN) to solve the HMP problem [19][20][21][22][23][24][25][26][27][28][29][30][31][32].RNN-based approaches can extract temporal features efficiently and capture motion information at different moments.However, there are some inherent problems with RNN structures in HMP.
We find that traditional RNN-based HMP models often have some common problems: (1) They lack explicit extraction of spatial features of joint data, which makes it difficult to obtain the relation between different joints.(2) The training process of RNN-based models is unstable.Due to the error accumulation and uncertainty, there are some constraints in the training process, leading to an incorrect optimization process.
For the first problem, leveraging a GCN to extract spatial connections between joints is a good choice.However, the GCN lacks temporal information transmission and movement information extraction.Attention mechanisms can be viewed as graph convolution operations on a fully connected graph.Since sequence implementation is formed between joints, it can also model the spatial relation of different joints.Moreover, we can imitate the word embedding operation to apply temporal embedding to a short clip of a pose sequence and model the motion within the clip with the hidden dimension.However, if the clip is too long, the temporal features cannot be modeled well because of the multiple possible motion processes.Therefore, hidden dimensions can only be used to model small changes in motion over a short period of time.Nevertheless, how to convey past information to the future is still a problem.It is also unclear what information is needed for long-term predictions.Because features for current and future predictions are tightly integrated, we propose the KV-LSTM (key and value-long short-term memory) model to handle this problem.Since we use the attention mechanism and hidden dimension to model a clip's temporal features, the input feature is projected as query, key and value.The query represents the information that the model needs for the current prediction, while the key and the value are the information that can be offered for prediction.We can use key and value to convey useful information, and the query to obtain information needed for prediction.To this end, we adopt the attention mechanism to divide the information and choose the LSTM model to deliver the information.
The second problem is the unstable training process.Uncertainty is an important factor.For samples with closely identical condition input, the model cannot output different results, so it will try to output the average positions of their ground truth.In addition, some motions are too complex to fit or have no inherent regularity.If the model does its best to fit these samples, it may affect the prediction of most other motions.As a result, we treat the prediction of different samples as different independent tasks and introduce uncertainty to alleviate this problem.Error accumulation is the second factor.Initially, the model uses the ground truth as input to predict the shortterm outcome.However, because the short-term prediction has a certain deviation from the ground truth when it is used as the input for the next stage of prediction, the model needs to predict the result of the next stage from a biased input.This leads to a model that needs to fit different prediction processes.It will not only affect the longterm forecasts but also restrict the correct short-term prediction process.If the short-term prediction continues to fluctuate, it will also affect the subsequent long-term prediction process.Finally, the model cannot converge reasonably or converges incorrectly.What we want is that the model can make perfect short-term predictions and then use the previous outputs to predict subsequent results.To address this problem, we add a higher loss weight to the short-term prediction.This design can not only accelerate the convergence of our model but also improve our model's performance.
In summary, the main contributions of this paper are three-fold: (1) A well-designed attention mechanism is used to model movements over small periods of time and the relation between joints.We have incorporated the attention mechanism into our LSTM model, where the attention mechanism is used to selectively focus on relevant information and the LSTM model is utilized to transfer learned information from prior prediction steps to subsequent ones.We call this structure KV-LSTM.
(2) We utilize the uncertainty prediction module and strong short-term constraints to make the model's training process more stable and the final convergence results better.
(3) The experiments demonstrate that our method outperforms previous approaches on two public datasets.
However, CNNs cannot directly model the relation between different human joints.Sometimes it may be difficult to extract useful features from joints.
Since human joints have topological structures, most recent work has used GCNs to model human joints and extract their correlation information.Askan et al. [8] proposed a method similar to the GCN, which leveraged small networks to exchange and fuse features among joints.The works of [12,14,15] used the GCN to encode features or to decode them, which associate different joints' information.Some works [10,11,17,18] are fully based on the GCN to model human joints.Mao et al. [18] viewed a pose as a fully connected graph and adopted the GCN design principles to extract hidden information between any pair of joints.Dang et al. [11] devised a multi-scale network to model the human poses at different abstraction levels, which gives the model a multi-level understanding of human movements.Sofianos et al. [33] proposed a method to extract spatio-temporal features using GCNs.Many works based on the GCN used the discrete cosine transform (DCT) to extract temporal information.
With the development of the Transformer model [34], many works have attempted to introduce this structure into the task of human motion prediction [35,36].The self-attention module of the Transformer can compute relations between different joints.It also has the ability to model the temporal information of motion sequences.
Since human motion data have temporal relations, many works have utilized the RNN structure to estimate human poses.For instance, ERD [21] combined LSTM [37] with an encoder and decoder to form a network for temporal modeling.Jain et al. [27] proposed structural-RNN to extract the spatio-temporal features of human motions.Martinez et al. [29] used the sequence-to-sequence architecture to model human motion structures similar to natural language processing.We also utilize the LSTM model as the basic block of our proposed network and combine it with the attention mechanism, which extracts the connection between different joints more effectively and assists in the transmission of temporal information.However, RNNs are difficult to train, usually yielding problems of error accumulation.

Methodology
Human motion prediction is a task to predict future motion sequences given the currently observed motion sequences.Let S 1:T h = [X 1 , . . ., X T h ] ∈ R J×D×T h denote an observed motion sequence of length T h where X i is a motion at time i, and S T h :T h +T p is the motion sequence of length T p that needs to be predicted.Note that J is the dimension of the joints and D is the dimension of the coordinates.We do not directly predict all future motions at once.Instead, we divide the motion sequence into several clips.The clip's length is denoted as T s .We predict T s frames every time until reaching the terminal frame.For the above purpose, we design an RNN-based prediction framework as illustrated in Fig. 2(a).

Motion attention mechanism
Attention [38] is an important mechanism in the field of natural language processing.This structure is often used to effectively extract global features or model sequential associations.
Since human motion data are serialized data, we can use attention to obtain the relation between different frames.This mechanism can extract the temporal information of joints, which is a critical clue for predicting future human motions.However, a sequence is naturally formed between different joints as well.As a result, it can also be used to extract relations between joints.In our proposed model, we utilize an attention mechanism on the dimension of joints, which is denoted as the motion attention mechanism (MAM).In this module, we not only extract the relationships between different joints but also leverage the hidden dimension to capture movement information over a short time period.This is because we recognize that motion within a segment contains richer information, which enhances the effectiveness of the transmitted features.
We divide the motion sequence into several clips.Each clip has a length of T s frames.The motion data can be represented as a tensor of dimension J × D × T s .Then, we exchange the last two dimensions and merge them into the same dimension to form a tensor of dimension J ×(T s ×D).After passing a motion encoder, the hidden feature is sent to the MAM to extract spatial information between different joints.As we acquire the attention mechanism to learn the relation between different joints, we use the selfattention.The attention function maps the input into a query and a set of key-value pairs.The output is computed as a weighted sum of the values, where the weight is determined by the score made up of the query and the corresponding key.We compute the matrix of outputs as follows: where where the projections are parameter matrices ×d model .We linearly project the input X h times with different learnable parameters.Then, we obtain the query, key and value of hidden dimension d k , d k and d v .After calculating the multi-head attention, we obtain h different output values.These are concatenated and projected to final values.In this work, we set h = 8 and Figure 2 (a) Overview of our proposed framework.The prediction module is an iterative process.The input clip in the first two stages is derived from the observed sequences, and the follow-up input clip is the result predicted by the model itself.Finally, the multi-stage prediction results are concatenated to form the prediction results.The uncertainty prediction module (UPM) takes in the observed sequences and estimates their uncertainty, which is used to calculate the loss.L represents the loss function, W represents the model parameters, and σ i , i ∈ 1, 2, . . ., n represents the sample standard deviation estimated through the model.(b) Details of the prediction module.Q, K t , and V t represent query, key and value, respectively.The subscript t is the iteration of the LSTM cell.P K t and P V t denote the output of the K-LSTM cell and the V-LSTM cell, respectively.N indicates the number of blocks.

KV-LSTM
Since our MAM is to model the relation between the joints, temporal information is only modeled within small clips by hidden dimensions.However, temporal information is critical for prediction tasks.Knowledge of the past can play a role as a priori constraint, allowing more precise predictions of possible future movements.As a result, the model needs to have the ability to transmit past information, so that further prediction processes can obtain more information, and thus better long-term prediction results can be obtained.
The LSTM structure is usually used to extract temporal correlations.In addition, the LSTM network can also be viewed as a memory bank to store information.In this work, we use this structure to pass information from past clips to subsequent clips.However, it is often unclear what information is needed for long-term prediction, because the features for current prediction and future prediction are tightly integrated.Thanks to the motion attention mechanism, we can divide the features into queries, keys and values.The query represents the information that the model needs for the current prediction, while the key and the value are the information that can be offered for the current prediction.Then, we can employ the key and value to pass information, and the query to obtain the information needed for prediction.As a result, we propose the KV-LSTM model to transmit features from past to future, as shown in Fig. 2(b).
The query will be passed to the MAM with its original value.The key requires not only the current information but also the past information to be taken into account as the query is targeted for current prediction.We utilize the LSTM model to control the information flow that is sent to the MAM for current prediction as well as to select features that are required for the subsequent prediction.Let K t be the computed key, and the transfer process of the key can be expressed as: where , respectively.σ indicates the sigmoid function, and tanh represents the hyperbolic tangent function.By adopting an LSTM structure, our model can capture past information from the cell and hidden layers, while adding current features to the cell for future prediction.Then, P K t is sent to the MAM and acts as the true key.C t and P K t are delivered to the next LSTM cell.Likewise, we also use an LSTM structure for information passing of the value.The subsequent prediction can also take the information of the previous clips into account, so as to promote the effect of long-term prediction.
We claim that the query does not need this LSTM structure to convey information.Each clip must make its own prediction.The query only represents what clue the current prediction requires.Thus, the query at the current clip is useless and may even be harmful noise for subsequent predictions.We send it to the MAM directly.

Dealing with instability
The human motion prediction task usually uses the mean square error (MSE) as the loss function to train the model.However, we find that there may be some problems in optimizing an RNN-based model if we adopt the naive approach.The training process can be very erratic.We will introduce two methods to handle the instability in the following paragraphs.
Uncertainty prediction module There are many types of motion, and the range, speed, and posture of each movement vary greatly.Our model is designed to fit different motions, which can essentially be regarded as a combination of multiple fitting tasks.However, some movements are periodic, while others are more random.In addition, there are small sections of some actions that are similar and difficult to distinguish when fed into the model as conditional data.For samples with nearly identical condition inputs, the model cannot obtain different outputs, so it will try to fit the average positions of their ground truth values.We call this kind of data unpredictable data.Other training data can be called predictable data.If the model tries its best to fit the unpredictable data, it will learn not only useless knowledge for unpredictable data but also harmful knowledge for predictable data.Therefore, the best approach is to make the model focus more on predicting the predictable data and giving up some of the unpredictable data.Nevertheless, it is challenging to define what type of data is unpredictable.
To solve this problem, we introduce uncertainty.In Bayesian modeling, there are two main types of uncertainty that one can model [39].One is the epistemic uncertainty and the other is the aleatoric uncertainty.The aleatoric uncertain can be divided into two subcategories.Data-dependent or heteroscedastic uncertainty is aleatoric uncertainty that depends on the input data, while taskdependent or homoscedastic uncertainty is dependent on different tasks.In this work, we treat different samples as different tasks, because they represent various kinds of motions.A multi-sample loss can be derived based on maximizing the Gaussian likelihood with homoscedastic uncertainty.
Let f θ (x) be our prediction model.θ is its learnable parameters and x is the input data.For the human motion prediction task, we define our likelihood p(y|f θ (x)) as a Gaussian N with the mean given by the model output: with an observation noise scalar σ .For different samples x 1 , x 2 , . . ., x n , the ground truth value can be denoted as y 1 , y 2 , . . ., y n .We assume that the model's predictions for different samples are independent of each other.Then, Equation ( 8) can be rewritten as: We attempt to maximize the log-likelihood of the model, which leads to the minimization objective, L(W, σ 1 , σ 2 , . . ., σ n ) for our model: Where The σ i is the output of the model, and here we use a separate module to estimate the variance of different samples, which is named as the uncertainty prediction module (UPM).
Strong short-term constraint MSE loss is often used as the training loss of regression tasks.However, this can lead to the problem that we treat the prediction results as equally important at all times.This is often harmful because long-term predictions depend on short-term results, and we hope that the model can predict better long-term results as well as good short-term results.Due to the error accumulation problem of models based on RNNs, it is difficult for models to make perfect short-term predictions and long-term predictions at the same time.When the model wants to predict long-term motions, it needs to consider not only the input conditions but also the input uncertainty caused by the accumulation of errors.This, in turn, affects the short-term prediction process.It will make short-term forecast results unstable or poor, and long-term prediction results will fluctuate with the unstable short-term forecast results.Finally, the model training process has difficulty in converging for estimation.Ideally, a model should be able to make accurate longterm predictions if its short-term predictions are accurate.However, when using the same weight for MSE loss, the model tends to prioritize the optimization of longterm prediction over short-term prediction, as the difference between the prediction result and the ground truth is larger for long-term prediction.This makes it difficult to improve short-term prediction.To address this issue, we increase the loss weight for short-term prediction to make the model pay more attention to it.Our experimental results demonstrate that this approach not only accelerates the convergence of the model but also improves its performance.

Datasets
• Human3.6M[40] is one of the largest motion capture datasets, which consists of 3.6 million human poses and corresponding images.It has 15 types of actions performed by 7 actors (S1, S5, S6, S7, S8, S9 and S11).The actors are represented by a skeleton of 32 joints.Following the data pre-processing of [11], we only select 22 joints.The global rotations and translations of poses are removed and the frame rate is downsampled from 50 fps to 25 fps.Of the 7 actors, S5 and S11 are used for testing and validation, respectively, and the rest actors are used for training.
• CMU-MoCap has 8 different action categories.The global rotations and translations of the poses are also removed.Each pose has 38 joints, and only 25 joints are used following [11,18].

Comparison setting
• Evaluation metrics.We train and test our model on coordinate representations.The mean per joint position error (MPJPE) is used as our evaluation metric for 3D coordinate errors.We will show the results measured by 3D coordinates in this paper.
• Test scope.We note that the works of [15,18,29] randomly chose 8 samples per action for testing, Mao et al. [17] randomly selected 256 samples per action, and Dang et al. [11] chose all the samples for testing.We follow Dang et al. [11] to measure the total test dataset in this paper.
• Implementation details.Following [11], we set the input length to 10 frames and the output length to 25 frames for both Human3.6Mand CMU-MoCap, respectively.We design a 2-layer MAM to make predictions and the layer number of KV-LSTM is set to 1.The head of our MAM is 8 and the hidden dimension of our model is 128.For computational reasons, our UPM simply uses an multi-layer perception (MLP) network implementation.For the weight of the loss of prediction results at different times, we add a higher weight to the prediction of the first 5 frames and the first 10 frames, which are set to 8 and 2, respectively.We employ AdamW as the solver.The initial learning rate is 0.0001 and multiplied by 0.5 after 30 epochs.The model is trained for 100 epochs with a batch size of 16.The devices used are an NVIDIA RTX 3090 GPU and an Intel Xeon Gold 6226R CPU.

Comparisons with previous approaches
We will compare our model with Res.Sup.[29], DMGNN [15], LTD [18], and MSR [11] on two datasets, Human3.6M and CMU-MoCap.Res.Sup. is also an RNN-based approach.DMGNN utilizes the GCN to extract features and RNN to decode the information.LTD totally uses the GCN to model spatio-temporal features and performs the prediction in the frequency domain.MSR is a recent method for executing LTD in a multi-scale architecture.All these methods are previous state-of-the-art techniques that release their code publicly.We will choose their pre-trained models or re-train their models using the default hyperparameters for fair comparison.
• Human3.6M.Table 1 shows the quantitative comparisons of short-term prediction (less than 400 ms) on the Human3.6Mdataset between our model and other works.The comparisons of long-term prediction (more than 400 ms but less than 1000 ms) on the Human3.6Mdataset are summarized in Table 2.In most cases, our model performs better than the compared approaches.An example of some predicted poses of our method and MSR is demonstrated in Fig. 3.As depicted in the figure, our predicted poses are more accurate compared to MSR.
• CMU-MoCap.Table 3 displays the quantitative comparisons of the prediction results on the CMU-MoCap dataset between our model and other works.On this dataset, our method also outperforms the compared approaches.

Ablation analysis
The ablative experiment is set on the Human3.6Mdataset to analyze the validity of our proposed method and obtain the results.• Architecture.We propose several designs, which contribute to the effectiveness of our method.It can be summarized as follows: (1) the KV-LSTM module, and (2) the uncertainty prediction module.Table 4 summarizes the ablation experiments on different variants of the com-plete model.Our full model contains 2 layers of MAM and each has its own KV-LSTM to transfer past information to the future.The UPM is implemented using a simple MLP network.The average prediction error is 68.78.Next, we describe our experience and lessons we learned   Continuing to increase the number of layers does not provide significant gains in performance, but increases computation complexity.( 5) We attempted to use a frame-byframe prediction approach for our model.However, we found that this method was difficult to converge, and the resulting predictions were poor in quality.

Conclusion
We propose a human motion prediction model that combines LSTM and the attention mechanism.Different from the recent methods that use the GCN as the main structure, our model uses self-attention to model the connections between joints as well as to separate the information and uses LSTM to transmit features, thus improving the prediction performance.In addition, we also focus on handling the instability of training.We utilize uncertainty and strong constraints on short-term prediction to address this problem.Finally, several experimental results show that the proposed method can improve the prediction accuracy on two major human motion prediction datasets, indicating the effectiveness of our proposed model.

Figure 1
Figure 1 Human motion prediction

Figure 3
Figure 3 Visualization of predicted poses of our method and MSR on a sample of Human3.6M

with our model. ( 1 )
To show the effectiveness of the KV-LSTM, we removed this module and the prediction error became 73.43, which is a large performance drop.(2) Then, we used the KV-LSTM but removed the UPM.The prediction error was 69.83, which implies the necessity of the uncertainty prediction module.(3) In the third experiment, we removed the strong short-term constraint from the full model.This operation causes the prediction error to increase from 68.78 to 69.55.(4)We also modeled spatial associations with different layers of the MAM, and the experimental results show that a 2-layer design is the best.
The weights and bias terms of the forget gate, input gate, output gate and the new candidate cell are denoted by W t represents the forget gate operation, q t denotes the input gate operation, o t indicates the output gate operation, C t represents the new candidate cell, C t denotes the cell state and P K t represents the LSTM output.

Table 1
Comparisons of short-term prediction on Human3.6M.The results at 80 ms, 160 ms, 320 ms, and 400 ms in the future are shown.The best results are highlighted in bold, and the second best is underlined

Table 2
Comparisons of long-term prediction on Human3.6M.The results at 560 ms and 1000 ms in the future are shown.The best results are highlighted in bold, and the second best is underlined

Table 3
CMU-MoCap: comparisons of average prediction errors