Skip to main content

Multi-level Motion Attention for Human Motion Prediction


Human motion prediction aims to forecast future human poses given a historical motion. Whether based on recurrent or feed-forward neural networks, existing learning based methods fail to model the observation that human motion tends to repeat itself, even for complex sports actions and cooking activities. Here, we introduce an attention based feed-forward network that explicitly leverages this observation. In particular, instead of modeling frame-wise attention via pose similarity, we propose to extract motion attention to capture the similarity between the current motion context and the historical motion sub-sequences. In this context, we study the use of different types of attention, computed at joint, body part, and full pose levels. Aggregating the relevant past motions and processing the result with a graph convolutional network allows us to effectively exploit motion patterns from the long-term history to predict the future poses. Our experiments on Human3.6M, AMASS and 3DPW validate the benefits of our approach for both periodical and non-periodical actions. Thanks to our attention model, it yields state-of-the-art results on all three datasets. Our code is available at

This is a preview of subscription content, access via your institution.

Fig. 1: Human motion prediction
Fig. 2: Overview of our motion attention pipeline.
Fig. 3: Predictor.
Fig. 4: Different fusion model.
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11


  1. 1.

    Described at

  2. 2.

    Available at


  1. Akhter, I., Sheikh, Y., Khan, S., & Kanade, T. (2009). Nonrigid structure from motion in trajectory space. In: Advances in neural information processing systems, pp 41–48.

  2. Arjovsky, M., & Bottou, L. (2017). Towards principled methods for training generative adversarial networks. In: ICLR.

  3. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate.

  4. Brand, M., & Hertzmann, A. (2000). Style machines. In: Proceedings of the 27th annual conference on Computer graphics and interactive techniques, ACM Press/Addison-Wesley Publishing Co., pp 183–192.

  5. Butepage, J., Black, M.J., Kragic, D., & Kjellstrom, H. (2017). Deep representation learning for human motion prediction and classification. In: CVPR.

  6. Cai, Y., Huang, L., Wang, Y., Cham, T.J., Cai, J., Yuan, J., Liu, J., Yang, X., Zhu, Y., Shen, X., et al. (2020). Learning progressive joint propagation for human motion prediction. In: ECCV.

  7. Fragkiadaki, K., Levine, S., Felsen, P., & Malik, J. (2015). Recurrent network models for human dynamics. In: ICCV, pp 4346–4354.

  8. Gong, H., Sim, J., Likhachev, M., & Shi, J. (2011). Multi-hypothesis motion planning for visual object tracking. In: ICCV, IEEE, pp 619–626.

  9. Gopalakrishnan, A., Mali, A., Kifer, D., Giles, L., & Ororbia, A.G. (2019). A neural temporal model for human motion prediction. In: CVPR, pp 12116–12125.

  10. Gui, L.Y., Wang, Y.X., Liang, X., & Moura, J.M. (2018). Adversarial geometry-aware human motion prediction. In: ECCV, pp 786–803.

  11. Hernandez, A., Gall, J., & Moreno-Noguer, F. (2019). Human motion prediction via spatio-temporal inpainting. In: ICCV, pp 7134–7143.

  12. Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339.

    Article  Google Scholar 

  13. Jain, A., Zamir, A.R., Savarese, S., & Saxena, A. (2016). Structural-rnn: Deep learning on spatio-temporal graphs. In: CVPR, pp 5308–5317.

  14. Kipf, T.N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In: ICLR.

  15. Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Skip-thought vectors. In: NIPS, pp 3294–3302.

  16. Koppula, H.S., & Saxena, A. (2013). Anticipating human activities for reactive robotic response. In: IROS, Tokyo, p 2071.

  17. Kovar, L., Gleicher, M., & Pighin, F. (2008). Motion graphs. In: ACM SIGGRAPH 2008 classes, pp 1–10.

  18. Levine, S., Wang, J. M., Haraux, A., Popović, Z., & Koltun, V. (2012). Continuous character control with low-dimensional embeddings. ACM Transactions on Graphics, 31(4), 28.

    Article  Google Scholar 

  19. Li, C., Zhang, Z., Lee, W.S., Lee, G.H. (2018a). Convolutional sequence to sequence model for human dynamics. In: CVPR, pp 5226–5234.

  20. Li, X., Li, H., Joo, H., Liu, Y., & Sheikh, Y. (2018b). Structure from recurrent motion: From rigidity to recurrency. In: CVPR, pp 3032–3040.

  21. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Trans Graphics (Proc SIGGRAPH Asia), 34(6), 248:1-248:16.

  22. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., & Black, M.J. (2019). Amass: Archive of motion capture as surface shapes. In: ICCV,

  23. Mao, W., Liu, M., Salzmann, M., & Li, H. (2019). Learning trajectory dependencies for human motion prediction. In: ICCV, pp 9489–9497.

  24. Mao, W., Liu, M., & Salzmann, M. (2020). History repeats itself: Human motion prediction via motion attention. In: ECCV.

  25. von Marcard, T., Henschel, R., Black, M., Rosenhahn, B., & Pons-Moll, G. (2018). Recovering accurate 3d human pose in the wild using imus and a moving camera. In: ECCV.

  26. Martinez, J., Black, M.J., & Romero, J. (2017). On human motion prediction using recurrent neural networks. In: CVPR.

  27. Pavllo, D., Feichtenhofer, C., Auli, M., & Grangier, D. (2019). Modeling human motion with quaternion-based neural networks. IJCV pp 1–18.

  28. Romero, J., Tzionas, D., & Black, M.J. (2017). Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc SIGGRAPH Asia) 36(6).

  29. Runia, T.F., Snoek, C.G., & Smeulders, A.W. (2018). Real-world repetition estimation by div, grad and curl. In: CVPR, pp 9009–9017.

  30. Sidenbladh, H., Black, M.J., & Sigal, L. (2002). Implicit probabilistic models of human motion for synthesis and tracking. In: ECCV, Springer, pp 784–800.

  31. Sutskever, I., Martens, J., & Hinton, G.E. (2011). Generating text with recurrent neural networks. In: ICML, pp 1017–1024.

  32. Tang, Y., Ma, L., Liu, W., Zheng, W.S. (2018). Long-term human motion prediction by modeling motion context and enhancing motion dynamics. IJCAI 10.24963/ijcai.2018/130.

  33. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In: NIPS, pp 5998–6008.

  34. Wang, J. M., Fleet, D. J., & Hertzmann, A. (2008). Gaussian process dynamical models for human motion. Transactions on Pattern Analysis and Machine Intelligence, 30(2), 283–298.

    Article  Google Scholar 

Download references


This research was supported in part by the Australia Research Council DECRA Fellowship (DE180100628) and ARC Discovery Grant (DP200102274). The authors would like to thank NVIDIA for the donated GPU (Titan V).

Author information



Corresponding author

Correspondence to Wei Mao.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Javier Romero.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 2 (mp4 2645 KB)

Supplementary material 1 (pdf 288 KB)

Supplementary material 3 (pdf 1562 KB)

Supplementary material 4 (pdf 1220 KB)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Mao, W., Liu, M., Salzmann, M. et al. Multi-level Motion Attention for Human Motion Prediction. Int J Comput Vis 129, 2513–2535 (2021).

Download citation


  • Human motion prediction
  • Motion attention
  • Deep learning