Long-Term Human Motion Prediction with Scene Context

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12346)


Human movement is goal-directed and influenced by the spatial layout of the objects in the scene. To plan future human motion, it is crucial to perceive the environment – imagine how hard it is to navigate a new room with lights off. Existing works on predicting human motion do not pay attention to the scene context and thus struggle in long-term prediction. In this work, we propose a novel three-stage framework that exploits scene context to tackle this task. Given a single scene image and 2D pose histories, our method first samples multiple human motion goals, then plans 3D human paths towards each goal, and finally predicts 3D human pose sequences following each path. For stable training and rigorous evaluation, we contribute a synthetic dataset with clean annotations. In both synthetic and real datasets, our method shows consistent quantitative and qualitative improvements over existing methods. Project page: (Please refer to our arXiv for a longer version of the paper with more visualizations.)



We thank Carsten Stoll and Christoph Lassner for the helpful feedback. We are also very grateful for the discussion within the BAIR community.

Supplementary material

500725_1_En_23_MOESM1_ESM.pdf (2.7 mb)
Supplementary material 1 (pdf 2787 KB)

Supplementary material 2 (mp4 80700 KB)


  1. 1.
    CMU Motion Capture Database.
  2. 2.
    Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Nonrigid structure from motion in trajectory space. In: NIPS (2009)Google Scholar
  3. 3.
    Akhter, I., Simon, T., Khan, S., Matthews, I., Sheikh, Y.: Bilinear spatiotemporal basis models. SIGGRAPH (2012)Google Scholar
  4. 4.
    Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: human trajectory prediction in crowded spaces. In: CVPR (2016)Google Scholar
  5. 5.
    Alahi, A., Ramanathan, V., Fei-Fei, L.: Socially-aware large-scale crowd forecasting. In: CVPR (2014)Google Scholar
  6. 6.
    Alexopoulos, C., Griffin, P.M.: Path planning for a mobile robot. IEEE Trans. Syst. Man Cybern. (1992)Google Scholar
  7. 7.
    Brand, M., Hertzmann, A.: Style machines. SIGGRAPH (2000)Google Scholar
  8. 8.
    Chai, Y., Sapp, B., Bansal, M., Anguelov, D.: Multipath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. In: CoRL (2019)Google Scholar
  9. 9.
    Chao, Y.W., Yang, J., Price, B., Cohen, S., Deng, J.: Forecasting human dynamics from static images. In: CVPR (2017)Google Scholar
  10. 10.
    Chen, Y., Huang, S., Yuan, T., Qi, S., Zhu, Y., Zhu, S.C.: Holistic++ scene understanding: single-view 3D holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. In: ICCV (2019)Google Scholar
  11. 11.
    Chiu, H.K., Adeli, E., Wang, B., Huang, D.A., Niebles, J.C.: Action-agnostic human pose forecasting. In: WACV (2019)Google Scholar
  12. 12.
    Elhayek, A., Stoll, C., Hasler, N., Kim, K.I., Seidel, H.P., Theobalt, C.: Spatio-temporal motion tracking with unsynchronized cameras. In: CVPR (2012)Google Scholar
  13. 13.
    Fabbri, M., Lanzi, F., Calderara, S., Palazzi, A., Vezzani, R., Cucchiara, R.: Learning to detect and track visible and occluded body joints in a virtual world. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 450–466. Springer, Cham (2018). Scholar
  14. 14.
    Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)Google Scholar
  15. 15.
    Ghosh, P., Song, J., Aksan, E., Hilliges, O.: Learning human motion models for long-term predictions. In: 3DV (2017)Google Scholar
  16. 16.
    Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social GAN: socially acceptable trajectories with generative adversarial networks. In: CVPR (2018)Google Scholar
  17. 17.
    Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3D human pose ambiguities with 3D scene constraints. In: ICCV (2019)Google Scholar
  18. 18.
    Helbing, D., Molnar, P.: Social force model for pedestrian dynamics. Phys. Rev. E (1995)Google Scholar
  19. 19.
    Hernandez, A., Gall, J., Moreno-Noguer, F.: Human motion prediction via spatio-temporal inpainting. In: CVPR (2019)Google Scholar
  20. 20.
    Holden, D., Saito, J., Komura, T., Joyce, T.: Learning motion manifolds with convolutional autoencoders. In: SIGGRAPH Asian Technical Briefs (2015)Google Scholar
  21. 21.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. TPAMI (2013)Google Scholar
  22. 22.
    Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: CVPR (2016)Google Scholar
  23. 23.
    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. ICLR (2014)Google Scholar
  24. 24.
    Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 201–214. Springer, Heidelberg (2012). Scholar
  25. 25.
    Krähenbühl, P.: Free supervision from video games. In: CVPR (2018)Google Scholar
  26. 26.
    LaValle, S.M.: Planning Algorithms. Cambridge University Press (2006)Google Scholar
  27. 27.
    Law, H., Teng, Y., Russakovsky, O., Deng, J.: CornerNet-Lite: efficient keypoint based object detection. arXiv preprint arXiv:1904.08900 (2019)
  28. 28.
    Lee, D., Liu, S., Gu, J., Liu, M.Y., Yang, M.H., Kautz, J.: Context-aware synthesis and placement of object instances. In: NIPS (2018)Google Scholar
  29. 29.
    Lerner, A., Chrysanthou, Y., Lischinski, D.: Crowds by example. In: CGF (2007)Google Scholar
  30. 30.
    Li, C., Zhang, Z., Sun Lee, W., Hee Lee, G.: Convolutional sequence to sequence model for human dynamics. In: CVPR (2018)Google Scholar
  31. 31.
    Li, X., Liu, S., Kim, K., Wang, X., Yang, M.H., Kautz, J.: Putting humans in a scene: learning affordance in 3D indoor environments. In: CVPR (2019)Google Scholar
  32. 32.
    Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., Li, H.: Auto-conditioned recurrent networks for extended complex human motion synthesis. In: ICLR (2018)Google Scholar
  33. 33.
    Ma, W.C., Huang, D.A., Lee, N., Kitani, K.M.: Forecasting interactive dynamics of pedestrians with fictitious play. In: CVPR (2017)Google Scholar
  34. 34.
    Makansi, O., Ilg, E., Cicek, O., Brox, T.: Overcoming limitations of mixture density networks: a sampling and fitting framework for multimodal future prediction. In: CVPR (2019)Google Scholar
  35. 35.
    von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 614–631. Springer, Cham (2018). Scholar
  36. 36.
    Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR (2017)Google Scholar
  37. 37.
    Monszpart, A., Guerrero, P., Ceylan, D., Yumer, E., Mitra, N.J.: iMapper: interaction-guided joint scene and human motion mapping from monocular videos. SIGGRAPH (2019)Google Scholar
  38. 38.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). Scholar
  39. 39.
    Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR (2019)Google Scholar
  40. 40.
    Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3D human pose estimation in video with temporal convolutions and semi-supervised training. In: CVPR (2019)Google Scholar
  41. 41.
    Pavllo, D., Grangier, D., Auli, M.: QuaterNet: a quaternion-based recurrent model for human motion. In: BMVC (2018)Google Scholar
  42. 42.
    Pavlovic, V., Rehg, J.M., MacCormick, J.: Learning switching linear models of human motion. In: NIPS (2001)Google Scholar
  43. 43.
    Pellegrini, S., Ess, A., Schindler, K., Van Gool, L.: You’ll never walk alone: modeling social behavior for multi-target tracking. In: CVPR (2009)Google Scholar
  44. 44.
    Sadeghian, A., Kosaraju, V., Sadeghian, A., Hirose, N., Rezatofighi, H., Savarese, S.: SoPhie: an attentive GAN for predicting paths compliant to social and physical constraints. In: CVPR (2019)Google Scholar
  45. 45.
    Savva, M., Chang, A.X., Hanrahan, P., Fisher, M., Nießner, M.: PiGraphs: Learning Interaction Snapshots from Observations. TOG (2016)Google Scholar
  46. 46.
    Sun, X., Xiao, B., Wei, F., Liang, S., Wei, Y.: Integral human pose regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 536–553. Springer, Cham (2018). Scholar
  47. 47.
    Tai, L., Zhang, J., Liu, M., Burgard, W.: Socially compliant navigation through raw depth inputs with generative adversarial imitation learning. In: ICRA (2018)Google Scholar
  48. 48.
    Tay, M.K.C., Laugier, C.: Modelling smooth paths using gaussian processes. In: Laugier, C., Siegwart, R. (eds.) Field and Service Robotics, pp. 381–390. Springer, Heidelberg (2008). Scholar
  49. 49.
    Treuille, A., Cooper, S., Popović, Z.: Continuum crowds. TOG (2006)Google Scholar
  50. 50.
    Urtasun, R., Fleet, D.J., Geiger, A., Popović, J., Darrell, T.J., Lawrence, N.D.: Topologically-constrained latent variable models. In: ICML (2008)Google Scholar
  51. 51.
    Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)Google Scholar
  52. 52.
    Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. In: ICML (2017)Google Scholar
  53. 53.
    Vo, M., Narasimhan, S.G., Sheikh, Y.: Spatiotemporal bundle adjustment for dynamic 3D reconstruction. In: CVPR (2016)Google Scholar
  54. 54.
    Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: CVPR (2017)Google Scholar
  55. 55.
    Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for human motion. TPAMI (2007)Google Scholar
  56. 56.
    Wang, J.M., Fleet, D.J., Hertzmann, A.: Multifactor gaussian process models for style-content separation. In: ICML (2007)Google Scholar
  57. 57.
    Wang, X., Girdhar, R., Gupta, A.: Binge watching: scaling affordance learning from sitcoms. In: CVPR (2017)Google Scholar
  58. 58.
    Wang, Z., Chen, L., Rathore, S., Shin, D., Fowlkes, C.: Geometric pose affordance: 3D human pose with scene constraints. arXiv preprint arXiv:1905.07718 (2019)
  59. 59.
    Wang, Z., Shin, D., Fowlkes, C.C.: Predicting camera viewpoint improves cross-dataset generalization for 3d human pose estimation. arXiv preprint arXiv:2004.03143 (2020)
  60. 60.
    Wei, M., Miaomiao, L., Mathieu, S., Hongdong, L.: Learning trajectory dependencies for human motion prediction. In: ICCV (2019)Google Scholar
  61. 61.
    Weng, C.Y., Curless, B., Kemelmacher-Shlizerman, I.: Photo wake-up: 3D character animation from a single photo. In: CVPR (2019)Google Scholar
  62. 62.
    Yu, T., et al.: One-shot imitation from observing humans via domain-adaptive meta-learning. IROS (2018)Google Scholar
  63. 63.
    Zhang, J.Y., Felsen, P., Kanazawa, A., Malik, J.: Predicting 3D human dynamics from video. In: ICCV (2019)Google Scholar
  64. 64.
    Zhao, L., Peng, X., Tian, Yu., Kapadia, M., Metaxas, D.: Learning to forecast and refine residual motion for image-to-video generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 403–419. Springer, Cham (2018). Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.UC BerkeleyBerkeleyUSA
  2. 2.Nanjing UniversityNanjingChina

Personalised recommendations