Skip to main content

GIMO: Gaze-Informed Human Motion Prediction in Context

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13673))

Included in the following conference series:

Abstract

Predicting human motion is critical for assistive robots and AR/VR applications, where the interaction with humans needs to be safe and comfortable. Meanwhile, an accurate prediction depends on understanding both the scene context and human intentions. Even though many works study scene-aware human motion prediction, the latter is largely underexplored due to the lack of ego-centric views that disclose human intent and the limited diversity in motion and scenes. To reduce the gap, we propose a large-scale human motion dataset that delivers high-quality body pose sequences, scene scans, as well as ego-centric views with the eye gaze that serves as a surrogate for inferring human intent. By employing inertial sensors for motion capture, our data collection is not tied to specific scenes, which further boosts the motion dynamics observed from our subjects. We perform an extensive study of the benefits of leveraging the eye gaze for ego-centric human motion prediction with various state-of-the-art architectures. Moreover, to realize the full potential of the gaze, we propose a novel network architecture that enables bidirectional communication between the gaze and motion branches. Our network achieves the top performance in human motion prediction on the proposed dataset, thanks to the intent information from eye gaze and the denoised gaze feature modulated by the motion. Code and data can be found at https://github.com/y-zheng18/GIMO.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://noitom.com/perception-neuron-series.

  2. 2.

    https://www.microsoft.com/en-us/hololens.

  3. 3.

    https://apps.apple.com/us/app/3d-scanner-app/id1419913995.

References

  1. Admoni, H., Scassellati, B.: Social eye gaze in human-robot interaction: a review. J. Hum.-Robot Interact. 6(1), 25–63 (2017)

    Article  Google Scholar 

  2. Aksan, E., Kaufmann, M., Cao, P., Hilliges, O.: A spatio-temporal transformer for 3D human motion prediction. In: 2021 International Conference on 3D Vision (3DV), pp. 565–574. IEEE (2021)

    Google Scholar 

  3. Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7144–7153 (2019)

    Google Scholar 

  4. Cao, Z., Gao, H., Mangalam, K., Cai, Q.-Z., Vo, M., Malik, J.: Long-term human motion prediction with scene context. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_23

    Chapter  Google Scholar 

  5. CMU Graphics Lab (2000). http://mocap.cs.cmu.edu/

  6. Duarte, N.F., Raković, M., Tasevski, J., Coco, M.I., Billard, A., Santos-Victor, J.: Action anticipation: reading the intentions of humans and robots. IEEE Robot. Autom. Lett. 3(4), 4132–4139 (2018)

    Article  Google Scholar 

  7. Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4346–4354 (2015)

    Google Scholar 

  8. Gottlieb, J., Oudeyer, P.Y., Lopes, M., Baranes, A.: Information-seeking, curiosity, and attention: computational and neural mechanisms. Trends Cogn. Sci. 17(11), 585–593 (2013)

    Article  Google Scholar 

  9. Guzov, V., Mir, A., Sattler, T., Pons-Moll, G.: Human poseitioning system (HPS): 3D human pose estimation and self-localization in large scenes from body-mounted sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4318–4329 (2021)

    Google Scholar 

  10. Hassan, M., et al.: Stochastic scene-aware motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11374–11384 (2021)

    Google Scholar 

  11. Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3D human pose ambiguities with 3D scene constraints. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2282–2292 (2019)

    Google Scholar 

  12. Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3D human pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 68–84 (2018)

    Google Scholar 

  13. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3. 6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)

    Google Scholar 

  14. Jaegle, A., et al.: Perceiver IO: a general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795 (2021)

  15. Jiang, H., Grauman, K.: Seeing invisible poses: estimating 3D body pose from egocentric video. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3501–3509. IEEE (2017)

    Google Scholar 

  16. Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3334–3342 (2015)

    Google Scholar 

  17. Joo, H., Simon, T., Sheikh, Y.: Total capture: a 3D deformation model for tracking faces, hands, and bodies. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8320–8329 (2018)

    Google Scholar 

  18. Kocabas, M., Athanasiou, N., Black, M.J.: Vibe: video inference for human body pose and shape estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5253–5263 (2020)

    Google Scholar 

  19. Kocabas, M., Huang, C.H.P., Hilliges, O., Black, M.J.: Pare: part attention regressor for 3D human body estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11127–11137 (2021)

    Google Scholar 

  20. Kolotouros, N., Pavlakos, G., Black, M.J., Daniilidis, K.: Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2252–2261 (2019)

    Google Scholar 

  21. Kratzer, P., Bihlmaier, S., Midlagajni, N.B., Prakash, R., Toussaint, M., Mainprice, J.: Mogaze: a dataset of full-body motions that includes workspace geometry and eye-gaze. IEEE Robot. Autom. Lett. 6(2), 367–373 (2020)

    Article  Google Scholar 

  22. Kratzer, P., Toussaint, M., Mainprice, J.: Prediction of human full-body movements with motion optimization and recurrent neural networks. In: 2020 ICRA, pp. 1792–1798 (2020)

    Google Scholar 

  23. Li, J., et al.: Task-generic hierarchical human motion prior using VAEs. In: 2021 International Conference on 3D Vision (3DV), pp. 771–781. IEEE (2021)

    Google Scholar 

  24. Li, J., et al.: Learning to generate diverse dance motions with transformer. arXiv preprint arXiv:2008.08171 (2020)

  25. Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with AIST++. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13401–13412 (2021)

    Google Scholar 

  26. Li, Y., Liu, M., Rehg, J.: In the eye of the beholder: gaze and actions in first person video. IEEE Trans. Pattern Anal. Mach. Intell. (2021)

    Google Scholar 

  27. Li, Z., Zhou, Y., Xiao, S., He, C., Huang, Z., Li, H.: Auto-conditioned recurrent networks for extended complex human motion synthesis. arXiv preprint arXiv:1707.05363 (2017)

  28. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. (TOG) 34(6), 1–16 (2015)

    Article  Google Scholar 

  29. Luo, Z., Golestaneh, S.A., Kitani, K.M.: 3D human motion estimation via motion compression and refinement. In: Proceedings of the Asian Conference on Computer Vision (2020)

    Google Scholar 

  30. Luo, Z., Hachiuma, R., Yuan, Y., Kitani, K.: Dynamics-regulated kinematic policy for egocentric pose estimation. In: Advances in Neural Information Processing Systems, vol. 34 (2021)

    Google Scholar 

  31. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5442–5451 (2019)

    Google Scholar 

  32. Mao, W., Liu, M., Salzmann, M.: History repeats itself: human motion prediction via motion attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 474–489. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_28

    Chapter  Google Scholar 

  33. von Marcard, T., Henschel, R., Black, M.J., Rosenhahn, B., Pons-Moll, G.: Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 601–617 (2018)

    Google Scholar 

  34. Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2891–2900 (2017)

    Google Scholar 

  35. Martínez-González, A., Villamizar, M., Odobez, J.M.: Pose transformers (POTR): human motion prediction with non-autoregressive transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2276–2284 (2021)

    Google Scholar 

  36. Ng, E., Xiang, D., Joo, H., Grauman, K.: You2me: inferring body pose in egocentric video via first and second person interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9890–9900 (2020)

    Google Scholar 

  37. Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10975–10985 (2019)

    Google Scholar 

  38. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)

    Google Scholar 

  39. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  40. Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3D human motion model for robust pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11488–11499 (2021)

    Google Scholar 

  41. Rhodin, H., et al.: Egocap: egocentric marker-less motion capture with two fisheye cameras. ACM Trans. Graph. (TOG) 35(6), 1–11 (2016)

    Article  Google Scholar 

  42. Tatler, B.W., Hayhoe, M.M., Land, M.F., Ballard, D.H.: Eye guidance in natural vision: reinterpreting salience. J. Vis. 11(5) (2011)

    Google Scholar 

  43. Tian, Y., Zhang, H., Liu, Y., Wang, l.: Recovering 3D human mesh from monocular images: a survey. arXiv preprint arXiv:2203.01923 (2022)

  44. Tome, D., et al.: Selfpose: 3D egocentric pose estimation from a headset mounted camera. arXiv preprint arXiv:2011.01519 (2020)

  45. Tome, D., Peluse, P., Agapito, L., Badino, H.: xR-EgoPose: egocentric 3D human pose from an HMD camera. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7728–7738 (2019)

    Google Scholar 

  46. Ungureanu, D., et al.: Hololens 2 research mode as a tool for computer vision research. arXiv preprint arXiv:2008.11239 (2020)

  47. Valle-Pérez, G., Henter, G.E., Beskow, J., Holzapfel, A., Oudeyer, P.Y., Alexanderson, S.: Transflower: probabilistic autoregressive dance generation with multimodal attention. ACM Trans. Graph. (TOG) 40(6), 1–14 (2021)

    Article  Google Scholar 

  48. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  49. Von Marcard, T., Pons-Moll, G., Rosenhahn, B.: Human pose estimation from video and IMUs. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1533–1547 (2016)

    Article  Google Scholar 

  50. Wang, J., Liu, L., Xu, W., Sarkar, K., Theobalt, C.: Estimating egocentric 3D human pose in global space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11500–11509 (2021)

    Google Scholar 

  51. Wang, J., Xu, H., Xu, J., Liu, S., Wang, X.: Synthesizing long-term 3D human motion and interaction in 3D scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9401–9411 (2021)

    Google Scholar 

  52. Wei, P., Liu, Y., Shu, T., Zheng, N., Zhu, S.C.: Where and why are they looking? Jointly inferring human attention and intentions in complex tasks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6801–6809 (2018)

    Google Scholar 

  53. Xu, W., et al.: Mo2cap2: real-time mobile 3D motion capture with a cap-mounted fisheye camera. IEEE Trans. Visual Comput. Graphics 25(5), 2093–2101 (2019)

    Article  Google Scholar 

  54. Yuan, Y., Kitani, K.: 3D ego-pose estimation via imitation learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 735–750 (2018)

    Google Scholar 

  55. Yuan, Y., Kitani, K.: Ego-pose estimation and forecasting as real-time PD control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10082–10092 (2019)

    Google Scholar 

  56. Zhang, H., et al.: PyMAF: 3D human pose and shape regression with pyramidal mesh alignment feedback loop. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2021)

    Google Scholar 

  57. Zhang, S., et al.: Egobody: human body shape, motion and social interactions from head-mounted devices. arXiv preprint arXiv:2112.07642 (2021)

  58. Zhang, S., Zhang, Y., Bogo, F., Marc, P., Tang, S.: Learning motion priors for 4D human body capture in 3d scenes. In: International Conference on Computer Vision (ICCV), October 2021

    Google Scholar 

  59. Zhang, S., Zhang, Y., Bogo, F., Pollefeys, M., Tang, S.: Learning motion priors for 4D human body capture in 3D scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11343–11353 (2021)

    Google Scholar 

  60. Zhang, S., Zhang, Y., Ma, Q., Black, M.J., Tang, S.: Place: proximity learning of articulation and contact in 3D environments. In: 2020 International Conference on 3D Vision (3DV), pp. 642–651. IEEE (2020)

    Google Scholar 

  61. Zhang, Y., Black, M.J., Tang, S.: We are more than our joints: predicting how 3D bodies move. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3372–3382 (2021)

    Google Scholar 

  62. Zhang, Y., Hassan, M., Neumann, H., Black, M.J., Tang, S.: Generating 3D people in scenes without people. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6194–6204 (2020)

    Google Scholar 

  63. Zhang, Y., Tang, S.: The wanderings of odysseus in 3D scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20481–20491 (2022)

    Google Scholar 

Download references

Acknowledgments

The authors are supported by a grant from the Stanford HAI Institute, a Vannevar Bush Faculty Fellowship, a gift from the Amazon Research Awards program, the NSFC grant No. 62125107, and No. 62171255. Also, Toyota Research Institute provided funds to support this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanchao Yang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 9321 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zheng, Y. et al. (2022). GIMO: Gaze-Informed Human Motion Prediction in Context. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13673. Springer, Cham. https://doi.org/10.1007/978-3-031-19778-9_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19778-9_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19777-2

  • Online ISBN: 978-3-031-19778-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics