Skip to main content

PoseGPT: Quantization-Based 3D Human Motion Generation and Forecasting

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13666))

Included in the following conference series:

Abstract

We address the problem of action-conditioned generation of human motion sequences. Existing work falls into two categories: forecast models conditioned on observed past motions, or generative models conditioned on action labels and duration only. In contrast, we generate motion conditioned on observations of arbitrary length, including none. To solve this generalized problem, we propose PoseGPT, an auto-regressive transformer-based approach which internally compresses human motion into quantized latent sequences. An auto-encoder first maps human motion to latent index sequences in a discrete space, and vice-versa. Inspired by the Generative Pretrained Transformer (GPT), we propose to train a GPT-like model for next-index prediction in that space; this allows PoseGPT to output distributions on possible futures, with or without conditioning on past motion. The discrete and compressed nature of the latent space allows the GPT-like model to focus on long-range signal, as it removes low-level redundancy in the input signal. Predicting discrete indices also alleviates the common pitfall of predicting averaged poses, a typical failure case when regressing continuous values, as the average of discrete targets is not a target itself. Our experimental results show that our proposed approach achieves state-of-the-art results on HumanAct12, a standard but small scale dataset, as well as on BABEL, a recent large scale MoCap dataset, and on GRAB, a human-object interactions dataset.

T. Lucas and F. Baradel—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    From left to right and top to bottom: ‘turning’, ‘touching face’, ‘walking’, ‘sitting’.

References

  1. Agarwal, A., Triggs, B.: Recovering 3D human pose from monocular images. IEEE Trans. Pattern Anal. Mach. Intell. 28(1), 44–58 (2005)

    Article  Google Scholar 

  2. Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2Action: generative adversarial synthesis from language to action. In: ICRA, pp. 5915–5920 (2018)

    Google Scholar 

  3. Ahuja, C., Morency, L.: Language2Pose: natural language grounded pose forecasting. In: 3DV, pp. 719–728 (2019)

    Google Scholar 

  4. Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: ICCV, pp. 7144–7153 (2019)

    Google Scholar 

  5. Badler, N.: Temporal scene analysis: conceptual descriptions of object movements. PhD thesis, University of Toronto (1975)

    Google Scholar 

  6. Badler, N.I., Phillips, C.B., Webber, B.L.: Simulating Humans: Computer Graphics Animation and Control. Oxford University Press, NY (1993)

    Book  MATH  Google Scholar 

  7. Baradel, F., Groueix, T., Weinzaepfel, P., Brégier, R., Kalantidis, Y., Rogez, G.: Leveraging mocap data for human mesh recovery. In: 3DV, pp. 586–595 (2021)

    Google Scholar 

  8. Barratt, S., Sharma, R.: A note on the inception score. arXiv preprint arXiv:1801.01973 (2018)

  9. Barsoum, E., Kender, J., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: CVPRW, pp. 1418–1427 (2018)

    Google Scholar 

  10. Bengio, Y., Léonard, N., Courville, A.: Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 (2013)

  11. Bowden, R.: Learning statistical models of human motion. In: CVPRW (2000)

    Google Scholar 

  12. Brégier, R.: Deep regression on manifolds: a 3D rotation case study. In: 3DV, pp. 166–174 (2021)

    Google Scholar 

  13. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)

    Google Scholar 

  14. Cao, Z., Gao, H., Mangalam, K., Cai, Q.-Z., Vo, M., Malik, J.: Long-term human motion prediction with scene context. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_23

    Chapter  Google Scholar 

  15. Chen, M., et al.: Generative pretraining from pixels. In: ICML, pp. 1691–1703 (2020)

    Google Scholar 

  16. Chen, X., Mishra, N., Rohaninejad, M., Abbeel, P.: PixelSNAIL: an improved autoregressive generative model. In: ICML, pp. 864–872 (2018)

    Google Scholar 

  17. Chorowski, J., Weiss, R.J., Bengio, S., Van Den Oord, A.: Unsupervised speech representation learning using WaveNet autoencoders. IEEE/ACM Trans. Audio Speech Lang. Process. 27(12), 2041–2053 (2019)

    Article  Google Scholar 

  18. De Fauw, J., Dieleman, S., Simonyan, K.: Hierarchical autoregressive image models with auxiliary decoders. arXiv preprint arXiv:1903.04933 (2019)

  19. Delmas, G., Weinzaepfel, P., Lucas, T., Moreno-Noguer, F., Rogez, G.: PoseScript: 3D human poses from natural language. In: ECCV (2022)

    Google Scholar 

  20. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)

    Google Scholar 

  21. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: CVPR, pp. 12873–12883 (2021)

    Google Scholar 

  22. Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV, pp. 4346–4354 (2015)

    Google Scholar 

  23. Galata, A., Johnson, N., Hogg, D.: Learning variable-length Markov models of behavior. Comput. Vis. Image Underst. 81(3), 398–413 (2001)

    Article  MATH  Google Scholar 

  24. Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: CVPR, pp. 1396–1406 (2021)

    Google Scholar 

  25. Ghosh, P., Song, J., Aksan, E., Hilliges, O.: Learning human motion models for long-term predictions. In: 3DV, pp. 458–466 (2017)

    Google Scholar 

  26. Goodfellow, I., et al.: Generative adversarial nets. Commun. ACM 63(11), 139–144 (2014)

    Article  Google Scholar 

  27. Guo, C., et al.: Action2Motion: conditioned generation of 3D human motions. In: ACMMM, pp. 2021–2029 (2020)

    Google Scholar 

  28. Gupta, A., Johnson, J., Fei-Fei, L., Savarese, S., Alahi, A.: Social GAN: socially acceptable trajectories with generative adversarial networks. In: CVPR, pp. 2255–2264 (2018)

    Google Scholar 

  29. Habibie, I., Holden, D., Schwarz, J., Yearsley, J., Komura, T.: A recurrent variational autoencoder for human motion synthesis. In: BMVC (2017)

    Google Scholar 

  30. Herda, L., Fua, P., Plankers, R., Boulic, R., Thalmann, D.: Skeleton-based motion capture for robust reconstruction of human motion. In: Proceedings Computer Animation 2000, pp. 77–83 (2000)

    Google Scholar 

  31. Holden, D., Komura, T., Saito, J.: Phase-functioned neural networks for character control. ACM Trans. Graph. 36(4), 1–13 (2017)

    Article  Google Scholar 

  32. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2014)

    Google Scholar 

  33. Jegou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33(1), 117–128 (2010)

    Article  Google Scholar 

  34. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  35. Kocabas, M., Athanasiou, N., Black, M.J.: VIBE: video inference for human body pose and shape estimation. In: CVPR, pp. 5253–5263 (2020)

    Google Scholar 

  36. Lee, H.Y., et al.: Dancing to music. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  37. Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Learn to dance with AIST++: music conditioned 3D dance generation. arXiv preprint arXiv:2101.08779 (2021)

  38. Lin, A.S., Wu, L., Rodolfo, C., Kevin Tai, Q.H.R.J.M.: Generating animated videos of human activities from natural language descriptions. In: Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS (2018)

    Google Scholar 

  39. Lin, X., Amer, M.R.: Human motion modeling using DVGANs. arXiv preprint arXiv:1804.10652 (2018)

  40. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015)

    Article  Google Scholar 

  41. Lucas, T., Shmelkov, K., Alahari, K., Schmid, C., Verbeek, J.: Adaptive density estimation for generative models. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  42. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: ICCV, pp. 5442–5451 (2019)

    Google Scholar 

  43. Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR, pp. 2891–2900 (2017)

    Google Scholar 

  44. Naeem, M.F., Oh, S.J., Uh, Y., Choi, Y., Yoo, J.: Reliable fidelity and diversity metrics for generative models. In: ICML, pp. 7176–7185 (2020)

    Google Scholar 

  45. Van den Oord, A., et al.: Conditional image generation with PixelCNN decoders. Adv. Neural Inf. Process. Syst. 29 (2016)

    Google Scholar 

  46. van den Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: ICML, pp. 1747–1756 (2016)

    Google Scholar 

  47. van den Oord, A., Oriol, V., Kavukcuoglu, K.: Neural discrete representation learning. In: ICML (2018)

    Google Scholar 

  48. Pavlakos, G., et al.: Expressive body capture: 3D hands, face, and body from a single image. In: CVPR, pp. 10975–10985 (2019)

    Google Scholar 

  49. Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: ICCV, pp. 10985–10995 (2021)

    Google Scholar 

  50. Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: bodies, action and behavior with English labels. In: CVPR, pp. 722–731 (2021)

    Google Scholar 

  51. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  52. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  53. Razavi, A., Van den Oord, A., Vinyals, O.: Generating diverse high-fidelity images with VQ-VAE-2. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  54. Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: HuMoR: 3D human motion model for robust pose estimation. ICCV, pp. 11488–11499 (2021)

    Google Scholar 

  55. Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and approximate inference in deep, generative models. In: ICML, pp. 1278–1286 (2014)

    Google Scholar 

  56. Rogez, G., Weinzaepfel, P., Schmid, C.: LCR-Net++: multi-person 2D and 3D pose detection in natural images. IEEE Trans. Pattern Anal. Mach. Intell. 42(5), 1146–1161 (2019)

    Google Scholar 

  57. Shmelkov, K., Schmid, C., Alahari, K.: How good is my GAN? In: ECCV, pp. 213–229 (2018)

    Google Scholar 

  58. Siyao, L., et al.: Bailando: 3D dance generation by actor-critic GPT with choreographic memory. In: CVPR, pp. 11050–11059 (2022)

    Google Scholar 

  59. Starke, S., Zhang, H., Komura, T., Saito, J.: Neural state machine for character-scene interactions. ACM Trans. Graph. 38(6), 1–14 (2019)

    Article  Google Scholar 

  60. Taheri, O., Ghorbani, N., Black, M.J., Tzionas, D.: GRAB: a dataset of whole-body human grasping of objects. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12349, pp. 581–600. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_34

    Chapter  Google Scholar 

  61. Taylor, G.W., Hinton, G.E., Roweis, S.: Modeling human motion using binary latent variables. Adv. Neural Inf. Process. Syst. 19 (2006)

    Google Scholar 

  62. Carnegie Mellon University: CMU graphics lab motion capture database. http://mocap.cs.cmu.edu/

  63. Urtasun, R., Fleet, D.J., Lawrence, N.D.: Modeling human locomotion with topologically constrained latent variable models. In: Elgammal, A., Rosenhahn, B., Klette, R. (eds.) HuMo 2007. LNCS, vol. 4814, pp. 104–118. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75703-0_8

    Chapter  Google Scholar 

  64. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  65. Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

    Google Scholar 

  66. Walker, J., Razavi, A., Oord, A.V.D.: Predicting video with VQVAE. arXiv preprint arXiv:2103.01950 (2021)

  67. Weinzaepfel, P., Brégier, R., Combaluzier, H., Leroy, V., Rogez, G.: DOPE: distillation of part experts for whole-body 3D pose estimation in the wild. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 380–397. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_23

    Chapter  Google Scholar 

  68. Weissenborn, D., Täckström, O., Uszkoreit, J.: Scaling autoregressive video models. In: ICLR (2020)

    Google Scholar 

  69. Yuan, Y., Kitani, K.: DLow: diversifying latent flows for diverse human motion prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 346–364. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_20

    Chapter  Google Scholar 

  70. Zhang, Y., Black, M.J., Tang, S.: We are more than our joints: predicting how 3D bodies move. In: CVPR, pp. 3372–3382 (2021)

    Google Scholar 

  71. Zheng, C., et al.: Deep learning-based human pose estimation: a survey. arXiv preprint arXiv:2012.13392 (2020)

  72. Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: CVPR, pp. 5745–5753 (2019)

    Google Scholar 

  73. Zou, S., et al.: Polarization human shape and pose dataset. arXiv preprint arXiv:2004.14899 (2020)

  74. Zou, S., et al.: 3D human shape reconstruction from a polarization image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 351–368. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_21

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Lucas .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 2 (mp4 3924 KB)

Supplementary material 1 (pdf 47 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lucas, T., Baradel, F., Weinzaepfel, P., Rogez, G. (2022). PoseGPT: Quantization-Based 3D Human Motion Generation and Forecasting. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13666. Springer, Cham. https://doi.org/10.1007/978-3-031-20068-7_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20068-7_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20067-0

  • Online ISBN: 978-3-031-20068-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics