Skip to main content

Adversarial Generative Grammars for Human Activity Prediction

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12347))

Included in the following conference series:

Abstract

In this paper we propose an adversarial generative grammar model for future prediction. The objective is to learn a model that explicitly captures temporal dependencies, providing a capability to forecast multiple, distinct future activities. Our adversarial grammar is designed so that it can learn stochastic production rules from the data distribution, jointly with its latent non-terminal representations. Being able to select multiple production rules during inference leads to different predicted outcomes, thus efficiently modeling many plausible futures. The adversarial generative grammar is evaluated on the Charades, MultiTHUMOS, Human3.6M, and 50 Salads datasets and on two activity prediction tasks: future 3D human pose prediction and future activity prediction. The proposed adversarial grammar outperforms the state-of-the-art approaches, being able to predict much more accurately and further in the future, than prior work. Code will be open sourced.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For a branching factor of k rules per non-terminal with a sequence of length L, there are in \(k^L\) terminals and non-terminals (for \(k=2\), \(L=10\) we have \(\sim \)1000 and for \(k=3\) \(\sim \)60,000.

  2. 2.

    Note that most of the previous works used the MultiTHUMOS dataset and the Charades dataset for per-frame activity categorization; our works showcases a long-term activity forecasting capability, instead.

References

  1. Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. arXiv preprint arXiv:1710.11252 (2017)

  2. Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. ICLR (2019)

    Google Scholar 

  3. Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP (2014)

    Google Scholar 

  4. Chomsky, N.: Three models for the description of language. IRE Transactions on information theory 2(3), 113–124 (1956)

    Article  Google Scholar 

  5. Denton, E., Fergus, R.: Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687 (2018)

  6. Emily L Denton, Soumith Chintala, R.F.: Deep generative image models using a laplacian pyramid of adversarial networks. Advances in Neural Information Processing Systems (NeurIPS) (2015)

    Google Scholar 

  7. Farha, Y.A., Richard, A., Gall, J.: When will you do what? - anticipating temporal occurrences of activities. In: CVPR (2018)

    Google Scholar 

  8. Fedus, W., Goodfellow, I., Dai, A.: Maskgan: Better text generation via filling in the \(\_\). ICLR (2018)

    Google Scholar 

  9. Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Advances in Neural Information Processing Systems (NeurIPS). pp. 64–72 (2016)

    Google Scholar 

  10. Fraccaro, M., Sønderby, S.K., Paquet, U., Winther, O.: Sequential neural models with stochastic layers. In: Advances in neural information processing systems. pp. 2199–2207 (2016)

    Google Scholar 

  11. Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)

    Google Scholar 

  12. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in Neural Information Processing Systems (NeurIPS) (2014)

    Google Scholar 

  13. Han, F., Zhu, S.C.: Bottom-up/top-down image parsing with attribute grammar. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(1), 59–73 (2008)

    Google Scholar 

  14. Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., Xing, E.P.: Toward controlled generation of text. ICML (2017)

    Google Scholar 

  15. Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (2014)

    Google Scholar 

  16. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. CVPR (2017)

    Google Scholar 

  17. Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-rnn: Deep learning on spatio-temporal graphs. In: CVPR (2016)

    Google Scholar 

  18. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. In: ICLR (2017)

    Google Scholar 

  19. Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: CVPR (2019)

    Google Scholar 

  20. Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523 (2018)

  21. Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: A continuous relaxation of discrete random variables. In: ICLR (2017)

    Google Scholar 

  22. Martinez, J., Black, M., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR (2017)

    Google Scholar 

  23. Moore, D., Essa, I.: Recognizing multitasked activities from video using stochastic context-free grammar. In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI). pp. 770–776 (2002)

    Google Scholar 

  24. Pirsiavash, H., Ramanan, D.: Parsing videos of actions with segmental grammars. In: CVPR. pp. 612–619 (2014)

    Google Scholar 

  25. Qi, S., Huang, S., Wei, P., Zhu, S.C.: Predicting human activities using stochastic grammar. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1164–1172 (2017)

    Google Scholar 

  26. Qi, S., Jia, B., Zhu, S.C.: Generalized earley parser: Bridging symbolic grammars and sequence data for future prediction. arXiv preprint arXiv:1806.03497 (2018)

  27. Ryoo, M.S., Aggarwal, J.K.: Recognition of composite human activities through context-free grammar based representation. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). vol. 2, pp. 1709–1718. IEEE (2006)

    Google Scholar 

  28. Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: Crowdsourcing data collection for activity understanding. Proceedings of European Conference on Computer Vision (ECCV) (2016)

    Google Scholar 

  29. Socher, R., Lin, C.C., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11). pp. 129–136 (2011)

    Google Scholar 

  30. Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing. pp. 729–738. ACM (2013)

    Google Scholar 

  31. Tang, Y., Ma, L., Liu, W., Zheng, W.S.: Long-term human motion prediction by modeling motion context and enhancing motion dynamic. In: IJCAI (2018)

    Google Scholar 

  32. Vo, N.N., Bobick, A.F.: From stochastic grammar to bayes network: Probabilistic parsing of complex activity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2641–2648 (2014)

    Google Scholar 

  33. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: CVPR (2018)

    Google Scholar 

  34. Yang, F., Yang, Z., Cohen, W.W.: Differentiable learning of logical rules for knowledge base reasoning. Advances in Neural Information Processing Systems (NeurIPS) (2017)

    Google Scholar 

  35. Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Fei-Fei, L.: Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision (IJCV) pp. 1–15 (2015)

    Google Scholar 

  36. Yogatama, D., Miao, Y., Melis, G., Ling, W., Kuncoro, A., Dyer, C., Blunsom, P.: Memory architectures in recurrent neural network language models. ICLR (2018)

    Google Scholar 

  37. Yu, L., Zhang, W., J. Wang, Yu, Y.: Seqgan: sequence generative adversarial nets with policy gradient. Proceedings of AAAI Conference on Artificial Intelligence (AAAI) (2017)

    Google Scholar 

  38. Zhao, Y., Zhu, S.C.: Image parsing with stochastic scene grammar. In: Advances in Neural Information Processing Systems. pp. 73–81 (2011)

    Google Scholar 

  39. Zhu, S.C., Mumford, D.: A stochastic grammar of images. Foundations and Trends\(\textregistered \) in Computer Graphics and Vision 2 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. J. Piergiovanni .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 242 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Piergiovanni, A.J., Angelova, A., Toshev, A., Ryoo, M.S. (2020). Adversarial Generative Grammars for Human Activity Prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12347. Springer, Cham. https://doi.org/10.1007/978-3-030-58536-5_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58536-5_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58535-8

  • Online ISBN: 978-3-030-58536-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics