Adversarial Generative Grammars for Human Activity Prediction

Piergiovanni, A. J.; Angelova, Anelia; Toshev, Alexander; Ryoo, Michael S.

doi:10.1007/978-3-030-58536-5_30

A. J. Piergiovanni¹²,
Anelia Angelova¹²,
Alexander Toshev¹² &
…
Michael S. Ryoo^12,13

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12347))

Included in the following conference series:

European Conference on Computer Vision

5532 Accesses
11 Citations

Abstract

In this paper we propose an adversarial generative grammar model for future prediction. The objective is to learn a model that explicitly captures temporal dependencies, providing a capability to forecast multiple, distinct future activities. Our adversarial grammar is designed so that it can learn stochastic production rules from the data distribution, jointly with its latent non-terminal representations. Being able to select multiple production rules during inference leads to different predicted outcomes, thus efficiently modeling many plausible futures. The adversarial generative grammar is evaluated on the Charades, MultiTHUMOS, Human3.6M, and 50 Salads datasets and on two activity prediction tasks: future 3D human pose prediction and future activity prediction. The proposed adversarial grammar outperforms the state-of-the-art approaches, being able to predict much more accurately and further in the future, than prior work. Code will be open sourced.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For a branching factor of k rules per non-terminal with a sequence of length L, there are in \(k^L\) terminals and non-terminals (for \(k=2\), \(L=10\) we have \(\sim \)1000 and for \(k=3\) \(\sim \)60,000.
2.
Note that most of the previous works used the MultiTHUMOS dataset and the Charades dataset for per-frame activity categorization; our works showcases a long-term activity forecasting capability, instead.

References

Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. arXiv preprint arXiv:1710.11252 (2017)
Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. ICLR (2019)
Google Scholar
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP (2014)
Google Scholar
Chomsky, N.: Three models for the description of language. IRE Transactions on information theory 2(3), 113–124 (1956)
Article Google Scholar
Denton, E., Fergus, R.: Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687 (2018)
Emily L Denton, Soumith Chintala, R.F.: Deep generative image models using a laplacian pyramid of adversarial networks. Advances in Neural Information Processing Systems (NeurIPS) (2015)
Google Scholar
Farha, Y.A., Richard, A., Gall, J.: When will you do what? - anticipating temporal occurrences of activities. In: CVPR (2018)
Google Scholar
Fedus, W., Goodfellow, I., Dai, A.: Maskgan: Better text generation via filling in the \(\_\). ICLR (2018)
Google Scholar
Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Advances in Neural Information Processing Systems (NeurIPS). pp. 64–72 (2016)
Google Scholar
Fraccaro, M., Sønderby, S.K., Paquet, U., Winther, O.: Sequential neural models with stochastic layers. In: Advances in neural information processing systems. pp. 2199–2207 (2016)
Google Scholar
Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)
Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in Neural Information Processing Systems (NeurIPS) (2014)
Google Scholar
Han, F., Zhu, S.C.: Bottom-up/top-down image parsing with attribute grammar. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(1), 59–73 (2008)
Google Scholar
Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., Xing, E.P.: Toward controlled generation of text. ICML (2017)
Google Scholar
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (2014)
Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. CVPR (2017)
Google Scholar
Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-rnn: Deep learning on spatio-temporal graphs. In: CVPR (2016)
Google Scholar
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. In: ICLR (2017)
Google Scholar
Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: CVPR (2019)
Google Scholar
Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523 (2018)
Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: A continuous relaxation of discrete random variables. In: ICLR (2017)
Google Scholar
Martinez, J., Black, M., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR (2017)
Google Scholar
Moore, D., Essa, I.: Recognizing multitasked activities from video using stochastic context-free grammar. In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI). pp. 770–776 (2002)
Google Scholar
Pirsiavash, H., Ramanan, D.: Parsing videos of actions with segmental grammars. In: CVPR. pp. 612–619 (2014)
Google Scholar
Qi, S., Huang, S., Wei, P., Zhu, S.C.: Predicting human activities using stochastic grammar. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1164–1172 (2017)
Google Scholar
Qi, S., Jia, B., Zhu, S.C.: Generalized earley parser: Bridging symbolic grammars and sequence data for future prediction. arXiv preprint arXiv:1806.03497 (2018)
Ryoo, M.S., Aggarwal, J.K.: Recognition of composite human activities through context-free grammar based representation. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). vol. 2, pp. 1709–1718. IEEE (2006)
Google Scholar
Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: Crowdsourcing data collection for activity understanding. Proceedings of European Conference on Computer Vision (ECCV) (2016)
Google Scholar
Socher, R., Lin, C.C., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11). pp. 129–136 (2011)
Google Scholar
Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing. pp. 729–738. ACM (2013)
Google Scholar
Tang, Y., Ma, L., Liu, W., Zheng, W.S.: Long-term human motion prediction by modeling motion context and enhancing motion dynamic. In: IJCAI (2018)
Google Scholar
Vo, N.N., Bobick, A.F.: From stochastic grammar to bayes network: Probabilistic parsing of complex activity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2641–2648 (2014)
Google Scholar
Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: CVPR (2018)
Google Scholar
Yang, F., Yang, Z., Cohen, W.W.: Differentiable learning of logical rules for knowledge base reasoning. Advances in Neural Information Processing Systems (NeurIPS) (2017)
Google Scholar
Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Fei-Fei, L.: Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision (IJCV) pp. 1–15 (2015)
Google Scholar
Yogatama, D., Miao, Y., Melis, G., Ling, W., Kuncoro, A., Dyer, C., Blunsom, P.: Memory architectures in recurrent neural network language models. ICLR (2018)
Google Scholar
Yu, L., Zhang, W., J. Wang, Yu, Y.: Seqgan: sequence generative adversarial nets with policy gradient. Proceedings of AAAI Conference on Artificial Intelligence (AAAI) (2017)
Google Scholar
Zhao, Y., Zhu, S.C.: Image parsing with stochastic scene grammar. In: Advances in Neural Information Processing Systems. pp. 73–81 (2011)
Google Scholar
Zhu, S.C., Mumford, D.: A stochastic grammar of images. Foundations and Trends\(\textregistered \) in Computer Graphics and Vision 2 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Robotics at Google, Mountain View, USA
A. J. Piergiovanni, Anelia Angelova, Alexander Toshev & Michael S. Ryoo
Stony Brook University, New York, USA
Michael S. Ryoo

Authors

A. J. Piergiovanni
View author publications
You can also search for this author in PubMed Google Scholar
Anelia Angelova
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Toshev
View author publications
You can also search for this author in PubMed Google Scholar
Michael S. Ryoo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. J. Piergiovanni .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 242 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Piergiovanni, A.J., Angelova, A., Toshev, A., Ryoo, M.S. (2020). Adversarial Generative Grammars for Human Activity Prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12347. Springer, Cham. https://doi.org/10.1007/978-3-030-58536-5_30

Download citation

DOI: https://doi.org/10.1007/978-3-030-58536-5_30
Published: 03 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58535-8
Online ISBN: 978-3-030-58536-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Adversarial Generative Grammars for Human Activity Prediction