Adversarial Generative Grammars for Human Activity Prediction

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12347)


In this paper we propose an adversarial generative grammar model for future prediction. The objective is to learn a model that explicitly captures temporal dependencies, providing a capability to forecast multiple, distinct future activities. Our adversarial grammar is designed so that it can learn stochastic production rules from the data distribution, jointly with its latent non-terminal representations. Being able to select multiple production rules during inference leads to different predicted outcomes, thus efficiently modeling many plausible futures. The adversarial generative grammar is evaluated on the Charades, MultiTHUMOS, Human3.6M, and 50 Salads datasets and on two activity prediction tasks: future 3D human pose prediction and future activity prediction. The proposed adversarial grammar outperforms the state-of-the-art approaches, being able to predict much more accurately and further in the future, than prior work. Code will be open sourced.

Supplementary material

504434_1_En_30_MOESM1_ESM.pdf (243 kb)
Supplementary material 1 (pdf 242 KB)


  1. 1.
    Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. arXiv preprint arXiv:1710.11252 (2017)
  2. 2.
    Brock, A., Donahue, J., Simonyan, K.: Large scale gan training for high fidelity natural image synthesis. ICLR (2019)Google Scholar
  3. 3.
    Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP (2014)Google Scholar
  4. 4.
    Chomsky, N.: Three models for the description of language. IRE Transactions on information theory 2(3), 113–124 (1956)CrossRefGoogle Scholar
  5. 5.
    Denton, E., Fergus, R.: Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687 (2018)
  6. 6.
    Emily L Denton, Soumith Chintala, R.F.: Deep generative image models using a laplacian pyramid of adversarial networks. Advances in Neural Information Processing Systems (NeurIPS) (2015)Google Scholar
  7. 7.
    Farha, Y.A., Richard, A., Gall, J.: When will you do what? - anticipating temporal occurrences of activities. In: CVPR (2018)Google Scholar
  8. 8.
    Fedus, W., Goodfellow, I., Dai, A.: Maskgan: Better text generation via filling in the \(\_\). ICLR (2018)Google Scholar
  9. 9.
    Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Advances in Neural Information Processing Systems (NeurIPS). pp. 64–72 (2016)Google Scholar
  10. 10.
    Fraccaro, M., Sønderby, S.K., Paquet, U., Winther, O.: Sequential neural models with stochastic layers. In: Advances in neural information processing systems. pp. 2199–2207 (2016)Google Scholar
  11. 11.
    Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: ICCV (2015)Google Scholar
  12. 12.
    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in Neural Information Processing Systems (NeurIPS) (2014)Google Scholar
  13. 13.
    Han, F., Zhu, S.C.: Bottom-up/top-down image parsing with attribute grammar. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(1), 59–73 (2008)Google Scholar
  14. 14.
    Hu, Z., Yang, Z., Liang, X., Salakhutdinov, R., Xing, E.P.: Toward controlled generation of text. ICML (2017)Google Scholar
  15. 15.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (2014)Google Scholar
  16. 16.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. CVPR (2017)Google Scholar
  17. 17.
    Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-rnn: Deep learning on spatio-temporal graphs. In: CVPR (2016)Google Scholar
  18. 18.
    Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. In: ICLR (2017)Google Scholar
  19. 19.
    Ke, Q., Fritz, M., Schiele, B.: Time-conditioned action anticipation in one shot. In: CVPR (2019)Google Scholar
  20. 20.
    Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523 (2018)
  21. 21.
    Maddison, C.J., Mnih, A., Teh, Y.W.: The concrete distribution: A continuous relaxation of discrete random variables. In: ICLR (2017)Google Scholar
  22. 22.
    Martinez, J., Black, M., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR (2017)Google Scholar
  23. 23.
    Moore, D., Essa, I.: Recognizing multitasked activities from video using stochastic context-free grammar. In: Proceedings of AAAI Conference on Artificial Intelligence (AAAI). pp. 770–776 (2002)Google Scholar
  24. 24.
    Pirsiavash, H., Ramanan, D.: Parsing videos of actions with segmental grammars. In: CVPR. pp. 612–619 (2014)Google Scholar
  25. 25.
    Qi, S., Huang, S., Wei, P., Zhu, S.C.: Predicting human activities using stochastic grammar. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1164–1172 (2017)Google Scholar
  26. 26.
    Qi, S., Jia, B., Zhu, S.C.: Generalized earley parser: Bridging symbolic grammars and sequence data for future prediction. arXiv preprint arXiv:1806.03497 (2018)
  27. 27.
    Ryoo, M.S., Aggarwal, J.K.: Recognition of composite human activities through context-free grammar based representation. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). vol. 2, pp. 1709–1718. IEEE (2006)Google Scholar
  28. 28.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: Crowdsourcing data collection for activity understanding. Proceedings of European Conference on Computer Vision (ECCV) (2016)Google Scholar
  29. 29.
    Socher, R., Lin, C.C., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th international conference on machine learning (ICML-11). pp. 129–136 (2011)Google Scholar
  30. 30.
    Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing. pp. 729–738. ACM (2013)Google Scholar
  31. 31.
    Tang, Y., Ma, L., Liu, W., Zheng, W.S.: Long-term human motion prediction by modeling motion context and enhancing motion dynamic. In: IJCAI (2018)Google Scholar
  32. 32.
    Vo, N.N., Bobick, A.F.: From stochastic grammar to bayes network: Probabilistic parsing of complex activity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2641–2648 (2014)Google Scholar
  33. 33.
    Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: CVPR (2018)Google Scholar
  34. 34.
    Yang, F., Yang, Z., Cohen, W.W.: Differentiable learning of logical rules for knowledge base reasoning. Advances in Neural Information Processing Systems (NeurIPS) (2017)Google Scholar
  35. 35.
    Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Fei-Fei, L.: Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision (IJCV) pp. 1–15 (2015)Google Scholar
  36. 36.
    Yogatama, D., Miao, Y., Melis, G., Ling, W., Kuncoro, A., Dyer, C., Blunsom, P.: Memory architectures in recurrent neural network language models. ICLR (2018)Google Scholar
  37. 37.
    Yu, L., Zhang, W., J. Wang, Yu, Y.: Seqgan: sequence generative adversarial nets with policy gradient. Proceedings of AAAI Conference on Artificial Intelligence (AAAI) (2017)Google Scholar
  38. 38.
    Zhao, Y., Zhu, S.C.: Image parsing with stochastic scene grammar. In: Advances in Neural Information Processing Systems. pp. 73–81 (2011)Google Scholar
  39. 39.
    Zhu, S.C., Mumford, D.: A stochastic grammar of images. Foundations and Trends\(\textregistered \) in Computer Graphics and Vision 2 (2007)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Robotics at GoogleMountain ViewUSA
  2. 2.Stony Brook UniversityNew YorkUSA

Personalised recommendations