Object Level Visual Reasoning in Videos

  • Fabien BaradelEmail author
  • Natalia Neverova
  • Christian Wolf
  • Julien Mille
  • Greg Mori
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11217)


Human activity recognition is typically addressed by detecting key concepts like global and local motion, features related to object classes present in the scene, as well as features related to the global context. The next open challenges in activity recognition require a level of understanding that pushes beyond this and call for models with capabilities for fine distinction and detailed comprehension of interactions between actors and objects in a scene. We propose a model capable of learning to reason about semantically meaningful spatio-temporal interactions in videos. The key to our approach is a choice of performing this reasoning at the object level through the integration of state of the art object detection networks. This allows the model to learn detailed spatial interactions that exist at a semantic, object-interaction relevant level. We evaluate our method on three standard datasets (Twenty-BN Something-Something, VLOG and EPIC Kitchens) and achieve state of the art results on all of them. Finally, we show visualizations of the interactions learned by the model, which illustrate object classes and their interactions corresponding to different activity classes.


Video understanding Human-object interaction 



This work was funded by grant Deepvision (ANR-15-CE23-0029, STPGP-479356-15), a joint French/Canadian call by ANR & NSERC.

Supplementary material

474201_1_En_7_MOESM1_ESM.pdf (3.6 mb)
Supplementary material 1 (pdf 3697 KB)


  1. 1.
    Abu-El-Haija, S., et al.: Youtube-8m: a large-scale video classification benchmark. arXiv preprint arxiv:1609.08675 (2016)
  2. 2.
    Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Salah, A.A., Lepri, B. (eds.) HBU 2011. LNCS, vol. 7065, pp. 29–39. Springer, Heidelberg (2011). Scholar
  3. 3.
    Baradel, F., Wolf, C., Mille, J., Taylor, G.: Glimpse clouds: human activity recognition from unstructured feature points. In: CVPR (2018)Google Scholar
  4. 4.
    Battaglia, P.W., Pascanu, R., Lai, M., Rezende, D.J., Kavukcuoglu, K.: Interaction networks for learning about objects, relations and physics. In: NIPS (2016)Google Scholar
  5. 5.
    Bolei, Z., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018, Part I. LNCS, vol. 11205, pp. 831–846. Springer, Cham (2018)Google Scholar
  6. 6.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)Google Scholar
  7. 7.
    Damen, D., et al.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV (2018, to appear)Google Scholar
  8. 8.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  9. 9.
    Duke, B.: Lintel: Python video decoding (2018).
  10. 10.
    Fleuret, F., Li, T., Dubout, C., Wampler, E.K., Yantis, S., Geman, D.: Comparing machines and humans on a visual categorization test. Proc. Natl. Acad. Sci. U.S.A. 108(43), 17621–5 (2011)CrossRefGoogle Scholar
  11. 11.
    Fouhey, D.F., Kuo, W., Efros, A.A., Malik, J.: From lifestyle vlogs to everyday interactions. In: CVPR (2018)Google Scholar
  12. 12.
    Goyal, R., et al.: The “something something” video database for learning and evaluating visual common sense. In: ICCV (2017)Google Scholar
  13. 13.
    Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. arXiv preprint arXiv:1705.08421 (2017)
  14. 14.
    He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  15. 15.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  16. 16.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  17. 17.
    Hudson, D., Manning, C.: Compositional attention networks for machine reasoning. In: ICLR (2018)Google Scholar
  18. 18.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  19. 19.
    Kim, J., Ricci, M., Serre, T.: Not-So-CLEVR: visual relations strain feedforward neural networks. arXiv preprint arXiv:1802.03390 (2018)
  20. 20.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICML (2015)Google Scholar
  21. 21.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. (IJCV) 123, 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: ICCV (2017)Google Scholar
  23. 23.
    Luvizon, D., Picard, D., Tabia, H.: 2D/3D pose estimation and action recognition using multitask deep learning. In: CVPR (2018)Google Scholar
  24. 24.
    Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. arXiv preprint arXiv:1801.03150 (2018)
  25. 25.
    Perez, E., Vries, H.D., Strub, F., Dumoulin, V., Courville, A.: Learning visual reasoning without strong priors. In: ICML Machine Learning in Speech and Language Processing Workshop (2017)Google Scholar
  26. 26.
    Pickup, L.C., et al.: Seeing the arrow of time. In: CVPR (2014)Google Scholar
  27. 27.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  28. 28.
    Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. IJCV 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Santoro, A., et al.: A simple neural network module for relational reasoning. In: NIPS (2017)Google Scholar
  30. 30.
    Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: CVPR (2016)Google Scholar
  31. 31.
    Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. In: ICLR Workshop (2016)Google Scholar
  32. 32.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  33. 33.
    Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: AAAI (2016)Google Scholar
  34. 34.
    Stabinger, S., Rodríguez-Sánchez, A., Piater, J.: 25 years of CNNs: can we compare to human abstraction capabilities? In: Villa, A.E.P., Masulli, P., Pons Rivero, A.J. (eds.) ICANN 2016. LNCS, vol. 9887, pp. 380–387. Springer, Cham (2016). Scholar
  35. 35.
    van Steenkiste, S., Chang, M., Greff, K., Schmidhuber, J.: Relational neural expectation maximization: unsupervised discovery of objects and their interactions. In: ICLR (2018)Google Scholar
  36. 36.
    Sun, L., Jia, K., Chen, K., Yeung, D., Shi, B.E., Savarese, S.: Lattice long short-term memory for human action recognition. In: ICCV (2017)Google Scholar
  37. 37.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  38. 38.
    Velikovi, P., Cucurull, G., Casanova, A., Romero, A., Li , P., Bengio, Y.: Graph attention networks. In: ICLR (2018)Google Scholar
  39. 39.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action Recognition by Dense Trajectories. In: CVPR (2011)Google Scholar
  40. 40.
    Watters, N., Zoran, D., Weber, T., Battaglia, P., Pascanu, R., Tacchetti, A.: Visual interaction networks: learning a physics simulator from video. In: NIPS (2017)Google Scholar
  41. 41.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arxiv:1712.04851 (2017)
  42. 42.
    Yeung, S., Russakovsky, O., Jin, N., Andriluka, M., Mori, G., Fei-Fei, L.: Every moment counts: dense detailed labeling of actions in complex videos. arXiv preprint arXiv:1507.05738 (2015)

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Fabien Baradel
    • 1
    Email author
  • Natalia Neverova
    • 2
  • Christian Wolf
    • 1
    • 3
  • Julien Mille
    • 4
  • Greg Mori
    • 5
  1. 1.Université Lyon, INSA Lyon, CNRS, LIRISVilleurbanneFrance
  2. 2.Facebook AI ResearchParisFrance
  3. 3.INRIA, CITI LaboratoryVilleurbanneFrance
  4. 4.Laboratoire d’Informatique de l’Univ. de Tours, INSA Centre Val de LoireBloisFrance
  5. 5.Simon Fraser UniversityVancouverCanada

Personalised recommendations