Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos

  • Bingbin LiuEmail author
  • Serena Yeung
  • Edward Chou
  • De-An Huang
  • Li Fei-Fei
  • Juan Carlos Niebles
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11207)


A major challenge in computer vision is scaling activity understanding to the long tail of complex activities without requiring collecting large quantities of data for new actions. The task of video retrieval using natural language descriptions seeks to address this through rich, unconstrained supervision about complex activities. However, while this formulation offers hope of leveraging underlying compositional structure in activity descriptions, existing approaches typically do not explicitly model compositional reasoning. In this work, we introduce an approach for explicitly and dynamically reasoning about compositional natural language descriptions of activity in videos. We take a modular neural network approach that, given a natural language query, extracts the semantic structure to assemble a compositional neural network layout and corresponding network modules. We show that this approach is able to achieve state-of-the-art results on the DiDeMo video retrieval dataset.


Video retrieval Action recognition Modular networks 



We would like to thank members in the Stanford Vision Lab as well as Lu Jiang, Mei Han and Jia Li from Google, for helpful discussions and support. We would also like to thank the anonymous reviewers, whose suggestions and comments have helped to improve the paper.


  1. 1.
    Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)
  2. 2.
    Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705 (2016)
  3. 3.
    Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Neural module networks. In: CVPR (2016)Google Scholar
  4. 4.
    Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)Google Scholar
  5. 5.
    Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)CrossRefGoogle Scholar
  6. 6.
    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)Google Scholar
  7. 7.
    Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, HLT 2011, vol. 1, pp. 190–200. Association for Computational Linguistics, Stroudsburg (2011).
  8. 8.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  9. 9.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  10. 10.
    Feng, X., Perona, P.: Human action recognition by sequence of movelet codewords. In: Proceedings of First International Symposium on 3D Data Processing Visualization and Transmission, pp. 717–721 (2002).
  11. 11.
    Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: a deep visual-semantic embedding model. In: NIPS (2013)Google Scholar
  12. 12.
    Gaidon, A., Harchaoui, Z., Schmid, C.: Temporal localization of actions with actoms. IEEE TPAMI 35(11), 2782–2795 (2013)CrossRefGoogle Scholar
  13. 13.
    Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. CoRR abs/1705.08421 (2017). arXiv:1705.08421
  14. 14.
    Guadarrama, S., et al.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: ICCV (2013)Google Scholar
  15. 15.
    Han, W., et al.: Seq-NMS for video object detection. arXiv preprint arXiv:1602.08465 (2016)
  16. 16.
    Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.C.: Localizing moments in video with natural language. In: ICCV (2017)Google Scholar
  17. 17.
    Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. In: ICCV (2017)Google Scholar
  18. 18.
    Hu, R., Rohrbach, M., Andreas, J., Darrell, T., Saenko, K.: Modeling relationships in referential expressions with compositional modular networks. In: CVPR (2017)Google Scholar
  19. 19.
    İkizler, N., Forsyth, D.A.: Searching for complex human activities with no visual examples. IJCV 80, 337–357 (2008)CrossRefGoogle Scholar
  20. 20.
    Jain, M., van Gemert, J.C., Mensink, T., Snoek, C.G.: Objects2action: classifying and localizing actions without any video example. In: ICCV (2015)Google Scholar
  21. 21.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. TPAMI 35(1), 221–231 (2013)CrossRefGoogle Scholar
  22. 22.
    Jiang, Y.G., et al.: THUMOS challenge: action recognition with a large number of classes (2014).
  23. 23.
    Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR, pp. 1988–1997. IEEE (2017)Google Scholar
  24. 24.
    Johnson, J., et al.: Inferring and executing programs for visual reasoning. In: ICCV (2017)Google Scholar
  25. 25.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  26. 26.
    Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  27. 27.
    Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 423–430. Association for Computational Linguistics (2003)Google Scholar
  28. 28.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)Google Scholar
  29. 29.
    Li, L.J., Su, H., Fei-Fei, L., Xing, E.P.: Object bank: a high-level image representation for scene classification & semantic feature sparsification. In: NIPS (2010)Google Scholar
  30. 30.
    Li, L.-J., Su, H., Lim, Y., Fei-Fei, L.: Objects as attributes for scene classification. In: Kutulakos, K.N. (ed.) ECCV 2010. LNCS, vol. 6553, pp. 57–69. Springer, Heidelberg (2012). Scholar
  31. 31.
    Lillo, I., Niebles, J.C., Soto, A.: Sparse composition of body poses and atomic actions for human activity recognition in RGB-D videos. Image Vis. Comput. 59, 63–75 (2017)CrossRefGoogle Scholar
  32. 32.
    Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS (2016)Google Scholar
  33. 33.
    Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. arXiv preprint arXiv:1801.03150 (2018)
  34. 34.
    Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 392–405. Springer, Heidelberg (2010). Scholar
  35. 35.
    Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). Scholar
  36. 36.
    Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: EMNLP (2014)Google Scholar
  37. 37.
    Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)CrossRefGoogle Scholar
  38. 38.
    Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B., Pinkal, M.: Grounding action descriptions in videos. Trans. Assoc. Comput. Linguist. (2013)Google Scholar
  39. 39.
    Rohrbach, A., Torabi, A., Rohrbach, M., Tandon, N., Pal, C., Larochelle, H., Courville, A., Schiele, B.: Movie description. IJCV 123(1), 94–120 (2017)CrossRefGoogle Scholar
  40. 40.
    Rohrbach, M., et al.: Recognizing fine-grained and composite activities using hand-centric features and script data. IJCV 119, 346–373 (2016)MathSciNetCrossRefGoogle Scholar
  41. 41.
    Sadanand, S., Corso, J.J.: Action bank: a high-level representation of activity in video. In: CVPR (2012)Google Scholar
  42. 42.
    Schiele, B.: A database for fine grained activity detection of cooking activities. In: CVPR (2012)Google Scholar
  43. 43.
    Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift descriptor and its application to action recognition. In: Proceedings of the 15th ACM International Conference on Multimedia, MM 2007 (2007)Google Scholar
  44. 44.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 510–526. Springer, Cham (2016). Scholar
  45. 45.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  46. 46.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ICLR (2015)Google Scholar
  47. 47.
    Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: ICCV (2003)Google Scholar
  48. 48.
    Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. (2014)Google Scholar
  49. 49.
    Socher, R., Lin, C.C., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: ICML (2011)Google Scholar
  50. 50.
    Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (2013)Google Scholar
  51. 51.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  52. 52.
    Torabi, A., Tandon, N., Sigal, L.: Learning language-visual embedding for movie understanding with natural-language. arXiv preprint arXiv:1609.08124 (2016)
  53. 53.
    Tran, A., Cheong, L.F.: Two-stream flow-guided convolutional attention networks for action recognition. arXiv preprint arXiv:1708.09268 (2017)
  54. 54.
    Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. CoRR abs/1412.0767 (2014). arXiv:1412.0767
  55. 55.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: CVPR (2011)Google Scholar
  56. 56.
    Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). Scholar
  57. 57.
    Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115, 224–241 (2011)CrossRefGoogle Scholar
  58. 58.
    Xiao, F., Sigal, L., Lee, Y.J.: Weakly-supervised visual grounding of phrases with linguistic structures. In: CVPR (2017)Google Scholar
  59. 59.
    Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR (2016)Google Scholar
  60. 60.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR (2016)Google Scholar
  61. 61.
    Yao, L., et al.: Describing videos by exploiting temporal structure. In: ICCV (2015)Google Scholar
  62. 62.
    Yeung, S., Fathi, A., Fei-Fei, L.: VideoSET: video summary evaluation through text. arXiv preprint arXiv:1406.5824 (2014)
  63. 63.
    Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR (2016)Google Scholar
  64. 64.
    Zacks, J.M., Tversky, B.: Event structure in perception and conception. Psychol. Bull. 127, 3–21 (2001)CrossRefGoogle Scholar
  65. 65.
    Zellers, R., Choi, Y.: Zero-shot activity recognition with verb attribute induction. arXiv preprint arXiv:1707.09468 (2017)
  66. 66.
    Zhao, H., Yan, Z., Wang, H., Torresani, L., Torralba, A.: SLAC: a sparsely labeled dataset for action classification and localization. arXiv preprint arXiv:1712.09374 (2017)
  67. 67.
    Zhou, L., Xu, C., Corso, J.J.: ProcNets: learning to segment procedures in untrimmed and unconstrained videos. CoRR abs/1703.09788 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Stanford UniversityStanfordUSA
  2. 2.Google Cloud AIMountain ViewUSA

Personalised recommendations