Attention-Based Two-Phase Model for Video Action Detection

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10425)


This paper considers the task of action detection in long untrimmed video. Existing methods tend to process every single frame or fragment through the whole video to make detection decisions, which can not only be time-consuming but also burden the computational models. Instead, we present an attention-based model to perform action detection by watching only a few fragments, which is independent with the video length and can be applied to real-world videos consequently. Our motivation is inspired by the observation that human usually focus their attention sequentially on different frames of a video to quickly narrow down the extent where an action occurs. Our model is a two-phase architecture, where a temporal proposal network is designed to predict temporal proposals for multi-category actions in the first phase. The temporal proposal network observes a fixed number of locations in a video to predict action bounds and learn a location transfer policy. In the second phase, a well-trained classifier is prepared to extract visual information from proposals, to classify the action and decide whether to adopt the proposals. We evaluate our model on ActivityNet dataset and show it can significantly outperform the baseline.


  1. 1.
    Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. ACM Comput. Surv. 43(3), 16 (2011)CrossRefGoogle Scholar
  2. 2.
    Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)
  3. 3.
    Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. In: ICLR (2015)Google Scholar
  4. 4.
    Ba, J., Salakhutdinov, R.R., Grosse, R.B., Frey, B.J.: Learning wake-sleep recurrent attention models. In: NIPS, pp. 2593–2601 (2015)Google Scholar
  5. 5.
    Bazzani, L., Larochelle, H., Torresani, L.: Recurrent mixture density network for spatiotemporal visual attention. arXiv preprint arXiv:1603.08199 (2016)
  6. 6.
    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp. 961–970 (2015)Google Scholar
  7. 7.
    Campbell, L.W., Bobick, A.F.: Recognition of human body motion using phase space constraints. In: ICCV, pp. 624–630 (1995)Google Scholar
  8. 8.
    Collobert, R., Kavukcuoglu, K., Farabet, C.: Torch7: A matlab-like environment for machine learning. In: BigLearn, NIPS Workshop. No. EPFL-CONF-192376 (2011)Google Scholar
  9. 9.
    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: ICCV, pp. 2625–2634 (2015)Google Scholar
  10. 10.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)Google Scholar
  11. 11.
    Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding videos, constructing plots learning a visually grounded storyline model from annotated videos. In: CVPR, pp. 2012–2019 (2009)Google Scholar
  12. 12.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: ICCV, pp. 3192–3199 (2013)Google Scholar
  13. 13.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. TPAMI 35(1), 221–231 (2013)CrossRefGoogle Scholar
  14. 14.
    Kantorov, V., Laptev, I.: Efficient feature extraction, encoding and classification for action recognition. In: CVPR, pp. 2593–2600 (2014)Google Scholar
  15. 15.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732 (2014)Google Scholar
  16. 16.
    Klaser, A., Marszałek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: BMVC, pp. 275:1–275:10 (2008)Google Scholar
  17. 17.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1097–1105 (2012)Google Scholar
  18. 18.
    Laptev, I.: On space-time interest points. IJCV 64(2–3), 107–123 (2005)CrossRefGoogle Scholar
  19. 19.
    Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: NIPS, pp. 2204–2212 (2014)Google Scholar
  20. 20.
    Niyogi, S.A., Adelson, E.H.: Analyzing and recognizing walking figures in xyt. In: CVPR, pp. 469–474 (1994)Google Scholar
  21. 21.
    Oneata, D., Verbeek, J., Schmid, C.: The lear submission at thumos 2014 (2014)Google Scholar
  22. 22.
    Poppe, R.: A survey on vision-based human action recognition. Image Vis. Comput. 28(6), 976–990 (2010)CrossRefGoogle Scholar
  23. 23.
    Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. In: ICLR (2016)Google Scholar
  24. 24.
    Shi, Y., Bobick, A., Essa, I.: Learning temporal sequence model from partially labeled data. In: CVPR, vol. 2, pp. 1631–1638 (2006)Google Scholar
  25. 25.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)Google Scholar
  26. 26.
    Singh, G., Cuzzolin, F.: Untrimmed video classification for activity detection: submission to activitynet challenge. arXiv preprint arXiv:1607.01979 (2016)
  27. 27.
    Sutton, R.S., McAllester, D.A., Singh, S.P., Mansour, Y., et al.: Policy gradient methods for reinforcement learning with function approximation. In: NIPS, vol. 99, pp. 1057–1063 (1999)Google Scholar
  28. 28.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)Google Scholar
  29. 29.
    Tang, Y., Srivastava, N., Salakhutdinov, R.R.: Learning generative models with visual attention. In: NIPS, pp. 1808–1816 (2014)Google Scholar
  30. 30.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp. 4489–4497 (2015)Google Scholar
  31. 31.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)
  32. 32.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV 103(1), 60–79 (2013)MathSciNetCrossRefGoogle Scholar
  33. 33.
    Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558 (2013)Google Scholar
  34. 34.
    Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC, pp. 124–1 (2009)Google Scholar
  35. 35.
    Wang, J., Wang, W., Wang, R., Gao, W., et al.: Deep alternative neural network: Exploring contexts as early as possible for action recognition. In: NIPS, pp. 811–819 (2016)Google Scholar
  36. 36.
    Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features. In: THUMOS14 Action Recognition Challenge, vol. 1, p. 2 (2014)Google Scholar
  37. 37.
    Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115(2), 224–241 (2011)CrossRefGoogle Scholar
  38. 38.
    Willems, G., Tuytelaars, T., Gool, L.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5303, pp. 650–663. Springer, Heidelberg (2008). doi: 10.1007/978-3-540-88688-4_48 CrossRefGoogle Scholar
  39. 39.
    Williams, R.J.: Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn. 8(3–4), 229–256 (1992)zbMATHGoogle Scholar
  40. 40.
    Wu, J., Wang, G., Yang, W., Ji, X.: Action recognition with joint attention on multi-level deep features. arXiv preprint arXiv:1607.02556 (2016)
  41. 41.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)Google Scholar
  42. 42.
    Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. arXiv preprint arXiv:1511.06984 (2015)

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.School of Electronic and Computer EngineeringShenzhen Graduate School, Peking UniversityShenzhenChina

Personalised recommendations