Advertisement

Recurrent Tubelet Proposal and Recognition Networks for Action Detection

  • Dong Li
  • Zhaofan Qiu
  • Qi DaiEmail author
  • Ting Yao
  • Tao Mei
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11210)

Abstract

Detecting actions in videos is a challenging task as video is an information intensive media with complex variations. Existing approaches predominantly generate action proposals for each individual frame or fixed-length clip independently, while overlooking temporal context across them. Such temporal contextual relations are vital for action detection as an action is by nature a sequence of movements. This motivates us to leverage the localized action proposals in previous frames when determining action regions in the current one. Specifically, we present a novel deep architecture called Recurrent Tubelet Proposal and Recognition (RTPR) networks to incorporate temporal context for action detection. The proposed RTPR consists of two correlated networks, i.e., Recurrent Tubelet Proposal (RTP) networks and Recurrent Tubelet Recognition (RTR) networks. The RTP initializes action proposals of the start frame through a Region Proposal Network and then estimates the movements of proposals in next frame in a recurrent manner. The action proposals of different frames are linked to form the tubelet proposals. The RTR capitalizes on a multi-channel architecture, where in each channel, a tubelet proposal is fed into a CNN plus LSTM to recurrently recognize action in the tubelet. We conduct extensive experiments on four benchmark datasets and demonstrate superior results over state-of-the-art methods. More remarkably, we obtain mAP of 98.6%, 81.3%, 77.9% and 22.3% with gains of 2.9%, 4.3%, 0.7% and 3.9% over the best competitors on UCF-Sports, J-HMDB, UCF-101 and AVA, respectively.

Keywords

Action detection Action recognition 

Supplementary material

474211_1_En_19_MOESM1_ESM.pdf (513 kb)
Supplementary material 1 (pdf 512 KB)

References

  1. 1.
    Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  2. 2.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: CVPR (2017)Google Scholar
  3. 3.
    Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: CVPR (2016)Google Scholar
  4. 4.
    Girshick, R.: Fast r-cnn. In: ICCV (2015)Google Scholar
  5. 5.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  6. 6.
    Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with R*CNN. In: ICCV (2015)Google Scholar
  7. 7.
    Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR (2015)Google Scholar
  8. 8.
    Gu, C., et al.: Ava: a video dataset of spatio-temporally localized atomic visual actions. arXiv preprint arXiv:1705.08421 (2017)
  9. 9.
    He, J., Ibrahim, M.S., Deng, Z., Mori, G.: Generic tubelet proposals for action localization. In: WACV (2018)Google Scholar
  10. 10.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  11. 11.
    Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos. In: ICCV (2017)Google Scholar
  12. 12.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: ICCV (2013)Google Scholar
  13. 13.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: ACM MM (2014)Google Scholar
  14. 14.
    Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: ICCV (2017)Google Scholar
  15. 15.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  16. 16.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  17. 17.
    Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint action localization and recognition. In: ICCV (2011)Google Scholar
  18. 18.
    Laptev, I., Pérez, P.: Retrieving actions in movies. In: ICCV (2007)Google Scholar
  19. 19.
    Li, Q., Qiu, Z., Yao, T., Mei, T., Rui, Y., Luo, J.: Action recognition by learning deep multi-granular spatio-temporal video representation. In: ICMR (2016)Google Scholar
  20. 20.
    Li, Q., Qiu, Z., Yao, T., Mei, T., Rui, Y., Luo, J.: Learning hierarchical video representation for action recognition. IJMIR 6(1), 85–98 (2017)Google Scholar
  21. 21.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  22. 22.
    Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46493-0_45CrossRefGoogle Scholar
  23. 23.
    Qiu, Z., Yao, T., Mei, T.: Deep quantization: encoding convolutional activations with deep generative model. In: CVPR (2017)Google Scholar
  24. 24.
    Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: ICCV (2017)Google Scholar
  25. 25.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)Google Scholar
  26. 26.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  27. 27.
    Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR (2008)Google Scholar
  28. 28.
    Saha, S., Singh, G., Sapienza, M., Torr, P.H., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. In: BMVC (2016)Google Scholar
  29. 29.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  30. 30.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  31. 31.
    Singh, G., Saha, S., Cuzzolin, F.: Online real time multiple spatiotemporal action localisation and prediction. In: ICCV (2017)Google Scholar
  32. 32.
    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  33. 33.
    Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)Google Scholar
  34. 34.
    Tian, Y., Sukthankar, R., Shah, M.: Spatiotemporal deformable part models for action detection. In: CVPR (2013)Google Scholar
  35. 35.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  36. 36.
    Wang, L., Qiao, Y., Tang, X., Van Gool, L.: Actionness estimation using hybrid fully convolutional networks. In: CVPR (2016)Google Scholar
  37. 37.
    Wang, Y., Song, J., Wang, L., Van Gool, L., Hilliges, O.: Two-stream SR-CNNs for action recognition in videos. In: BMVC (2016)Google Scholar
  38. 38.
    Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Learning to track for spatio-temporal action localization. In: ICCV (2015)Google Scholar
  39. 39.
    Yang, Z., Gao, J., Nevatia, R.: Spatio-temporal action detection with cascade proposal and location anticipation. In: BMVC (2017)Google Scholar
  40. 40.
    Yuan, J., Liu, Z., Wu, Y.: Discriminative subvolume search for efficient action detection. In: CVPR (2009)Google Scholar
  41. 41.
    Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: CVPR (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Dong Li
    • 1
  • Zhaofan Qiu
    • 1
  • Qi Dai
    • 2
    Email author
  • Ting Yao
    • 3
  • Tao Mei
    • 3
  1. 1.University of Science and Technology of ChinaHefeiChina
  2. 2.Microsoft ResearchBeijingChina
  3. 3.JD AI ResearchBeijingChina

Personalised recommendations