DAPs: Deep Action Proposals for Action Understanding

  • Victor Escorcia
  • Fabian Caba Heilbron
  • Juan Carlos Niebles
  • Bernard Ghanem
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9907)

Abstract

Object proposals have contributed significantly to recent advances in object understanding in images. Inspired by the success of this approach, we introduce Deep Action Proposals (DAPs), an effective and efficient algorithm for generating temporal action proposals from long videos. We show how to take advantage of the vast capacity of deep learning models and memory cells to retrieve from untrimmed videos temporal segments, which are likely to contain actions. A comprehensive evaluation indicates that our approach outperforms previous work on a large scale action benchmark, runs at 134 FPS making it practical for large-scale scenarios, and exhibits an appealing ability to generalize, i.e. to retrieve good quality temporal proposals of actions unseen in training.

Keywords

Action proposals Action detection Long-short term memory 

Supplementary material

419975_1_En_47_MOESM1_ESM.mp4 (47.1 mb)
Supplementary material 1 (mp4 48180 KB)
419975_1_En_47_MOESM2_ESM.pdf (175 kb)
Supplementary material 2 (pdf 175 KB)

References

  1. 1.
    Atmosukarto, I., Ahuja, N., Ghanem, B.: Action recognition using discriminative structured trajectory groups. In: 2015 IEEE Winter Conference on Applications of Computer Vision, pp. 899–906. IEEE (2015)Google Scholar
  2. 2.
    Atmosukarto, I., Ghanem, B., Ahuja, N.: Trajectory-based fisher kernel representation for action recognition in videos. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp. 3333–3336. IEEE (2012)Google Scholar
  3. 3.
    Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 961–970 (2015)Google Scholar
  4. 4.
    Heilbron, F.C., Niebles, J.C., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2016)Google Scholar
  5. 5.
    Chen, C., Grauman, K.: Efficient activity detection with max-subgraph search. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1274–1281 (2012)Google Scholar
  6. 6.
    Chen, W., Xiong, C., Xu, R., Corso, J.J.: Actionness ranking with lattice conditional ordinal random fields. In: IEEE Conference on Computer Vision and Pattern Recognition CVPR, pp. 748–755 (2014)Google Scholar
  7. 7.
    Erhan, D., Szegedy, C., Toshev, A., Anguelov, D.: Scalable object detection using deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2147–2154 (2014)Google Scholar
  8. 8.
    Everingham, M., Eslami, S.M.A., Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vis. (IJCV) 111(1), 98–136 (2015)CrossRefGoogle Scholar
  9. 9.
    Gaidon, A., Harchaoui, Z., Schmid, C.: Temporal localization of actions with actoms. IEEE Trans. Pattern Anal. Mach. Intell. 35(11), 2782–2795 (2013)CrossRefGoogle Scholar
  10. 10.
    van Gemert, J.C., Jain, M., Gati, E., Snoek, C.G.: Apt: action localization proposals from dense trajectories. In: British Machine Vision Conference (BMVC) (2015)Google Scholar
  11. 11.
    Gkioxari, G., Malik, J.: Finding action tubes. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 759–768 (2015)Google Scholar
  12. 12.
    Gorban, A., Idrees, H., Jiang, Y.G., Zamir, A.R., Laptev, I., Shah, M., Sukthankar, R.: THUMOS challenge: action recognition with a large number of classes (2015). http://www.thumos.info/
  13. 13.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part VII. LNCS, vol. 8695, pp. 345–360. Springer, Heidelberg (2014)Google Scholar
  14. 14.
    Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part VII. LNCS, vol. 8695, pp. 297–312. Springer, Heidelberg (2014)Google Scholar
  15. 15.
    Hosang, J., Benenson, R., Dollár, P., Schiele, B.: What makes for effective detection proposals? IEEE Trans. Pattern Anal. Mach. Intell. 38(4), 814–830 (2016). doi:10.1109/TPAMI.2015.2465908 CrossRefGoogle Scholar
  16. 16.
    Hua, Y., Alahari, K., Schmid, C.: Online object tracking with proposal selection. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  17. 17.
    Jain, M., van Gemert, J.C., Jégou, H., Bouthemy, P., Snoek, C.G.M.: Action localization with tubelets from motion. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 740–747 (2014)Google Scholar
  18. 18.
    Jiang, Y.G., Liu, J., Zamir, A.R., Toderici, G., Laptev, I., Shah, M., Sukthankar, R.: THUMOS challenge: action recognition with a large number of classes (2014). http://crcv.ucf.edu/THUMOS14/
  19. 19.
    Karaman, S., Seidenari, L., Del Bimbo, A.: Fast saliency based pooling of fisher encoded dense trajectories (2014)Google Scholar
  20. 20.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Advances in Neural Information Processing Systems (NIPS), pp. 3128–3137 (2014)Google Scholar
  21. 21.
    Lillo, I., Niebles, J.C., Soto, A.: A hierarchical pose-based approach to complex action understanding using dictionaries of actionlets and motion poselets. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016)Google Scholar
  22. 22.
    Mettes, P., van Gemert, J., Cappallo, S., Mensink, T., Snoek, C.: Bag-of-fragments: selecting and encoding video fragments for event detection and recounting. In: ACM International Conference on Multimedia Retrieval (ICMR) (2015)Google Scholar
  23. 23.
    Ng, J.Y., Hausknecht, M.J., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: deep networks for video classification. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4694–4702 (2015)Google Scholar
  24. 24.
    Oneata, D., Revaud, J., Verbeek, J., Schmid, C.: Spatio-temporal object detection proposals. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part III. LNCS, vol. 8691, pp. 737–752. Springer, Heidelberg (2014)Google Scholar
  25. 25.
    Oneata, D., Verbeek, J., Schmid, C.: The lear submission at thumos 2014 (2014)Google Scholar
  26. 26.
    Oneata, D., Verbeek, J.J., Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: IEEE International Conference on Computer Vision, ICCV, pp. 1817–1824(2013)Google Scholar
  27. 27.
    Oneata, D., Verbeek, J.J., Schmid, C.: Efficient action localization with approximately normalized fisher vectors. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2545–2552 (2014)Google Scholar
  28. 28.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)Google Scholar
  29. 29.
    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Shou, Z., Wang, D., Chang, S.: Action temporal localization in untrimmed videos via multi-stage cnns. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2016)Google Scholar
  31. 31.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 3104–3112 (2014). http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks
  32. 32.
    Szegedy, C., Reed, S., Erhan, D., Anguelov, D.: Scalable, high-quality object detection. CoRR abs/1412.1441 (2014). http://arxiv.org/abs/1412.1441
  33. 33.
    Tang, K., Yao, B., Fei-Fei, L., Koller, D.: Combining the right features for complex event recognition. In: The IEEE International Conference on Computer Vision (ICCV), December 2013Google Scholar
  34. 34.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision, ICCV, pp. 4489–4497 (2015)Google Scholar
  35. 35.
    Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013). http://dx.doi.org/10.1007/s11263-013-0620-5 CrossRefGoogle Scholar
  36. 36.
    Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPRn, vol. 1, pp. I-511 (2001)Google Scholar
  37. 37.
    Kuo, W., Hariharan, J.M.B.: Deepbox:learning objectness with convolutional networks. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  38. 38.
    Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative cnn video representation for event detection. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR (2015)Google Scholar
  39. 39.
    Yu, G., Yuan, J.: Fast action proposals for human action detection and search. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1302–1311 (2015)Google Scholar
  40. 40.
    Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014, Part V. LNCS, vol. 8693, pp. 391–405. Springer, Heidelberg (2014)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Victor Escorcia
    • 1
  • Fabian Caba Heilbron
    • 1
  • Juan Carlos Niebles
    • 2
    • 3
  • Bernard Ghanem
    • 1
  1. 1.King Abdullah University of Science and Technology (KAUST)ThuwalSaudi Arabia
  2. 2.Stanford UniversityStanfordUSA
  3. 3.Universidad del NorteBarranquillaColombia

Personalised recommendations