Skip to main content

Weakly Supervised Temporal Action Detection with Shot-Based Temporal Pooling Network

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11304))

Included in the following conference series:

Abstract

Weakly supervised temporal action detection in untrimmed videos is an important yet challenging task, where only video-level class labels are available for temporally locating actions in the videos during training. In this paper, we propose a novel architecture for this task. Specifically, we put forward an effective shot-based sampling method aiming at generating a more simplified but representative feature sequence for action detection, instead of using uniform sampling which causes extremely irrelevant frames retained. Furthermore, in order to distinguish action instances existing in the videos, we design a multi-stage Temporal Pooling Network (TPN) for the purposes of predicting video categories and localizing class-specific action instances respectively. Experiments conducted on THUMOS14 dataset confirm that our method outperforms other state-of-the-art weakly supervised approaches.

This research has been supported by NSFC Program (61673269, 61273285).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Wang, L., Qiao, Y., Tang, X.: MoFAP: a multi-level representation for action recognition. 2016 Int. J. Comput. Vis. 119, 254–271 (2016). https://doi.org/10.1007/s11263-015-0859-0

    Article  MathSciNet  Google Scholar 

  2. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732. IEEE Press, New York (2014)

    Google Scholar 

  3. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision, pp. 4489–4497. IEEE Press, New York (2015)

    Google Scholar 

  4. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

    Chapter  Google Scholar 

  5. Lin, T., Zhao, X., Fan, Z.: Temporal action localization with two-stream segment-based RNN. In: 2017 IEEE Conference on Image Processing, pp. 1–4. IEEE Press, New York (2017)

    Google Scholar 

  6. Shou, Z., Wang, D., Chang, S.: Action temporal localization in untrimmed videos via multi-stage CNNs. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058. IEEE Press, New York (2016)

    Google Scholar 

  7. Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: 25th ACM International Conference on Multimedia, pp. 988–996. ACM, California (2017)

    Google Scholar 

  8. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: 2017 IEEE International Conference on Computer Vision, pp. 6–7. IEEE Press, New York (2017)

    Google Scholar 

  9. Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. arXiv preprint arXiv:1806.02964 (2018)

  10. Gan, C., Wang, N., Yang, Y., Yeung, D., G.Hauptmann, A.: DevNet: a deep event network for multimedia event detection and evidence recounting. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2568–2577. IEEE Press, New York (2015)

    Google Scholar 

  11. Singh, K.K., Lee, Y.J.: Hide-and-Seek: forcing a network to be meticulous for weakly-supervised object and action localization. In: 2017 IEEE International Conference on Computer Vision, pp. 1961–1970. IEEE Press, New York (2017)

    Google Scholar 

  12. Simoyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576. Curran Associates Inc., New York (2014)

    Google Scholar 

  13. Wang, L., Xiong, Y., Lin, D., van Gool, L.: UntrimmedNets for weakly supervised action recognition and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2–6. IEEE Press, New York (2017)

    Google Scholar 

  14. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. IEEE Press, New York (2016)

    Google Scholar 

  15. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24673-2_3

    Chapter  Google Scholar 

  16. Yamasaki, T.: Histogram of oriented gradients. In: Journal of the Institute of Image Information and Television Engineers, pp. 1368–1371 (2010)

    Google Scholar 

  17. Lin, M., Chen, Q., Yan, S.: Network in network. In: 2014 IEEE International Conference on Learning Representations, pp. 1–4. IEEE Press, New York (2014)

    Google Scholar 

  18. Jiang, Y.G., et al.: THUMOS challenge: action recognition with a large number of classes. In: ECCV Workshop, vol. 5. Springer, Heidelberg (2014)

    Google Scholar 

  19. Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: 22nd ACM International Conference on Multimedia, pp. 675–678 (2014)

    Google Scholar 

  20. Abadi, M., Agarwal, A., Barham, P., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)

  21. Oneata, D., Verbeek, J., Schmid, C.: The LEAR submission at thumos2014. In: Thumos14 Action Recognition Challenge, pp. 1–7. Springer, Heidelberg (2014)

    Google Scholar 

  22. Gao, J., Yang, Z., Nevatia, R.: Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180 (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xu Zhao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Su, H., Zhao, X., Lin, T., Fei, H. (2018). Weakly Supervised Temporal Action Detection with Shot-Based Temporal Pooling Network. In: Cheng, L., Leung, A., Ozawa, S. (eds) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science(), vol 11304. Springer, Cham. https://doi.org/10.1007/978-3-030-04212-7_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04212-7_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04211-0

  • Online ISBN: 978-3-030-04212-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics