Weakly Supervised Temporal Action Detection with Shot-Based Temporal Pooling Network

Su, Haisheng; Zhao, Xu; Lin, Tianwei; Fei, Haiping

doi:10.1007/978-3-030-04212-7_37

Haisheng Su¹⁶,
Xu Zhao¹⁶,
Tianwei Lin¹⁶ &
…
Haiping Fei¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11304))

Included in the following conference series:

International Conference on Neural Information Processing

2355 Accesses
6 Citations
3 Altmetric

Abstract

Weakly supervised temporal action detection in untrimmed videos is an important yet challenging task, where only video-level class labels are available for temporally locating actions in the videos during training. In this paper, we propose a novel architecture for this task. Specifically, we put forward an effective shot-based sampling method aiming at generating a more simplified but representative feature sequence for action detection, instead of using uniform sampling which causes extremely irrelevant frames retained. Furthermore, in order to distinguish action instances existing in the videos, we design a multi-stage Temporal Pooling Network (TPN) for the purposes of predicting video categories and localizing class-specific action instances respectively. Experiments conducted on THUMOS14 dataset confirm that our method outperforms other state-of-the-art weakly supervised approaches.

This research has been supported by NSFC Program (61673269, 61273285).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wang, L., Qiao, Y., Tang, X.: MoFAP: a multi-level representation for action recognition. 2016 Int. J. Comput. Vis. 119, 254–271 (2016). https://doi.org/10.1007/s11263-015-0859-0
Article MathSciNet Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732. IEEE Press, New York (2014)
Google Scholar
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision, pp. 4489–4497. IEEE Press, New York (2015)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Lin, T., Zhao, X., Fan, Z.: Temporal action localization with two-stream segment-based RNN. In: 2017 IEEE Conference on Image Processing, pp. 1–4. IEEE Press, New York (2017)
Google Scholar
Shou, Z., Wang, D., Chang, S.: Action temporal localization in untrimmed videos via multi-stage CNNs. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058. IEEE Press, New York (2016)
Google Scholar
Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: 25th ACM International Conference on Multimedia, pp. 988–996. ACM, California (2017)
Google Scholar
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: 2017 IEEE International Conference on Computer Vision, pp. 6–7. IEEE Press, New York (2017)
Google Scholar
Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. arXiv preprint arXiv:1806.02964 (2018)
Gan, C., Wang, N., Yang, Y., Yeung, D., G.Hauptmann, A.: DevNet: a deep event network for multimedia event detection and evidence recounting. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2568–2577. IEEE Press, New York (2015)
Google Scholar
Singh, K.K., Lee, Y.J.: Hide-and-Seek: forcing a network to be meticulous for weakly-supervised object and action localization. In: 2017 IEEE International Conference on Computer Vision, pp. 1961–1970. IEEE Press, New York (2017)
Google Scholar
Simoyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576. Curran Associates Inc., New York (2014)
Google Scholar
Wang, L., Xiong, Y., Lin, D., van Gool, L.: UntrimmedNets for weakly supervised action recognition and detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2–6. IEEE Press, New York (2017)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. IEEE Press, New York (2016)
Google Scholar
Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24673-2_3
Chapter Google Scholar
Yamasaki, T.: Histogram of oriented gradients. In: Journal of the Institute of Image Information and Television Engineers, pp. 1368–1371 (2010)
Google Scholar
Lin, M., Chen, Q., Yan, S.: Network in network. In: 2014 IEEE International Conference on Learning Representations, pp. 1–4. IEEE Press, New York (2014)
Google Scholar
Jiang, Y.G., et al.: THUMOS challenge: action recognition with a large number of classes. In: ECCV Workshop, vol. 5. Springer, Heidelberg (2014)
Google Scholar
Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: 22nd ACM International Conference on Multimedia, pp. 675–678 (2014)
Google Scholar
Abadi, M., Agarwal, A., Barham, P., et al.: Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467 (2016)
Oneata, D., Verbeek, J., Schmid, C.: The LEAR submission at thumos2014. In: Thumos14 Action Recognition Challenge, pp. 1–7. Springer, Heidelberg (2014)
Google Scholar
Gao, J., Yang, Z., Nevatia, R.: Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180 (2017)

Download references

Author information

Authors and Affiliations

Department of Automation, Shanghai Jiao Tong University, Shanghai, China
Haisheng Su, Xu Zhao & Tianwei Lin
Industrial Internet Innovation Center (Shanghai) Co., Ltd., Shanghai, China
Haiping Fei

Authors

Haisheng Su
View author publications
You can also search for this author in PubMed Google Scholar
Xu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Tianwei Lin
View author publications
You can also search for this author in PubMed Google Scholar
Haiping Fei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xu Zhao .

Editor information

Editors and Affiliations

The Chinese Academy of Sciences, Beijing, China
Long Cheng
City University of Hong Kong, Kowloon, Hong Kong
Andrew Chi Sing Leung
Kobe University, Kobe, Japan
Seiichi Ozawa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Su, H., Zhao, X., Lin, T., Fei, H. (2018). Weakly Supervised Temporal Action Detection with Shot-Based Temporal Pooling Network. In: Cheng, L., Leung, A., Ozawa, S. (eds) Neural Information Processing. ICONIP 2018. Lecture Notes in Computer Science(), vol 11304. Springer, Cham. https://doi.org/10.1007/978-3-030-04212-7_37

Download citation

DOI: https://doi.org/10.1007/978-3-030-04212-7_37
Published: 17 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04211-0
Online ISBN: 978-3-030-04212-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics