Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization

Xia, Huifen; Zhan, Yongzhao; Cheng, Keyang

doi:10.1007/s00530-022-00912-y

Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization

Regular Paper
Published: 18 March 2022

Volume 28, pages 1529–1541, (2022)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Huifen Xia^1,3,
Yongzhao Zhan^1,2 &
Keyang Cheng¹

294 Accesses
4 Citations
Explore all metrics

Abstract

Weakly supervised temporal action localization (W-TAL) is designed to detect and classify all the action instances in an untrimmed video with only video-level labels. Due to the lack of frame-level annotations, the correlations learning between action snippets and the separation between action and background are the two key issues for accurate action localization. To mine the intrinsic correlations of space and time embodied in the occurrences of action in a video and identify the action and background in the snippets, a novel method based on spatial–temporal correlations learning and action-background jointed attention for W-TAL is proposed. In this method, the graph convolution and 1-D temporal convolution networks are constructed to learn the spatial and temporal features of the video, respectively, then fused to learn a fruitful spatial–temporal correlative feature map. This ensures more completed features representation for action localization. Next, different from the other methods, action-background jointed attention mechanism is presented to explicitly modelled background as well as action in a three-branch classification network. This classification network can distinguish action and background and realize the separation of action and background better, so as to promote more accurate action localization. Experiments conducted on Thumos14 and ActivityNet1.3 show that our method outperforms state-of-the-art methods, especially at some high t-IoU thresholds, which further validates the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep cascaded action attention network for weakly-supervised temporal action localization

Article 15 March 2023

LGAFormer: transformer with local and global attention for action detection

Article 06 May 2024

Mask attention-guided graph convolution layer for weakly supervised temporal action detection

Article 07 December 2021

References

Fabian, C.H., Victor, E., Bernard, G., Juan, C.N.: Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pp 961–970 (2015)
Joao, C., Andrew, Z.: Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6299–6308 (2017)
Yu-Wei, C., Sudheendra, V., Bryan, S., Ross, D.A., Jia, D., Rahul, S.: Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1130–1139 (2018)
Christoph, F., Haoqi, F., Jitendra, M., Kaiming, H.: Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp 6202–6211 (2019)
Ge, Y., Qin, X., Yang, D., Jagersand, M.: Deep snippet selective network for weakly supervised temporal action localization. Pattern Recogn. 110, 107686 (2021)
Article Google Scholar
Hang, Z., Yongzhao, Z., Qirong, M.: Video anomaly detection based on space–time fusion graph network learning. J. Comput. Res. Dev. 58(1), 48 (2021)
Google Scholar
Huang, L., Huang, Y., Ouyang, W., Wang, L.: Relational prototypical network for weakly supervised temporal action localization. Proc. AAAI Conf. Artif. Intell. 34, 11053–11060 (2020)
Google Scholar
Jiang, Y.-G.., Liu, J.: A Roshan Zamir, George Toderici, Ivan Laptev, Mubarak Shah, and Rahul Sukthankar. Action recognition with a large number of classes, Thumos challenge (2014)
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016
Pilhyeon, L., Youngjung, U., Hyeran, B.: Background suppression network for weakly-supervised temporal action localization. In AAAI, pages 11320–11327, 2020
Tianwei, L., Xu, Z., Zheng, S.: Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia, pages 988–996 (2017)
Daochang, L., Tingting, J. Yizhou, W.: Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1298–1307, 2019
Fuchen, L., Ting, Y., Zhaofan, Q., Xinmei, T., Jiebo, L. Tao, M.: Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 344–353 (2019)
Ma, Y.-F., Hua, X.-S., Lie, L., Zhang, H.-J.: A generic framework of user attention model and its application in video summarization. IEEE Trans. Multimedia 7(5), 907–919 (2005)
Article Google Scholar
Phuc, N., Ting, L., Gautam, P., Bohyung, H.: Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6752–6761 (2018)
Phuc, X.N., Deva, R., Fowlkes, C.C.: Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision, pages 5502–5511 (2019)
Sujoy ,P., Sourya, R., Roy-Chowdhury, A.K.: W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp 563–579 (2018)
Maheen, R., Hedvig, K., Lee, Y.J. Action graphs: Weakly-supervised action localization with graph convolution networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 615–624 (2020)
Baifeng, S., Qi, D., Yadong, M., Jingdong, W.: Weakly-supervised action localization by generative attention modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1009–1019 (2020)
Lei, S., Yifan, Z., Jian, C., Hanqing, L.: Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7912–7921 (2019)
Zheng, S., Hang, G., Lei, Z., Kazuyuki, M., Shih-Fu, C.: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 154–171 (2018)
Zheng, S., Dongang, W., Shih-Fu, C.: Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1049–1058 (2016)
Waqas, S., Chen, C., Mubarak, S.: Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488 (2018)
Du, T., Lubomir, B., Rob, F., Lorenzo, T., Manohar, P.: Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497 (2015)
Heng, W., Cordelia, S.: Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pages 3551–3558 (2013)
Limin, W., Yuanjun, X., Dahua, L., Luc, V.G.: Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 4325–4334 (2017)
Limin, W., Yuanjun, X., Zhe, W., Yu, Q., Dahua, L., Xiaoou, T., Luc, V.G.: Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016
Andreas, W., Thomas, P., Christopher, Z., Horst, B., Daniel, C.: An improved algorithm for tv-l 1 optical flow. In Statistical and geometrical approaches to visual motion analysis, pages 23–45. Springer (2009)
Mengmeng, X., Chen, Z. David, S.R., Ali, T., Bernard, G.: G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10156–10165, 2020
Zeng, R., Gan, C., Chen, P., Huang, W., Qingyao, W., Tan, M.: Breaking winner-takes-all: Iterative-winners-out networks for weakly supervised temporal action localization. IEEE Trans. Image Process. 28(12), 5797–5808 (2019)
Article MathSciNet Google Scholar
Runhao, Z., Wenbing, H., Mingkui, T., Yu, R., Peilin, Z., Junzhou, H., Chuang, G.: Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 7094–7103 (2019)
Yuanhao, Z., Le, W., Wei, T., Qilin, Z., Junsong, Y., Gang, H.: Two-stream consensus network for weakly-supervised temporal action localization. In European conference on computer vision, pages 37–54. Springer (2020)
Peisen, Z., Lingxi, X., Chen, J., Ya, Z., Yanfeng, W., Qi, T.: Bottom-up temporal action localization with mutual regularization. In European Conference on Computer Vision, pages 539–555. Springer (2020)
Yue, Z., Yuanjun, X., Limin, W., Zhirong, W., Xiaoou, T., Dahua, L.: Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2914–2923 (2017)

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No.61672268).

Author information

Authors and Affiliations

School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, 212013, Jiangsu, China
Huifen Xia, Yongzhao Zhan & Keyang Cheng
Jiangsu Engineering Research Center of Big Data Ubiquitous Perception and Intelligent Agriculture Applications, Zhenjiang, 212013, Jiangsu, China
Yongzhao Zhan
Changzhou Vocational Institute of Mechatronic Technology, Wujin, Changzhou, 213164, Jiangsu, China
Huifen Xia

Authors

Huifen Xia
View author publications
You can also search for this author in PubMed Google Scholar
Yongzhao Zhan
View author publications
You can also search for this author in PubMed Google Scholar
Keyang Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongzhao Zhan.

Additional information

Communicated by B-K Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xia, H., Zhan, Y. & Cheng, K. Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization. Multimedia Systems 28, 1529–1541 (2022). https://doi.org/10.1007/s00530-022-00912-y

Download citation

Received: 02 September 2021
Accepted: 02 March 2022
Published: 18 March 2022
Issue Date: August 2022
DOI: https://doi.org/10.1007/s00530-022-00912-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization

Abstract

Access this article

Similar content being viewed by others

Deep cascaded action attention network for weakly-supervised temporal action localization

LGAFormer: transformer with local and global attention for action detection

Mask attention-guided graph convolution layer for weakly supervised temporal action detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spatial–temporal correlations learning and action-background jointed attention for weakly-supervised temporal action localization

Abstract

Access this article

Similar content being viewed by others

Deep cascaded action attention network for weakly-supervised temporal action localization

LGAFormer: transformer with local and global attention for action detection

Mask attention-guided graph convolution layer for weakly supervised temporal action detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation