Semi-supervised Temporal Action Detection with Proposal-Free Masking

Nag, Sauradip; Zhu, Xiatian; Song, Yi-Zhe; Xiang, Tao

doi:10.1007/978-3-031-20062-5_38

Sauradip Nag^12,13,
Xiatian Zhu^12,14,
Yi-Zhe Song^12,13 &
…
Tao Xiang^12,13

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13663))

Included in the following conference series:

European Conference on Computer Vision

1622 Accesses
8 Citations

Abstract

Existing temporal action detection (TAD) methods rely on a large number of training data with segment-level annotations. Collecting and annotating such a training set is thus highly expensive and unscalable. Semi-supervised TAD (SS-TAD) alleviates this problem by leveraging unlabeled videos freely available at scale. However, SS-TAD is also a much more challenging problem than supervised TAD, and consequently much under-studied. Prior SS-TAD methods directly combine an existing proposal-based TAD method and a SSL method. Due to their sequential localization (e.g., proposal generation) and classification design, they are prone to proposal error propagation. To overcome this limitation, in this work we propose a novel \({\underline{S}emi-supervised~Temporal~action~detection~model~based~on}\) \({\underline{P}rop\underline{O}sal-free~\underline{T}emporal~mask}\) (SPOT) with a parallel localization (mask generation) and classification architecture. Such a novel design effectively eliminates the dependence between localization and classification by cutting off the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for prediction refinement, and a new pretext task for self-supervised model pre-training. Extensive experiments on two standard benchmarks show that our SPOT outperforms state-of-the-art alternatives, often by a large margin. The PyTorch implementation of SPOT is available at https://github.com/sauradip/SPOT

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note, instead of contributing a novel generic SSL algorithm, we propose a new TAD architecture designed particularly for facilitating the usage of prior SSL methods (e.g., pseudo labeling) in the sense of minimizing localization error propagation.

References

Alwassel, H., Mahajan, D., Torresani, L., Ghanem, B., Tran, D.: Self-supervised learning by cross-modal audio-video clustering. In: NeurIPS (2020)
Google Scholar
Bai, Y., Wang, Y., Tong, Y., Yang, Y., Liu, Q., Liu, J.: Boundary content graph neural network for temporal action proposal generation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12373, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58604-1_8
Chapter Google Scholar
Bao, H., Dong, L., Wei, F.: Beit: bert pre-training of image transformers. arXiv preprint. arXiv:2106.08254 (2021)
Benaim, S., et al.: SpeedNet: learning the speediness in videos. In: CVPR (2020)
Google Scholar
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.: Mixmatch: a holistic approach to semi-supervised learning. In: NeurIPS (2019)
Google Scholar
Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-nms-improving object detection with one line of code. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5561–5569 (2017)
Google Scholar
Buch, S., Escorcia, V., Shen, C., Ghanem, B., Carlos Niebles, J.: Sst: single-stream temporal action proposals. In: CVPR (2017)
Google Scholar
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp. 961–970 (2015)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
Google Scholar
Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised learning. In: TNNLS (2009)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML (2020)
Google Scholar
Chen, X., He, K.: Exploring simple siamese representation learning. In: CVPR (2021)
Google Scholar
Chen, Y., Zhu, X., Gong, S.: Semi-supervised deep learning with memory. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 275–291. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_17
Chapter Google Scholar
Chen, Y.C., et al.: UNITER: universal image-TExt representation learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 104–120. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_7
Chapter Google Scholar
Chen, Y., Tu, Z., Ge, L., Zhang, D., Chen, R., Yuan, J.: So-handnet: self-organizing network for 3d hand pose estimation with semi-supervised learning. In: ICCV (2019)
Google Scholar
Dong, Q., Zhu, X., Gong, S.: Single-label multi-class image classification by deep logistic regression. In: AAAI, vol. 33, pp. 3486–3493 (2019)
Google Scholar
Gao, J., Yang, Z., Chen, K., Sun, C., Nevatia, R.: Turn tap: temporal unit regression network for temporal action proposals. In: ICCV (2017)
Google Scholar
Grill, J.B., et al.: Bootstrap your own latent: a new approach to self-supervised learning. In: NeurIPS (2020)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR (2020)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
Ibrahim, M.S., Vahdat, A., Ranjbar, M., Macready, W.G.: Semi-supervised semantic image segmentation with self-correcting networks. In: CVPR (2020)
Google Scholar
Idrees, H., et al.: The thumos challenge on action recognition for videos“in the wild". Comput. Vis. Image Underst. 155, 1–23 (2017)
Article Google Scholar
Ji, J., Cao, K., Niebles, J.C.: Learning temporal action proposals with fewer labels. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7073–7082 (2019)
Google Scholar
Kim, J., Hur, Y., Park, S., Yang, E., Hwang, S.J., Shin, J.: Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. In: NeurIPS (2020)
Google Scholar
Laine, S., Aila, T.: Temporal ensembling for semi-supervised learning. In: ICLR (2017)
Google Scholar
Lee, D.H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: ICML Workshop (2013)
Google Scholar
Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: Bmn: boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3889–3898 (2019)
Google Scholar
Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 3–21. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_1
Chapter Google Scholar
Little, W.A.: The existence of persistent states in the brain. In: From High-Temperature Superconductivity to Microminiature Refrigeration (1974)
Google Scholar
Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., Mei, T.: Gaussian temporal awareness networks for action localization. In: CVPR (2019)
Google Scholar
Miech, A., Alayrac, J.B., Smaira, L., Laptev, I., Sivic, J., Zisserman, A.: End-to-end learning of visual representations from uncurated instructional videos. In: CVPR (2020)
Google Scholar
Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 4th International Conference on 3D Vision (3DV), pp. 565–571. IEEE (2016)
Google Scholar
Misra, I., Maaten, L.V.D.: Self-supervised learning of pretext-invariant representations. In: CVPR (2020)
Google Scholar
Misra, I., Zitnick, C.L., Hebert, M.: Shuffle and learn: unsupervised learning using temporal order verification. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 527–544. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_32
Chapter Google Scholar
Mitra, R., Gundavarapu, N.B., Sharma, A., Jain, A.: Multiview-consistent semi-supervised learning for 3d human pose estimation. In: CVPR (2020)
Google Scholar
Miyato, T., Maeda, S.I., Koyama, M., Ishii, S.: Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE TPAMI 41(8), 1979–1993 (2018)
Article Google Scholar
Nag, S., Zhu, X., Song, Y.Z., Xiang, T.: Temporal action localization with global segmentation mask transformers (2021)
Google Scholar
Nag, S., Zhu, X., Song, Y.z., Xiang, T.: Proposal-free temporal action detection via global segmentation mask learning. In: ECCV (2022)
Google Scholar
Nag, S., Zhu, X., Song, Y.z., Xiang, T.: Zero-shot temporal action detection via vision-language prompting. In: ECCV (2022)
Google Scholar
Nag, S., Zhu, X., Xiang, T.: Few-shot temporal action localization with query adaptive transformer. arXiv preprint. arXiv:2110.10552 (2021)
Ouali, Y., Hudelot, C., Tami, M.: Semi-supervised semantic segmentation with cross-consistency training. In: CVPR (2020)
Google Scholar
Patrick, M., et al.: Space-time crop & attend: Improving cross-modal video representation learning. arXiv preprint. arXiv:2103.10211 (2021)
Qing, Z., et al.: Temporal context aggregation network for temporal action proposal refinement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 485–494 (2021)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: TPAMI, vol. 39, no. 6, pp. 1137–1149 (2016)
Google Scholar
Riba, E., Mishkin, D., Ponsa, D., Rublee, E., Bradski, G.: Kornia: an open source differentiable computer vision library for pytorch. In: WACV, pp. 3674–3683 (2020)
Google Scholar
Shi, B., Dai, Q., Hoffman, J., Saenko, K., Darrell, T., Xu, H.: Temporal action detection with multi-level supervision. In: CVPR, pp. 8022–8032 (2021)
Google Scholar
Sohn, K., et al.: Fixmatch: simplifying semi-supervised learning with consistency and confidence. arXiv preprint. arXiv:2001.07685 (2020)
Sridhar, D., Quader, N., Muralidharan, S., Li, Y., Dai, P., Lu, J.: Class semantics-based attention for action detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13739–13748 (2021)
Google Scholar
Su, H., Gan, W., Wu, W., Qiao, Y., Yan, J.: Bsn++: complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation. arXiv preprint. arXiv:2009.07641 (2020)
Tang, Y.S., Lee, G.H.: Transferable semi-supervised 3d object detection from rgb-d data. In: ICCV (2019)
Google Scholar
Tarvainen, A., Valpola, H.: Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint. arXiv:1703.01780 (2017)
Vaswani, A., et al.: Attention is all you need. arXiv preprint. arXiv:1706.03762 (2017)
Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 402–419. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_24
Chapter Google Scholar
Wang, L., Xiong, Y., Lin, D., Van Gool, L.: Untrimmednets for weakly supervised action recognition and detection. In: CVPR, pp. 4325–4334 (2017)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Wang, T., Zhu, Y., Zhao, C., Zeng, W., Wang, J., Tang, M.: Adaptive class suppression loss for long-tail object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3103–3112 (2021)
Google Scholar
Wang, X., Zhang, S., Qing, Z., Shao, Y., Gao, C., Sang, N.: Self-supervised learning for semi-supervised temporal action proposal. In: CVPR, pp. 1905–1914 (2021)
Google Scholar
Wei, D., Lim, J., Zisserman, A., Freeman, W.T.: Learning and using the arrow of time. In: CVPR (2018)
Google Scholar
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: CVPR (2018)
Google Scholar
Xie, Q., Dai, Z., Hovy, E., Luong, T., Le, Q.: Unsupervised data augmentation for consistency training. In: NeurIPS (2020)
Google Scholar
Xiong, Y., et al.: Cuhk & ethz & siat submission to activitynet challenge 2016. arXiv preprint. arXiv:1608.00797 (2016)
Xu, H., Das, A., Saenko, K.: R-c3d: region convolutional 3d network for temporal activity detection. In: ICCV (2017)
Google Scholar
Xu, M., et al.: Boundary-sensitive pre-training for temporal localization in videos (2020)
Google Scholar
Xu, M., et al.: Boundary-sensitive pre-training for temporal localization in videos. In: ICCV, pp. 7220–7230 (2021)
Google Scholar
Xu, M., Perez-Rua, J.M., Zhu, X., Ghanem, B., Martinez, B.: Low-fidelity end-to-end video encoder pre-training for temporal action localization. In: NeurIPS (2021)
Google Scholar
Xu, M., Zhao, C., Rojas, D.S., Thabet, A., Ghanem, B.: G-tad: sub-graph localization for temporal action detection. In: CVPR (2020)
Google Scholar
Yan, P., Li, G., Xie, Y., Li, Z., Wang, C., Chen, T., Lin, L.: Semi-supervised video salient object detection using pseudo-labels. In: ICCV (2019)
Google Scholar
Zhao, C., Thabet, A.K., Ghanem, B.: Video self-stitching graph network for temporal action localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13658–13667 (2021)
Google Scholar
Zhao, N., Chua, T.S., Lee, G.H.: Sess: Self-ensembling semi-supervised 3d object detection. In: CVPR (2020)
Google Scholar
Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV (2017)
Google Scholar
Zhu, X.J.: Semi-supervised learning literature survey. University of Wisconsin-Madison Department of Computer Sciences, Technical reports (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

CVSSP, University of Surrey, Guildford, UK
Sauradip Nag, Xiatian Zhu, Yi-Zhe Song & Tao Xiang
iFlyTek-Surrey Joint Research Centre on Artificial Intelligence, London, UK
Sauradip Nag, Yi-Zhe Song & Tao Xiang
Surrey Institute for People-Centred Artificial Intelligence, University of Surrey, Guildford, UK
Xiatian Zhu

Authors

Sauradip Nag
View author publications
You can also search for this author in PubMed Google Scholar
Xiatian Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Zhe Song
View author publications
You can also search for this author in PubMed Google Scholar
Tao Xiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sauradip Nag .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1993 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nag, S., Zhu, X., Song, YZ., Xiang, T. (2022). Semi-supervised Temporal Action Detection with Proposal-Free Masking. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13663. Springer, Cham. https://doi.org/10.1007/978-3-031-20062-5_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-20062-5_38
Published: 11 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20061-8
Online ISBN: 978-3-031-20062-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Semi-supervised Temporal Action Detection with Proposal-Free Masking