SODA: Weakly Supervised Temporal Action Localization Based on Astute Background Response and Self-Distillation Learning

Abstract

Weakly supervised temporal action localization is a practical yet challenging task. Although great efforts have been made in recent years, the existing methods still have limited capacity in dealing with the challenges of over-localization, joint-localization, and under-localization. Based on our investigation, the first two challenges arise from insufficient ability to suppress background response, while the third challenge is due to the lack of discovering action frames. To better address these challenges, we first propose the astute background response strategy. By enforcing the classification target of the background category to be zero, such a strategy can endow the conductive effect between video-level classification and frame-level classification, thus guiding the action category to suppress responses at background frames astutely and helping address the over-localization and joint-localization challenges. For alleviating the under-localization challenge, we introduce the self-distillation learning strategy. It simultaneously learns one master network and multiple auxiliary networks, where the auxiliary networks enhance the master network to discover complete action frames. Experimental results on three benchmarks demonstrate the favorable performance of the proposed method against previous counterparts, and its efficacy to tackle the existing three challenges.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

References

  1. Bearman, A., Russakovsky, O., Ferrari, V., & Fei-Fei, L. (2016). What’s the point: Semantic segmentation with point supervision. In European conference on computer vision, (pp. 549–565). Springer

  2. Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, (pp. 535–541).

  3. Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 6299–6308).

  4. Chan, L., Hosseini, M. S., & Plataniotis, KN. (2020). A comprehensive analysis of weakly-supervised semantic segmentation in different image domains. International Journal of Computer Vision, 129(2), 1–24

  5. Chao, Y. W., Vijayanarasimhan, S., Seybold, B., Ross, D. A., Deng, J., & Sukthankar, R. (2018). Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 1130–1139).

  6. Choe, J., & Shim, H. (2019). Attention-based dropout layer for weakly supervised object localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 2219–2228).

  7. Choe, J., Oh, S. J., Lee, S., Chun, S., Akata, Z., & Shim, H. (2020). Evaluating weakly supervised object localization methods right. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 3133–3142).

  8. Crowley, E. J., Gray, G., & Storkey, A. J. (2018). Moonshine: Distilling with cheap convolutions. In NeurIPS, (pp. 2893–2903).

  9. Caba Heilbron, F., Victor Escorcia, B. G., & Niebles, J. C. (2015). Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 961–970).

  10. Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 6202–6211).

  11. Gao, J., Chen, K., & Nevatia, R. (2018a). Ctap: Complementary temporal action proposal generation. In European conference on computer vision, (pp. 70–85). Springer

  12. Gao, Z., Wang, L., Jojic, N., Niu, Z., Zheng, N., & Hua, G. (2018b). Video imprint. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(12), 3086–3099.

    Article  Google Scholar 

  13. Gong, C., Tao, D., Liu, W., Liu, L., & Yang, J. (2017). Label propagation via teaching-to-learn and learning-to-teach. IEEE Transactions on Neural Networks and Learning Systems, 28(6), 1452–1465.

    Article  Google Scholar 

  14. Gong, C., Chang, X., Fang, M., & Yang, J. (2018). Teaching semi-supervised classifier via generalized distillation. In IJCAI, (pp 2156–2162).

  15. Gong, C., Yang, J., You, J. J., & Sugiyama, M. (2020a). Centroid estimation with guaranteed efficiency: A general framework for weakly supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2020.3044997

  16. Gong, G., Wang, X., Mu, Y., & Tian, Q. (2020b). Learning temporal co-attention models for unsupervised video action localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 9819–9828).

  17. Han, J., Yang, L., Zhang, D., Chang, X., & Liang, X. (2018a). Reinforcement cutting-agent learning for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (pp. 9080–9089).

  18. Han, J., Zhang, D., Cheng, G., Liu, N., & Xu, D. (2018b). Advanced deep-learning techniques for salient and category-specific object detection: a survey. IEEE Signal Processing Magazine, 35(1), 84–100.

    Article  Google Scholar 

  19. Hattori, H., Lee, N., Boddeti, V. N., Beainy, F., Kitani, K. M., & Kanade, T. (2018). Synthesizing a scene-specific pedestrian detector and pose estimator for static video surveillance. International Journal of Computer Vision, 126(9), 1027–1044.

    Article  Google Scholar 

  20. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531

  21. Hoeffding, W. (1994). Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, (pp. 409–426). Springer

  22. Hou, Y., Ma, Z., Liu, C., & Loy, C. C. (2019). Learning lightweight lane detection cnns by self attention distillation. In Proceedings of the IEEE/CVF international conference on computer vision, (pp. 1013–1021).

  23. Jain, M., Ghodrati, A., & Snoek, C. G. (2020). Actionbytes: Learning from trimmed videos to localize actions. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 1171–1180).

  24. Jiang, Y. G., Liu, J., Roshan Zamir, A., Toderici, G., Laptev, I., Shah, M., & Sukthankar, R. (2014). THUMOS challenge: Action recognition with a large number of classes. http://crcv.ucf.edu/THUMOS14/

  25. Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In International conference on learning representations (ICLR).

  26. Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907

  27. Lee, P., Uh, Y., & Byun, H. (2020). Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI conference on artificial intelligence.

  28. Lin, T., Zhao, X., Su, H., Wang, C., & Yang, M. (2018). Bsn: Boundary sensitive network for temporal action proposal generation. In European conference on computer vision, (pp. 3–21). Springer

  29. Lin, T., Liu, X., Li, X., Ding, E., & Wen, S. (2019). Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 3889–3898).

  30. Liu, D., Jiang, T., & Wang, Y. (2019a). Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 1298–1307).

  31. Liu, L., Ouyang, W., Wang, X., Fieguth, P., Chen, J., Liu, X., et al. (2020). Deep learning for generic object detection: A survey. International journal of computer vision, 128(2), 261–318.

    Article  Google Scholar 

  32. Liu, Z., Wang, L., Zhang, Q., Gao, Z., Niu, Z., Zheng, N., & Hua, G. (2019b). Weakly supervised temporal action localization through contrast based evaluation networks. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 3899–3908).

  33. Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., & Mei, T. (2019). Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 344–353).

  34. Lu, C., Shi, J., Wang, W., & Jia, J. (2019). Fast abnormal event detection. International Journal of Computer Vision, 127(8), 993–1011.

    Article  Google Scholar 

  35. Lu, X., Wang, W., Shen, J., Tai, Y. W., Crandall, D. J., & Hoi, S. C. (2020). Learning video object segmentation from unlabeled videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (pp. 8960–8970).

  36. Luo, Z., Guillory, D., Shi, B., Ke, W., Wan, F., Darrell, T., & Xu, H. (2020). Weakly-supervised action localization with expectation-maximization multi-instance learning. In European conference on computer vision.

  37. Ma, F., Zhu, L., Yang, Y., Zha, S., Kundu, G., Feiszli, M., & Shou, Z. (2020). Sf-net: Single-frame supervision for temporal action localization. In European conference on computer vision.

  38. Mettes, P., Van Gemert, J. C., & Snoek, C. G. (2016). Spot on: Action localization from pointly-supervised proposals. In European conference on computer vision, (pp. 437–453). Springer

  39. Min, K., & Corso, J. J. (2020). Adversarial background-aware loss for weakly-supervised temporal activity localization. In European conference on computer vision, (pp. 283–299). Springer

  40. Narayan, S., Cholakkal, H., Khan, F. S., & Shao, L. (2019). 3c-net: Category count and center loss for weakly-supervised action localization. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 8679–8687).

  41. Nguyen, P., Liu, T., Prasad, G., & Han, B. (2018). Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 6752–6761).

  42. Nguyen, P. X., Ramanan, D., & Fowlkes, C. C. (2019). Weakly-supervised action localization with background modeling. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 5502–5511).

  43. Pang, Y., Zhao, X., Zhang, L., & Lu, H. (2020). Multi-scale interactive network for salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 9413–9422).

  44. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems (NeurIPS), (pp. 8024–8035).

  45. Paul, S., Roy, S., & Roy-Chowdhury, A. K. (2018). W-talc: Weakly-supervised temporal activity localization and classification. In European conference on computer vision, (pp. 588–607). Springer

  46. Qiu, Z., Yao, T., & Mei, T. (2017). Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 5533–5541).

  47. Ramanathan, V., Wang, R., & Mahajan, D. (2020). Dlwl: Improving detection for lowshot classes with weakly labelled data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (pp. 9342–9352).

  48. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (NeurIPS), (pp. 91–99).

  49. Shi, B., Dai, Q., Mu, Y., & Wang, J. (2020). Weakly-supervised action localization by generative attention modeling. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 1009–1019).

  50. Shou, Z., Wang, D., & Chang, S. F. (2016). Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 1049–1058).

  51. Shou, Z., Gao, H., Zhang, L., Miyazawa, K., & Chang, SF. (2018). Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In European conference on computer vision, (pp. 162–179). Springer

  52. Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (NeurIPS), (pp. 568–576).

  53. Singh, K. K., & Lee, Y. J. (2017). Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 3544–3553). IEEE

  54. Song, L., Liu, J., Sun, M., & Shang, X. (2020). Weakly supervised group mask network for object detection. International Journal of Computer Vision, 129(3), 1–22.

  55. Sun, G., Wang, W., Dai, J., & Van Gool, L. (2020). Mining cross-image semantics for weakly supervised semantic segmentation. In European conference on computer vision, (pp. 347–365). Springer

  56. Toolkit, CPTG. (2019). v10. 1 documentation. URL: https://docs.nvidia.com/cuda/archive/101/pascal-tuning-guide/index.html

  57. Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 4489–4497).

  58. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 3551–3558).

  59. Wang, H., Kläser, A., Schmid, C., & Liu, C. L. (2013). Dense trajectories and motion boundary descriptors for action recognition. International journal of computer vision, 103(1), 60–79.

    MathSciNet  Article  Google Scholar 

  60. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, (pp. 20–36). Springer

  61. Wang, L., Xiong, Y., Lin, D., & Van Gool, L. (2017). Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 4325–4334).

  62. Wei, H., Feng, L., Chen, X., & An, B. (2020). Combating noisy labels by agreement: A joint training method with co-regularization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 13726–13735).

  63. Wu, R., Feng, M., Guan, W., Wang, D., Lu, H., & Ding, E. (2019a). A mutual learning method for salient object detection with intertwined multi-supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 8150–8159).

  64. Wu, Y., Zhu, L., Yan, Y., & Yang, Y. (2019b). Dual attention matching for audio-visual event localization. In Proceedings of the IEEE/CVF international conference on computer vision, (pp. 6292–6300).

  65. Xu, H., Das, A., & Saenko, K. (2017). R-c3d: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 5783–5792).

  66. Xu, M., Zhao, C., Rojas, D. S., Thabet, A., & Ghanem, B. (2020). G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 10156–10165).

  67. Xu, Y., Zhang, C., Cheng, Z., Xie, J., Niu, Y., Pu, S., & Wu, F. (2019). Segregated temporal assembly recurrent networks for weakly supervised multiple action detection. In Proceedings of the AAAI conference on artificial intelligence, (vol. 33, pp. 9070–9078).

  68. Yan, Y., Xu, C., Cai, D., & Corso, J. J. (2020). A weakly supervised multi-task ranking framework for actor-action semantic segmentation. International Journal of Computer Vision, 128(5), 1414–1432.

    Article  Google Scholar 

  69. Yang, L., Han, J., Zhang, D., Liu, N., & Zhang, D. (2018). Segmentation in weakly labeled videos via a semantic ranking and optical warping network. IEEE Transactions on Image Processing, 27(8), 4025–4037.

    MathSciNet  Article  Google Scholar 

  70. Yang, L., Peng, H., Zhang, D., Fu, J., & Han, J. (2020a). Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, 29, 8535–8548.

    Article  Google Scholar 

  71. Yang, T., Zhu, S., Chen, C., Yan, S., Zhang, M., & Willis, A. (2020b). Mutualnet: Adaptive convnet via mutual learning from network width and resolution. In European conference on computer vision, (pp. 299–315). Springer

  72. Yu, T., Ren, Z., Li, Y., Yan, E., Xu, N., & Yuan, J. (2019). Temporal structure mining for weakly supervised action detection. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 5522–5531).

  73. Yuan, L., Tay, F. E., Li, G., Wang, T., & Feng, J. (2020). Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 3903–3911).

  74. Yuan, Y., Lyu, Y., Shen, X., Tsang, I. W., & Yeung, D. Y. (2019). Marginalized average attentional network for weakly-supervised learning. In International conference on learning representations (ICLR).

  75. Yun, S., Park, J., Lee, K., & Shin, J. (2020). Regularizing class-wise predictions via self-knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (pp. 13876–13885).

  76. Zach, C., Pock, T., & Bischof, H. (2007). A duality based approach for realtime tv-l 1 optical flow. In Joint pattern recognition symposium, Springer, (pp. 214–223).

  77. Zeng, R., Huang, W., Tan, M., Rong, Y., Zhao, P., Huang, J., & Gan, C. (2019). Graph convolutional networks for temporal action localization. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 7094–7103).

  78. Zhai, Y., Wang, L., Tang, W., Zhang, Q., Yuan, J., & Hua, G. (2020). Two-stream consensus network for weakly-supervised temporal action localization. In European conference on computer vision, (pp. 37–54). Springer

  79. Zhang, C., Xu, Y., Cheng, Z., Niu, Y., Pu, S., Wu, F., & Zou, F. (2019a). Adversarial seeded sequence growing for weakly-supervised temporal action localization. In Proceedings of the 27th ACM international conference on multimedia, (pp. 738–746).

  80. Zhang, D., Han, J., Yang, L., & Xu, D. (2018a). Spftn: A joint learning framework for localizing and segmenting objects in weakly labeled videos. IEEE transactions on pattern analysis and machine intelligence, 42(2), 475–489.

  81. Zhang, D., Han, J., Zhao, L., & Meng, D. (2019b). Leveraging prior-knowledge for weakly supervised object detection under a collaborative self-paced curriculum learning framework. International Journal of Computer Vision, 127(4), 363–380.

    Article  Google Scholar 

  82. Zhang, D., Han, J., Zhao, L., & Zhao, T. (2020a). From discriminant to complete: Reinforcement searching-agent learning for weakly supervised object detection. IEEE Transactions on Neural Networks and Learning Systems, 31(12), 5549–5560.

    Article  Google Scholar 

  83. Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., & Ma, K. (2019c). Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE international conference on computer vision (ICCV), (pp. 3713–3722).

  84. Zhang, X., Wei, Y., Feng, J., Yang, Y., & Huang, T. S. (2018b). Adversarial complementary learning for weakly supervised object localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 1325–1334).

  85. Zhang, X. Y., Shi, H., Li, C., Zheng, K., Zhu, X., & Duan, L. (2019d). Learning transferable self-attentive representations for action recognition in untrimmed videos with weak supervision. In Proceedings of the AAAI conference on artificial intelligence, (vol. 33, pp. 9227–9234).

  86. Zhang, X. Y., Shi, H., Li, C., & Li, P. (2020b). Multi-instance multi-label action recognition and localization based on spatio-temporal pre-trimming for untrimmed videos. In AAAI, (pp. 12886–12893).

  87. Zhang, Y., Xiang, T., Hospedales, T. M., & Lu, H. (2018c). Deep mutual learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 4320–4328).

  88. Zhao, P., Xie, L., Ju, C., Zhang, Y., Wang, Y., & Tian, Q. (2020). Bottom-up temporal action localization with mutual regularization. In European Conference on Computer Vision, (pp. 539–555). Springer

  89. Zhong, J. X., Li, N., Kong, W., Zhang, T., Li, T. H., & Li, G. (2018). Step-by-step erasion, one-by-one collection: a weakly supervised temporal action detector. In: Proceedings of the 26th ACM international conference on Multimedia, (pp. 35–44).

  90. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), (pp. 2921–2929).

  91. Zhu, L., & Yang, Y. (2020). Label independent memory for semi-supervised few-shot video classification. IEEE Transactions on Pattern Analysis and Machine Intelligence.https://doi.org/10.1109/TPAMI.2020.3007511.

Download references

Author information

Affiliations

Authors

Corresponding authors

Correspondence to Junwei Han or Dingwen Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the the National Natural Science Foundation of China under Grants 61876140 and U1801265, Key-Area Research and Development Program of Guangdong Province(2019B010110001), the Research Funds for Interdisciplinary subject NWPU.

Communicated by Dong Xu.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhao, T., Han, J., Yang, L. et al. SODA: Weakly Supervised Temporal Action Localization Based on Astute Background Response and Self-Distillation Learning. Int J Comput Vis 129, 2474–2498 (2021). https://doi.org/10.1007/s11263-021-01473-9

Download citation

Keywords

  • Temporal action localization
  • Background response
  • Self-distillation learning