Advertisement

Bottom-Up Temporal Action Localization with Mutual Regularization

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12353)

Abstract

Recently, temporal action localization (TAL), i.e., finding specific action segments in untrimmed videos, has attracted increasing attentions of the computer vision community. State-of-the-art solutions for TAL involves evaluating the frame-level probabilities of three action-indicating phases, i.e. starting, continuing, and ending; and then post-processing these predictions for the final localization. This paper delves deep into this mechanism, and argues that existing methods, by modeling these phases as individual classification tasks, ignored the potential temporal constraints between them. This can lead to incorrect and/or inconsistent predictions when some frames of the video input lack sufficient discriminative information. To alleviate this problem, we introduce two regularization terms to mutually regularize the learning procedure: the Intra-phase Consistency (IntraC) regularization is proposed to make the predictions verified inside each phase; and the Inter-phase Consistency (InterC) regularization is proposed to keep consistency between these phases. Jointly optimizing these two terms, the entire framework is aware of these potential constraints during an end-to-end optimization process. Experiments are performed on two popular TAL datasets, THUMOS14 and ActivityNet1.3. Our approach clearly outperforms the baseline both quantitatively and qualitatively. The proposed regularization also generalizes to other TAL methods (e.g., TSA-Net and PGCN). Code: https://github.com/PeisenZhao/Bottom-Up-TAL-with-MR.

Keywords

Action localization Action proposals Mutual regularization 

Notes

This work is supported by the National Key Research and Development Program of China (No. 2019YFB1804304), SHEITC (No. 2018-RGZN-02046), 111 plan (No. BP0719010), and STCSM (No. 18DZ2270700), and State Key Laboratory of UHD Video and Audio Production and Presentation.

Supplementary material

504445_1_En_32_MOESM1_ESM.zip (2.3 mb)
Supplementary material 1 (zip 2342 KB)

References

  1. 1.
    Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. In: arXiv preprint arXiv:1609.08675 (2016)
  2. 2.
    Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-NMS-improving object detection with one line of code. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 5561–5569 (2017)Google Scholar
  3. 3.
    Buch, S., Escorcia, V., Shen, C., Ghanem, B., Carlos Niebles, J.: SST: single-stream temporal action proposals. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2911–2920 (2017)Google Scholar
  4. 4.
    Caba Heilbron, F., Carlos Niebles, J., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1914–1923 (2016)Google Scholar
  5. 5.
    Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961–970 (2015)Google Scholar
  6. 6.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299–6308 (2017)Google Scholar
  7. 7.
    Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1130–1139 (2018)Google Scholar
  8. 8.
    Dai, X., Singh, B., Zhang, G., Davis, L.S., Qiu Chen, Y.: Temporal context network for activity localization in videos. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5793–5802 (2017)Google Scholar
  9. 9.
    Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_47CrossRefGoogle Scholar
  10. 10.
    Gan, C., Sun, C., Duan, L., Gong, B.: Webly-supervised video recognition by mutually voting for relevant web images and web video frames. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 849–866. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46487-9_52CrossRefGoogle Scholar
  11. 11.
    Gan, C., Yao, T., Yang, K., Yang, Y., Mei, T.: You lead, we exceed: labor-free video concept learning by jointly exploiting web videos and images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  12. 12.
    Gao, J., Chen, K., Nevatia, R.: CTAP: complementary temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 70–85. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01216-8_5CrossRefGoogle Scholar
  13. 13.
    Gao, J., Yang, Z., Chen, K., Sun, C., Nevatia, R.: Turn tap: temporal unit regression network for temporal action proposals. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 3628–3636 (2017)Google Scholar
  14. 14.
    Girshick, R.: Fast R-CNN. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 1440–1448 (2015)Google Scholar
  15. 15.
    Gong, G., Zheng, L., Bai, K., Mu, Y.: Scale matters: temporal scale aggregation network for precise action localization in untrimmed videos. In: International Conference on Multimedia and Expo (ICME), pp. 1–6 (2020)Google Scholar
  16. 16.
    Jiang, Y.G., et al.: THUMOS challenge: action recognition with a large number of classes (2014)Google Scholar
  17. 17.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732 (2014)Google Scholar
  18. 18.
    Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  19. 19.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: International Conference on Learning Representations (ICLR), pp. 1–14 (2017)Google Scholar
  20. 20.
    Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750 (2018)Google Scholar
  21. 21.
    Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV (2019)Google Scholar
  22. 22.
    Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of the International Conference on Computer Vision (ICCV) (2019)Google Scholar
  23. 23.
    Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 3–19. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01225-0_1CrossRefGoogle Scholar
  24. 24.
    Liu, D., Jiang, T., Wang, Y.: Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1298–1307 (2019)Google Scholar
  25. 25.
    Liu, Y., Ma, L., Zhang, Y., Liu, W., Chang, S.F.: Multi-granularity generator for temporal action proposal. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3604–3613 (2019)Google Scholar
  26. 26.
    Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI) 42, 502–508 (2019)Google Scholar
  27. 27.
    Peisen, Z., Lingxi, X., Ya, Z., Qi, T.: Universal-to-specific framework for complex action recognition. arXiv preprint arXiv:2007.06149 (2020)
  28. 28.
    Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5534–5542. IEEE (2017)Google Scholar
  29. 29.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788 (2016)Google Scholar
  30. 30.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 91–99 (2015)Google Scholar
  31. 31.
    Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5734–5743 (2017)Google Scholar
  32. 32.
    Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1049–1058 (2016)Google Scholar
  33. 33.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. IEEE (2015)Google Scholar
  34. 34.
    Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6450–6459 (2018)Google Scholar
  35. 35.
    Wang, L., Li, W., Li, W., Van Gool, L.: Appearance-and-relation networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1430–1439 (2018)Google Scholar
  36. 36.
    Wang, L., Xiong, Y., Lin, D., Van Gool, L.: Untrimmednets for weakly supervised action recognition and detection. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4325–4334 (2017)Google Scholar
  37. 37.
    Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 305–321 (2018)Google Scholar
  38. 38.
    Xiong, Y., Zhao, Y., Wang, L., Lin, D., Tang, X.: A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716 (2017)
  39. 39.
    Xu, H., Das, A., Saenko, K.: R-C3D: region convolutional 3D network for temporal activity detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5783–5792 (2017)Google Scholar
  40. 40.
    Yuan, Z., Stroud, J.C., Lu, T., Deng, J.: Temporal action localization by structured maximal sums. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3684–3692 (2017)Google Scholar
  41. 41.
    Zeng, R., et al.: Graph convolutional networks for temporal action localization. In: Proceedings of the International Conference on Computer Vision (ICCV) (2019)Google Scholar
  42. 42.
    Zhao, Y., et al.: Cuhk & ethz & siat submission to activitynet challenge 2017. arXiv preprint arXiv:1710.08011 (2017)
  43. 43.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2914–2923 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Cooperative Medianet Innovation CenterShanghai Jiao Tong UniversityShanghaiChina
  2. 2.Huawei Inc.ShenzhenChina

Personalised recommendations