Advertisement

Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12351)

Abstract

Weakly-supervised Temporal Action Localization (W-TAL) aims to classify and localize all action instances in an untrimmed video under only video-level supervision. However, without frame-level annotations, it is challenging for W-TAL methods to identify false positive action proposals and generate action proposals with precise temporal boundaries. In this paper, we present a Two-Stream Consensus Network (TSCN) to simultaneously address these challenges. The proposed TSCN features an iterative refinement training method, where a frame-level pseudo ground truth is iteratively updated, and used to provide frame-level supervision for improved model training and false positive action proposal elimination. Furthermore, we propose a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries. Experiments conducted on the THUMOS14 and ActivityNet datasets show that the proposed TSCN outperforms current state-of-the-art methods, and even achieves comparable results with some recent fully-supervised methods.

Keywords

Temporal action localization Weakly-supervised learning 

Notes

Acknowledgement

This work was supported partly by National Key R&D Program of China Grant 2018AAA0101400, NSFC Grants 61629301, 61773312, and 61976171, China Postdoctoral Science Foundation Grant 2019M653642, Young Elite Scientists Sponsorship Program by CAST Grant 2018QNRC001, and Natural Science Foundation of Shaanxi Grant 2020JQ-069.

Supplementary material

504443_1_En_3_MOESM1_ESM.pdf (422 kb)
Supplementary material 1 (pdf 422 KB)

References

  1. 1.
    Alwassel, H., Pardo, A., Heilbron, F.C., Thabet, A., Ghanem, B.: RefineLoc: iterative refinement for weakly-supervised action localization. arXiv preprint arXiv:1904.00227 (2019)
  2. 2.
    Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)Google Scholar
  3. 3.
    Carreira, J., Zisserman, A.: Quo Vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)Google Scholar
  4. 4.
    Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster R-CNN architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018)Google Scholar
  5. 5.
    Crasto, N., Weinzaepfel, P., Alahari, K., Schmid, C.: MARS: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7882–7891 (2019)Google Scholar
  6. 6.
    Dai, X., Singh, B., Zhang, G., Davis, L.S., Chen, Y.Q.: Temporal context network for activity localization in videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5793–5802 (2017)Google Scholar
  7. 7.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 886–893 (2005)Google Scholar
  8. 8.
    Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Proceedings of the European Conference on Computer Vision, pp. 428–441 (2006)Google Scholar
  9. 9.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Li, F.F.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)Google Scholar
  10. 10.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)Google Scholar
  11. 11.
    Gao, J., Yang, Z., Chen, K., Sun, C., Nevatia, R.: TURN TAP: temporal unit regression network for temporal action proposals. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3628–3636 (2017)Google Scholar
  12. 12.
    Gao, J., Yang, Z., Nevatia, R.: Cascaded boundary regression for temporal action detection. arXiv preprint arXiv:1705.01180 (2017)
  13. 13.
    Gorban, A., et al.: THUMOS challenge: action recognition with a large number of classes (2015)Google Scholar
  14. 14.
    Heilbron, F.C., Barrios, W., Escorcia, V., Ghanem, B.: SCC: semantic context cascade for efficient action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3175–3184 (2017)Google Scholar
  15. 15.
    Jiang, Y.G., et al.: THUMOS challenge: action recognition with a large number of classes (2014)Google Scholar
  16. 16.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  17. 17.
    Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
  18. 18.
    Singh, K.K., Lee, Y.J.: Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3524–3533 (2017)Google Scholar
  19. 19.
    Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64, 107–123 (2005)CrossRefGoogle Scholar
  20. 20.
    Lee, P., Uh, Y., Byun, H.: Background suppression network for weakly-supervised temporal action localization. In: Proceedings of the AAAI Conference on Artificial Intelligence (2020)Google Scholar
  21. 21.
    Lin, T., Liu, X., Li, X., Ding, E., Wen, S.: BMN: boundary-matching network for temporal action proposal generation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3889–3898 (2019)Google Scholar
  22. 22.
    Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision, pp. 3–19 (2018)Google Scholar
  23. 23.
    Liu, D., Jiang, T., Wang, Y.: Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1298–1307 (2019)Google Scholar
  24. 24.
    Liu, Z., et al.: Weakly supervised temporal action localization through contrast based evaluation networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3899–3908 (2019)Google Scholar
  25. 25.
    Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., Mei, T.: Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 344–353 (2019)Google Scholar
  26. 26.
    Narayan, S., Cholakkal, H., Khan, F.S., Shao, L.: 3C-Net: category count and center loss for weakly-supervised action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8679–8687 (2019)Google Scholar
  27. 27.
    Nguyen, P., Liu, T., Prasad, G., Han, B.: Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6752–6761 (2018)Google Scholar
  28. 28.
    Nguyen, P.X., Ramanan, D., Fowlkes, C.C.: Weakly-supervised action localization with background modeling. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5502–5511 (2019)Google Scholar
  29. 29.
    Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8024–8035 (2019)Google Scholar
  30. 30.
    Paul, S., Roy, S., Roy-Chowdhury, A.K.: W-TALC: weakly-supervised temporal activity localization and classification. In: Proceedings of the European Conference on Computer Vision, pp. 563–579 (2018)Google Scholar
  31. 31.
    Piergiovanni, A., Ryoo, M.S.: Representation flow for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)Google Scholar
  32. 32.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  33. 33.
    Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5734–5743 (2017)Google Scholar
  34. 34.
    Shou, Z., Gao, H., Zhang, L., Miyazawa, K., Chang, S.F.: AutoLoc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the European Conference on Computer Vision, pp. 154–171 (2018)Google Scholar
  35. 35.
    Shou, Z., et al.: DMC-Net: generating discriminative motion cues for fast compressed video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1268–1277 (2019)Google Scholar
  36. 36.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: Proceedings of the European Conference on Computer Vision, pp. 510–526 (2016)Google Scholar
  37. 37.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of Neural Information Processing Systems, pp. 568–576 (2014)Google Scholar
  38. 38.
    Tan, M., et al.: Learning graph structure for multi-label image classification via clique generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4100–4109 (2015)Google Scholar
  39. 39.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3169–3176 (2011)Google Scholar
  40. 40.
    Wang, L., Koniusz, P., Huynh, D.Q.: Hallucinating IDT descriptors and I3D optical flow features for action recognition with CNNS. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8698–8708 (2019)Google Scholar
  41. 41.
    Wang, L., Xiong, Y., Lin, D., Van Gool, L.: UntrimmedNets for weakly supervised action recognition and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4325–4334 (2017)Google Scholar
  42. 42.
    Xu, H., Das, A., Saenko, K.: R-C3D: region convolutional 3D network for temporal activity detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5783–5792 (2017)Google Scholar
  43. 43.
    Yu, T., Ren, Z., Li, Y., Yan, E., Xu, N., Yuan, J.: Temporal structure mining for weakly supervised action detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5522–5531 (2019)Google Scholar
  44. 44.
    Yuan, Z., Stroud, J.C., Lu, T., Deng, J.: Temporal action localization by structured maximal sums. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3684–3692 (2017)Google Scholar
  45. 45.
    Zeng, R., et al.: Graph convolutional networks for temporal action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7094–7103 (2019)Google Scholar
  46. 46.
    Zhai, Y., Wang, L., Liu, Z., Zhang, Q., Hua, G., Zheng, N.: Action coherence network for weakly supervised temporal action localization. In: Proceedings of the IEEE International Conference on Image Processing, pp. 3696–3700 (2019)Google Scholar
  47. 47.
    Zhao, Y., Xiong, Y., Lin, D.: Recognize actions by disentangling components of dynamics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6566–6575 (2018)Google Scholar
  48. 48.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923 (2017)Google Scholar
  49. 49.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Xi’an Jiaotong UniversityXi’anChina
  2. 2.University of Illinois at ChicagoChicagoUSA
  3. 3.HERE TechnologiesChicagoUSA
  4. 4.State University of New York at BuffaloBuffaloUSA
  5. 5.Wormpex AI ResearchBellevueUSA

Personalised recommendations