Advertisement

Adversarial Background-Aware Loss for Weakly-Supervised Temporal Activity Localization

Conference paper
  • 726 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12359)

Abstract

Temporally localizing activities within untrimmed videos has been extensively studied in recent years. Despite recent advances, existing methods for weakly-supervised temporal activity localization struggle to recognize when an activity is not occurring. To address this issue, we propose a novel method named A2CL-PT. Two triplets of the feature space are considered in our approach: one triplet is used to learn discriminative features for each activity class, and the other one is used to distinguish the features where no activity occurs (i.e. background features) from activity-related features for each video. To further improve the performance, we build our network using two parallel branches which operate in an adversarial way: the first branch localizes the most salient activities of a video and the second one finds other supplementary activities from non-localized parts of the video. Extensive experiments performed on THUMOS14 and ActivityNet datasets demonstrate that our proposed method is effective. Specifically, the average mAP of IoU thresholds from 0.1 to 0.9 on the THUMOS14 dataset is significantly improved from 27.9% to 30.0%.

Keywords

A2CL-PT Temporal activity localization Adversarial learning Weakly-supervised learning Center loss with a pair of triplets 

Notes

Acknowledgement

We thank Stephan Lemmer, Victoria Florence, Nathan Louis, and Christina Jung for their valuable feedback and comments. This research was, in part, supported by NIST grant 60NANB17D191.

Supplementary material

504468_1_En_17_MOESM1_ESM.zip (28.1 mb)
Supplementary material 1 (zip 28771 KB)

References

  1. 1.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)Google Scholar
  2. 2.
    Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster r-cnn architecture for temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018)Google Scholar
  3. 3.
    Caba Heilbron, F., Victor Escorcia, B.G., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015)Google Scholar
  4. 4.
    He, X., Zhou, Y., Zhou, Z., Bai, S., Bai, X.: Triplet-center loss for multi-view 3D object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1945–1954 (2018)Google Scholar
  5. 5.
    Jiang, Y.G., et al.: THUMOS challenge: action recognition with a large number of classes (2014). http://crcv.ucf.edu/THUMOS14/
  6. 6.
    Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  7. 7.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  8. 8.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)Google Scholar
  9. 9.
    Lee, P., Uh, Y., Byun, H.: Background suppression network for weakly-supervised temporal action localization. In: AAAI (2020)Google Scholar
  10. 10.
    Li, Z., Xu, C., Leng, B.: Angular triplet-center loss for multi-view 3D shape retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8682–8689 (2019)Google Scholar
  11. 11.
    Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)Google Scholar
  12. 12.
    Liu, D., Jiang, T., Wang, Y.: Completeness modeling and context separation for weakly supervised temporal action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1298–1307 (2019)Google Scholar
  13. 13.
    Liu, Z., et al.: Weakly supervised temporal action localization through contrast based evaluation networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3899–3908 (2019)Google Scholar
  14. 14.
    Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., Mei, T.: Gaussian temporal awareness networks for action localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 344–353 (2019)Google Scholar
  15. 15.
    Monfort, M., et al.: Moments in time dataset: one million videos for event understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42, 1–8 (2019).  https://doi.org/10.1109/TPAMI.2019.2901464CrossRefGoogle Scholar
  16. 16.
    Narayan, S., Cholakkal, H., Khan, F.S., Shao, L.: 3C-Net: category count and center loss for weakly-supervised action localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8679–8687 (2019)Google Scholar
  17. 17.
    Nguyen, P., Liu, T., Prasad, G., Han, B.: Weakly supervised action localization by sparse temporal pooling network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6752–6761 (2018)Google Scholar
  18. 18.
    Nguyen, P.X., Ramanan, D., Fowlkes, C.C.: Weakly-supervised action localization with background modeling. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5502–5511 (2019)Google Scholar
  19. 19.
    Paul, S., Roy, S., Roy-Chowdhury, A.K.: W-TALC: weakly-supervised temporal activity localization and classification. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 563–579 (2018)Google Scholar
  20. 20.
    Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: CDC: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5734–5743 (2017)Google Scholar
  21. 21.
    Shou, Z., Gao, H., Zhang, L., Miyazawa, K., Chang, S.F.: AutoLoc: weakly-supervised temporal action localization in untrimmed videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171 (2018)Google Scholar
  22. 22.
    Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058 (2016)Google Scholar
  23. 23.
    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  24. 24.
    Wang, L., Xiong, Y., Lin, D., Van Gool, L.: UntrimmedNets for weakly supervised action recognition and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4325–4334 (2017)Google Scholar
  25. 25.
    Wen, Y., Zhang, K., Li, Z., Qiao, Yu.: A discriminative feature learning approach for deep face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 499–515. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_31CrossRefGoogle Scholar
  26. 26.
    Xu, H., Das, A., Saenko, K.: R-C3D: region convolutional 3D network for temporal activity detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5783–5792 (2017)Google Scholar
  27. 27.
    Xu, Y., et al.: Segregated temporal assembly recurrent networks for weakly supervised multiple action detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9070–9078 (2019)Google Scholar
  28. 28.
    Yuan, Y., Lyu, Y., Shen, X., Tsang, I.W., Yeung, D.Y.: Marginalized average attentional network for weakly-supervised learning. In: International Conference on Learning Representations (ICLR) (2019)Google Scholar
  29. 29.
    Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L1 optical flow. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 214–223. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-74936-3_22CrossRefGoogle Scholar
  30. 30.
    Zhang, X., Wei, Y., Feng, J., Yang, Y., Huang, T.S.: Adversarial complementary learning for weakly supervised object localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1325–1334 (2018)Google Scholar
  31. 31.
    Zhao, H., Torralba, A., Torresani, L., Yan, Z.: HACS: human action clips and segments dataset for recognition and temporal localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 8668–8678 (2019)Google Scholar
  32. 32.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2914–2923 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of MichiganAnn ArborUSA

Personalised recommendations