AutoLoc: Weakly-Supervised Temporal Action Localization in Untrimmed Videos

  • Zheng ShouEmail author
  • Hang Gao
  • Lei Zhang
  • Kazuyuki Miyazawa
  • Shih-Fu Chang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11220)


Temporal Action Localization (TAL) in untrimmed video is important for many applications. But it is very expensive to annotate the segment-level ground truth (action class and temporal boundary). This raises the interest of addressing TAL with weak supervision, namely only video-level annotations are available during training). However, the state-of-the-art weakly-supervised TAL methods only focus on generating good Class Activation Sequence (CAS) over time but conduct simple thresholding on CAS to localize actions. In this paper, we first develop a novel weakly-supervised TAL framework called AutoLoc to directly predict the temporal boundary of each action instance. We propose a novel Outer-Inner-Contrastive (OIC) loss to automatically discover the needed segment-level supervision for training such a boundary predictor. Our method achieves dramatically improved performance: under the IoU threshold 0.5, our method improves mAP on THUMOS’14 from 13.7% to 21.2% and mAP on ActivityNet from 7.4% to 27.3%. It is also very encouraging to see that our weakly-supervised method achieves comparable results with some fully-supervised methods.


Temporal action localization Weak supervision Outer-Inner-contrastive Class activation sequence 



We appreciate the support from Mitsubishi Electric for this project.


  1. 1.
    Activitynet challenge 2016 (2016).
  2. 2.
    Aggarwal, J.K., Ryoo, M.S.: Human activity analysis: a review. In: ACM Computing Surveys (2011)CrossRefGoogle Scholar
  3. 3.
    Asadi-Aghbolaghi, M., Clapés, A., Bellantonio, M., Escalante, H.J., Ponce-López, V., Baró, X., Guyon, I., Kasaei, S., Escalera, S.: A survey on deep learning based approaches for action and gesture recognition in image sequences. In: FG (2017)Google Scholar
  4. 4.
    Bearman, A., Russakovsky, O., Ferrari, V., Fei-Fei, L.: Whats the point: Semantic segmentation with point supervision. In: ECCV (2016)Google Scholar
  5. 5.
    Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: CVPR (2016)Google Scholar
  6. 6.
    Buch, S., Escorcia, V., Ghanem, B., Fei-Fei, L., Niebles, J.C.: End-to-end, single-stream temporal action detection in untrimmed videos. In: BMVC (2017)Google Scholar
  7. 7.
    Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: Sst: single-stream temporal action proposals. In: CVPR (2017)Google Scholar
  8. 8.
    Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR (2017)Google Scholar
  9. 9.
    Chen, Y., Jin, X., Feng, J., Yan, S.: Training group orthogonal neural networks with privileged information. In: IJCAI (2017)Google Scholar
  10. 10.
    Chen, Y., Jin, X., Kang, B., Feng, J., Yan, S.: Sharing residual units through collective tensor factorization in deep neural networks. In: IJCAI (2018)Google Scholar
  11. 11.
    Chen, Y., Kalantidis, Y., Li, J., Yan, S., Feng, J.: Multi-fiber networks for video recognition. In: ECCV (2018)Google Scholar
  12. 12.
    Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks. In: NIPS (2017)Google Scholar
  13. 13.
    Cheng, G., Wan, Y., Saudagar, A.N., Namuduri, K., Buckles, B.P.: Advances in human action recognition: a survey (2015). arXiv:1501.05964
  14. 14.
    Dai, X., Singh, B., Zhang, G., Davis, L.S., Chen, Y.Q.: Temporal context network for activity localization in videos. In: ICCV (2017)Google Scholar
  15. 15.
    Dave, A., Russakovsky, O., Ramanan, D.: Predictive-corrective networks for action detection. In: CVPR (2017)Google Scholar
  16. 16.
    Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles (1997)CrossRefGoogle Scholar
  17. 17.
    Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)Google Scholar
  18. 18.
    Durand, T., Mordan, T., Thome, N., Cord, M.: Wildcat: weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In: CVPR (2017)Google Scholar
  19. 19.
    Escorcia, V., Heilbron, F.C., Niebles, J.C., Ghanem, B.: Daps: deep action proposals for action understanding. In: ECCV (2016)Google Scholar
  20. 20.
    Gao, J., Chen, K., Nevatia, R.: Ctap: complementary temporal action proposal generation. In: ECCV (2018)CrossRefGoogle Scholar
  21. 21.
    Gao, J., Yang, Z., Nevatia, R.: Cascaded boundary regression for temporal action detection. In: BMVC (2017)Google Scholar
  22. 22.
    Gao, J., Yang, Z., Sun, C., Chen, K., Nevatia, R.: Turn tap: temporal unit regression network for temporal action proposals. In: ICCV (2017)Google Scholar
  23. 23.
    Girshick, R.: Fast r-cnn. In: ICCV (2015)Google Scholar
  24. 24.
    Gorban, A., Idrees, H., Jiang, Y.G., Zamir, A.R., Laptev, I., Shah, M., Sukthankar, R.: THUMOS challenge: action recognition with a large number of classes (2015).
  25. 25.
    Gudi, A., van Rosmalen, N., Loog, M., van Gemert, J.: Object-extent pooling for weakly supervised single-shot localization. In: BMVC (2017)Google Scholar
  26. 26.
    He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: TPMAI (2015)Google Scholar
  27. 27.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  28. 28.
    Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)Google Scholar
  29. 29.
    Heilbron, F.C., Barrios, W., Escorcia, V., Ghanem, B.: Scc: semantic context cascade for efficient action detection. In: CVPR (2017)Google Scholar
  30. 30.
    Heilbron, F.C., Niebles, J.C., Ghanem, B.: Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: CVPR (2016)Google Scholar
  31. 31.
    Hong, S., Yeo, D., Kwak, S., Lee, H., Han, B.: Weakly supervised semantic segmentation using web-crawled videos. In: CVPR (2017)Google Scholar
  32. 32.
    Huang, D.A., Fei-Fei, L., Niebles, J.C.: Connectionist temporal modeling for weakly supervised action labeling. In: ECCV (2016)Google Scholar
  33. 33.
    Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR (2017)Google Scholar
  34. 34.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  35. 35.
    Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. In: TPMAI (2013)Google Scholar
  36. 36.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. In: ACM MM (2014)Google Scholar
  37. 37.
    Jiang, Y.G., Liu, J., Zamir, A.R., Toderici, G., Laptev, I., Shah, M., Sukthankar, R.: THUMOS challenge: action recognition with a large number of classes (2014).
  38. 38.
    Jie, Z., Wei, Y., Jin, X., Feng, J., Liu, W.: Deep self-taught learning for weakly supervised object localization. In: CVPR (2017)Google Scholar
  39. 39.
    Kang, S.M., Wildes, R.P.: Review of action recognition and detection methods (2016). arXiv preprint arXiv:1610.06906
  40. 40.
    Kantorov, V., Oquab, M., Cho, M., Laptev, I.: Contextlocnet: context-aware deep network models for weakly supervised localization. In: ECCV (2016)Google Scholar
  41. 41.
    Karaman, S., Seidenari, L., Bimbo, A.D.: Fast saliency based pooling of fisher encoded dense trajectories. In: ECCV THUMOS Workshop (2014)Google Scholar
  42. 42.
    Khoreva, A., Benenson, R., Hosang, J., Hein, M., Schiele, B.: Simple does it: weakly supervised instance and semantic segmentation. In: CVPR (2017)Google Scholar
  43. 43.
    Kim, D., Yoo, D., Kweon, I.S., et al.: Two-phase learning for weakly supervised object localization. In: ICCV (2017)Google Scholar
  44. 44.
    Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: ACM MM (2017)Google Scholar
  45. 45.
    Lindeberg, T.: Feature detection with automatic scale selection. In: IJCV (1998)Google Scholar
  46. 46.
    Liu, W., et al.: Ssd: single shot multibox detector. In: ECCV (2016)Google Scholar
  47. 47.
    Mettes, P., van Gemert, J.C., Snoek, C.G.: Spot on: action localization from pointly-supervised proposals. In: ECCV (2016)Google Scholar
  48. 48.
    Oneata, D., Verbeek, J., Schmid, C.: The lear submission at thumos 2014. In: ECCV THUMOS Workshop (2014)Google Scholar
  49. 49.
    Papandreou, G., Chen, L.C., Murphy, K.P., Yuille, A.L.: Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In: CVPR (2015)Google Scholar
  50. 50.
    Poppe, R.: A survey on vision-based human action recognition. In: Image and Vision Computing (2010)CrossRefGoogle Scholar
  51. 51.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)Google Scholar
  52. 52.
    Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR (2017)Google Scholar
  53. 53.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  54. 54.
    Richard, A., Gall, J.: Temporal action detection using a statistical language model. In: CVPR (2016)Google Scholar
  55. 55.
    Richard, A., Kuehne, H., Gall, J.: Weakly supervised action learning with rnn based fine-to-coarse modeling. In: CVPR (2017)Google Scholar
  56. 56.
    Shen, Z., Li, J., Su, Z., Li, M., Chen, Y., Jiang, Y.G., Xue, X.: Weakly supervised dense video captioning. In: CVPR (2017)Google Scholar
  57. 57.
    Shi, M., Caesar, H., Ferrari, V.: Weakly supervised object localization using things and stuff transfer. In: ICCV (2017)Google Scholar
  58. 58.
    Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: Cdc: convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In: CVPR (2017)Google Scholar
  59. 59.
    Shou, Z., Gao, H., Zhang, L., Miyazawa, K., Chang, S.F.: Autoloc: weakly-supervised temporal action localization (2018). arXiv preprint arXiv:1807.08333
  60. 60.
    Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage cnns. In: CVPR (2016)Google Scholar
  61. 61.
    Sigurdsson, G.A., Divvala, S., Farhadi, A., Gupta, A.: Asynchronous temporal fields for action recognition. In: CVPR (2017)Google Scholar
  62. 62.
    Sigurdsson, G.A., Russakovsky, O., Farhadi, A., Laptev, I., Gupta, A.: Much ado about time: exhaustive annotation of temporal data. In: HCOMP (2016)Google Scholar
  63. 63.
    Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: ECCV (2016)Google Scholar
  64. 64.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS (2014)Google Scholar
  65. 65.
    Singh, K.K., Lee, Y.J.: Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In: ICCV (2017)Google Scholar
  66. 66.
    Sun, C., Paluri, M., Collobert, R., Nevatia, R., Bourdev, L.: Pronet: Learning to propose object-specific boxes for cascaded neural networks. In: CVPR (2016)Google Scholar
  67. 67.
    Sun, C., Shetty, S., Sukthankar, R., Nevatia, R.: Temporal localization of fine-grained actions in videos by domain transfer from web images. In: ACM MM (2015)Google Scholar
  68. 68.
    Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: CVPR (2017)Google Scholar
  69. 69.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV (2015)Google Scholar
  70. 70.
    Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: Convnet architecture search for spatiotemporal feature learning (2017). arXiv preprint arXiv:1708.05038
  71. 71.
    Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features. In: ECCV THUMOS Workshop (2014)Google Scholar
  72. 72.
    Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks: towards good practices for deep action recognition. In: ECCV (2016)Google Scholar
  73. 73.
    Wang, L., Xiong, Y., Lin, D., Van Gool, L.: Untrimmednets for weakly supervised action recognition and detection. In: CVPR (2017)Google Scholar
  74. 74.
    Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. In: Computer Vision and Image Understanding (2011)CrossRefGoogle Scholar
  75. 75.
    Xu, H., Das, A., Saenko, K.: R-c3d: Region convolutional 3d network for temporal activity detection. In: ICCV (2017)Google Scholar
  76. 76.
    Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR (2016)Google Scholar
  77. 77.
    Yuan, J., Ni, B., Yang, X., Kassim, A.: Temporal action localization with pyramid of score distribution features. In: CVPR (2016)Google Scholar
  78. 78.
    Yuan, Z., Stroud, J.C., Lu, T., Deng, J.: Temporal action localization by structured maximal sums. In: CVPR (2017)Google Scholar
  79. 79.
    Zhang, H., Kyaw, Z., Yu, J., Chang, S.F.: Ppr-fcn: weakly supervised visual relation detection via parallel pairwise r-fcn. In: ICCV (2017)Google Scholar
  80. 80.
    Zhang, J., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. In: ECCV (2016)Google Scholar
  81. 81.
    Zhao, H., Yan, Z., Wang, H., Torresani, L., Torralba, A.: Slac: A sparsely labeled dataset for action classification and localization (2017). arXiv preprint arXiv:1712.09374
  82. 82.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV (2017)Google Scholar
  83. 83.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)Google Scholar
  84. 84.
    Zhu, Y., Zhou, Y., Ye, Q., Qiu, Q., Jiao, J.: Soft proposal networks for weakly supervised object localization. In: ICCV (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Columbia UniversityNew YorkUSA
  2. 2.Microsoft ResearchRedmondUSA
  3. 3.Mitsubishi ElectricTokyoJapan

Personalised recommendations