Learning to Localize Actions from Moments

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12348)


With the knowledge of action moments (i.e., trimmed video clips that each contains an action instance), humans could routinely localize an action temporally in an untrimmed video. Nevertheless, most practical methods still require all training videos to be labeled with temporal annotations (action category and temporal boundary) and develop the models in a fully-supervised manner, despite expensive labeling efforts and inapplicable to new categories. In this paper, we introduce a new design of transfer learning type to learn action localization for a large set of action categories, but only on action moments from the categories of interest and temporal annotations of untrimmed videos from a small set of action classes. Specifically, we present Action Herald Networks (AherNet) that integrate such design into an one-stage action localization framework. Technically, a weight transfer function is uniquely devised to build the transformation between classification of action moments or foreground video segments and action localization in synthetic contextual moments or untrimmed videos. The context of each moment is learnt through the adversarial mechanism to differentiate the generated features from those of background in untrimmed videos. Extensive experiments are conducted on the learning both across the splits of ActivityNet v1.3 and from THUMOS14 to ActivityNet v1.3. Our AherNet demonstrates the superiority even comparing to most fully-supervised action localization methods. More remarkably, we train AherNet to localize actions from 600 categories on the leverage of action moments in Kinetics-600 and temporal annotations from 200 classes in ActivityNet v1.3.



This work is partially supported by Beijing Academy of Artificial Intelligence (BAAI) and the National Key R&D Program of China under contract No. 2017YFB1002203.

Supplementary material (2.1 mb)
Supplementary material 1 (zip 2146 KB)


  1. 1.
    Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: OSDI (2016)Google Scholar
  2. 2.
    Buch, S., Escorcia, V., Ghanem, B., Fei-Fei, L., Niebles, J.C.: End-to-end, single-stream temporal action detection in untrimmed videos. In: BMVC (2017)Google Scholar
  3. 3.
    Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: SST: single-stream temporal action proposals. In: CVPR (2017)Google Scholar
  4. 4.
    Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster R-CNN architecture for temporal action localization. In: CVPR (2018)Google Scholar
  5. 5.
    Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (ELUs). In: ICLR (2016)Google Scholar
  6. 6.
    Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). Scholar
  7. 7.
    Gaidon, A., Harchaoui, Z., Schmid, C.: Temporal localization of actions with actoms. IEEE Trans. PAMI 35(11), 2782–2795 (2013)CrossRefGoogle Scholar
  8. 8.
    Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: ICML (2015)Google Scholar
  9. 9.
    Gao, J., Chen, K., Nevatia, R.: CTAP: complementary temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 70–85. Springer, Cham (2018). Scholar
  10. 10.
    Gao, J., Yang, Z., Sun, C., Chen, K., Nevatia, R.: TURN TAP: temporal unit regression network for temporal action proposals. In: ICCV (2017)Google Scholar
  11. 11.
    De Geest, R., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., Tuytelaars, T.: Online action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 269–284. Springer, Cham (2016). Scholar
  12. 12.
    Ghanem, B., et al.: The ActivityNet large-scale activity recognition challenge 2018 summary. arXiv preprint arXiv:1808.03766 (2018)
  13. 13.
    Girshick, R.: Fast R-CNN. In: ICCV (2015)Google Scholar
  14. 14.
    Goodfellow, I.J., et al.: Generative Adversarial Nets. In: NIPS (2014)Google Scholar
  15. 15.
    He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  16. 16.
    Heilbron, F.C., Barrios, W., Escorica, V., Ghanem, B.: SCC: semantic context cascade for efficient action detection. In: CVPR (2017)Google Scholar
  17. 17.
    Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: ActivityNet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)Google Scholar
  18. 18.
    Hoffman, J., Guadarrama, S., Tzeng, E., Hu, R., Donahue, J.: LSDA: large scale detection through adaptation. In: NIPS (2014)Google Scholar
  19. 19.
    Hu, R., Dollar, P., He, K., Darell, T., Girshick, R.: Learning to segment every thing. In: CVPR (2018)Google Scholar
  20. 20.
    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)Google Scholar
  21. 21.
    Jiang, Y.G., Liu, J., Zamir, A.R., Toderici, G.: THUMOS challenge: action recognition with a large number of classes (2014).
  22. 22.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  23. 23.
    Kuen, J., Perazzi, F., Lin, Z., Zhang, J., Tan, Y.P.: Scaling object detection by transferring classification weights. In: ICCV (2019)Google Scholar
  24. 24.
    Lea, C., Michael D. Flynn, R.V., Reiter, A., Hager, G.D.: Temporal convolutional network for action segmentation and detection. In: CVPR (2017)Google Scholar
  25. 25.
    Li, D., Qiu, Z., Dai, Q., Yao, T., Mei, T.: Recurrent tubelet proposal and recognition networks for action detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 306–322. Springer, Cham (2018). Scholar
  26. 26.
    Li, D., Yao, T., Qiu, Z., Li, H., Mei, T.: Long short-term relation networks for video action detection. In: ACM MM (2019)Google Scholar
  27. 27.
    Lin, T., Zhao, X., Shou, Z.: Single shot temporal action detection. In: ACM MM (2017)Google Scholar
  28. 28.
    Lin, T., Zhao, X., Shou, Z.: Temporal convolution based action proposal: submission to activitynet 2017. arXiv preprint arXiv:1707.06750 (2017)
  29. 29.
    Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 3–21. Springer, Cham (2018). Scholar
  30. 30.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: ICCV (2017)Google Scholar
  31. 31.
    Liu, D., Jiang, T., Wang, Y.: Completeness modeling and context separation for weakly supervised temporal action localization. In: CVPR (2019)Google Scholar
  32. 32.
    Long, F., Yao, T., Qiu, Z., Tian, X., Luo, J., Mei, T.: Gaussian temporal awareness networks for action localization. In: CVPR (2019)Google Scholar
  33. 33.
    Long, F., Yao, T., Qiu, Z., Tian, X., Mei, T., Luo, J.: Coarse-to-fine localization of temporal action proposals. IEEE Trans. Multimed. 22(6), 1577–1590 (2020)CrossRefGoogle Scholar
  34. 34.
    Lu, S., Wang, Z., Mei, T., Guan, G., Feng, D.D.: A bag-of-importance model with locality-constrained coding based feature learning for video summarization. IEEE Trans. Multimed. 16(6), 1497–1509 (2014)CrossRefGoogle Scholar
  35. 35.
    Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: ICML (2013)Google Scholar
  36. 36.
    van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. JMLR (2008)Google Scholar
  37. 37.
    Nguyen, P., Liu, T., Prasad, G., Han, B.: Weakly supervised action localization by sparse temporal pooling network. In: CVPR (2018)Google Scholar
  38. 38.
    Nguyen, P.X., Ramanan, D., Fowlkes, C.C.: Weakly-supervised action localization with background modeling. In: ICCV (2019)Google Scholar
  39. 39.
    Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: ICCV (2013)Google Scholar
  40. 40.
    Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: ICCV (2017)Google Scholar
  41. 41.
    Qiu, Z., Yao, T., Ngo, C.W., Tian, X., Mei, T.: Learning spatio-temporal representation with local and global diffusion. In: CVPR (2019)Google Scholar
  42. 42.
    Shi, B., Dai, Q., Mu, Y., Wang, J.: Weakly-supervised action localization by generative attention modeling. In: CVPR (2020)Google Scholar
  43. 43.
    Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: CDC: convolutional-de-convolutional network for precise temporal action localization in untrimmed videos. In: CVPR (2017)Google Scholar
  44. 44.
    Shou, Z., Gao, H., Zhang, L., Miyazawa, K., Chang, S.-F.: AutoLoc: weakly-supervised temporal action localization in untrimmed videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11220, pp. 162–179. Springer, Cham (2018). Scholar
  45. 45.
    Shou, Z., et al.: Online detection of action start in untrimmed, streaming videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 551–568. Springer, Cham (2018). Scholar
  46. 46.
    Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: CVPR (2016)Google Scholar
  47. 47.
    Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: CVPR (2016)Google Scholar
  48. 48.
    Singh, G., Cuzzolin, F.: Untrimmed video classification for activity detection: submission to ActivityNet challenge. arXiv preprint arXiv:1607.01979 (2016)
  49. 49.
    Tang, K., Yao, B., Fei-Fei, L., Koller, D.: Combining the right features for complex event recognition. In: ICCV (2013)Google Scholar
  50. 50.
    Tang, Y., Wang, J., Gao, B., Dellandrea, E., Gaizauskas, R., Chen, L.: Large scale semi-supervised object detection using visual and semantic knowledge transfer. In: CVPR (2016)Google Scholar
  51. 51.
    Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaption. In: CVPR (2017)Google Scholar
  52. 52.
    Wang, L., Xiong, Y., Lin, D., Gool, L.V.: UntrimmedNets for weakly supervised action recognition and detection. In: CVPR (2017)Google Scholar
  53. 53.
    Wang, R., Tao, D.: UTS at activitynet 2016. In: CVPR ActivityNet Challenge Workshop (2016)Google Scholar
  54. 54.
    Xiong, Y., Zhao, Y., Wang, L., Lin, D., Tang, X.: A pursuit of temporal accuracy in general activity detection. arXiv preprint arXiv:1703.02716 (2017)
  55. 55.
    Xu, H., Das, A., Saenko, K.: R-C3D: region convolutional 3D network for temporal activity detection. In: ICCV (2017)Google Scholar
  56. 56.
    Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR (2016)Google Scholar
  57. 57.
    Yuan, J., Ni, B., Yang, X., Kassim, A.A.: Temporal action localization with pyramid of score distribution features. In: CVPR (2016)Google Scholar
  58. 58.
    Zeng, R., et al.: Graph convolutional networks for temporal action localization. In: ICCV (2019)Google Scholar
  59. 59.
    Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: ICCV (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of Science and Technology of ChinaHefeiChina
  2. 2.JD AI ResearchBeijingChina
  3. 3.University of RochesterRochesterUSA

Personalised recommendations