Localizing the Common Action Among a Few Videos

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12352)


This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label. To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. The network contains: (i) a mutual enhancement module to simultaneously complement the representation of the few trimmed support videos and the untrimmed query video; (ii) a progressive alignment module that iteratively fuses the support videos into the query branch; and (iii) a pairwise matching module to weigh the importance of different support videos. Evaluation of few-shot common action localization in untrimmed videos containing a single or multiple action instances demonstrates the effectiveness and general applicability of our proposal.



Common action localization Few-shot learning 

Supplementary material

504444_1_En_30_MOESM1_ESM.pdf (233 kb)
Supplementary material 1 (pdf 233 KB)


  1. 1.
    Bojanowski, P., et al.: Weakly supervised action labeling in videos under ordering constraints. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, Tinne (eds.) ECCV 2014. LNCS, vol. 8693, pp. 628–643. Springer, Cham (2014). Scholar
  2. 2.
    Buch, S., Escorcia, V., Shen, C., Ghanem, B., Carlos Niebles, J.: Sst: single-stream temporal action proposals. In: CVPR (2017)Google Scholar
  3. 3.
    Caba Heilbron, F., Escorcia, V., Ghanem, B., Car-los Niebles, J.: Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR (2015)Google Scholar
  4. 4.
    Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: ACL (2011)Google Scholar
  5. 5.
    Dai, X., Singh, B., Zhang, G., Davis, L.S., Chen, Y.Q: Temporal context network for activity localization in videos. In: ICCV (2017)Google Scholar
  6. 6.
    Damen, D.: Scaling egocentric vision: the epic-kitchens dataset. In: ECCV (2018)Google Scholar
  7. 7.
    Dong, X., Zheng, L., Ma, F., Yang, Y., Meng, D.: Few-example object detection with model communication. PAMI 41(7), 1641–1654 (2018)CrossRefGoogle Scholar
  8. 8.
    Duchenne, O., Laptev, I., Sivic, J., Bach, F., Ponce, J.: Automatic annotation of human actions in video. In ICCV (2009)Google Scholar
  9. 9.
    Escorcia, V., Caba Heilbron, F., Niebles, J.C., Ghanem, B.: DAPs: deep action proposals for action understanding. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 768–784. Springer, Cham (2016). Scholar
  10. 10.
    Feng, Y., Ma, L., Liu, W., Zhang, T., Luo, J.: Video re-localization. In: ECCV (2018)Google Scholar
  11. 11.
    Gao, J., Chen, K., Nevatia, R.: Ctap: Complementary temporal action proposal generation. In: ECCV (2018)Google Scholar
  12. 12.
    Gao, J., Yang, Z., Sun, C., Chen, K., Nevatia, R.: Turn tap: temporal unit regression network for temporal action proposals. In: ICCV (2017)Google Scholar
  13. 13.
    Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, Tinne (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Cham (2014). Scholar
  14. 14.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  15. 15.
    Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)Google Scholar
  16. 16.
    Hu, T., Mettes, P., Huang, J.-H., Snoek, C.G.M.: SILCO: show a few images, localize the common object. In: ICCV(2019)Google Scholar
  17. 17.
    Idrees, H., et al.: The THUMOS challenge on action recognition for videos “in the wild”. In: CVIU (2017)Google Scholar
  18. 18.
    Jain, M., Ghodrati, A., Snoek, C.G.M.: ActionBytes: learning from trimmed videos to localize actions. In: CVPR (2020)Google Scholar
  19. 19.
    Jain, M., van Gemert, J.C., Mensink, T., Snoek, C.G.M.: Objects2action: classifying and localizing actions without any video example. In: ICCV (2015)Google Scholar
  20. 20.
    Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Joint learning of object and action detectors. In: ICCV (2017)Google Scholar
  21. 21.
    Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)Google Scholar
  22. 22.
    Kay, W., et al.: The kinetics human action video dataset. arXiv (2017)Google Scholar
  23. 23.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv (2014)Google Scholar
  24. 24.
    Kuehne, H., Richard, A., Gall, J.: A hybrid RNN-HMM approach for weakly supervised temporal action segmentation. arXiv (2019)Google Scholar
  25. 25.
    Singh, K.K., Lee, Y.J.: Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization. In: ICCV (2017)Google Scholar
  26. 26.
    Lin, T., Zhao, X., Su, H., Wang, C., Yang, M.: BSN: boundary sensitive network for temporal action proposal generation. In: ECCV (2018)Google Scholar
  27. 27.
    Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMS for activity detection and early detection. In: CVPR (2016)Google Scholar
  28. 28.
    van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)Google Scholar
  29. 29.
    Mettes, P., Snoek, C.G.M.: Spatial-aware object embeddings for zero-shot localization and classification of actions. In: ICCV (2017)Google Scholar
  30. 30.
    Nguyen, P., Liu, T., Prasad, G., Han, G.: Weakly supervised action localization by sparse temporal pooling network. In: CVPR (2018)Google Scholar
  31. 31.
    Nguyen, P.X., Ramanan, D., Charless C.F.: Weakly-supervised action localization with background modeling. In: ICCV (2019)Google Scholar
  32. 32.
    Oneata, D., Verbeek, J., Cordelia, S.: Action and event recognition with fisher vectors on a compact feature set. In: ICCV (2013)Google Scholar
  33. 33.
    Pasze, A., et al.: Automatic differentiation in pytorch. In: NeurIPS (2017)Google Scholar
  34. 34.
    Paul, S., Roy, S., Roy-Chowdhury, A.K.: W-talc: Weakly-supervised temporal activity localization and classification. In: ECCV (2018)Google Scholar
  35. 35.
    Sawatzky, J., Garbade, M., Gall, J.: Ex paucis plura: learning affordance segmentation from very few examples. In: Brox, T., Bruhn, A., Fritz, M. (eds.) GCPR 2018. LNCS, vol. 11269, pp. 169–184. Springer, Cham (2019). Scholar
  36. 36.
    Shaban, A., Rahimi, A., Gould, S., Boots, B., Hartley, R.: Learning to find common objects across image collections. In: ICCV (2019)Google Scholar
  37. 37.
    Shou, Z., Wang, D., Chang, S.-F.: Temporal action localization in untrimmed videos via multi-stage CNNS. In: CVPR (2016)Google Scholar
  38. 38.
    Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: CVPR (2016)Google Scholar
  39. 39.
    Soomro, K., Shah, M.: Unsupervised action discovery and localization in videos. In: ICCV (2017)Google Scholar
  40. 40.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  41. 41.
    Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)Google Scholar
  42. 42.
    Wang, L., Qiao, Y., Tang, X.: Action recognition and detection by combining motion and appearance features. THUMOS14 Action Recogn. Challenge, 1(2), 2 (2014)Google Scholar
  43. 43.
    Wang, L., Xiong, Y., Lin, D., Van Gool, L.: Untrimmednets for weakly supervised action recognition and detection. In: CVPR (2017)Google Scholar
  44. 44.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)Google Scholar
  45. 45.
    Xu, H., Das, A., Saenko, K.: R-C3D: Region convolutional 3D network for temporal activity detection. In: ICCV (2017)Google Scholar
  46. 46.
    Yang, H., He, X., Porikli, F.: One-shot action localization by learning sequence matching network. In: CVPR (2018)Google Scholar
  47. 47.
    Yang, J., Yuan, J.: Common action discovery and localization in unconstrained videos. In: ICCV (2017)Google Scholar
  48. 48.
    Yeung, S., Russakovsky, O., Mori, G., Fei-Fei, L.: End-to-end learning of action detection from frame glimpses in videos. In: CVPR (2016)Google Scholar
  49. 49.
    Zhang, Z., Zhao, Z., Lin, Z., Song, J., Cai, D.: Localizing unseen activities in video via image query. In: IJCAI (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Peking UniversityBeijingChina
  2. 2.University of AmsterdamAmsterdamThe Netherlands

Personalised recommendations