CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12361)


Most current pipelines for spatio-temporal action localization connect frame-wise or clip-wise detection results to generate action proposals, where only local information is exploited and the efficiency is hindered by dense per-frame localization. In this paper, we propose Coarse-to-Fine Action Detector (CFAD), an original end-to-end trainable framework for efficient spatio-temporal action localization. The CFAD introduces a new paradigm that first estimates coarse spatio-temporal action tubes from video streams, and then refines the tubes’ location based on key timestamps. This concept is implemented by two key components, the Coarse and Refine Modules in our framework. The parameterized modeling of long temporal information in the Coarse Module helps obtain accurate initial tube estimation, while the Refine Module selectively adjusts the tube location under the guidance of key timestamps. Against other methods, the proposed CFAD achieves competitive results on action detection benchmarks of UCF101-24, UCFSports and JHMDB-21 with inference speed that is 3.3\(\times \) faster than the nearest competitor.


Spatiotemporal action detection Coarse-to-fine paradigm Parameterized modeling 



The paper is supported in part by the following grants: China Major Project for New Generation of AI Grant (No. 2018AAA0100400), National Natural Science Foundation of China (No. 61971277). The work is also supported by funding from Clobotics under the Joint Research Program of Smart Retail.

Supplementary material

504471_1_En_30_MOESM1_ESM.pdf (102 kb)
Supplementary material 1 (pdf 102 KB)


  1. 1.
    Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: NeurIPS, pp. 3981–3989 (2016)Google Scholar
  2. 2.
    Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3024, pp. 25–36. Springer, Heidelberg (2004). Scholar
  3. 3.
    Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)Google Scholar
  4. 4.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255. IEEE (2009)Google Scholar
  5. 5.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)Google Scholar
  6. 6.
    Gkioxari, G., Malik, J.: Finding action tubes. In: CVPR, June 2015Google Scholar
  7. 7.
    Gu, C., et al.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: CVPR, pp. 6047–6056 (2018)Google Scholar
  8. 8.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  9. 9.
    Hou, R., Chen, C., Shah, M.: An end-to-end 3D convolutional neural network for action detection and segmentation in videos. In: ICCV (2017)Google Scholar
  10. 10.
    Hu, R., Dollár, P., He, K., Darrell, T., Girshick, R.: Learning to segment every thing. In: CVPR, pp. 4233–4241 (2018)Google Scholar
  11. 11.
    Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, pp. 4700–4708 (2017)Google Scholar
  12. 12.
    Huang, J., Li, N., Zhong, J., Li, T.H., Li, G.: Online action tube detection via resolving the spatio-temporal context pattern. In: ACM MM, pp. 993–1001. ACM (2018)Google Scholar
  13. 13.
    Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: NeurIPS, pp. 2017–2025 (2015)Google Scholar
  14. 14.
    Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: ICCV, pp. 3192–3199, December 2013Google Scholar
  15. 15.
    Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detector for spatio-temporal action localization. In: ICCV, pp. 4405–4413 (2017)Google Scholar
  16. 16.
    Li, D., Qiu, Z., Dai, Q., Yao, T., Mei, T.: Recurrent tubelet proposal and recognition networks for action detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 306–322. Springer, Cham (2018). Scholar
  17. 17.
    Li, Y., Wang, Z., Wang, L., Wu, G.: Actions as moving points. arXiv preprint arXiv:2001.04608 (2020)
  18. 18.
    Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., Snoek, C.G.: VideoLSTM convolves, attends and flows for action recognition. Comput. Vis. Image Underst. 166, 41–50 (2018)CrossRefGoogle Scholar
  19. 19.
    Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 744–759. Springer, Cham (2016). Scholar
  20. 20.
    Pramono, R.R.A., Chen, Y.T., Fang, W.H.: Hierarchical self-attention network for action localization in videos. In: ICCV (2019)Google Scholar
  21. 21.
    Qiu, Z., Yao, T., Ngo, C.W., Tian, X., Mei, T.: Learning spatio-temporal representation with local and global diffusion. In: CVPR, pp. 12056–12065 (2019)Google Scholar
  22. 22.
    Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR, pp. 7263–7271 (2017)Google Scholar
  23. 23.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS, pp. 91–99 (2015)Google Scholar
  24. 24.
    Rodriguez, M.D., Ahmed, J., Shah, M.: Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR, pp. 1–8, June 2008Google Scholar
  25. 25.
    Saha, S., Singh, G., Cuzzolin, F.: AMTNet: action-micro-tube regression by end-to-end trainable deep architecture. In: ICCV, pp. 4414–4423 (2017)Google Scholar
  26. 26.
    Saha, S., Singh, G., Sapienza, M., Torr, P.H., Cuzzolin, F.: Deep learning for detecting multiple space-time action tubes in videos. In: BMVC (2016)Google Scholar
  27. 27.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NeurIPS, pp. 568–576 (2014)Google Scholar
  28. 28.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  29. 29.
    Singh, G., Saha, S., Sapienza, M., Torr, P.H., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: ICCV, pp. 3637–3646 (2017)Google Scholar
  30. 30.
    Song, L., Zhang, S., Yu, G., Sun, H.: TACNet: transition-aware context network for spatio-temporal action detection. In: CVPR, pp. 11987–11995 (2019)Google Scholar
  31. 31.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild (2012)Google Scholar
  32. 32.
    Su, R., Ouyang, W., Zhou, L., Xu, D.: Improving action localization by progressive cross-stream cooperation. In: CVPR, pp. 12016–12025 (2019)Google Scholar
  33. 33.
    Sun, L., Jia, K., Chen, K., Yeung, D.Y., Shi, B.E., Savarese, S.: Lattice long short-term memory for human action recognition. In: ICCV, pp. 2147–2156 (2017)Google Scholar
  34. 34.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, December 2015Google Scholar
  35. 35.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR, June 2018Google Scholar
  36. 36.
    Xu, H., Das, A., Saenko, K.: R-C3D: region convolutional 3D network for temporal activity detection. In: ICCV, pp. 5783–5792 (2017)Google Scholar
  37. 37.
    Yang, T., Zhang, X., Li, Z., Zhang, W., Sun, J.: MetaAnchor: learning to detect objects with customized anchors. In: NeurIPS, pp. 320–330 (2018)Google Scholar
  38. 38.
    Yang, X., Yang, X., Liu, M.Y., Xiao, F., Davis, L.S., Kautz, J.: STEP: spatio-temporal progressive learning for video action detection. In: CVPR, pp. 264–272 (2019)Google Scholar
  39. 39.
    Zhao, J., Snoek, C.G.: Dance with flow: two-in-one stream action detection. In: CVPR, pp. 9935–9944 (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Department of Electronic EngineeringShanghai Jiao Tong UniversityShanghaiChina
  2. 2.Institute for Advanced Communication and Data ScienceShanghai UniversityShanghaiChina
  3. 3.Faculty of Computing and InformaticsMultimedia UniversityCyberjayaMalaysia
  4. 4.Adobe ResearchSan FranciscoUSA
  5. 5.CloboticsShanghaiChina

Personalised recommendations