Learning Where to Focus for Efficient Video Object Detection

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12361)


Transferring existing image-based detectors to the video is non-trivial since the quality of frames is always deteriorated by part occlusion, rare pose, and motion blur. Previous approaches exploit to propagate and aggregate features across video frames by using optical flow-warping. However, directly applying image-level optical flow onto the high-level features might not establish accurate spatial correspondences. Therefore, a novel module called Learnable Spatio-Temporal Sampling (LSTS) has been proposed to learn semantic-level correspondences among adjacent frame features accurately. The sampled locations are first randomly initialized, then updated iteratively to find better spatial correspondences guided by detection supervision progressively. Besides, Sparsely Recursive Feature Updating (SRFU) module and Dense Feature Aggregation (DFA) module are also introduced to model temporal relations and enhance per-frame features, respectively. Without bells and whistles, the proposed method achieves state-of-the-art performance on the ImageNet VID dataset with less computational complexity and real-time speed. Code will be made available at LSTS.


Flow-warping Learnable Spatio-Temporal Sampling Spatial correspondences Temporal relations 



This research was supported by the Major Project for New Generation of AI under Grant No. 2018AAA0100400, the National Natural Science Foundation of China under Grants 91646207, 61976208 and 61620106003. We also would like to thank Lin Song for the discussions and suggestions.

Supplementary material

504471_1_En_2_MOESM1_ESM.pdf (414 kb)
Supplementary material 1 (pdf 414 KB)


  1. 1.
    Chen, K., et al.: Optimizing video object detection via a scale-time lattice. In: CVPR, pp. 7814–7823 (2018)Google Scholar
  2. 2.
    Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: NeurPIS, pp. 379–387 (2016)Google Scholar
  3. 3.
    Dai, J., et al.: Deformable convolutional networks. In: CVPR, pp. 764–773 (2017)Google Scholar
  4. 4.
    Deng, H., et al.: Object guided external memory network for video object detection. In: ICCV, pp. 6678–6687 (2019)Google Scholar
  5. 5.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–25 (2009)Google Scholar
  6. 6.
    Deng, J., Pan, Y., Yao, T., Zhou, W., Li, H., Mei, T.: Relation distillation networks for video object detection. In: ICCV, pp. 7023–7032 (2019)Google Scholar
  7. 7.
    Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV, pp. 2758–2766 (2015)Google Scholar
  8. 8.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: ICCV, pp. 3038–3046 (2017)Google Scholar
  9. 9.
    Gao, P., et al.: Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In: CVPR, pp. 6639–6648 (2019)Google Scholar
  10. 10.
    Gao, P., et al.: Question-guided hybrid convolution for visual question answering. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 485–501. Springer, Cham (2018). Scholar
  11. 11.
    Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32(11), 1231–1237 (2013)CrossRefGoogle Scholar
  12. 12.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)Google Scholar
  13. 13.
    Han, W., et al.: Seq-NMS for video object detection. arXiv preprint arXiv:1602.08465 (2016)
  14. 14.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV, pp. 2961–2969 (2017)Google Scholar
  15. 15.
    He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)CrossRefGoogle Scholar
  16. 16.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)Google Scholar
  17. 17.
    Hetang, C., Qin, H., Liu, S., Yan, J.: Impression network for video object detection. arXiv preprint arXiv:1712.05896 (2017)
  18. 18.
    Jiang, Z., Gao, P., Guo, C., Zhang, Q., Xiang, S., Pan, C.: Video object detection with locally-weighted deformable neighbors. In: AAAI (2019)Google Scholar
  19. 19.
    Jiang, Z., et al.: Learning motion priors for efficient video object detection. arXiv preprint arXiv:1911.05253 (2019)
  20. 20.
    Kang, K., et al.: Object detection in videos with tubelet proposal networks. In: CVPR, pp. 727–735 (2017)Google Scholar
  21. 21.
    Kang, K., Ouyang, W., Li, H., Wang, X.: Object detection from video tubelets with convolutional neural networks. In: CVPR, pp. 817–825 (2016)Google Scholar
  22. 22.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR, pp. 2117–2125 (2017)Google Scholar
  23. 23.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV, pp. 2980–2988 (2017)Google Scholar
  24. 24.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  25. 25.
    Liu, Y., Liu, J., Zeng, A., Wang, X.: Differentiable kernel evolution. In: CVPR, pp. 1834–1843 (2019)Google Scholar
  26. 26.
    Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: NeurPIS, pp. 2204–2212 (2014)Google Scholar
  27. 27.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurPIS, pp. 91–99 (2015)Google Scholar
  28. 28.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015). Scholar
  29. 29.
    Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015)
  30. 30.
    Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: CVPR, pp. 761–769 (2016)Google Scholar
  31. 31.
    Shvets, M., Liu, W., Berg, A.C.: Leveraging long-range temporal relationships between proposals for video object detection. In: ICCV, pp. 9756–9764 (2019)Google Scholar
  32. 32.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NeurPIS, pp. 568–576 (2014)Google Scholar
  33. 33.
    Uijlings, J.R., Van De Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. Int. J. Comput. Vision 104(2), 154–171 (2013). Scholar
  34. 34.
    Vaswani, A., et al.: Attention is all you need. In: NeurPIS, pp. 5998–6008 (2017)Google Scholar
  35. 35.
    Wang, S., Zhou, Y., Yan, J., Deng, Z.: Fully motion-aware network for video object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 557–573. Springer, Cham (2018). Scholar
  36. 36.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR, pp. 7794–7803 (2018)Google Scholar
  37. 37.
    Wu, H., Chen, Y., Wang, N., Zhang, Z.: Sequence level semantics aggregation for video object detection. In: ICCV, pp. 9217–9225 (2019)Google Scholar
  38. 38.
    Xiao, F., Lee, Y.J.: Video object detection with an aligned spatial-temporal memory. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 494–510. Springer, Cham (2018). Scholar
  39. 39.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)Google Scholar
  40. 40.
    Zhang, M., Song, G., Zhou, H., Liu, Y.: Discriminability distillation in group representation learning. In: ECCV (2020)Google Scholar
  41. 41.
    Zhou, H., Liu, J., Liu, Z., Liu, Y., Wang, X.: Rotate-and-render: unsupervised photorealistic face rotation from single-view images. In: CVPR, pp. 5911–5920 (2020)Google Scholar
  42. 42.
    Zhu, X., Dai, J., Yuan, L., Wei, Y.: Towards high performance video object detection. In: CVPR, pp. 7210–7218 (2018)Google Scholar
  43. 43.
    Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: ICCV, pp. 408–417 (2017)Google Scholar
  44. 44.
    Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: CVPR, pp. 2349–2358 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.National Laboratory of Pattern Recognition, Institute of AutomationChinese Academy of SciencesBeijingChina
  2. 2.School of Artificial IntelligenceUniversity of Chinese Academy of SciencesBeijingChina
  3. 3.The Chinese University of Hong KongHong KongPeople’s Republic of China
  4. 4.Horizon RoboticsBeijingChina

Personalised recommendations