Advertisement

Object Detection in Video with Spatiotemporal Sampling Networks

  • Gedas BertasiusEmail author
  • Lorenzo Torresani
  • Jianbo Shi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11216)

Abstract

We propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos. Our STSN performs object detection in a video frame by learning to spatially sample features from the adjacent frames. This naturally renders the approach robust to occlusion or motion blur in individual frames. Our framework does not require additional supervision, as it optimizes sampling locations directly with respect to object detection performance. Our STSN outperforms the state-of-the-art on the ImageNet VID dataset and compared to prior video object detection methods it uses a simpler design, and does not require optical flow data for training.

Notes

Acknowledgements

This work was funded in part by NSF award CNS-120552. We gratefully acknowledge NVIDIA and Facebook for the donation of GPUs used for portions of this work.

Supplementary material

474200_1_En_21_MOESM1_ESM.mp4 (0 kb)
Supplementary material 1 (mp4 1 KB)

References

  1. 1.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  2. 2.
    Szegedy, C., et al.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  3. 3.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  4. 4.
    Bertasius, G., Shi, J., Torresani, L.: DeepEdge: a multi-scale bifurcated deep network for top-down contour detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015Google Scholar
  5. 5.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)Google Scholar
  6. 6.
    Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR (2017)Google Scholar
  7. 7.
    Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. In: CVPR (2014)Google Scholar
  8. 8.
    Bertasius, G., Torresani, L., Yu, S.X., Shi, J.: Convolutional random walk networks for semantic image segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017Google Scholar
  9. 9.
    He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 346–361. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10578-9_23CrossRefGoogle Scholar
  10. 10.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the International Conference on Computer Vision (ICCV) (2017)Google Scholar
  11. 11.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the International Conference on Computer Vision (ICCV) (2017)Google Scholar
  12. 12.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Neural Information Processing Systems (NIPS) (2015)Google Scholar
  13. 13.
    Girshick, R.: Fast R-CNN. In: Proceedings of the International Conference on Computer Vision (ICCV) (2015)Google Scholar
  14. 14.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  15. 15.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 345–360. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10584-0_23CrossRefGoogle Scholar
  16. 16.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  17. 17.
    Dai, J., Li, Y., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks. In: Advances in Neural Information Processing Systems vol. 29, pp. 379–387. Curran Associates Inc. (2016)Google Scholar
  18. 18.
    Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 779–788 (2016)Google Scholar
  19. 19.
    Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 6517–6525 (2017)Google Scholar
  20. 20.
    Kang, K., et al.: T-CNN: tubelets with convolutional neural networks for object detection from videos. IEEE TCSVT 2017 (2017)Google Scholar
  21. 21.
    Kang, K., Ouyang, W., Li, H., Wang, X.: Object detection from video tubelets with convolutional neural networks. CoRR abs/1604.04053 (2016)
  22. 22.
    Han, W., et al.: Seq-NMS for video object detection. CoRR abs/1602.08465 (2016)
  23. 23.
    Lee, B., Erdenee, E., Jin, S., Rhee, P.: Multi-class multi-object tracking using changing point detection. CoRR abs/1608.08434 (2016)
  24. 24.
    Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection. In: International Conference on Computer Vision (ICCV) (2017)Google Scholar
  25. 25.
    Dai, J., et al.: Deformable convolutional networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 764–773, October 2017Google Scholar
  26. 26.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  27. 27.
    Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: International Conference on Computer Vision (ICCV) (2017)Google Scholar
  28. 28.
    Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs/1512.01274 (2015)
  29. 29.
    Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: CVPR (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Gedas Bertasius
    • 1
    Email author
  • Lorenzo Torresani
    • 2
  • Jianbo Shi
    • 1
  1. 1.University of PennsylvaniaPhiladelphiaUSA
  2. 2.Dartmouth CollegeHanoverUSA

Personalised recommendations