Abstract
Detecting small-scale pedestrians in aerial images is a challenging task that can be difficult even for humans. Observing that the single image based method cannot achieve robust performance because of the poor visual cues of small instances. Considering that multiple frames may provide more information to detect such difficult case instead of only single frame, we design a novel video based pedestrian detection method with a two-stream network pipeline to fully utilize the temporal and contextual information of a video. An aggregated feature map is proposed to absorb the spatial and temporal information with the help of spatial and temporal sub-networks. To better capture motion information, a more refined flow net (SPyNet) is adopted instead of a simple flownet. In the spatial stream subnetwork, we modified the backbone network structure by increasing the feature map resolution with relatively larger receptive field to make it suitable for small-scale detection. Experimental results based on drone video datasets demonstrate that our approach improves detection accuracy in the case of small-scale instances and reduces false positive detections. By exploiting the temporal information and aggregating the feature maps, our two-stream method improves the detection performance by 8.48% in mean Average Precision (mAP) from that of the basic single stream R-FCN method, and it outperforms the state-of-the-art method by 3.09% on the Okutama Human-action dataset.
Similar content being viewed by others
References
Barekatain, M., Marti, M., Shih, H. -F., Murray, S., Nakayama, K., Matsuo, Y., et al. (2017). Okutama-action: an aerial view video dataset for concurrent human action detection. In 30th IEEE conference on computer vision and pattern recognition workshops (Vol. 2017, pp. 2153–2160). IEEE Computer Society. https://doi.org/10.1109/CVPRW.2017.267.
Cai, Z., Fan, Q., Feris, R. S., & Vasconcelos, N. (2016). A unified multi-scale deep convolutional neural network for fast object detection. In European conference on computer vision (pp. 354–370). Springer. https://doi.org/10.1007/978-3-319-46493-0_22.
Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems (pp. 379–387). http://papers.nips.cc/paper/6465-r-fcn-object-detection-via-region-based-fully-convolutional-networks.pdf.
Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8), 1532–1545. https://doi.org/10.1109/TPAMI.2014.2300479.
Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2011). Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 743–761. https://doi.org/10.1109/TPAMI.2011.155.
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., et al. (2015) Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 2758–2766). https://doi.org/10.1109/ICCV.2015.316.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338. https://doi.org/10.1007/s11263-009-0275-4.
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1933–1941). https://doi.org/10.1109/CVPR.2016.213.
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3354–3361). IEEE. https://doi.org/10.1109/CVPR.2012.6248074.
Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 1440–1448). https://doi.org/10.1109/ICCV.2015.169.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90.
Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., et al. (2017). T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10), 2896–2907. https://doi.org/10.1109/TCSVT.2017.2736553.
Li, J., Liang, X., Shen, S., Xu, T., Feng, J., & Yan, S. (2017). Scale-aware fast R-CNN for pedestrian detection. IEEE Transactions on Multimedia, 20(4), 985–996. https://doi.org/10.1109/TMM.2017.2759508.
Lin, C., Lu, J., Wang, G., & Zhou, J. (2018). Graininess-aware deep feature learning for pedestrian detection. In Proceedings of the European conference on computer vision (ECCV) (pp. 732–747). https://doi.org/10.1007/978-3-030-01240-3_45.
Liu, W., Liao, S., Ren, W., Hu, W., & Yu, Y. (2019) High-level semantic feature detection: A new perspective for pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5187–5196). https://arxiv.org/abs/1904.02948.
Ning, G., Zhang, Z., Huang, C., Ren, X., Wang, H., Cai, C., et al. (2017) Spatially supervised recurrent convolutional neural networks for visual object tracking. In 2017 IEEE international symposium on circuits and systems (ISCAS) (pp. 1–4). IEEE. https://doi.org/10.1109/ISCAS.2017.8050867.
Ozge Unel, F., Ozkalayci, B. O., & Cigla, C. (2019). The power of tiling for small object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. http://openaccess.thecvf.com/content_CVPRW_2019/html/UAVision/Unel_The_Power_of_Tiling_for_Small_Object_Detection_CVPRW_2019_paper.html.
Ranjan, A., & Black, M. J. (2017). Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4161–4170). https://doi.org/10.1109/CVPR.2017.291.
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint: http://arxiv.org/abs/1804.02767.
Ren, S., He, K., Girshick, R., & Sun, J. (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99). https://doi.org/10.1109/TPAMI.2016.2577031.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y.
Shrivastava, A., Gupta, A., & Girshick, R. (2016) Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 761–769). https://doi.org/10.1109/CVPR.2016.89.
Simonyan, K., & Zisserman, A. (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 568–576). https://arxiv.org/abs/1406.2199.
Song, T., Sun, L., Xie, D., Sun, H., & Pu, S. (2018). Small-scale pedestrian detection based on somatic topology localization and temporal feature aggregation. arXiv preprint, https://arxiv.org/abs/1807.01438.
Varol, G., Laptev, I., & Schmid, C. (2017). Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1510–1517. https://doi.org/10.1109/TPAMI.2017.2712608.
Wang, S., Cheng, J., Liu, H., & Tang, M. (2018). Pcn: Part and context information for pedestrian detection with cnns. arXiv preprint, https://arxiv.org/abs/1804.04483.
Wang, S., Zhou, Y., Yan, J., & Deng, Z. (2018) Fully motion-aware network for video object detection. In Proceedings of the European conference on computer vision (ECCV) (pp. 542–557). https://doi.org/10.1007/978-3-030-01261-8_33.
Wojek, C., Walk, S., & Schiele, B. (2009) Multi-cue onboard pedestrian detection. In 2009 IEEE conference on computer vision and pattern recognition (pp. 794–801). IEEE. https://doi.org/10.1109/CVPR.2009.5206638.
Xie, H., Chen, Y., & Shin, H. (2019). Context-aware pedestrian detection especially for small-sized instances with Deconvolution Integrated Faster RCNN (DIF R-CNN). Applied Intelligence, 49(3), 1200–1211. https://doi.org/10.1007/s10489-018-1326-8.
Zhang, X., Cheng, L., Li, B., & Hu, H.-M. (2018). Too far to see? Not really!—Pedestrian detection with scale-aware localization policy. IEEE Transactions on Image Processing, 27(8), 3703–3715. https://doi.org/10.1109/TIP.2018.2818018.
Zhu, P., Wen, L., Bian, X., Ling, H., & Hu, Q. J. (2018). Vision meets drones: A challenge. arXiv preprint, https://arxiv.org/abs/1804.07437.
Zhu, X., Wang, Y., Dai, J., Yuan, L., & Wei, Y. (2017) Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE international conference on computer vision (pp. 408–417). https://doi.org/10.1109/ICCV.2017.52.
Acknowledgements
This material is based on work supported by the Ministry of Trade, Industry & Energy (MOTIE, Korea) under the Industrial Technology Innovation Program (10080619).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xie, H., Shin, H. Two-stream small-scale pedestrian detection network with feature aggregation for drone-view videos. Multidim Syst Sign Process 32, 897–913 (2021). https://doi.org/10.1007/s11045-021-00764-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11045-021-00764-1