Two-stream small-scale pedestrian detection network with feature aggregation for drone-view videos

Xie, Han; Shin, Hyunchul

doi:10.1007/s11045-021-00764-1

Two-stream small-scale pedestrian detection network with feature aggregation for drone-view videos

Published: 08 February 2021

Volume 32, pages 897–913, (2021)
Cite this article

Multidimensional Systems and Signal Processing Aims and scope Submit manuscript

486 Accesses
6 Citations
Explore all metrics

Abstract

Detecting small-scale pedestrians in aerial images is a challenging task that can be difficult even for humans. Observing that the single image based method cannot achieve robust performance because of the poor visual cues of small instances. Considering that multiple frames may provide more information to detect such difficult case instead of only single frame, we design a novel video based pedestrian detection method with a two-stream network pipeline to fully utilize the temporal and contextual information of a video. An aggregated feature map is proposed to absorb the spatial and temporal information with the help of spatial and temporal sub-networks. To better capture motion information, a more refined flow net (SPyNet) is adopted instead of a simple flownet. In the spatial stream subnetwork, we modified the backbone network structure by increasing the feature map resolution with relatively larger receptive field to make it suitable for small-scale detection. Experimental results based on drone video datasets demonstrate that our approach improves detection accuracy in the case of small-scale instances and reduces false positive detections. By exploiting the temporal information and aggregating the feature maps, our two-stream method improves the detection performance by 8.48% in mean Average Precision (mAP) from that of the basic single stream R-FCN method, and it outperforms the state-of-the-art method by 3.09% on the Okutama Human-action dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BDF-YOLOV5: Improved YOLOV5 Based on Bi-directional Fusion Network for Dense Pedestrian Detection

Multi-object detection for crowded road scene based on ML-AFP of YOLOv5

Article Open access 12 October 2023

Small-Scale Pedestrian Detection Based on Topological Line Localization and Temporal Feature Aggregation

References

Barekatain, M., Marti, M., Shih, H. -F., Murray, S., Nakayama, K., Matsuo, Y., et al. (2017). Okutama-action: an aerial view video dataset for concurrent human action detection. In 30th IEEE conference on computer vision and pattern recognition workshops (Vol. 2017, pp. 2153–2160). IEEE Computer Society. https://doi.org/10.1109/CVPRW.2017.267.
Cai, Z., Fan, Q., Feris, R. S., & Vasconcelos, N. (2016). A unified multi-scale deep convolutional neural network for fast object detection. In European conference on computer vision (pp. 354–370). Springer. https://doi.org/10.1007/978-3-319-46493-0_22.
Dai, J., Li, Y., He, K., & Sun, J. (2016). R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems (pp. 379–387). http://papers.nips.cc/paper/6465-r-fcn-object-detection-via-region-based-fully-convolutional-networks.pdf.
Dollár, P., Appel, R., Belongie, S., & Perona, P. (2014). Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8), 1532–1545. https://doi.org/10.1109/TPAMI.2014.2300479.
Article Google Scholar
Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2011). Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 743–761. https://doi.org/10.1109/TPAMI.2011.155.
Article Google Scholar
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., et al. (2015) Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 2758–2766). https://doi.org/10.1109/ICCV.2015.316.
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338. https://doi.org/10.1007/s11263-009-0275-4.
Article Google Scholar
Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1933–1941). https://doi.org/10.1109/CVPR.2016.213.
Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition (pp. 3354–3361). IEEE. https://doi.org/10.1109/CVPR.2012.6248074.
Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 1440–1448). https://doi.org/10.1109/ICCV.2015.169.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90.
Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., et al. (2017). T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology, 28(10), 2896–2907. https://doi.org/10.1109/TCSVT.2017.2736553.
Article Google Scholar
Li, J., Liang, X., Shen, S., Xu, T., Feng, J., & Yan, S. (2017). Scale-aware fast R-CNN for pedestrian detection. IEEE Transactions on Multimedia, 20(4), 985–996. https://doi.org/10.1109/TMM.2017.2759508.
Article Google Scholar
Lin, C., Lu, J., Wang, G., & Zhou, J. (2018). Graininess-aware deep feature learning for pedestrian detection. In Proceedings of the European conference on computer vision (ECCV) (pp. 732–747). https://doi.org/10.1007/978-3-030-01240-3_45.
Liu, W., Liao, S., Ren, W., Hu, W., & Yu, Y. (2019) High-level semantic feature detection: A new perspective for pedestrian detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5187–5196). https://arxiv.org/abs/1904.02948.
Ning, G., Zhang, Z., Huang, C., Ren, X., Wang, H., Cai, C., et al. (2017) Spatially supervised recurrent convolutional neural networks for visual object tracking. In 2017 IEEE international symposium on circuits and systems (ISCAS) (pp. 1–4). IEEE. https://doi.org/10.1109/ISCAS.2017.8050867.
Ozge Unel, F., Ozkalayci, B. O., & Cigla, C. (2019). The power of tiling for small object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. http://openaccess.thecvf.com/content_CVPRW_2019/html/UAVision/Unel_The_Power_of_Tiling_for_Small_Object_Detection_CVPRW_2019_paper.html.
Ranjan, A., & Black, M. J. (2017). Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4161–4170). https://doi.org/10.1109/CVPR.2017.291.
Redmon, J., & Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv preprint: http://arxiv.org/abs/1804.02767.
Ren, S., He, K., Girshick, R., & Sun, J. (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91–99). https://doi.org/10.1109/TPAMI.2016.2577031.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y.
Article MathSciNet Google Scholar
Shrivastava, A., Gupta, A., & Girshick, R. (2016) Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 761–769). https://doi.org/10.1109/CVPR.2016.89.
Simonyan, K., & Zisserman, A. (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems (pp. 568–576). https://arxiv.org/abs/1406.2199.
Song, T., Sun, L., Xie, D., Sun, H., & Pu, S. (2018). Small-scale pedestrian detection based on somatic topology localization and temporal feature aggregation. arXiv preprint, https://arxiv.org/abs/1807.01438.
Varol, G., Laptev, I., & Schmid, C. (2017). Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1510–1517. https://doi.org/10.1109/TPAMI.2017.2712608.
Article Google Scholar
Wang, S., Cheng, J., Liu, H., & Tang, M. (2018). Pcn: Part and context information for pedestrian detection with cnns. arXiv preprint, https://arxiv.org/abs/1804.04483.
Wang, S., Zhou, Y., Yan, J., & Deng, Z. (2018) Fully motion-aware network for video object detection. In Proceedings of the European conference on computer vision (ECCV) (pp. 542–557). https://doi.org/10.1007/978-3-030-01261-8_33.
Wojek, C., Walk, S., & Schiele, B. (2009) Multi-cue onboard pedestrian detection. In 2009 IEEE conference on computer vision and pattern recognition (pp. 794–801). IEEE. https://doi.org/10.1109/CVPR.2009.5206638.
Xie, H., Chen, Y., & Shin, H. (2019). Context-aware pedestrian detection especially for small-sized instances with Deconvolution Integrated Faster RCNN (DIF R-CNN). Applied Intelligence, 49(3), 1200–1211. https://doi.org/10.1007/s10489-018-1326-8.
Article Google Scholar
Zhang, X., Cheng, L., Li, B., & Hu, H.-M. (2018). Too far to see? Not really!—Pedestrian detection with scale-aware localization policy. IEEE Transactions on Image Processing, 27(8), 3703–3715. https://doi.org/10.1109/TIP.2018.2818018.
Article MathSciNet MATH Google Scholar
Zhu, P., Wen, L., Bian, X., Ling, H., & Hu, Q. J. (2018). Vision meets drones: A challenge. arXiv preprint, https://arxiv.org/abs/1804.07437.
Zhu, X., Wang, Y., Dai, J., Yuan, L., & Wei, Y. (2017) Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE international conference on computer vision (pp. 408–417). https://doi.org/10.1109/ICCV.2017.52.

Download references

Acknowledgements

This material is based on work supported by the Ministry of Trade, Industry & Energy (MOTIE, Korea) under the Industrial Technology Innovation Program (10080619).

Author information

Authors and Affiliations

Division of Electrical Engineering, Hanyang University, 55 Hanyangdeahak-ro, Ansan, Gyeonggi-do, Korea
Han Xie & Hyunchul Shin

Authors

Han Xie
View author publications
You can also search for this author in PubMed Google Scholar
Hyunchul Shin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hyunchul Shin.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xie, H., Shin, H. Two-stream small-scale pedestrian detection network with feature aggregation for drone-view videos. Multidim Syst Sign Process 32, 897–913 (2021). https://doi.org/10.1007/s11045-021-00764-1

Download citation

Received: 11 December 2019
Revised: 26 September 2020
Accepted: 27 January 2021
Published: 08 February 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s11045-021-00764-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two-stream small-scale pedestrian detection network with feature aggregation for drone-view videos

Abstract

Access this article

Similar content being viewed by others

BDF-YOLOV5: Improved YOLOV5 Based on Bi-directional Fusion Network for Dense Pedestrian Detection

Multi-object detection for crowded road scene based on ML-AFP of YOLOv5

Small-Scale Pedestrian Detection Based on Topological Line Localization and Temporal Feature Aggregation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two-stream small-scale pedestrian detection network with feature aggregation for drone-view videos

Abstract

Access this article

Similar content being viewed by others

BDF-YOLOV5: Improved YOLOV5 Based on Bi-directional Fusion Network for Dense Pedestrian Detection

Multi-object detection for crowded road scene based on ML-AFP of YOLOv5

Small-Scale Pedestrian Detection Based on Topological Line Localization and Temporal Feature Aggregation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation