Advertisement

Multiview Detection with Feature Perspective Transformation

Conference paper
  • 735 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12352)

Abstract

Incorporating multiple camera views for detection alleviates the impact of occlusions in crowded scenes. In a multiview detection system, we need to answer two important questions. First, how should we aggregate cues from multiple views? Second, how should we aggregate information from spatially neighboring locations? To address these questions, we introduce a novel multiview detector, MVDet. During multiview aggregation, for each location on the ground, existing methods use multiview anchor box features as representation, which potentially limits performance as pre-defined anchor boxes can be inaccurate. In contrast, via feature map perspective transformation, MVDet employs anchor-free representations with feature vectors directly sampled from corresponding pixels in multiple views. For spatial aggregation, different from previous methods that require design and operations outside of neural networks, MVDet takes a fully convolutional approach with large convolutional kernels on the multiview aggregated feature map. The proposed model is end-to-end learnable and achieves 88.2% MODA on Wildtrack dataset, outperforming the state-of-the-art by 14.1%. We also provide detailed analysis of MVDet on a newly introduced synthetic dataset, MultiviewX, which allows us to control the level of occlusion. Code and MultiviewX dataset are available at https://github.com/hou-yz/MVDet.

Keywords

Multiview detection Anchor-free Perspective transformation Fully convolutional Synthetic data 

Notes

Acknowledgement

Dr. Liang Zheng is the recipient of Australian Research Council Discovery Early Career Award (DE200101283) funded by the Australian Government. The authors thank all anonymous reviewers and ACs for their constructive comments.

Supplementary material

504444_1_En_1_MOESM1_ESM.zip (17.5 mb)
Supplementary material 1 (zip 17880 KB)

References

  1. 1.
    Baqué, P., Fleuret, F., Fua, P.: Deep occlusion reasoning for multi-camera multi-target detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 271–279 (2017)Google Scholar
  2. 2.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)Google Scholar
  3. 3.
    Chavdarova, T., et al.: WILDTRACK: a multi-camera hd dataset for dense unscripted pedestrian detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5030–5039 (2018)Google Scholar
  4. 4.
    Chavdarova, T., et al.: Deep multi-camera people detection. In: 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 848–853. IEEE (2017)Google Scholar
  5. 5.
    Chen, X., et al.: 3D object proposals for accurate object class detection. In: Advances in Neural Information Processing Systems, pp. 424–432 (2015)Google Scholar
  6. 6.
    Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907–1915 (2017)Google Scholar
  7. 7.
    Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: CenterNet: keypoint triplets for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6569–6578 (2019)Google Scholar
  8. 8.
    Fleuret, F., Berclaz, J., Lengagne, R., Fua, P.: Multicamera people tracking with a probabilistic occupancy map. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 267–282 (2007)CrossRefGoogle Scholar
  9. 9.
    Girshick, R.: Fast R-CNN object detection with caffe. Microsoft Research (2015)Google Scholar
  10. 10.
    Gupta, S., Girshick, R., Arbeláez, P., Malik, J.: Learning rich features from RGB-D images for object detection and segmentation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision – ECCV 2014. Lecture Notes in Computer Science, vol. 8695, pp. 345–360. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10584-0_23CrossRefGoogle Scholar
  11. 11.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  12. 12.
    Hoffman, J., Gupta, S., Leong, J., Guadarrama, S., Darrell, T.: Cross-modal adaptation for RGB-D detection. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 5032–5039. IEEE (2016)Google Scholar
  13. 13.
    Hosang, J., Benenson, R., Schiele, B.: Learning non-maximum suppression. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4507–4515 (2017)Google Scholar
  14. 14.
    Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in Neural Information Processing Systems, pp. 2017–2025 (2015)Google Scholar
  15. 15.
    Kasturi, R., Goldgof, D., Soundararajan, P., Manohar, V., Garofolo, J., Bowers, R., Boonstra, M., Korzhova, V., Zhang, J.: Framework for performance evaluation of face, text, and vehicle detection and tracking in video: data, metrics, and protocol. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 319–336 (2008)CrossRefGoogle Scholar
  16. 16.
    Kong, T., Sun, F., Liu, H., Jiang, Y., Shi, J.: FoveaBox: beyond anchor-based object detector. arXiv preprint arXiv:1904.03797 (2019)
  17. 17.
    Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3D proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. IEEE (2018)Google Scholar
  18. 18.
    Law, H., Deng, J.: CornerNet: Detecting objects as paired keypoints. In: Ferrari, V., Sminchisescu, C., Weiss, Y., Hebert, M. (eds.) Proceedings of the European Conference on Computer Vision (ECCV-2018). Lecture Notes in Computer Science, vol. 11218, pp. 765–781. Springer, Munich (2018).  https://doi.org/10.1007/978-3-030-01264-9_45CrossRefGoogle Scholar
  19. 19.
    Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3D object detection. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. Lecture Notes in Computer Science, vol. 11220, pp. 663–678. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01270-0_39CrossRefGoogle Scholar
  20. 20.
    Liu, R., et al.: An intriguing failing of convolutional neural networks and the CoordConv solution. In: Advances in Neural Information Processing Systems, pp. 9605–9616 (2018)Google Scholar
  21. 21.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. Lecture Notes in Computer Science, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  22. 22.
    Liu, W., Liao, S., Ren, W., Hu, W., Yu, Y.: High-level semantic feature detection: a new perspective for pedestrian detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5187–5196 (2019)Google Scholar
  23. 23.
    Lv, K., Sheng, H., Xiong, Z., Li, W., Zheng, L.: Pose-based view synthesis for vehicles: a perspective aware method. IEEE Trans. Image Process. 29, 5163–5174 (2020)CrossRefGoogle Scholar
  24. 24.
    Noh, J., Lee, S., Kim, B., Kim, G.: Improving occlusion and hard negative handling for single-stage pedestrian detectors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 966–974 (2018)Google Scholar
  25. 25.
    Ouyang, W., Zeng, X., Wang, X.: Partial occlusion handling in pedestrian detection with a deep model. IEEE Trans. Circ. Syst. Video Technol. 26(11), 2123–2137 (2015)CrossRefGoogle Scholar
  26. 26.
    Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters-improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4353–4361 (2017)Google Scholar
  27. 27.
    Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum pointnets for 3D object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918–927 (2018)Google Scholar
  28. 28.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)Google Scholar
  29. 29.
    Roig, G., Boix, X., Shitrit, H.B., Fua, P.: Conditional random fields for multi-camera object detection. In: 2011 International Conference on Computer Vision, pp. 563–570. IEEE (2011)Google Scholar
  30. 30.
    Shi, Y., Liu, L., Yu, X., Li, H.: Spatial-aware feature aggregation for image based cross-view geo-localization. In: Advances in Neural Information Processing Systems, pp. 10090–10100 (2019)Google Scholar
  31. 31.
    Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, vol. 11006, p. 1100612. International Society for Optics and Photonics (2019)Google Scholar
  32. 32.
    Song, T., Sun, L., Xie, D., Sun, H., Pu, S.: Small-scale pedestrian detection based on topological line localization and temporal feature aggregation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. Lecture Notes in Computer Science, vol. 11211, pp. 554–569. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01234-2_33CrossRefGoogle Scholar
  33. 33.
    Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3D shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015)Google Scholar
  34. 34.
    Sun, X., Zheng, L.: Dissecting person re-identification from the viewpoint of viewpoint. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 608–617 (2019)Google Scholar
  35. 35.
    Tian, Y., Luo, P., Wang, X., Tang, X.: Deep learning strong parts for pedestrian detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1904–1912 (2015)Google Scholar
  36. 36.
    Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9627–9636 (2019)Google Scholar
  37. 37.
    Unity: Unity technologies. https://unity.com/
  38. 38.
    Wang, F., Zhao, L., Li, X., Wang, X., Tao, D.: Geometry-aware scene text detection with instance transformation network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1381–1389 (2018)Google Scholar
  39. 39.
    Wang, X., Xiao, T., Jiang, Y., Shao, S., Sun, J., Shen, C.: Repulsion loss: detecting pedestrians in a crowd. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7774–7783 (2018)Google Scholar
  40. 40.
    Wu, J., et al.: Single image 3D interpreter network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016. Lecture Notes in Computer Science, vol. 9910, pp. 365–382. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46466-4_22CrossRefGoogle Scholar
  41. 41.
    Xu, Y., Liu, X., Liu, Y., Zhu, S.C.: Multi-view people tracking via hierarchical trajectory composition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4256–4265 (2016)Google Scholar
  42. 42.
    Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: learning single-view 3D object reconstruction without 3d supervision. In: Advances in Neural Information Processing Systems, pp. 1696–1704 (2016)Google Scholar
  43. 43.
    Yang, T., Zhang, X., Li, Z., Zhang, W., Sun, J.: Metaanchor: learning to detect objects with customized anchors. In: Advances in Neural Information Processing Systems, pp. 320–330 (2018)Google Scholar
  44. 44.
    Yao, Y., Zheng, L., Yang, X., Naphade, M., Gedeon, T.: Simulating content consistent vehicle datasets with attribute descent. arXiv preprint arXiv:1912.08855 (2019)
  45. 45.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
  46. 46.
    Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S.Z.: Occlusion-aware R-CNN: detecting pedestrians in a crowd. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. Lecture Notes in Computer Science, vol. 11207, pp. 357–374. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01219-9_39CrossRefGoogle Scholar
  47. 47.
    Zheng, S., et al.: Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1529–1537 (2015)Google Scholar
  48. 48.
    Zhou, C., Yuan, J.: Multi-label learning of part detectors for heavily occluded pedestrian detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3486–3495 (2017)Google Scholar
  49. 49.
    Zhu, C., He, Y., Savvides, M.: Feature selective anchor-free module for single-shot object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 840–849 (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Australian Centre for Robotic VisionAustralian National UniversityCanberraAustralia

Personalised recommendations