Align-Yolact: a one-stage semantic segmentation network for real-time object detection

Abstract

Object detection is a classic problem in computer vision. The main bottleneck of object detection lies in the fusion of multi-scale features. In this paper, we systematically study the design choices of neural network architecture for real-time object detection, and propose an Align-Yolact to improve the instance segmentation accuracy. Firstly, we propose a weighted bounding box, which improves the accurate positioning of the bounding box. Secondly, we add a bi-directional feature pyramid network to the feature fusion, which improves the mask quality and small target accuracy. Owing to these optimizations and better backbones, we achieve the SOTA results including both detection efficiency and accuracy.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

References

  1. Bochkovskiy A, Wang CY, Liao HYM (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934

  2. Bolya D, Zhou C, Xiao F et al (2019) YOLACT: real-time instance segmentation. In: The IEEE International Conference on Computer Vision (ICCV), 27 October–2 November 2019, Korea, pp 9157–9166

  3. Chen RC (2020) Automatic license plate recognition via sliding-window darknet-YOLO deep learning. Image vis Comput 87:47–56

    Google Scholar 

  4. Deng J, Dong W, Socher R et al (2009) Imagenet: a large-scale hierarchical image database. In: The 2009 IEEE conference on computer vision and pattern recognition (CVPR), 20–21 June 2009, Miami, pp 248–255

  5. Duan K, Bai S, Xie L et al (2019) Centernet: Keypoint triplets for object detection. In: The IEEE International Conference on Computer Vision (ICCV), 27 October- 2 November 2019, Korea, pp 6569–6578

  6. Ghiasi G, Lin TY, Le QV (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 16–19 June 2019, Las Vegas, pp 7036–7045

  7. Girshick R (2015) Fast r-cnn. In: The IEEE international conference on computer vision (ICCV), 7–13 December 2015, Chile, pp 1440–1448

  8. Girshick R, Donahue J, Darrell T et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: The IEEE conference on computer vision and pattern recognition, 23–28 June 2014, Ohio, pp 580–587

  9. Hariharan B, Arbeláez P, Bourdev L et al (2011) Semantic contours from inverse detectors. In: The 2011 International Conference on Computer Vision (ICCV), 6–13 November 2011, Barcelona, pp 991–998

  10. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 27–30 June 2016, Las Vegas, pp 770–778

  11. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-cnn. In: The IEEE international conference on computer vision (ICCV), 22–29 October 2017, Italy, pp 2961–2969

  12. Kong T, Sun F, Liu H et al (2020) Foveabox: beyound anchor-based object detection. IEEE Trans Image Process 29:7389–7398

    Article  Google Scholar 

  13. Law H, Deng J (2018) Cornernet: Detecting objects as paired keypoints. In: The European Conference on Computer Vision (ECCV), 8–14 September 2018, Munich, pp 734–750

  14. Lee Y, Park J (2020) Centermask: real-time anchor-free instance segmentation. In: The IEEE/CVF conference on computer vision and pattern recognition, 13–19 June 2020, Long Beach, pp 13906–13915

  15. Liu S, Qi L, Qin H et al (2018) Path aggregation network for instance segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition, 18–22 June 2018, Salt Lake City, pp 8759–8768

  16. Sandler M, Howard A, Zhu M, et al (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: the IEEE conference on computer vision and pattern recognition (CVPR), 18–22 June 2018, Salt Lake City, pp 4510–4520

  17. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556

  18. Tan M, Le Q V (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, 25–28 January 2019, Taiyuan, pp 6105–6114

  19. Tan M, Pang R, Le QV (2020) EfficientDet: scalable and efficient object detection. In: The IEEE/CVF conference on computer vision and pattern recognition, 13–19 June 2020, Long Beach, pp 10781–10790

  20. Wang X, Kong T, Shen C et al (2020) SOLO: segmenting objects by locations. In: The European Conference on Computer Vision (ECCV), 23–28 August 2020, online, pp 649–665

  21. Yang Z, Liu S, Hu H, Wang L (2019) Reppoints: point set representation for object detection. In: The IEEE International Conference on Computer Vision (ICCV), 27 October–2 November 2019, Korea, pp 9657–9666

  22. Zhang H, Wu C, Zhang Z et al (2020) Resnest: split-attention networks. arXiv preprint arXiv: 2004.08955

Download references

Acknowledgements

The authors would like to thank all the participants taken part in the experiments. This work was supported in part by the National Science Foundation of China (Grant No. 61841701) and Fujian Vocational College Intelligent Equipment Application Technology Collaborative Innovation Center Construction Project (Grant No. 2016-7) and the Science and Technology Project from Transportation Department of FuJian Province (Grant No. 201934).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Shaodan Lin.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lin, S., Zhu, K., Feng, C. et al. Align-Yolact: a one-stage semantic segmentation network for real-time object detection. J Ambient Intell Human Comput (2021). https://doi.org/10.1007/s12652-021-03340-4

Download citation

Keywords

  • Detector architecture
  • Backbone
  • BI-FPN
  • Local coefficient
  • NMS
  • GIOU