Skip to main content
Log in

Lightweight Transformers make strong encoders for underwater object detection

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Underwater object detection methods are widely used in ocean exploration tasks, and precise center localization can help users find objects of interest accurately and quickly. In recent years, the underwater detector based on convolutional neural networks (CNNs) has achieved great success. However, due to the locality of convolution, the detector based on CNNs is usually difficult to explicitly model the long-term dependence. In addition, Transformers can obtain global context, but it will seriously reduce the inference speed of the detector, because Transformers need a lot of memory and computation. In this paper, we propose CSPTCenterNet underwater detector, which uses a proposed lightweight Transformers to extract global context, so as to improve the performance of the detector while maintaining real-time detection. And we fuse the encoded feature maps with the high-resolution feature maps in the backbone network in the upsampling stage to increase the spatial details that Transformers lack. Finally, we use GIoU loss and multi-samples strategy to train the network to enhance the accurate regression ability of the detector. Extensive experiments on the underwater dataset and the PASCAL VOC dataset demonstrate the effectiveness of our proposed method. And our method achieves the best detection performance while achieving inference speed 2 to 10 times faster than other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Availability of data and materials

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

  1. Bello, I., Zoph, B., Vaswani, A., et al.: Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3286–3295 (2019)

  2. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: Yolov4: Optimal speed and accuracy of object detection (2020). arXiv preprint arXiv:2004.10934

  3. Carion, N., Massa, F., Synnaeve, G., et al.: End-to-end object detection with transformers. In: European conference on computer vision, Springer, pp 213–229 (2020)

  4. Chen, Q., Wang, Y., Yang, T., et al.: You only look one-level feature. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13,039–13,048 (2021)

  5. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2020)

  6. Everingham, M., Van Gool, L., Williams, C.K.I., et al.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vision 88(2), 303–338 (2010). https://doi.org/10.1007/s11263-009-0275-4

    Article  Google Scholar 

  7. Fan, Z., Xia, W., Liu, X., et al.: Detection and segmentation of underwater objects from forward-looking sonar based on a modified Mask RCNN. SIViP 15(6), 1135–1143 (2021). https://doi.org/10.1007/s11760-020-01841-x

    Article  Google Scholar 

  8. Fu, C.Y., Liu, W., Ranga, A., et al.: DSSD : Deconvolutional Single Shot Detector (2017). arXiv:1701.06659 [cs] ArXiv: 1701.06659

  9. Girshick, R., Donahue, J., Darrell, T., et al.: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. pp 580–587 (2014)

  10. He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 (2016)

  11. He, K., Gkioxari, G., Dollár, P., et al.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969 (2017)

  12. Huang, H., Zhou, H., Yang, X., et al.: Faster R-CNN for marine organisms detection and recognition using data augmentation. Neurocomputing 337, 372–384 (2019). https://doi.org/10.1016/j.neucom.2019.01.084

    Article  Google Scholar 

  13. Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: Proceedings of the European conference on computer vision (ECCV), pp 734–750 (2018)

  14. Lin, T.Y., Goyal, P., Girshick, R., et al.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988 (2017)

  15. Liu, W., Anguelov, D., Erhan, D., et al.: SSD: Single Shot MultiBox Detector. In: Leibe B, Matas J, Sebe N, et al (eds) Computer Vision - ECCV 2016. Springer International Publishing, Cham, Lecture Notes in Computer Science, pp 21–37 (2016), https://doi.org/10.1007/978-3-319-46448-0_2

  16. Liu, Z., Lin, Y., Cao, Y., et al.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10,012–10,022 (2021)

  17. Pan, T.S., Huang, H.C., Lee, J.C., et al.: Multi-scale ResNet for real-time underwater object detection. SIViP 15(5), 941–949 (2021). https://doi.org/10.1007/s11760-020-01818-w

    Article  Google Scholar 

  18. Redmon, J., Farhadi, A.: YOLOv3: An Incremental Improvement (2018). arXiv:1804.02767 [cs]

  19. Redmon, J., Divvala, S., Girshick, R., et al.: You Only Look Once: Unified, Real-Time Object Detection. pp 779–788 (2016)

  20. Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031, conference Name: IEEE Transactions on Pattern Analysis and Machine Intelligence (2017)

  21. Rezatofighi, H., Tsoi, N., Gwak, J., et al.: Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. pp 658–666 (2019)

  22. Srinivas, A., Lin, T.Y., Parmar, N., et al.: Bottleneck Transformers for Visual Recognition. pp 16,519–16,529 (2021)

  23. Tian, Z., Shen, C., Chen, H., et al.: FCOS: Fully Convolutional One-Stage Object Detection. pp 9627–9636 (2019)

  24. Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. Advances in neural information processing systems 30 (2017)

  25. Zhang, S., Chi, C., Yao, Y., et al.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9759–9768 (2020)

  26. Zhang, X., Wan, F., Liu, C., et al.: Freeanchor: Learning to match anchors for visual object detection. Advances in neural information processing systems 32 (2019)

  27. Zhou, X., Wang, D., Krähenbühl, P.: Objects as points (2019). arXiv preprint arXiv:1904.07850

  28. Zhu, X., Su, W., Lu, L., et al.: Deformable detr: Deformable transformers for end-to-end object detection. In: International Conference on Learning Representations (2020)

Download references

Acknowledgements

This project was supported by Guangzhou Key Laboratory of Intelligent Agriculture (201902010081).

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Hailong Liu, Jinrong Cui. The first draft of the manuscript was written by Hailong Liu, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Weifeng Zhang.

Ethics declarations

Competing interests

The authors have no relevant financial or non-financial interests to disclose.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cui, J., Liu, H., Zhong, H. et al. Lightweight Transformers make strong encoders for underwater object detection. SIViP 17, 1889–1896 (2023). https://doi.org/10.1007/s11760-022-02400-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-022-02400-2

Keywords

Navigation