Skip to main content
Log in

Hybrid Shunted Transformer embedding UNet for remote sensing image semantic segmentation

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

With the development of deep learning, Remote Sensing Image (RSI) semantic segmentation has produced significant advances. However, due to the sparse distribution of the objects and the high similarity between classes, the task of semantic segmentation in RSI is still extremely challenging. In this paper, we propose a novel semantic segmentation framework for RSI called HST-UNet that can overcome the shortcomings of the existing models and extract and recover the global and local features of RSI, which is a hybrid semantic segmentation model with Shunted Transformer as encoder and Multi-Scale Convolutional Attention Network (MSCAN) as decoder. Then, to better fuse the information from the Encoder and the Decoder and alleviate the ambiguity, we design a Learnable Weighted Fusion (LWF) module to effectively connect to the decoder features. Extensive experiments demonstrate that the proposed HST-UNet outperforms the state-of-the-art methods, achieving F1 score/MIoU accuracy of 71.44%/83.00% on the ISPRS Vaihingen dataset and 77.36%/87.09% on ISPRS Potsdam dataset. The code will be available at https://github.com/HC-Zhou/HST-UNet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Algorithm 1
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

The data that support the findings of this study are available at http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html with corresponding permission.

References

  1. Luo H, Chen C, Fang L, Khoshelham K, Shen G (2020) Ms-rrfsegnet: multiscale regional relation feature segmentation network for semantic segmentation of urban scene point clouds. IEEE Trans Geosci Remote Sens 58(12):8301–8315

    Article  Google Scholar 

  2. Neupane B, Horanont T, Aryal J (2021) Deep learning-based semantic segmentation of urban features in satellite images: a review and meta-analysis. Remote Sens 13(4):808

    Article  Google Scholar 

  3. Yuhua C, Wen L, Luc VG (2018) Road: reality oriented adaptation for semantic segmentation of urban scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7892–7901

  4. Ji W, Xiaofan X, Murambadoro D (2015) Understanding urban wetland dynamics: cross-scale detection and analysis of remote sensing. Int J Remote Sens 36(7):1763–1788

    Article  Google Scholar 

  5. Granholm A-H, Lindgren N, Olofsson K, Nyström M, Allard A, Olsson H (2017) Estimating vertical canopy cover using dense image-based point cloud data in four vegetation types in southern sweden. Int J Remote Sens 38(7):1820–1838

    Article  Google Scholar 

  6. Shahbazi M, Théau J, Ménard P (2014) Recent applications of unmanned aerial imagery in natural resource management. GISci Remote Sens 51(4):339–365

    Article  Google Scholar 

  7. Clarke JDA, Gibson D, Apps H (2010) The use of lidar in applied interpretive landform mapping for natural resource management, murray river alluvial plain, australia. Int J Remote Sens 31(23):6275–6296

    Article  Google Scholar 

  8. Weber E, Kane H (2020) Building disaster damage assessment in satellite imagery with multi-temporal fusion. arXiv preprint arXiv:2004.05525

  9. Chen W-J, Li C-C (2002) Rain retrievals using tropical rainfall measuring mission and geostationary meteorological satellite 5 data obtained during the scsmex. Int J Remote Sens 23(12):2425–2448

    Article  Google Scholar 

  10. Kaiming H, Xiangyu Z, Shaoqing R, Jian S (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  11. Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning, PMLR, pp 6105–6114

  12. Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S (2022) A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11976–11986

  13. Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5693–5703

  14. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  15. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: Medical Image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III 18, Springer, pp 234–241

  16. Zhao H, Sh, J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890

  17. d’Ascoli S, Touvron H, Leavitt ML, Morcos AS, Biroli G, Sagun L (2021) Convit: improving vision transformers with soft convolutional inductive biases. In International Conference on Machine Learning, pp 2286–2296. PMLR

  18. Fukui H, Hirakawa T, Yamashita T, Fujiyoshi H (2019) Attention branch network: learning of attention mechanism for visual explanation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10705–10714

  19. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  20. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3146–3154

  21. Zhao H, Zhang Y, Liu S, Shi J, Loy CC, Lin D, Jia J (2018) Psanet: point-wise spatial attention network for scene parsing. In Proceedings of the European conference on computer vision (ECCV), pp 267–283

  22. Yuan Y, Huang L, Guo J, Zhang C, Chen X, Wang J (2018) Ocnet: object context network for scene parsing. arXiv preprint arXiv:1809.00916

  23. Guo MH, Lu CZ, Hou Q, Liu Z, Cheng MM, Hu SM (2022) Segnext: rethinking convolutional attention design for semantic segmentation. arXiv preprint arXiv:2209.08575

  24. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30

  25. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  26. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159

  27. Strudel R, Garcia R, Laptev I, Schmid C (2021) Segmenter: transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pp 7262–7272

  28. Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding? In: ICML, vol 2, p 4

  29. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

  30. Ren S, Zhou D, He S, Feng J, Wang X (2022) Shunted self-attention via multi-scale token aggregation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10853–10862

  31. Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021) Transunet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306

  32. He X, Zhou Y, Zhao J, Zhang D, Yao R, Xue Y (2022) Swin transformer embedding unet for remote sensing image semantic segmentation. IEEE Trans Geosci Remote Sens 60:1–15

    Article  Google Scholar 

  33. Xiao X, Lian S, Luo Z, Li S (2018) Weighted res-unet for high-quality retina vessel segmentation. In 2018 9th international conference on information technology in medicine and education (ITME), IEEE, pp 327–331

  34. Zhou L, Zhang C, Wu M (2018) D-linknet: linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 182–186

  35. Wang J, Long X, Chen G, Wu Z, Chen Z, Ding E (2022) U-hrnet: delving into improving semantic representation of high resolution network for dense prediction. arXiv preprint arXiv:2210.07140

  36. Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8697–8710

  37. Broni-Bediako C, Murata Y, Mormille LH, Atsumi M (2021) Evolutionary nas for aerial image segmentation with gene expression programming of cellular encoding. Neural Comput Appl 1–20

  38. Woo S, Park J, Lee J-Y, Kweon IS (2018) Cbam: convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pp 3–19

  39. Zhu H, Zhang M, Zhang X, Zhang L (2021) Two-branch encoding and iterative attention decoding network for semantic segmentation. Neural Comput Appl 33:5151–5166

    Article  Google Scholar 

  40. Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr Philip HS et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6881–6890

  41. Cao H, Wang Y, Chen J, Jiang D, Zhang X, Tian Q, Wang M (2021) Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537

  42. Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Lu T, Luo P, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pp 568–578

  43. Zongwei Zhou Md, Siddiquee MR, Tajbakhsh N, Liang J (2019) Unet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imaging 39(6):1856–1867

    Article  Google Scholar 

  44. Huang H, Lin L, Tong R, Hu H, Zhang Q, Iwamoto Y, Han X, Chen Y-W, Wu, J. (2020) Unet 3+: a full-scale connected unet for medical image segmentation. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 1055–1059

  45. Mubashar M, Ali H, Grönlund C, Azmat S (2022) R2u++: a multiscale recurrent residual u-net with dense skip connections for medical image segmentation. Neural Comput Appl 34(20):17723–17739

    Article  Google Scholar 

  46. Ibtehaz N, Sohel RM (2020) Multiresunet: rethinking the u-net architecture for multimodal biomedical image segmentation. Neural Netw 121:74–87

    Article  Google Scholar 

  47. Wang H, Cao P, Wang J, Zaiane OR (2022) Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI conference on artificial intelligence, vol 36, pp 2441–2449

  48. Maggiori E, Tarabalka Y, Charpiat G, Alliez P (2017) High-resolution aerial image labeling with convolutional neural networks. IEEE Trans Geosci Remote Sens 55(12):7092–7103

    Article  Google Scholar 

  49. Liu Y, Minh Nguyen D, Deligiannis N, Ding W, Munteanu A (2017) Hourglass-shapenetwork based semantic segmentation for high resolution aerial imagery. Remote Sens 9(6):522

    Article  Google Scholar 

  50. Volpi M, Tuia D (2016) Dense semantic labeling of subdecimeter resolution images with convolutional neural networks. IEEE Trans Geosci Remote Sens 55(2):881–893

    Article  Google Scholar 

  51. Lichao M, Yuansheng H, Xiao Xiang Z (2020) Relation matters: relational context-aware fully convolutional network for semantic segmentation of high-resolution aerial images. IEEE Trans Geosci Remote Sens 58(11):7557–7569

    Article  Google Scholar 

  52. Li X, He H, Li X, Li D, Cheng G, Shi J, Lubin W, Yunhai T, Lin Z (2021) Pointflow: flowing semantics through points for aerial image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4217–4226

  53. Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp 801–818

  54. Xiao T, Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pp 418–434

  55. Bai L, Lin X, Ye Z, Xue D, Yao C, Hui M (2022) Msanlfnet: semantic segmentation network with multiscale attention and nonlocal filters for high-resolution remote sensing images. IEEE Geosci Remote Sens Lett 19:1–5

    Google Scholar 

Download references

Acknowledgments

This study was supported by National Natural Science Foundation of China (Grant Nos. 62006049, 62172113, and 62072123), Guangdong Basic and Applied Basic Research Foundation (No. 2023A1515010939), Project of Education Department of Guangdong Province (Grant Nos. 2022KTSCX068 and 2020ZDZX3059), The Ministry of education of Humanities and Social Science project (Grant No. 18JDGC012), Guangdong Science and Technology Project (Grant Nos. KTP20210197 and 2017A040403068), and Guangdong Science and Technology Innovation Strategy Special Fund Project (Climbing Plan) (No. pdjh2022b0302).

Author information

Authors and Affiliations

Authors

Contributions

Huacong Zhou performed conceptualization, software, and writing—original draft; Xiangling Xiao performed validation, formal analysis, and writing—review and editing; Huihui Li performed methodology and funding acquisition; Xiaoyong Liu performed supervision and funding acquisition; and Peng Liang performed resources, validation, investigation, and funding acquisition.

Corresponding authors

Correspondence to Huihui Li or Xiaoyong Liu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, H., Xiao, X., Li, H. et al. Hybrid Shunted Transformer embedding UNet for remote sensing image semantic segmentation. Neural Comput & Applic (2024). https://doi.org/10.1007/s00521-024-09888-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00521-024-09888-4

Keywords

Navigation