AS-TransUnet: Combining ASPP and Transformer for Semantic Segmentation

Wang, Jinshuo; Gao, Dongxu; Wang, Xuna; Gao, Hongwei; Ju, Zhaojie

doi:10.1007/978-981-99-6486-4_13

Jinshuo Wang¹⁵,
Dongxu Gao¹⁶,
Xuna Wang¹⁵,
Hongwei Gao¹⁵ &
…
Zhaojie Ju¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14268))

Included in the following conference series:

International Conference on Intelligent Robotics and Applications

461 Accesses

Abstract

Semantic segmentation is a task to classify each pixel in an image. Most recent semantic segmentation methods adopt full convolutional network FCN. FCN uses a fully convolutional network with encoding and decoder architecture. Encoders are used for feature extraction, and the decoder uses encoder-encoded features as input to decode the final segmentation prediction results. However, the convolutional kernel of feature extraction is not too large, so the model can only use local information to understand the input image, limiting the initial receptive field of the model. In addition, semantic segmentation tasks also need details in addition to semantic information, such as contextual information. To solve the above problems, we innovatively introduced the space pyramid structure (ASPP) into TransUnet, a model based on Transformers and U-Net, which is called AS-TransUnet. The spatial pyramid module can obtain more receptive fields to obtain multi-scale information. In addition, we add an attention module to the decoder to help the model learn relevant features. To verify the performance and efficiency of the model, we conducted experiments on two common data sets and compared them with the latest model. Experimental results show the superiority of this model.

The authors would like to acknowledge the support from the National Natural Science Foundation of China (52075530)

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Long, J., Shelhamer, E. and Darrell, T.: Fully convolutional networks for semantic segmentation In: CVPR, pp. 3431–3440 (2015)
Google Scholar
Chen, J., et al.: TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv:2102.04306 (2021)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 580–587 (2014)
Google Scholar
LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)
Article Google Scholar
Ciresan, D.C., Gambardella, L.M., Giusti, A., Schmidhuber, J.: Deep neural networks segment neuronal membranes in electron microscopy images. In: NIPS, pp. 2852–2860 (2012)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014), arXiv:1409.1556
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: 2017 Pyramid scene parsing network. In: CVPR, pp. 28881–2890
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Analysis Mach. Intell. 40(4), 834–848 (2017)
Google Scholar
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 (2017)
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV ( 2018)
Google Scholar
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEETPAMI 40(4), 834–848 (2018)
Google Scholar
Fu, J., Liu, J., Tian, H., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: CVPR (2019)
Google Scholar
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV (2019)
Google Scholar
Takikawa, T., Acuna, D., Jampani, V., Fidler, S.: Gated-SCNN: gated shape CNNs for semantic segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), pp. 5228–5237 (2019)
Google Scholar
Yu, J., Gao, H., Chen, Y., Zhou, D., Liu, J., Ju, Z.: Deep object detector with attentional spatiotemporal LSTM for space human-robot interaction. IEEE Trans. Hum.-Mach. Syst. 52(4), 784–793 (2022). https://doi.org/10.1109/THMS.2022.3144951
Article Google Scholar
Zheng, S. et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR (2021)
Google Scholar
Xie, E., Wang, W., Yu, Z., et al.: SegFormer: simple and efficient design for semantic segmentation with transformers. Adv. Neural Inform. Process. Syst. 34 (2021)
Google Scholar
Li, X., et al.: Improving Semantic Segmentation via Decoupled Body and Edge Supervision. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 435–452. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_26
Yuan, Y., Xie, J., Chen, X., Wang, J.: SegFix: Model-Agnostic Boundary Refinement for Segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 489–506. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_29
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inform. Process. Syst., 30. Curran Associates (2017)
Google Scholar
Woo, S., Park, J., Lee, JY., Kweon, I.S.: CBAM: Convolutional Block Attention Module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11211. Springer, Cham. TPAMI, vol. 40, no. 4, pp. 834–848 (2018).
Google Scholar
Yu, J., Gao, H., Sun, J., Zhou, D., Ju, Z.: Spatial cognition-driven deep learning for car detection in unmanned aerial vehicle imagery. IEEE Trans. Cogn. Develop. Syst. 14(4), 1574–1583 (2022). https://doi.org/10.1109/TCDS.2021.3124764
Article Google Scholar
Liu, R., Yuan, Z., Liu, T., Xiong, Z.: End-to-end lane shape prediction with transformers. In: WACV (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Yuan, Y. and Wang, J.O.: Ocnet: Object context network for scene parsing. arXiv (2018)
Google Scholar
Zhao, H., et al.: Psanet: Point-wise spatial attention network for scene parsing. In: ECCV, pp. 270–286 (2018)
Google Scholar
Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: Denseaspp for semantic segmentation in street scenes. In: CVPR, pp. 3684–3692 (2018)
Google Scholar
Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, LC.: Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds.) Computer Vision – ECCV 2020. ECCV 2020. LNCS, vol 12349. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_7
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions ICLR (2016)
Google Scholar
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE TPAMI. 39(12), 2481–2495 (2017)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2106)
Google Scholar
Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: ECCV, pp. 44–57 (2008)
Google Scholar
Bilinski, P., Prisacariu, V.: Dense decoder shortcut connections for single-pass semantic segmentation. In: CVPR, pp. 6596–6605 (2018)
Google Scholar
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Bisenet: bilateral segmentation network for real-time semantic segmentation. In: ECCV (2018)
Google Scholar

Download references

Acknowledgment

The authors would like to acknowledge the support from the National Natural Science Foundation of China (52075530).

Author information

Authors and Affiliations

School of Automation and Electrical Engineering, Shenyang Ligong University, Shenyang, 110159, China
Jinshuo Wang, Xuna Wang & Hongwei Gao
School of Computing, University of Portsmouth, Portsmouth, PO13HE, UK
Dongxu Gao & Zhaojie Ju

Authors

Jinshuo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dongxu Gao
View author publications
You can also search for this author in PubMed Google Scholar
Xuna Wang
View author publications
You can also search for this author in PubMed Google Scholar
Hongwei Gao
View author publications
You can also search for this author in PubMed Google Scholar
Zhaojie Ju
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhaojie Ju .

Editor information

Editors and Affiliations

Zhejiang University, Hangzhou, China
Huayong Yang
Harbin Institute of Technology, Shenzhen, China
Honghai Liu
Zhejiang University, Hangzhou, China
Jun Zou
Huazhong University of Science and Technology, Wuhan, China
Zhouping Yin
Shenyang Institute of Automation, Shenyang, Liaoning, China
Lianqing Liu
Zhejiang University, Hangzhou, China
Geng Yang
Zhejiang University, Hangzhou, China
Xiaoping Ouyang
Harbin Institute of Technology, Shenzhen, China
Zhiyong Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, J., Gao, D., Wang, X., Gao, H., Ju, Z. (2023). AS-TransUnet: Combining ASPP and Transformer for Semantic Segmentation. In: Yang, H., et al. Intelligent Robotics and Applications. ICIRA 2023. Lecture Notes in Computer Science(), vol 14268. Springer, Singapore. https://doi.org/10.1007/978-981-99-6486-4_13

Download citation

DOI: https://doi.org/10.1007/978-981-99-6486-4_13
Published: 10 October 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-6485-7
Online ISBN: 978-981-99-6486-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics