Abstract
Semantic segmentation is a task to classify each pixel in an image. Most recent semantic segmentation methods adopt full convolutional network FCN. FCN uses a fully convolutional network with encoding and decoder architecture. Encoders are used for feature extraction, and the decoder uses encoder-encoded features as input to decode the final segmentation prediction results. However, the convolutional kernel of feature extraction is not too large, so the model can only use local information to understand the input image, limiting the initial receptive field of the model. In addition, semantic segmentation tasks also need details in addition to semantic information, such as contextual information. To solve the above problems, we innovatively introduced the space pyramid structure (ASPP) into TransUnet, a model based on Transformers and U-Net, which is called AS-TransUnet. The spatial pyramid module can obtain more receptive fields to obtain multi-scale information. In addition, we add an attention module to the decoder to help the model learn relevant features. To verify the performance and efficiency of the model, we conducted experiments on two common data sets and compared them with the latest model. Experimental results show the superiority of this model.
The authors would like to acknowledge the support from the National Natural Science Foundation of China (52075530)
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Long, J., Shelhamer, E. and Darrell, T.: Fully convolutional networks for semantic segmentation In: CVPR, pp. 3431–3440 (2015)
Chen, J., et al.: TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv:2102.04306 (2021)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 580–587 (2014)
LeCun, Y., et al.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)
Ciresan, D.C., Gambardella, L.M., Giusti, A., Schmidhuber, J.: Deep neural networks segment neuronal membranes in electron microscopy images. In: NIPS, pp. 2852–2860 (2012)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014), arXiv:1409.1556
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: 2017 Pyramid scene parsing network. In: CVPR, pp. 28881–2890
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Analysis Mach. Intell. 40(4), 834–848 (2017)
Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 (2017)
Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV ( 2018)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEETPAMI 40(4), 834–848 (2018)
Fu, J., Liu, J., Tian, H., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: CVPR (2019)
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: ICCV (2019)
Takikawa, T., Acuna, D., Jampani, V., Fidler, S.: Gated-SCNN: gated shape CNNs for semantic segmentation. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), pp. 5228–5237 (2019)
Yu, J., Gao, H., Chen, Y., Zhou, D., Liu, J., Ju, Z.: Deep object detector with attentional spatiotemporal LSTM for space human-robot interaction. IEEE Trans. Hum.-Mach. Syst. 52(4), 784–793 (2022). https://doi.org/10.1109/THMS.2022.3144951
Zheng, S. et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR (2021)
Xie, E., Wang, W., Yu, Z., et al.: SegFormer: simple and efficient design for semantic segmentation with transformers. Adv. Neural Inform. Process. Syst. 34 (2021)
Li, X., et al.: Improving Semantic Segmentation via Decoupled Body and Edge Supervision. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 435–452. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_26
Yuan, Y., Xie, J., Chen, X., Wang, J.: SegFix: Model-Agnostic Boundary Refinement for Segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 489–506. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_29
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803 (2018)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inform. Process. Syst., 30. Curran Associates (2017)
Woo, S., Park, J., Lee, JY., Kweon, I.S.: CBAM: Convolutional Block Attention Module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11211. Springer, Cham. TPAMI, vol. 40, no. 4, pp. 834–848 (2018).
Yu, J., Gao, H., Sun, J., Zhou, D., Ju, Z.: Spatial cognition-driven deep learning for car detection in unmanned aerial vehicle imagery. IEEE Trans. Cogn. Develop. Syst. 14(4), 1574–1583 (2022). https://doi.org/10.1109/TCDS.2021.3124764
Liu, R., Yuan, Z., Liu, T., Xiong, Z.: End-to-end lane shape prediction with transformers. In: WACV (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR (2021)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Yuan, Y. and Wang, J.O.: Ocnet: Object context network for scene parsing. arXiv (2018)
Zhao, H., et al.: Psanet: Point-wise spatial attention network for scene parsing. In: ECCV, pp. 270–286 (2018)
Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: Denseaspp for semantic segmentation in street scenes. In: CVPR, pp. 3684–3692 (2018)
Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, LC.: Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds.) Computer Vision – ECCV 2020. ECCV 2020. LNCS, vol 12349. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_7
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions ICLR (2016)
Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE TPAMI. 39(12), 2481–2495 (2017)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR, pp. 3213–3223 (2106)
Brostow, G.J., Shotton, J., Fauqueur, J., Cipolla, R.: Segmentation and recognition using structure from motion point clouds. In: ECCV, pp. 44–57 (2008)
Bilinski, P., Prisacariu, V.: Dense decoder shortcut connections for single-pass semantic segmentation. In: CVPR, pp. 6596–6605 (2018)
Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N.: Bisenet: bilateral segmentation network for real-time semantic segmentation. In: ECCV (2018)
Acknowledgment
The authors would like to acknowledge the support from the National Natural Science Foundation of China (52075530).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, J., Gao, D., Wang, X., Gao, H., Ju, Z. (2023). AS-TransUnet: Combining ASPP and Transformer for Semantic Segmentation. In: Yang, H., et al. Intelligent Robotics and Applications. ICIRA 2023. Lecture Notes in Computer Science(), vol 14268. Springer, Singapore. https://doi.org/10.1007/978-981-99-6486-4_13
Download citation
DOI: https://doi.org/10.1007/978-981-99-6486-4_13
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-6485-7
Online ISBN: 978-981-99-6486-4
eBook Packages: Computer ScienceComputer Science (R0)