Abstract
Road segmentation is a fundamental task for dynamic map in unmanned aerial vehicle (UAV) path navigation. In unplanned, unknown and even damaged areas, there are usually unpaved roads with blurred edges, deformations and occlusions. These challenges of unpaved road segmentation pose significant challenges to the construction of dynamic maps. Our major contributions have: (1) Inspired by dilated convolution, we propose dilated cross window self-attention (DCWin-Attention), which is composed of a dilated cross window mechanism and a pixel regional module. Our goal is to model the long-range horizontal and vertical road dependencies for unpaved roads with deformation and blurred edges. (2) A shifted cross window mechanism is introduced through coupling with DCWin-Attention to reduce the influence of occluded roads in UAV imagery. In detail, the GVT backbone is constructed by using the DCWin-Attention block for multilevel deep features with global dependency. (3) The unpaved road is segmented with the confidence map generated by fusing the deep features of different levels in a unified perceptual parsing network. We verify our method on the self-established BJUT-URD dataset and public DeepGlobe dataset, which achieves 67.72 and 52.67% of the highest IoU at proper inference efficiencies of 2.7, 2.8 FPS, respectively, demonstrating its effectiveness and superiority in unpaved road segmentation. Our code is available at https://github.com/BJUT-AIVBD/GVT-URS.
Similar content being viewed by others
Data availability
Data are openly available in a public repository that issues datasets. The data that support the findings of this study are available in DeepGlobe openly at https://www.kaggle.com/datasets/balraj98/deepglobe-road-extraction-dataset and self-built BJUT-URD at https://github.com/BJUT-AIVBD/GVT-URS/tree/main/unpaved-road-dataset.
References
Liu, F., Liu, Y., Nie, Z., Gao, Y.: Precise single-frequency positioning using low-cost receiver with the aid of lane-level map matching for land vehicle navigation. J. Navig. 74(1), 24–37 (2021). https://doi.org/10.1017/S0373463320000375
Zhang, J., Xiu, Y.: Image stitching based on human visual system and SIFT algorithm. Vis. Comput. 40(1), 427–439 (2024). https://doi.org/10.1007/s00371-023-02791-4
Shi, W., Zhu, C.: The line segment match method for extracting road network from high-resolution satellite images. IEEE Trans. Geosci. Remote Sens. 40(2), 511–514 (2002). https://doi.org/10.1109/36.992826
Huang, X., Lu, Q., Zhang, L.: A multi-index learning approach for classification of high-resolution remotely sensed images over urban areas. J. Photogramm. Remote Sens. 90(1), 36–48 (2014). https://doi.org/10.1016/j.isprsjprs.2014.01.008
Zhou, H., Kong, H., Wei, L., Creighton, D., Nahavandi, S.: On detecting road regions in a single UAV image. IEEE Trans. Intell. Transp. Syst. 18(7), 1713–1722 (2017). https://doi.org/10.1109/TITS.2016.2622280
Yang, X., Li, X., Ye, Y., Lau, R.Y.K., Zhang, X., Huang, X.: Road detection and centerline extraction via deep recurrent convolutional neural network U-Net. IEEE Trans. Geosci. Remote Sens. 57(9), 7209–7220 (2019). https://doi.org/10.1109/TGRS.2019.2912301
Li, J., Chen, J., Sheng, B., Li, P., Yang, P., Feng, D.D., Qi, J.: Automatic detection and classification system of domestic waste via multimodel cascaded convolutional neural network. IEEE Trans. Ind. Informat. 18(1), 163–173 (2022). https://doi.org/10.1109/TII.2021.3085669
Xie, Z., Zhang, W., Sheng, B., Li, P., Chen, C.L.P.: BaGFN: Broad attentive graph fusion network for high-order feature interactions. IEEE Trans. Neural Networks Learn. Syst. 34(8), 4499–4513 (2023). https://doi.org/10.1109/TNNLS.2021.3116209
Sheng, B., Li, P., Ali, R., Chen, C.L.P.: Improving video temporal consistency via broad learning system. IEEE Trans. Cybern. 52(7), 6662–6675 (2022). https://doi.org/10.1109/TCYB.2021.3079311
Jiang, N., Sheng, B., Li, P., Lee, T.-Y.: PhotoHelper: Portrait photographing guidance via deep feature retrieval and fusion. IEEE Trans. Multim. 25(1), 2226–2238 (2023). https://doi.org/10.1109/TMM.2022.3144890
Shamsolmoali, P., Zareapoor, M., Zhou, H., Wang, R., Yang, J.: Road segmentation for remote sensing images using adversarial spatial pyramid networks. IEEE Trans. Geosci. Remote Sens. 59(6), 4673–4688 (2021). https://doi.org/10.1109/TGRS.2020.3016086
Soni, P.K., Rajpal, N., Mehta, R.: Semiautomatic road extraction framework based on shape features and LS-SVM from high-resolution images. J. Indian Soc. Remote Sens. 48(1), 513–524 (2020). https://doi.org/10.1007/s12524-019-01077-4
Soni, P.K., Rajpal, N., Mehta, R.: Road network extraction using multi-layered filtering and tensor voting from aerial images. Egypt. J. Remote Sens. Space Sci. 24(2), 211–219 (2021). https://doi.org/10.1016/j.ejrs.2021.01.004
Gong, S., Zhou, H., Xue, F., Fang, C., Li, Y., Zhou, Y.: FastRoadSeg: Fast monocular road segmentation network. IEEE Trans. Intel. Trans. Syst. 23(11), 21505–21514 (2022). https://doi.org/10.1109/TITS.2022.3192473
Zhang, H., Song, Y., Chen, Y., Zhong, H., Liu, L., Wang, Y., Akilan, T., Wu, Q.M.J.: MRSDI-CNN: Multi-model rail surface defect inspection system based on convolutional neural networks. IEEE Trans. Intel. Trans. Syst. 23(8), 11162–11177 (2022). https://doi.org/10.1109/TITS.2021.3101053
Liang, X., Zhang, J., Zhuo, L., Li, Y., Tian, Q.: Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis. IEEE Trans. Circuits Syst. Video Technol. 30(6), 1758–1770 (2020). https://doi.org/10.1109/TCSVT.2019.2905881
Wang, Y., Park, J.: Transitional asymmetric non-local neural networks for real-world dirt road segmentation. In: Int. Conf. Pattern Recognit., pp. 6949–6956 (2021). https://doi.org/10.1109/ICPR48806.2021.9412882
Li, X., Zhao, Z., Wang, Q.: ABSSNet: Attention-based spatial segmentation network for traffic scene understanding. IEEE Trans. Cyber. 52(9), 9352–9362 (2021). https://doi.org/10.1109/TCYB.2021.3050558
Abdollahi, A., Pradhan, B., Alamri, A.: RoadVecNet: A new approach for simultaneous road network segmentation and vectorization from aerial and google earth imagery in a complex urban set-up. GISci. Remote Sens. 58(7), 1151–1174 (2021). https://doi.org/10.1080/15481603.2021.1972713
Elhassan, M.A., Huang, C., Yang, C., Munea, T.L.: DSANet: Dilated spatial attention for real-time semantic segmentation in urban street scenes. Expert Syst. Appl. 183(1), 1150–1190 (2021). https://doi.org/10.1016/j.eswa.2021.115090
Liu, S., Zhang, H., Shao, L., Yang, J.: Built-in depth-semantic coupled encoding for scene parsing, vehicle detection, and road segmentation. IEEE Trans. Intel. Trans. Syst. 22(9), 5520–5534 (2021). https://doi.org/10.1109/TITS.2020.2987819
Chen, W., Zhou, G., Liu, Z., Li, X., Zheng, X., Wang, L.: NIGAN: A framework for mountain road extraction integrating remote sensing road-scene neighborhood probability enhancements and improved conditional generative adversarial network. IEEE Trans. Geos. Remote Sens. 60(1), 1–15 (2022). https://doi.org/10.1109/TGRS.2022.3188908
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with Transformers. In: Eur. Conf. Comput. Vis., pp. 213–229 (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, Z., Qiu, G., Li, P., Zhu, L., Yang, X., Sheng, B.: MNGNAS: Distilling adaptive combination of multiple searched networks for one-shot neural architecture search. IEEE Trans. Pattern Anal. Mach. Intell. 45(11), 13489–13508 (2023). https://doi.org/10.1109/TPAMI.2023.3293885
Chen, T., Jiang, D., Li, R.: Swin Transformers make strong contextual encoders for VHR image road extraction. In: IEEE Int. Geosci. Remote Sens. Symp., pp. 3019–3022 (2022). https://doi.org/10.1109/IGARSS46834.2022.9883628
Ding, L., Lin, D., Lin, S., Zhang, J., Cui, X., Wang, Y., Tang, H., Bruzzone, L.: Looking outside the window: wide-context transformer for the semantic segmentation of high-resolution remote sensing images. IEEE Trans. Geos. Remote Sens. 60(1), 1–13 (2022). https://doi.org/10.1109/TGRS.2022.3168697
Li, A., Jiao, J., Li, N., Qi, W., Xu, W., Pang, M.: Conmw Transformer: A general vision Transformer backbone with merged-window attention. In: Int. Conf. on Image Process., pp. 1551–1555 (2022). https://doi.org/10.1109/ICIP46576.2022.9897179
Gao, L., Liu, H., Yang, M., Chen, L., Wan, Y., Xiao, Z., Qian, Y.: STransFuse: Fusing swin Transformer and convolutional neural network for remote sensing image semantic segmentation. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 14(1), 10990–11003 (2021). https://doi.org/10.1109/JSTARS.2021.3119654
He, X., Zhou, Y., Zhao, J., Zhang, D., Yao, R., Xue, Y.: Swin Transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geos. Remote Sens. 60(1), 1–15 (2022). https://doi.org/10.1109/TGRS.2022.3144165
Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H.: Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers. In: IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6881–6890 (2021). https://doi.org/10.1109/CVPR46437.2021.00681
Lin, X., Sun, S., Huang, W., Sheng, B., Li, P., Feng, D.D.: EAPT: Efficient attention pyramid Transformer for image processing. IEEE Trans. Multim. 25(1), 50–61 (2023). https://doi.org/10.1109/TMM.2021.3120873
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical vision Transformer using shifted windows. In: IEEE Int. Conf. Comput. Vis., pp. 10012–10022 (2021). https://doi.org/10.1109/ICCV48922.2021.00986
Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Eur. Conf. Comput. Vis., pp. 418–434 (2018). https://doi.org/10.1007/978-3-030-01228-1_26
Gao, X., Sun, X., Yan, M., Sun, H., Fu, K., Zhang, Y., Ge, Z.: Road extraction from remote sensing images by multiple feature pyramid network. In: IEEE Int. Geosci. Remote Sens. Symp., pp. 6907–6910 (2018). https://doi.org/10.1109/IGARSS.2018.8519093
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6230–6239 (2017). https://doi.org/10.1109/CVPR.2017.660
Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B.: CSwin Transformer: A general vision Transformer backbone with cross-shaped windows. In: IEEE Conf. Comput. Vis. Pattern Recognit., pp. 12114–12124 (2022). https://doi.org/10.1109/CVPR52688.2022.01181
Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., Hughes, F., Tuia, D., Raskar, R.: DeepGlobe 2018: A challenge to parse the earth through satellite images. In: IEEE Conf. Comput. Vis. Pattern Recognit. Workshop, pp. 172–181 (2018). https://doi.org/10.1109/CVPRW.2018.00031
Yuan, Y., Chen, X., Wang, J.: Object-contextual representations for semantic segmentation. In: Eur. Conf. Comput. Vis., pp. 173–190 (2020). https://doi.org/10.1007/978-3-030-58539-6_11
Chen, L., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-Decoder with atrous separable convolution for semantic image segmentation. In: Eur. Conf. Comput. Vis., pp. 801–818 (2018). https://doi.org/10.1007/978-3-030-01234-2_49
Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNet: Criss-cross attention for semantic segmentation. In: IEEE Int. Conf. Comput. Vis., pp. 603–612 (2019). https://doi.org/10.1109/ICCV.2019.00069
Pan, H., Hong, Y., Sun, W., Jia, Y.: Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Trans. Intell. Transp. Syst. 24(3), 3448–3460 (2023). https://doi.org/10.1109/TITS.2022.3228042
Xu, J., Xiong, Z., Bhattacharyya, S. P.: PIDNet: A real-time semantic segmentation network inspired by PID controllers. In: IEEE Conf. Comput. Vis. Pattern Recognit., pp. 19529–19539 (2023). https://doi.org/10.1109/CVPR52729.2023.01871
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision Transformer: A versatile backbone for dense prediction without convolutions. In: Eur. Conf. Comput. Vis., pp. 548–558 (2021). https://doi.org/10.1109/ICCV48922.2021.00061
Yu, Q., Xia, Y., Bai, Y., Lu, Y., Yuille, A. L., Shen, W.: Glance-and-gaze vision Transformer. In: Adv. Neural Inf. Process. Syst., pp. 12990–13003 (2021). https://doi.org/10.48550/arXiv.2106.02277
Hassani, A., Walton, S., Li, J., Li, S., Shi, H.: Neighborhood attention Transformer. In: IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6185–6194 (2023). https://doi.org/10.1109/CVPR52729.2023.00599
Funding
This work is supported by National Natural Science Foundation of China (No. 62371015), Beijing Natural Science Foundation (No. L211017).
Author information
Authors and Affiliations
Contributions
Wensheng Li was involved in investigation, methodology, data curation, writing—original draft, visualization and validation. Jing Zhang contributed to methodology, conceptualization, supervision and writing—review and editing. Jiafeng Li was involved in resources, funding acquisition and methodology. Li Zhuo contributed to project administration and supervision.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there are no conflicts of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, W., Zhang, J., Li, J. et al. Unpaved road segmentation of UAV imagery via a global vision transformer with dilated cross window self-attention for dynamic map. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03416-0
Accepted:
Published:
DOI: https://doi.org/10.1007/s00371-024-03416-0