Skip to main content
Log in

Unpaved road segmentation of UAV imagery via a global vision transformer with dilated cross window self-attention for dynamic map

  • Research
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Road segmentation is a fundamental task for dynamic map in unmanned aerial vehicle (UAV) path navigation. In unplanned, unknown and even damaged areas, there are usually unpaved roads with blurred edges, deformations and occlusions. These challenges of unpaved road segmentation pose significant challenges to the construction of dynamic maps. Our major contributions have: (1) Inspired by dilated convolution, we propose dilated cross window self-attention (DCWin-Attention), which is composed of a dilated cross window mechanism and a pixel regional module. Our goal is to model the long-range horizontal and vertical road dependencies for unpaved roads with deformation and blurred edges. (2) A shifted cross window mechanism is introduced through coupling with DCWin-Attention to reduce the influence of occluded roads in UAV imagery. In detail, the GVT backbone is constructed by using the DCWin-Attention block for multilevel deep features with global dependency. (3) The unpaved road is segmented with the confidence map generated by fusing the deep features of different levels in a unified perceptual parsing network. We verify our method on the self-established BJUT-URD dataset and public DeepGlobe dataset, which achieves 67.72 and 52.67% of the highest IoU at proper inference efficiencies of 2.7, 2.8 FPS, respectively, demonstrating its effectiveness and superiority in unpaved road segmentation. Our code is available at https://github.com/BJUT-AIVBD/GVT-URS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Data availability

Data are openly available in a public repository that issues datasets. The data that support the findings of this study are available in DeepGlobe openly at https://www.kaggle.com/datasets/balraj98/deepglobe-road-extraction-dataset and self-built BJUT-URD at https://github.com/BJUT-AIVBD/GVT-URS/tree/main/unpaved-road-dataset.

References

  1. Liu, F., Liu, Y., Nie, Z., Gao, Y.: Precise single-frequency positioning using low-cost receiver with the aid of lane-level map matching for land vehicle navigation. J. Navig. 74(1), 24–37 (2021). https://doi.org/10.1017/S0373463320000375

    Article  Google Scholar 

  2. Zhang, J., Xiu, Y.: Image stitching based on human visual system and SIFT algorithm. Vis. Comput. 40(1), 427–439 (2024). https://doi.org/10.1007/s00371-023-02791-4

    Article  Google Scholar 

  3. Shi, W., Zhu, C.: The line segment match method for extracting road network from high-resolution satellite images. IEEE Trans. Geosci. Remote Sens. 40(2), 511–514 (2002). https://doi.org/10.1109/36.992826

    Article  Google Scholar 

  4. Huang, X., Lu, Q., Zhang, L.: A multi-index learning approach for classification of high-resolution remotely sensed images over urban areas. J. Photogramm. Remote Sens. 90(1), 36–48 (2014). https://doi.org/10.1016/j.isprsjprs.2014.01.008

    Article  Google Scholar 

  5. Zhou, H., Kong, H., Wei, L., Creighton, D., Nahavandi, S.: On detecting road regions in a single UAV image. IEEE Trans. Intell. Transp. Syst. 18(7), 1713–1722 (2017). https://doi.org/10.1109/TITS.2016.2622280

    Article  Google Scholar 

  6. Yang, X., Li, X., Ye, Y., Lau, R.Y.K., Zhang, X., Huang, X.: Road detection and centerline extraction via deep recurrent convolutional neural network U-Net. IEEE Trans. Geosci. Remote Sens. 57(9), 7209–7220 (2019). https://doi.org/10.1109/TGRS.2019.2912301

    Article  Google Scholar 

  7. Li, J., Chen, J., Sheng, B., Li, P., Yang, P., Feng, D.D., Qi, J.: Automatic detection and classification system of domestic waste via multimodel cascaded convolutional neural network. IEEE Trans. Ind. Informat. 18(1), 163–173 (2022). https://doi.org/10.1109/TII.2021.3085669

    Article  Google Scholar 

  8. Xie, Z., Zhang, W., Sheng, B., Li, P., Chen, C.L.P.: BaGFN: Broad attentive graph fusion network for high-order feature interactions. IEEE Trans. Neural Networks Learn. Syst. 34(8), 4499–4513 (2023). https://doi.org/10.1109/TNNLS.2021.3116209

    Article  Google Scholar 

  9. Sheng, B., Li, P., Ali, R., Chen, C.L.P.: Improving video temporal consistency via broad learning system. IEEE Trans. Cybern. 52(7), 6662–6675 (2022). https://doi.org/10.1109/TCYB.2021.3079311

    Article  Google Scholar 

  10. Jiang, N., Sheng, B., Li, P., Lee, T.-Y.: PhotoHelper: Portrait photographing guidance via deep feature retrieval and fusion. IEEE Trans. Multim. 25(1), 2226–2238 (2023). https://doi.org/10.1109/TMM.2022.3144890

    Article  Google Scholar 

  11. Shamsolmoali, P., Zareapoor, M., Zhou, H., Wang, R., Yang, J.: Road segmentation for remote sensing images using adversarial spatial pyramid networks. IEEE Trans. Geosci. Remote Sens. 59(6), 4673–4688 (2021). https://doi.org/10.1109/TGRS.2020.3016086

    Article  Google Scholar 

  12. Soni, P.K., Rajpal, N., Mehta, R.: Semiautomatic road extraction framework based on shape features and LS-SVM from high-resolution images. J. Indian Soc. Remote Sens. 48(1), 513–524 (2020). https://doi.org/10.1007/s12524-019-01077-4

    Article  Google Scholar 

  13. Soni, P.K., Rajpal, N., Mehta, R.: Road network extraction using multi-layered filtering and tensor voting from aerial images. Egypt. J. Remote Sens. Space Sci. 24(2), 211–219 (2021). https://doi.org/10.1016/j.ejrs.2021.01.004

    Article  Google Scholar 

  14. Gong, S., Zhou, H., Xue, F., Fang, C., Li, Y., Zhou, Y.: FastRoadSeg: Fast monocular road segmentation network. IEEE Trans. Intel. Trans. Syst. 23(11), 21505–21514 (2022). https://doi.org/10.1109/TITS.2022.3192473

    Article  Google Scholar 

  15. Zhang, H., Song, Y., Chen, Y., Zhong, H., Liu, L., Wang, Y., Akilan, T., Wu, Q.M.J.: MRSDI-CNN: Multi-model rail surface defect inspection system based on convolutional neural networks. IEEE Trans. Intel. Trans. Syst. 23(8), 11162–11177 (2022). https://doi.org/10.1109/TITS.2021.3101053

    Article  Google Scholar 

  16. Liang, X., Zhang, J., Zhuo, L., Li, Y., Tian, Q.: Small object detection in unmanned aerial vehicle images using feature fusion and scaling-based single shot detector with spatial context analysis. IEEE Trans. Circuits Syst. Video Technol. 30(6), 1758–1770 (2020). https://doi.org/10.1109/TCSVT.2019.2905881

    Article  Google Scholar 

  17. Wang, Y., Park, J.: Transitional asymmetric non-local neural networks for real-world dirt road segmentation. In: Int. Conf. Pattern Recognit., pp. 6949–6956 (2021). https://doi.org/10.1109/ICPR48806.2021.9412882

  18. Li, X., Zhao, Z., Wang, Q.: ABSSNet: Attention-based spatial segmentation network for traffic scene understanding. IEEE Trans. Cyber. 52(9), 9352–9362 (2021). https://doi.org/10.1109/TCYB.2021.3050558

    Article  Google Scholar 

  19. Abdollahi, A., Pradhan, B., Alamri, A.: RoadVecNet: A new approach for simultaneous road network segmentation and vectorization from aerial and google earth imagery in a complex urban set-up. GISci. Remote Sens. 58(7), 1151–1174 (2021). https://doi.org/10.1080/15481603.2021.1972713

    Article  Google Scholar 

  20. Elhassan, M.A., Huang, C., Yang, C., Munea, T.L.: DSANet: Dilated spatial attention for real-time semantic segmentation in urban street scenes. Expert Syst. Appl. 183(1), 1150–1190 (2021). https://doi.org/10.1016/j.eswa.2021.115090

    Article  Google Scholar 

  21. Liu, S., Zhang, H., Shao, L., Yang, J.: Built-in depth-semantic coupled encoding for scene parsing, vehicle detection, and road segmentation. IEEE Trans. Intel. Trans. Syst. 22(9), 5520–5534 (2021). https://doi.org/10.1109/TITS.2020.2987819

    Article  Google Scholar 

  22. Chen, W., Zhou, G., Liu, Z., Li, X., Zheng, X., Wang, L.: NIGAN: A framework for mountain road extraction integrating remote sensing road-scene neighborhood probability enhancements and improved conditional generative adversarial network. IEEE Trans. Geos. Remote Sens. 60(1), 1–15 (2022). https://doi.org/10.1109/TGRS.2022.3188908

    Article  Google Scholar 

  23. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with Transformers. In: Eur. Conf. Comput. Vis., pp. 213–229 (2020). https://doi.org/10.1007/978-3-030-58452-8_13

  24. Chen, Z., Qiu, G., Li, P., Zhu, L., Yang, X., Sheng, B.: MNGNAS: Distilling adaptive combination of multiple searched networks for one-shot neural architecture search. IEEE Trans. Pattern Anal. Mach. Intell. 45(11), 13489–13508 (2023). https://doi.org/10.1109/TPAMI.2023.3293885

    Article  Google Scholar 

  25. Chen, T., Jiang, D., Li, R.: Swin Transformers make strong contextual encoders for VHR image road extraction. In: IEEE Int. Geosci. Remote Sens. Symp., pp. 3019–3022 (2022). https://doi.org/10.1109/IGARSS46834.2022.9883628

  26. Ding, L., Lin, D., Lin, S., Zhang, J., Cui, X., Wang, Y., Tang, H., Bruzzone, L.: Looking outside the window: wide-context transformer for the semantic segmentation of high-resolution remote sensing images. IEEE Trans. Geos. Remote Sens. 60(1), 1–13 (2022). https://doi.org/10.1109/TGRS.2022.3168697

    Article  Google Scholar 

  27. Li, A., Jiao, J., Li, N., Qi, W., Xu, W., Pang, M.: Conmw Transformer: A general vision Transformer backbone with merged-window attention. In: Int. Conf. on Image Process., pp. 1551–1555 (2022). https://doi.org/10.1109/ICIP46576.2022.9897179

  28. Gao, L., Liu, H., Yang, M., Chen, L., Wan, Y., Xiao, Z., Qian, Y.: STransFuse: Fusing swin Transformer and convolutional neural network for remote sensing image semantic segmentation. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 14(1), 10990–11003 (2021). https://doi.org/10.1109/JSTARS.2021.3119654

  29. He, X., Zhou, Y., Zhao, J., Zhang, D., Yao, R., Xue, Y.: Swin Transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geos. Remote Sens. 60(1), 1–15 (2022). https://doi.org/10.1109/TGRS.2022.3144165

    Article  Google Scholar 

  30. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H.: Rethinking semantic segmentation from a sequence-to-sequence perspective with Transformers. In: IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6881–6890 (2021). https://doi.org/10.1109/CVPR46437.2021.00681

  31. Lin, X., Sun, S., Huang, W., Sheng, B., Li, P., Feng, D.D.: EAPT: Efficient attention pyramid Transformer for image processing. IEEE Trans. Multim. 25(1), 50–61 (2023). https://doi.org/10.1109/TMM.2021.3120873

    Article  Google Scholar 

  32. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical vision Transformer using shifted windows. In: IEEE Int. Conf. Comput. Vis., pp. 10012–10022 (2021). https://doi.org/10.1109/ICCV48922.2021.00986

  33. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Eur. Conf. Comput. Vis., pp. 418–434 (2018). https://doi.org/10.1007/978-3-030-01228-1_26

  34. Gao, X., Sun, X., Yan, M., Sun, H., Fu, K., Zhang, Y., Ge, Z.: Road extraction from remote sensing images by multiple feature pyramid network. In: IEEE Int. Geosci. Remote Sens. Symp., pp. 6907–6910 (2018). https://doi.org/10.1109/IGARSS.2018.8519093

  35. Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6230–6239 (2017). https://doi.org/10.1109/CVPR.2017.660

  36. Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B.: CSwin Transformer: A general vision Transformer backbone with cross-shaped windows. In: IEEE Conf. Comput. Vis. Pattern Recognit., pp. 12114–12124 (2022). https://doi.org/10.1109/CVPR52688.2022.01181

  37. Demir, I., Koperski, K., Lindenbaum, D., Pang, G., Huang, J., Basu, S., Hughes, F., Tuia, D., Raskar, R.: DeepGlobe 2018: A challenge to parse the earth through satellite images. In: IEEE Conf. Comput. Vis. Pattern Recognit. Workshop, pp. 172–181 (2018). https://doi.org/10.1109/CVPRW.2018.00031

  38. Yuan, Y., Chen, X., Wang, J.: Object-contextual representations for semantic segmentation. In: Eur. Conf. Comput. Vis., pp. 173–190 (2020). https://doi.org/10.1007/978-3-030-58539-6_11

  39. Chen, L., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-Decoder with atrous separable convolution for semantic image segmentation. In: Eur. Conf. Comput. Vis., pp. 801–818 (2018). https://doi.org/10.1007/978-3-030-01234-2_49

  40. Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNet: Criss-cross attention for semantic segmentation. In: IEEE Int. Conf. Comput. Vis., pp. 603–612 (2019). https://doi.org/10.1109/ICCV.2019.00069

  41. Pan, H., Hong, Y., Sun, W., Jia, Y.: Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Trans. Intell. Transp. Syst. 24(3), 3448–3460 (2023). https://doi.org/10.1109/TITS.2022.3228042

    Article  Google Scholar 

  42. Xu, J., Xiong, Z., Bhattacharyya, S. P.: PIDNet: A real-time semantic segmentation network inspired by PID controllers. In: IEEE Conf. Comput. Vis. Pattern Recognit., pp. 19529–19539 (2023). https://doi.org/10.1109/CVPR52729.2023.01871

  43. Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision Transformer: A versatile backbone for dense prediction without convolutions. In: Eur. Conf. Comput. Vis., pp. 548–558 (2021). https://doi.org/10.1109/ICCV48922.2021.00061

  44. Yu, Q., Xia, Y., Bai, Y., Lu, Y., Yuille, A. L., Shen, W.: Glance-and-gaze vision Transformer. In: Adv. Neural Inf. Process. Syst., pp. 12990–13003 (2021). https://doi.org/10.48550/arXiv.2106.02277

  45. Hassani, A., Walton, S., Li, J., Li, S., Shi, H.: Neighborhood attention Transformer. In: IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6185–6194 (2023). https://doi.org/10.1109/CVPR52729.2023.00599

Download references

Funding

This work is supported by National Natural Science Foundation of China (No. 62371015), Beijing Natural Science Foundation (No. L211017).

Author information

Authors and Affiliations

Authors

Contributions

Wensheng Li was involved in investigation, methodology, data curation, writing—original draft, visualization and validation. Jing Zhang contributed to methodology, conceptualization, supervision and writing—review and editing. Jiafeng Li was involved in resources, funding acquisition and methodology. Li Zhuo contributed to project administration and supervision.

Corresponding author

Correspondence to Jing Zhang.

Ethics declarations

Conflict of interest

The authors declare that there are no conflicts of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, W., Zhang, J., Li, J. et al. Unpaved road segmentation of UAV imagery via a global vision transformer with dilated cross window self-attention for dynamic map. Vis Comput (2024). https://doi.org/10.1007/s00371-024-03416-0

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00371-024-03416-0

Keywords

Navigation