Skip to main content
Log in

Graph attention network-optimized dynamic monocular visual odometry

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Monocular Visual Odometry (VO) is often formulated as a sequential dynamics problem that relies on scene rigidity assumption. One of the main challenges is rejecting moving objects and estimating camera pose in dynamic environments. Existing methods either take the visual cues in the whole image equally or eliminate the fixed semantic categories by heuristics or attention mechanisms. However, they fail to tackle unknown dynamic objects which are not labeled in the training sets of the network. To solve these issues, we propose a novel framework, named graph attention network (GAT)-optimized dynamic monocular visual odometry (GDM-VO), to remove dynamic objects explicitly with semantic segmentation and multi-view geometry in this paper. Firstly, we employ a multi-task learning network to perform semantic segmentation and depth estimation. Then, we reject priori known and unknown objective moving objects through semantic information and multi-view geometry, respectively. Furthermore, to our best knowledge, we are the first to leverage GAT to capture long-range temporal dependencies from consecutive image sequences adaptively, while existing sequential modeling approaches need to select information manually. Extensive experiments on the KITTI and TUM datasets demonstrate the superior performance of GDM-VO overs existing state-of-the-art classical and learning-based monocular VO.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Qiao, X., Ren, P., Dustdar, S., Liu, L., Ma, H., Chen, J.: Web ar: A promising future for mobile augmented reality-state of the art, challenges, and insights. Proceedings of the IEEE 107(4), 651–666 (2019)

    Article  Google Scholar 

  2. Yadav, R., Kala, R.: Fusion of visual odometry and place recognition for slam in extreme conditions. Applied Intelligence, 1–20 (2022)

  3. Liu, H., Fang, S., Zhang, Z., Li, D., Lin, K., Wang, J.: Mfdnet: Collaborative poses perception and matrix fisher distribution for head pose estimation. IEEE Transactions on Multimedia 24, 2449–2460 (2021)

    Article  Google Scholar 

  4. Liu, H., Liu, T., Zhang, Z., Sangaiah, A.K., Yang, B., Li, Y.: Arhpe: Asymmetric relation-aware representation learning for head pose estimation in industrial human-computer interaction. IEEE Transactions on Industrial Informatics 18(10), 7107–7117 (2022)

    Article  Google Scholar 

  5. Liu, H., Zheng, C., Li, D., Shen, X., Lin, K., Wang, J., Zhang, Z., Zhang, Z., Xiong, N.N.: Edmf: Efficient deep matrix factorization with review feature learning for industrial recommender system. IEEE Transactions on Industrial Informatics 18(7), 4361–4371 (2021)

    Article  Google Scholar 

  6. Liu, H., Liu, T., Chen, Y., Zhang, Z., Li, Y.-F.: Ehpe: skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. IEEE Transactions on Multimedia (2022)

  7. Wang, S., Clark, R., Wen, H., Trigoni, N.: Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2043–2050 (2017). IEEE

  8. Wang, S., Clark, R., Wen, H., Trigoni, N.: End-to-end, sequence-to-sequence probabilistic visual odometry through deep neural networks. The International Journal of Robotics Research 37(4–5), 513–542 (2018)

    Article  Google Scholar 

  9. Sun, T., Sun, Y., Liu, M., Yeung, D.-Y.: Movable-object-aware visual slam via weakly supervised semantic segmentation. arXiv preprint arXiv:1906.03629 (2019)

  10. Kuo, X.-Y., Liu, C., Lin, K.-C., Lee, C.-Y.: Dynamic attention-based visual odometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 36–37 (2020)

  11. Damirchi, H., Khorrambakht, R., Taghirad, H.D.: Exploring self-attention for visual odometry. arXiv preprint arXiv:2011.08634 (2020)

  12. Bescos, B., Fácil, J.M., Civera, J., Neira, J.: Dynaslam: Tracking, mapping, and inpainting in dynamic scenes. IEEE Robotics and Automation Letters 3(4), 4076–4083 (2018)

    Article  Google Scholar 

  13. Cui, L., Ma, C.: Sof-slam: A semantic visual slam for dynamic environments. IEEE Access 7, 166528–166539 (2019)

    Article  Google Scholar 

  14. Wang, K., Lin, Y., Wang, L., Han, L., Hua, M., Wang, X., Lian, S., Huang, B.: A unified framework for mutual improvement of slam and semantic segmentation. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 5224–5230 (2019). IEEE

  15. Lipton, Z.C., Berkowitz, J., Elkan, C.: A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019 (2015)

  16. Xue, F., Wang, X., Li, S., Wang, Q., Wang, J., Zha, H.: Beyond tracking: Selecting memory and refining poses for deep visual odometry. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8575–8583 (2019)

  17. Li, S., Xue, F., Wang, X., Yan, Z., Zha, H.: Sequential adversarial learning for self-supervised deep visual odometry. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2851–2860 (2019)

  18. Zou, Y., Ji, P., Tran, Q.-H., Huang, J.-B., Chandraker, M.: Learning monocular visual odometry via self-supervised long-term modeling. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pp. 710–727 (2020). Springer

  19. Xue, F., Wang, Q., Wang, X., Dong, W., Wang, J., Zha, H.: Guided feature selection for deep visual odometry. In: Asian Conference on Computer Vision, pp. 293–308 (2018). Springer

  20. Saputra, M.R.U., de Gusmao, P.P., Wang, S., Markham, A., Trigoni, N.: Learning monocular visual odometry through geometry-aware curriculum learning. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 3549–3555 (2019). IEEE

  21. Sun, Y., Liu, M., Meng, M.Q.-H.: Improving rgb-d slam in dynamic environments: A motion removal approach. Robotics and Autonomous Systems 89, 110–122 (2017)

    Article  Google Scholar 

  22. Dai, W., Zhang, Y., Li, P., Fang, Z., Scherer, S.: Rgb-d slam in dynamic environments using point correlations. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020)

  23. Yu, C., Liu, Z., Liu, X.-J., Xie, F., Yang, Y., Wei, Q., Fei, Q.: Ds-slam: A semantic visual slam towards dynamic environments. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1168–1174 (2018). IEEE

  24. Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39(12), 2481–2495 (2017)

    Article  Google Scholar 

  25. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

  26. Ji, T., Wang, C., Xie, L.: Towards real-time semantic rgb-d slam in dynamic environments. arXiv preprint arXiv:2104.01316 (2021)

  27. Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE transactions on neural networks 20(1), 61–80 (2008)

    Article  Google Scholar 

  28. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)

  29. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017)

  30. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graph attention networks. arXiv preprint arXiv:1710.10903 (2017)

  31. Xue, F., Wu, X., Cai, S., Wang, J.: Learning multi-view camera relocalization with graph neural networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11372–11381 (2020). IEEE

  32. Turkoglu, M.O., Brachmann, E., Schindler, K., Brostow, G., Monszpart, A.: Visual camera re-localization using graph neural networks and relative pose supervision. arXiv preprint arXiv:2104.02538 (2021)

  33. Jiao, J., Cao, Y., Song, Y., Lau, R.: Look deeper into depth: Monocular depth estimation with semantic booster and attention-driven loss. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 53–69 (2018)

  34. Gao, T., Wei, W., Cai, Z., Fan, Z., Xie, S., Wang, X., Yu, Q.: Ci-net: Contextual information for joint semantic segmentation and depth estimation. arXiv preprint arXiv:2107.13800 (2021)

  35. Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T., Shen, C., Reid, I.: Real-time joint semantic segmentation and depth estimation using asymmetric annotations. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 7101–7107 (2019). IEEE

  36. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.-C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)

  37. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  38. Nekrasov, V., Shen, C., Reid, I.: Light-weight refinenet for real-time semantic segmentation. arXiv preprint arXiv:1810.03272 (2018)

  39. Lin, G., Liu, F., Milan, A., Shen, C., Reid, I.: Refinenet: Multi-path refinement networks for dense prediction. IEEE transactions on pattern analysis and machine intelligence 42(5), 1228–1242 (2019)

    Google Scholar 

  40. Gerlach, N.L., Meijer, G.J., Kroon, D.-J., Bronkhorst, E.M., Bergé, S.J., Maal, T.J.J.: Evaluation of the potential of automatic segmentation of the mandibular canal using cone-beam computed tomography. British journal of oral and maxillofacial surgery 52(9), 838–844 (2014)

    Article  Google Scholar 

  41. Sun, D., Yang, X., Liu, M.-Y., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943 (2018)

  42. Rong, Y., Huang, W., Xu, T., Huang, J.: Dropedge: Towards deep graph convolutional networks on node classification. arXiv preprint arXiv:1907.10903 (2019)

  43. Wan, Y., Gao, W., Wu, Y.: Optical flow assisted monocular visual odometry. In: Asian Conference on Pattern Recognition, pp. 366–377 (2019). Springer

  44. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)

  45. Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb-slam3: An accurate open-source library for visual, visual-inertial, and multimap slam. IEEE Transactions on Robotics (2021)

  46. Geiger, A., Ziegler, J., Stiller, C.: Stereoscan: Dense 3d reconstruction in real-time. In: 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 963–968 (2011). Ieee

  47. Lee, S., Rameau, F., Im, S., Kweon, I.S.: Self-supervised monocular depth and motion learning in dynamic scenes: Semantic prior to rescue. International Journal of Computer Vision 130(9), 2265–2285 (2022)

    Article  Google Scholar 

  48. Kazerouni, A., Heydarian, A., Soltany, M., Mohammadshahi, A., Omidi, A., Ebadollahi, S.: An intelligent modular real-time vision-based system for environment perception

  49. Zhu, Y., Sapra, K., Reda, F.A., Shih, K.J., Newsam, S., Tao, A., Catanzaro, B.: Improving semantic segmentation via video propagation and label relaxation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8856–8865 (2019)

  50. Kreso, I., Segvic, S., Krapac, J.: Ladder-style densenets for semantic segmentation of large natural images. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 238–245 (2017)

  51. Erkent, Ö., Laugier, C.: Semantic segmentation with unsupervised domain adaptation under varying weather conditions for autonomous vehicles. IEEE Robotics and Automation Letters 5(2), 3580–3587 (2020)

    Article  Google Scholar 

  52. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

  53. Cao, Y., Wu, Z., Shen, C.: Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology 28(11), 3174–3182 (2017)

    Article  Google Scholar 

  54. Godard, C., Mac Aodha, O., Firman, M., Brostow, G.J.: Digging into self-supervised monocular depth estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2019)

  55. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24(6), 381–395 (1981)

    Article  MathSciNet  Google Scholar 

  56. Costante, G., Mancini, M., Valigi, P., Ciarfuglia, T.A.: Exploring representation learning with cnns for frame-to-frame ego-motion estimation. IEEE robotics and automation letters 1(1), 18–25 (2015)

    Article  Google Scholar 

  57. Zhong, F., Wang, S., Zhang, Z., Wang, Y.: Detect-slam: Making object detection and slam mutually beneficial. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1001–1010 (2018). IEEE

Download references

Acknowledgements

This work was supported in part by the National Key R &D Program of China under Grant the 2018YFE0205503; in part by the Funds for International Cooperation and Exchange of NSFC under Grant 61720106007.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qiao Xiuquan.

Ethics declarations

Conflicts of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hongru, Z., Xiuquan, Q. Graph attention network-optimized dynamic monocular visual odometry. Appl Intell 53, 23067–23082 (2023). https://doi.org/10.1007/s10489-023-04687-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04687-1

Keywords

Navigation