Skip to main content
Log in

RDDRL: a recurrent deduction deep reinforcement learning model for multimodal vision-robot navigation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Existing deep reinforcement learning-based mobile robot navigation relies largely on single-modal visual perception to perform local-scale navigation. However, multimodal visual fusion-based global navigation is still under technical exploration. Visual navigation necessitates that agents drive safely in structured, changing, and even unpredictable environments; otherwise, inappropriate operations may result in mission failure and even irreversible damage to life and property. We propose a recurrent deduction deep learning model (RDDRL) for multimodal vision-robot navigation to address these issues. We incorporate a recurrent reasoning mechanism (RRM) into the reinforcement learning model, which allows the agent to store memory, predict the future, and aid in policy learning. Specifically, the RRM first stores current observations and states by learning a parameterized environment model and then predicts future transitions. The RRM then performs a self-assessment on the predicted behavior and perceives the consequences of the current policy, producing a more reliable decision-making process. Furthermore, to obtain global-scale behavioral decision-making, information from scene recognition, semantic segmentation, and pose estimation are fused and used as partial observations of the RDDRL. A large number of simulated experiments based on CARLA scenarios, as well as test results in real-world scenarios, show that RDDRL outperforms state-of-the-art RL methods in terms of driving stability and safety. The results show that by training the agent, the collision rate in the global decision-making of the unmanned vehicle decreases from 0.2 % in the training state to 0.0 % in the test state.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Code Availability

Not applicable.

References

  1. Zhu K, Zhang T (2021) Deep reinforcement learning based mobile robot navigation: A review. Tsinghua Sci Technol 674-691

  2. Möller R, Furnari A, Battiato S, Härmä A, Farinella GM (2021) A survey on human-aware robot navigation. Robot Auton Syst 1-31

  3. Iberraken D, Adouane L, Denis D (2018) Multi-level bayesian decision-making for safe and flexible autonomous navigation in highway environment. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) pp. 3984-3990

  4. Guo J, Chen Y, Hao Y, Yin Z, Yu Y, Li S (2022) Towards Comprehensive Testing on the Robustness of Cooperative Multi-agent Reinforcement Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 115-122

  5. Tu S, Waqas M, Rehman SU, Mir T, Abbas G, Abbas ZH, Ahmad I (2021) Reinforcement learning assisted impersonation attack detection in device-to-device communications. IEEE Trans Veh Technol 70(2):1474–1479

    Article  Google Scholar 

  6. Halim Z, Sulaiman M, Waqas M, Aydın D (2022) Deep neural network-based identification of driving risk utilizing driver dependent vehicle driving features: A scheme for critical infrastructure protection. J Ambient Intell Humaniz Comput 1-19

  7. Zhang W, Li X, Ma H, Luo Z, Li X (2021) Universal domain adaptation in fault diagnostics with hybrid weighted deep adversarial learning. IEEE Trans Industr Inf 17(12):7957–7967

    Article  Google Scholar 

  8. Li X, Xu Y, Li N, Yang B, Lei Y (2022) Remaining useful life prediction with partial sensor malfunctions using deep adversarial networks. IEEE/CAA Journal of Automatica Sinica 1-14

  9. Zhang HT, Hu BB, Xu Z, Cai Z, Liu B, Wang X, Zhao J (2021) Visual navigation and landing control of an unmanned aerial vehicle on a moving autonomous surface vehicle through adaptive learning. IEEE Trans Neural Netw Learn Syst 5345-5355

  10. Lin B, Zhu Y, Long Y, Liang X, Ye Q, Lin L (2021) Retreat for advancing: Dynamic reinforced instruction attacker for robust visual navigation. IEEE Trans Pattern Anal Mach Intell 1-15

  11. Mousavi HK, Motee N (2020) Estimation with fast feature selection in robot visual navigation. IEEE Robot Autom Lett 3572-3579

  12. Gronauer S, Diepold K (2022) Multi-agent deep reinforcement learning: A survey. Artif Intell Rev 895-943

  13. Lin E, Chen Q, Qi X (2020) Deep reinforcement learning for imbalanced classification. Appl Intell 2488-2502

  14. Qin Y, Wang H, Yi S, Li X, Zhai L (2020) Virtual machine placement based on multi-objective reinforcement learning. Appl Intell 2370-2383

  15. Wen S, Wen Z, Zhang D, Zhang H, Wang T (2021) A multi-robot path-planning algorithm for autonomous navigation using meta-reinforcement learning based on transfer learning. Appl Soft Comput 110:1–15

    Article  Google Scholar 

  16. Lu Y, Chen Y, Zhao D, Li D (2021) MGRL: Graph neural network based inference in a Markov network with reinforcement learning for visual navigation. Neurocomputing 140-150

  17. Zeng F, Wang C, Ge SS (2020) Tutor-guided interior navigation with deep reinforcement learning. IEEE Trans Cogn Develop Syst 1-11

  18. Tolani V, Bansal S, Faust A, Tomlin C (2021) Visual navigation among humans with optimal control as a supervisor. IEEE Robot Autom Lett 2288-2295

  19. Fang B, Mei G, Yuan X, Wang L, Wang Z, Wang J (2021) Visual SLAM for robot navigation in healthcare facility. Pattern Recogn 1-12

  20. Chaplot DS, Salakhutdinov DS, Gupta A, Gupta S (2020) Neural topological slam for visual navigation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp 12875-12884

  21. Ok K, Liu K, Frey K, How JP, Roy N (2019) Robust object-based slam for high-speed autonomous navigation. In International Conference on Robotics and Automation (ICRA) pp 669-675

  22. Dor M, Skinner KA, Driver T, Tsiotras P (2021) Visual SLAM for asteroid relative navigation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2066-2075

  23. Karkus P, Cai S, Hsu D (2021) Differentiable slam-net: Learning particle slam for visual navigation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA, pp 2815–2825

    Google Scholar 

  24. Nguyen A, Nguyen N, Tran K, Tjiputra E, Tran QD (2020) Autonomous navigation in complex environments with deep multimodal fusion network. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) pp. 5824-5830

  25. Li J, Qin H, Wang J, Li J (2021) OpenStreetMap-based autonomous navigation for the four wheel-legged robot via 3D-Lidar and CCD camera. IEEE Trans Ind Electron2708-2717

  26. Unlu HU, Patel N, Krishnamurthy P, Khorrami F (2019) Sliding-window temporal attention based deep learning system for robust sensor modality fusion for ugv navigation. IEEE Robot Autom Lett 4216-4223

  27. Lin Y, Gao F, Qin T, Gao W, Liu T, Wu W, Shen S (2018) Autonomous aerial navigation using monocular visual-inertial fusion. J Field Rob 23-51

  28. Eckenhoff K, Geneva P, Huang G (2021) Mimc-vins: A versatile and resilient multi-imu multi-camera visual-inertial navigation system. IEEE Trans Robot 1360-1380

  29. Seok H, Lim J (2020) ROVINS: Robust omnidirectional visual inertial navigation system. IEEE Robot Autom Lett 6225-6232

  30. Zhu Y, Mottaghi R, Kolve E, Lim JJ, Gupta A, Fei-Fei L, Farhadi A (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In IEEE international conference on robotics and automation (ICRA) pp. 3357-3364

  31. Fang Q, Xu X, Wang X, Zeng Y (2020) Target-driven visual navigation in indoor scenes using reinforcement learning and imitation learning. CAAI Transactions on Intelligence Technology pp. 167-176

  32. Jin YL, Ji ZY, Zeng D, Zhang XP (2022) VWP: An Efficient DRL-based Autonomous Driving Model. IEEE Trans Multimedia 1-13

  33. Huang Z, Wu J, Lv C (2022) Efficient deep reinforcement learning with imitative expert priors for autonomous driving. IEEE Trans Neural Netw Learn Syst 1-13

  34. Wu Y, Liao S, Liu X, Li Z, Lu R (2021) Deep reinforcement learning on autonomous driving policy with auxiliary critic network. IEEE Trans Neural Netw Learn Syst 1-11

  35. Liu X, Liu Y, Chen Y, Hanzo L (2020) Enhancing the fuel-economy of V2I-assisted autonomous driving: A reinforcement learning approach. IEEE Trans Veh Technol 8329-8342

  36. Kastner L, Cox J, Buiyan T, Lambrecht J (2022) All-in-one: A DRL-based control switch combining state-of-the-art navigation planners. In International Conference on Robotics and Automation (ICRA) pp. 2861-2867

  37. Morad SD, Mecca R, Poudel RP, Liwicki S, Cipolla R (2021) Embodied visual navigation with automatic curriculum learning in real environments. IEEE Robot Autom Lett 683-690

  38. Seymour Z, Thopalli K, Mithun N, Chiu HP, Samarasekera S, Kumar R (2021) Maast: Map attention with semantic transformers for efficient visual navigation. In IEEE International Conference on Robotics and Automation (ICRA) pp. 13223-13230

  39. Huang C, Zhang R, Ouyang M, Wei P, Lin J, Su J, Lin L (2021) Deductive reinforcement learning for visual autonomous urban driving navigation. IEEE Trans Neural Netw Learn Syst 5379-5391

  40. Sun Y, Yuan B, Zhang Y, Zheng W, Xia Q, Tang B, Zhou X (2021) Research on Action Strategies and Simulations of DRL and MCTS-based Intelligent Round Game. Int J Control Autom Syst 2984-2998

  41. Mo K, Tang W, Li J, Yuan X (2022) Attacking deep reinforcement learning with decoupled adversarial policy. IEEE Trans Dependable Secure Comput 1-12

  42. Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley K (2016) Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning 1928-1937

  43. Huang X, Deng H, Zhang W, Song R, Li Y (2021) Towards multi-modal perception-based navigation: A deep reinforcement learning method. IEEE Robot Autom Lett 6(3):4986–4993

  44. Li Z, Zhou A, Wang M, Shen Y (2019) Deep fusion of multi-layers salient CNN features and similarity network for robust visual place recognition. In IEEE International Conference on Robotics and Biomimetics (ROBIO) pp. 22-29

  45. Qin T, Li P, Shen S (2018) Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans Rob 34(4):1004–1020

    Article  Google Scholar 

  46. Kendall A, Grimes M, Cipolla R (2015) Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision pp. 2938-2946

  47. Wang S, Clark R, Wen H, Trigoni N (2017) Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In IEEE international conference on robotics and automation (ICRA) pp. 2043-2050

  48. Li Z, Zhou A, Pu J, Yu J (2021) Multi-modal neural feature fusion for automatic driving through perception-aware path planning. IEEE Access 9:142782–142794

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Zhenyu Li and Aiguo Zhou conceived and designed the approach. Zhenyu Li contributed in the construction of simulation and real experimental environment. The first draft of the manuscript was written by Zhenyu Li. Aiguo Zhou corrected the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Zhenyu Li.

Ethics declarations

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Conflict of interest

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. Place recognition model

In the appendix, we introduce the establishment process of the environment perception model in detail, which includes the place recognition model, the scene segmentation model, and the pose estimation model. To improve the overall operational efficiency of the proposed RDDRL model, we pursue the principle of lightweight in the process of designing the environment model. Therefore, the designed place recognition model, scene segmentation model, and pose estimation model are all lightweight models. Fig. 17 depicts the place recognition model, which consists of five convolutional modules with a total of 13 convolutional layers.

Fig. 17
figure 17

Place recognition model

The 1 to 5 convolutional modules use 64, 128, 256, 512, and 512 filters, respectively. To extract salient features, all pooling layers use maxpooling, with a kernel size of \(2 \times 2\) and a stride size of \(2 \times 2\). The detailed structure description is shown in Table 7.

Table 7 The structure of place recognition model

To speed up scene recognition, the method of feature aggregation is used to aggregate salient features in a certain region into salient regions, which is expressed as follows:

$$\begin{aligned} \left\{ \begin{array}{l} f_{\Omega }=\left[ f_{\Omega , 1}, f_{\Omega , 2}, \cdots , f_{\Omega , K}\right] ^{T} \\ f_{\Omega , i}=\sum _{p \in \Omega } X_{i} \end{array}\right. \end{aligned}$$
(A1)

where \(X_{i}\) is the two-dimensional tensor that represents the activated responses of the \(i^th\) convolutional layer. \(f_{\Omega , i}\) is the \(i^{th}\) valid spatial position that represents the salient region in an image. Then, these salient regions are weighted to obtain global scene representation:

$$\begin{aligned} L(\textbf{I})=\sum _{i=0}^{N} f_{\Omega , i} \end{aligned}$$
(A2)

For the query scene \(\textbf{I}^q\)and the reference scene \(\textbf{I}^r\), the global scene representation obtained after convolutional encoding is \(L^q(\textbf{I})\) and \(L^r(\textbf{I})\), respectively. The cosine distance function is used to calculate the vector distance between the two to judge the place similarity, which is expressed as Eq. 1.

We carried out place recognition experiments to verify the effectiveness of the proposed model by comparing it with similar methods on four challenging datasets. The experimental results are shown in Fig. 18, which also was described in detail in our published papers [44].

Fig. 18
figure 18

Experimental results for place recognition on four challenging datasets

Appendix B. Scene segmentation model

We design a lightweight scene segmentation model that adopts a full-bottleneck structure, as shown in Fig. 19. To speed up the operation, atrous convolutions are embedded in the bottleneck structure. The whole architecture is shown in Table 8.

We use I to represent the matrix representation of the image, and the following results are obtained after feature encoding (encoder):

$$\begin{aligned} T_m=s(I*F_{m}^{w}+b_m) \end{aligned}$$
(B3)

where s is the RRelu function, b is the bias vectors for the \(k^{th}\) feature map, \(F_{m}^{w}\) is the convolutional filters. The decoder will also output m intermediate features by the previous down-sample \(T_m\). The convolutions from block 3 will cope with the encoding features, and output the results:

$$\begin{aligned} G_{m}=s\left( T_{m} * F_{m}^{w^{\prime }}+b_{m}^{\prime }\right) \end{aligned}$$
(B4)

Where \(F_{m}^{w^{\prime }}\) is the convolutional filters used in the current stage, \(b_{m}^{\prime }\) is the bias vectors in the current stage. Then, the decoder will cope with the output of the previous stage:

$$\begin{aligned} D_{m}=s\left( G_{m} * F_{m}^{w^{\prime }}+b_{m}^{\prime \prime }\right) \end{aligned}$$
(B5)

Where \(F_{m}^{w^{\prime }}\) is the convolutional filters used in the decoding stage, \(b_{m}^{\prime \prime }\) is the bias vectors in the decoding stage. Scene segmentation tries to produce labels for each pixel based on different categories appearing in a scene. In our work, we use the Softmax layer to as the object classifier to detect all objects, which is expressed as follows:

$$\begin{aligned} B_{m}^{*}={\text {softmax}}\left\{ D_{1}, D_{2}, \ldots , D_{m}\right\} \end{aligned}$$
(B6)

Where m is the number of categories.

Fig. 19
figure 19

Scene segmentation model

Table 8 The structure of scene segmentation model

We carried out semantic segmentation experiments to verify the effectiveness of the proposed model by comparing it with similar methods. To speed up model reasoning, we construct a lightweight semantic segmentation model. The results show that The proposed model is second only to LedNet in terms of performance, but the reasoning speed of our model is much higher than that of LedNet.

Table 9 Experimental results for semantic segmentation on Cityscapes dataset

Appendix C. Pose estimation model

To estimate the robot’s pose, we design a pose estimator based on multi-sensor fusion. The pose estimator first uses a visual encoder and an imu encoder to encode images and inertial measurements, respectively and then uses LSTM toconstruct a pose estimator. The vector representation of the pose is output by fusing the data collected by the two sensors, and the fully connected layer network is then used to regress the robot’s pose. The whole structure of pose estimation is shown in Fig. 20.

Fig. 20
figure 20

Pose estimation model

We first employ a CNN with five convolutional layers and LSTM to encode the image and imu features respectively, which is expressed as in Eqs. 5 and 6. We consider translation and rotation as motion on the manifolds and decompose the camera’s motion into a rotation motion and a translation motion in the vector space. Then, the rotation motion and translation motion are translated into a rotation matrix and a translation matrix by a special Euclidean group SE(3):

$$\begin{aligned} SE(3)=\left\{ T=\left[ \begin{array}{cc} R\left( \left( a_{v}, a_{i}\right) \right) &{} t\left( \left( a_{v}, a_{i}\right) \right) \\ 0^{T} &{} 1 \end{array}\right] \in \mathbb {R}^{4 \times 4}\right\} \end{aligned}$$
(B7)

Where R and T are the rotation and translation matrices of the camera in space motion, \(\mathbb {R}^{3}\) denotes a set of 3D vector Spaces, \(a_{v}\) and \(a_{i}\) are the scene descriptor and imu descriptor, lstm is the fusion function. Every moment of the camera’s motion has a matching Euclidean group SE(3), which forms a trajectory flow by recording the time sequence. The pose vector is fed into three fully connected neural layers for pose regression. The fully connected neural layers regress translations and rotations as a quaternion and map the fused pose vector into a robot’s pose vector, which is expressed as Eq. B7. We carried out pose estimation experiments to verify the effectiveness of the proposed model by comparing it with similar methods on the KITTI dataset. The selected comparison method includes the handcrafted engineering method (VINS [45]) and the deep learning method (PoseNet [46] and DeepVO [47]). By comparing the average pose errors (APE) of several methods, we can see from Fig. 21 that the pose estimation method proposed by us obtained a smaller APE value, indicating a higher pose estimation accuracy. Figures C6 and C7 respectively show the Translation and pose angle changes on the six KITTI sequences. It can be seen that the experimental curves are generally consistent with the groundtruth, which also was described in detail in our published papers [48].

Fig. 21
figure 21

Experimental results for pose estimation on the KITTI dataset

Fig. 22
figure 22

Translation changes on the KITTI dataset

Fig. 23
figure 23

Pose angle changes on the KITTI dataset

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Z., Zhou, A. RDDRL: a recurrent deduction deep reinforcement learning model for multimodal vision-robot navigation. Appl Intell 53, 23244–23270 (2023). https://doi.org/10.1007/s10489-023-04754-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-04754-7

Keywords

Navigation