Abstract
Existing deep reinforcement learning-based mobile robot navigation relies largely on single-modal visual perception to perform local-scale navigation. However, multimodal visual fusion-based global navigation is still under technical exploration. Visual navigation necessitates that agents drive safely in structured, changing, and even unpredictable environments; otherwise, inappropriate operations may result in mission failure and even irreversible damage to life and property. We propose a recurrent deduction deep learning model (RDDRL) for multimodal vision-robot navigation to address these issues. We incorporate a recurrent reasoning mechanism (RRM) into the reinforcement learning model, which allows the agent to store memory, predict the future, and aid in policy learning. Specifically, the RRM first stores current observations and states by learning a parameterized environment model and then predicts future transitions. The RRM then performs a self-assessment on the predicted behavior and perceives the consequences of the current policy, producing a more reliable decision-making process. Furthermore, to obtain global-scale behavioral decision-making, information from scene recognition, semantic segmentation, and pose estimation are fused and used as partial observations of the RDDRL. A large number of simulated experiments based on CARLA scenarios, as well as test results in real-world scenarios, show that RDDRL outperforms state-of-the-art RL methods in terms of driving stability and safety. The results show that by training the agent, the collision rate in the global decision-making of the unmanned vehicle decreases from 0.2 % in the training state to 0.0 % in the test state.
Similar content being viewed by others
Code Availability
Not applicable.
References
Zhu K, Zhang T (2021) Deep reinforcement learning based mobile robot navigation: A review. Tsinghua Sci Technol 674-691
Möller R, Furnari A, Battiato S, Härmä A, Farinella GM (2021) A survey on human-aware robot navigation. Robot Auton Syst 1-31
Iberraken D, Adouane L, Denis D (2018) Multi-level bayesian decision-making for safe and flexible autonomous navigation in highway environment. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) pp. 3984-3990
Guo J, Chen Y, Hao Y, Yin Z, Yu Y, Li S (2022) Towards Comprehensive Testing on the Robustness of Cooperative Multi-agent Reinforcement Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 115-122
Tu S, Waqas M, Rehman SU, Mir T, Abbas G, Abbas ZH, Ahmad I (2021) Reinforcement learning assisted impersonation attack detection in device-to-device communications. IEEE Trans Veh Technol 70(2):1474–1479
Halim Z, Sulaiman M, Waqas M, Aydın D (2022) Deep neural network-based identification of driving risk utilizing driver dependent vehicle driving features: A scheme for critical infrastructure protection. J Ambient Intell Humaniz Comput 1-19
Zhang W, Li X, Ma H, Luo Z, Li X (2021) Universal domain adaptation in fault diagnostics with hybrid weighted deep adversarial learning. IEEE Trans Industr Inf 17(12):7957–7967
Li X, Xu Y, Li N, Yang B, Lei Y (2022) Remaining useful life prediction with partial sensor malfunctions using deep adversarial networks. IEEE/CAA Journal of Automatica Sinica 1-14
Zhang HT, Hu BB, Xu Z, Cai Z, Liu B, Wang X, Zhao J (2021) Visual navigation and landing control of an unmanned aerial vehicle on a moving autonomous surface vehicle through adaptive learning. IEEE Trans Neural Netw Learn Syst 5345-5355
Lin B, Zhu Y, Long Y, Liang X, Ye Q, Lin L (2021) Retreat for advancing: Dynamic reinforced instruction attacker for robust visual navigation. IEEE Trans Pattern Anal Mach Intell 1-15
Mousavi HK, Motee N (2020) Estimation with fast feature selection in robot visual navigation. IEEE Robot Autom Lett 3572-3579
Gronauer S, Diepold K (2022) Multi-agent deep reinforcement learning: A survey. Artif Intell Rev 895-943
Lin E, Chen Q, Qi X (2020) Deep reinforcement learning for imbalanced classification. Appl Intell 2488-2502
Qin Y, Wang H, Yi S, Li X, Zhai L (2020) Virtual machine placement based on multi-objective reinforcement learning. Appl Intell 2370-2383
Wen S, Wen Z, Zhang D, Zhang H, Wang T (2021) A multi-robot path-planning algorithm for autonomous navigation using meta-reinforcement learning based on transfer learning. Appl Soft Comput 110:1–15
Lu Y, Chen Y, Zhao D, Li D (2021) MGRL: Graph neural network based inference in a Markov network with reinforcement learning for visual navigation. Neurocomputing 140-150
Zeng F, Wang C, Ge SS (2020) Tutor-guided interior navigation with deep reinforcement learning. IEEE Trans Cogn Develop Syst 1-11
Tolani V, Bansal S, Faust A, Tomlin C (2021) Visual navigation among humans with optimal control as a supervisor. IEEE Robot Autom Lett 2288-2295
Fang B, Mei G, Yuan X, Wang L, Wang Z, Wang J (2021) Visual SLAM for robot navigation in healthcare facility. Pattern Recogn 1-12
Chaplot DS, Salakhutdinov DS, Gupta A, Gupta S (2020) Neural topological slam for visual navigation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp 12875-12884
Ok K, Liu K, Frey K, How JP, Roy N (2019) Robust object-based slam for high-speed autonomous navigation. In International Conference on Robotics and Automation (ICRA) pp 669-675
Dor M, Skinner KA, Driver T, Tsiotras P (2021) Visual SLAM for asteroid relative navigation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2066-2075
Karkus P, Cai S, Hsu D (2021) Differentiable slam-net: Learning particle slam for visual navigation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Nashville, TN, USA, pp 2815–2825
Nguyen A, Nguyen N, Tran K, Tjiputra E, Tran QD (2020) Autonomous navigation in complex environments with deep multimodal fusion network. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) pp. 5824-5830
Li J, Qin H, Wang J, Li J (2021) OpenStreetMap-based autonomous navigation for the four wheel-legged robot via 3D-Lidar and CCD camera. IEEE Trans Ind Electron2708-2717
Unlu HU, Patel N, Krishnamurthy P, Khorrami F (2019) Sliding-window temporal attention based deep learning system for robust sensor modality fusion for ugv navigation. IEEE Robot Autom Lett 4216-4223
Lin Y, Gao F, Qin T, Gao W, Liu T, Wu W, Shen S (2018) Autonomous aerial navigation using monocular visual-inertial fusion. J Field Rob 23-51
Eckenhoff K, Geneva P, Huang G (2021) Mimc-vins: A versatile and resilient multi-imu multi-camera visual-inertial navigation system. IEEE Trans Robot 1360-1380
Seok H, Lim J (2020) ROVINS: Robust omnidirectional visual inertial navigation system. IEEE Robot Autom Lett 6225-6232
Zhu Y, Mottaghi R, Kolve E, Lim JJ, Gupta A, Fei-Fei L, Farhadi A (2017) Target-driven visual navigation in indoor scenes using deep reinforcement learning. In IEEE international conference on robotics and automation (ICRA) pp. 3357-3364
Fang Q, Xu X, Wang X, Zeng Y (2020) Target-driven visual navigation in indoor scenes using reinforcement learning and imitation learning. CAAI Transactions on Intelligence Technology pp. 167-176
Jin YL, Ji ZY, Zeng D, Zhang XP (2022) VWP: An Efficient DRL-based Autonomous Driving Model. IEEE Trans Multimedia 1-13
Huang Z, Wu J, Lv C (2022) Efficient deep reinforcement learning with imitative expert priors for autonomous driving. IEEE Trans Neural Netw Learn Syst 1-13
Wu Y, Liao S, Liu X, Li Z, Lu R (2021) Deep reinforcement learning on autonomous driving policy with auxiliary critic network. IEEE Trans Neural Netw Learn Syst 1-11
Liu X, Liu Y, Chen Y, Hanzo L (2020) Enhancing the fuel-economy of V2I-assisted autonomous driving: A reinforcement learning approach. IEEE Trans Veh Technol 8329-8342
Kastner L, Cox J, Buiyan T, Lambrecht J (2022) All-in-one: A DRL-based control switch combining state-of-the-art navigation planners. In International Conference on Robotics and Automation (ICRA) pp. 2861-2867
Morad SD, Mecca R, Poudel RP, Liwicki S, Cipolla R (2021) Embodied visual navigation with automatic curriculum learning in real environments. IEEE Robot Autom Lett 683-690
Seymour Z, Thopalli K, Mithun N, Chiu HP, Samarasekera S, Kumar R (2021) Maast: Map attention with semantic transformers for efficient visual navigation. In IEEE International Conference on Robotics and Automation (ICRA) pp. 13223-13230
Huang C, Zhang R, Ouyang M, Wei P, Lin J, Su J, Lin L (2021) Deductive reinforcement learning for visual autonomous urban driving navigation. IEEE Trans Neural Netw Learn Syst 5379-5391
Sun Y, Yuan B, Zhang Y, Zheng W, Xia Q, Tang B, Zhou X (2021) Research on Action Strategies and Simulations of DRL and MCTS-based Intelligent Round Game. Int J Control Autom Syst 2984-2998
Mo K, Tang W, Li J, Yuan X (2022) Attacking deep reinforcement learning with decoupled adversarial policy. IEEE Trans Dependable Secure Comput 1-12
Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley K (2016) Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning 1928-1937
Huang X, Deng H, Zhang W, Song R, Li Y (2021) Towards multi-modal perception-based navigation: A deep reinforcement learning method. IEEE Robot Autom Lett 6(3):4986–4993
Li Z, Zhou A, Wang M, Shen Y (2019) Deep fusion of multi-layers salient CNN features and similarity network for robust visual place recognition. In IEEE International Conference on Robotics and Biomimetics (ROBIO) pp. 22-29
Qin T, Li P, Shen S (2018) Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans Rob 34(4):1004–1020
Kendall A, Grimes M, Cipolla R (2015) Posenet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision pp. 2938-2946
Wang S, Clark R, Wen H, Trigoni N (2017) Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks. In IEEE international conference on robotics and automation (ICRA) pp. 2043-2050
Li Z, Zhou A, Pu J, Yu J (2021) Multi-modal neural feature fusion for automatic driving through perception-aware path planning. IEEE Access 9:142782–142794
Author information
Authors and Affiliations
Contributions
Zhenyu Li and Aiguo Zhou conceived and designed the approach. Zhenyu Li contributed in the construction of simulation and real experimental environment. The first draft of the manuscript was written by Zhenyu Li. Aiguo Zhou corrected the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Conflict of interest
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix A. Place recognition model
In the appendix, we introduce the establishment process of the environment perception model in detail, which includes the place recognition model, the scene segmentation model, and the pose estimation model. To improve the overall operational efficiency of the proposed RDDRL model, we pursue the principle of lightweight in the process of designing the environment model. Therefore, the designed place recognition model, scene segmentation model, and pose estimation model are all lightweight models. Fig. 17 depicts the place recognition model, which consists of five convolutional modules with a total of 13 convolutional layers.
The 1 to 5 convolutional modules use 64, 128, 256, 512, and 512 filters, respectively. To extract salient features, all pooling layers use maxpooling, with a kernel size of \(2 \times 2\) and a stride size of \(2 \times 2\). The detailed structure description is shown in Table 7.
To speed up scene recognition, the method of feature aggregation is used to aggregate salient features in a certain region into salient regions, which is expressed as follows:
where \(X_{i}\) is the two-dimensional tensor that represents the activated responses of the \(i^th\) convolutional layer. \(f_{\Omega , i}\) is the \(i^{th}\) valid spatial position that represents the salient region in an image. Then, these salient regions are weighted to obtain global scene representation:
For the query scene \(\textbf{I}^q\)and the reference scene \(\textbf{I}^r\), the global scene representation obtained after convolutional encoding is \(L^q(\textbf{I})\) and \(L^r(\textbf{I})\), respectively. The cosine distance function is used to calculate the vector distance between the two to judge the place similarity, which is expressed as Eq. 1.
We carried out place recognition experiments to verify the effectiveness of the proposed model by comparing it with similar methods on four challenging datasets. The experimental results are shown in Fig. 18, which also was described in detail in our published papers [44].
Appendix B. Scene segmentation model
We design a lightweight scene segmentation model that adopts a full-bottleneck structure, as shown in Fig. 19. To speed up the operation, atrous convolutions are embedded in the bottleneck structure. The whole architecture is shown in Table 8.
We use I to represent the matrix representation of the image, and the following results are obtained after feature encoding (encoder):
where s is the RRelu function, b is the bias vectors for the \(k^{th}\) feature map, \(F_{m}^{w}\) is the convolutional filters. The decoder will also output m intermediate features by the previous down-sample \(T_m\). The convolutions from block 3 will cope with the encoding features, and output the results:
Where \(F_{m}^{w^{\prime }}\) is the convolutional filters used in the current stage, \(b_{m}^{\prime }\) is the bias vectors in the current stage. Then, the decoder will cope with the output of the previous stage:
Where \(F_{m}^{w^{\prime }}\) is the convolutional filters used in the decoding stage, \(b_{m}^{\prime \prime }\) is the bias vectors in the decoding stage. Scene segmentation tries to produce labels for each pixel based on different categories appearing in a scene. In our work, we use the Softmax layer to as the object classifier to detect all objects, which is expressed as follows:
Where m is the number of categories.
We carried out semantic segmentation experiments to verify the effectiveness of the proposed model by comparing it with similar methods. To speed up model reasoning, we construct a lightweight semantic segmentation model. The results show that The proposed model is second only to LedNet in terms of performance, but the reasoning speed of our model is much higher than that of LedNet.
Appendix C. Pose estimation model
To estimate the robot’s pose, we design a pose estimator based on multi-sensor fusion. The pose estimator first uses a visual encoder and an imu encoder to encode images and inertial measurements, respectively and then uses LSTM toconstruct a pose estimator. The vector representation of the pose is output by fusing the data collected by the two sensors, and the fully connected layer network is then used to regress the robot’s pose. The whole structure of pose estimation is shown in Fig. 20.
We first employ a CNN with five convolutional layers and LSTM to encode the image and imu features respectively, which is expressed as in Eqs. 5 and 6. We consider translation and rotation as motion on the manifolds and decompose the camera’s motion into a rotation motion and a translation motion in the vector space. Then, the rotation motion and translation motion are translated into a rotation matrix and a translation matrix by a special Euclidean group SE(3):
Where R and T are the rotation and translation matrices of the camera in space motion, \(\mathbb {R}^{3}\) denotes a set of 3D vector Spaces, \(a_{v}\) and \(a_{i}\) are the scene descriptor and imu descriptor, lstm is the fusion function. Every moment of the camera’s motion has a matching Euclidean group SE(3), which forms a trajectory flow by recording the time sequence. The pose vector is fed into three fully connected neural layers for pose regression. The fully connected neural layers regress translations and rotations as a quaternion and map the fused pose vector into a robot’s pose vector, which is expressed as Eq. B7. We carried out pose estimation experiments to verify the effectiveness of the proposed model by comparing it with similar methods on the KITTI dataset. The selected comparison method includes the handcrafted engineering method (VINS [45]) and the deep learning method (PoseNet [46] and DeepVO [47]). By comparing the average pose errors (APE) of several methods, we can see from Fig. 21 that the pose estimation method proposed by us obtained a smaller APE value, indicating a higher pose estimation accuracy. Figures C6 and C7 respectively show the Translation and pose angle changes on the six KITTI sequences. It can be seen that the experimental curves are generally consistent with the groundtruth, which also was described in detail in our published papers [48].
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, Z., Zhou, A. RDDRL: a recurrent deduction deep reinforcement learning model for multimodal vision-robot navigation. Appl Intell 53, 23244–23270 (2023). https://doi.org/10.1007/s10489-023-04754-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-04754-7