Safe-Nav: learning to prevent PointGoal navigation failure in unknown environments

Training robots to safely navigate (Safe-Nav) in uncertain complex environments using the RGB-D sensor is quite challenging as it involves the performance of different tasks such as obstacle avoidance, optimal path planning, and control. Traditional navigation approaches cannot generate suitable paths which guarantee enough visible features. Recent learning-based methods are still not mature enough due to their proneness to collisions and prohibitive computational cost. This paper focuses on generating safe trajectories to the desired goal while avoiding collisions and tracking failure in unknown complex environments. We present Safe-Nav, a hierarchical framework composed of the visual simultaneous localization and mapping (SLAM) module, the global planner module and the local planner module. The visual SLAM module generates the navigation map and the robot pose. The global planner module plans a local waypoint on the real-time navigation map. In the local planner module, a deep-reinforcement-learning-based (DRL-based) policy is presented for taking safe actions towards local waypoints. Our DRL-based policy can learn different navigation skills (e.g., avoiding collisions and avoiding tracking failure) through specialized modes without any supervisory signals when the PointGoal-navigation-specied reward is provided. We have demonstrated the performance of our proposed Safe-Nav in the Habitat simulation environment. Our approach outperforms the recent learning-based method and conventional navigation approach with relative improvements of over 205% (0.55 vs. 0.18) and 139% (0.55 vs. 0.23) in the success rate, respectively.


Introduction
In the research of robot autonomous navigation, visual navigation has gradually become a primary and promising research direction with the rapid development of robotic vision [1,2]. Visual navigation tasks can be divided into many forms. For example, PointGoal navigation defines a task where an intelligent robot equipped with a visual/RGB-D sensor should navigate to the target location, provided by the relative position to the robot without prior maps [3,4]. In the PointGoal navigation task, the robot must learn a good environment representation and be robust to state-estimation errors in unknown environments. Therefore, PointGoal navigation can also reflect the automation level of the robotic system. The traditional navigation pipeline uses a modular and hierarchical fashion to decompose the navigation task into a series of subtasks, such as localization, mapping and planning [5]. The robot relies on the visual/RGB-D sensor to map and localize within the environment and then plan a suitable route from the starting position to the desired goal [6][7][8]. Localization and mapping using visual simultaneous localization and mapping (SLAM) methods are fundamental for correct visual navigation. Recently, visual SLAM systems have been widely studied in the literature [9][10][11]. However, visual SLAM systems still suffer from robustness issues. For example, feature-based visual SLAM systems often suffer from tracking failure in low-texture scenes due to the insufficient visible features, which can result in inaccurate pose estimation. Path planning is also a key component for visual navigation, which is often composed of global planners [12,13] and local planners [14,15]. The global planners obtain the optimal path from the starting location to the target location based on the environmental map and robot pose. Dijkstra's algorithm and A* algorithm are two widely used global planner algorithms. The former searches the map in a full range to obtain the optimal path, but a large amount of calculation is required, and the latter adds a heuristic term to accelerate the searching process of the algorithm [16]. With the increasing complexity and uncertainty of unknown environments, the efficiency of global path planning decreases significantly. Local planner computes the path through the local information, which has real-time planning ability. Dynamic Window Algorithm (DWA) [14] is a commonly used local planner algorithm, which only considers the environmental information within a speed window. From the sample range of speed, the trajectory with the highest score is selected as the output. Timed Elastic Band (TEB) [17] is also a widely used local path planning algorithm, which optimizes the constraints in the framework of General Graph Optimization (g2o) [15]. However, these methods require tuning numerous parameters manually. The second disadvantage of local planner algorithms is that they easily lead to tracking failure when combined with visual SLAM. The localization quality of visual SLAM is affected by the number of extracted feature points. Tracking failure easily occurs when there are not enough feature points observed in the current image, resulting in inaccurate pose estimation.
Currently, deep reinforcement learning (DRL) has shown tremendous potential in solving complex decision-making problems. DRL has spurred a lot of significant breakthroughs in many applications, such as manipulator controlling [18], autonomous driving [19,20], and games [21,22]. Considering the strengths of DRL, researchers attempt to apply it to tackle the PointGoal navigation problem [4,[23][24][25][26]. Compared with the previous traditional methods, the DRL-based methods avoid extensive hand-engineering but only need to learn the complete navigation system directly from the data. Potentially, these methods can make the best use of the regularity in the data to achieve better performance than hand-crafted systems.
The methods in [23,[27][28][29][30] use data-driven techniques to learn navigation strategies that can be deployed in previously unknown environments without the need of building an explicit spatial representation of the environment. However, these approaches are inherently sample-inefficient because the information provided by mature mapping, localization, and path-planning methods are ignored. Specially, Wijmans et al. [23] have trained their model to achieve good performance by using 2.5 billion frames of experience, which requires more than 6 months of GPU time! Although Ye et al. [27] propose that the sample efficiency can be improved using multiple auxiliary tasks, their training process still needs 40 M frames and 1 GPU-week. Meanwhile, some researchers propose to decouple the perception and policy to speed up the DRL training process [31][32][33]. However, the perceptual part also requires a lot of training and needs to be fine-tuned when facing a new scene.
Moreover, it is found that most learning-based models [23,27,28,[31][32][33][34] can use the simulator's flaws to slide along the obstacles instead of avoiding them [29,35]. Such policies may lead to catastrophic failures in the real world because collisions may damage the robot. Kadian et al. [29] find this flaw for the first time and modify the simulator to disable sliding along the obstacles. However, they only limit the number of collisions rather than learn how to avoid them. Sax et al. [30] and Bigazzi et al. [35] penalize obstacle collision with a large negative constant in the training process. However, Morad et al. [36] find that an effective collision-free policy cannot be learned when the negative constant is enforced in the training process due to reward sparsity and catastrophic forgetting [37][38][39]. To mitigate this elevated reward sparsity, they propose to use an automatic curriculum learning technique to improve the data sampling.
In summary, approaches to solve PointGoal navigation can broadly be classified into two categories: (1) traditional navigation methods that decompose navigation into localization, mapping and planning [12][13][14][15] or (2) learning neural policies using DRL [4,[23][24][25][26][27][28][29]35]. Traditional navigation methods have the weakness of intensive computational demand and involve numerous parameters that need to be tuned manually. Moreover, these methods cannot constrain a path such that the robot can see enough features. For the learning-based methods, simultaneously implementing motion planning, scene layouts and obstacle avoidance are extremely challenging and prohibitively expensive.
Based on these problems, we propose a PointGoal navigation framework to train and integrate safe DRL-based policy into the traditional navigation pipeline. In our method, the PointGoal navigation framework comprises a visual SLAM module, a global planner module and a DRL-based local planner module. The visual SLAM module produces the robot pose and two-dimensional (2D) navigation map from raw sensory inputs. To achieve long-range navigation, the global planner module is used as a local waypoint generator that connects the visual SLAM module and the DRL-based local planner module. The global planner module consumes the 2D navigation map and the robot pose from the visual SLAM module, and uses a conventional global planning algorithm to generate local waypoints for the DRL-based local planner module. The DRL-based local planner module can directly map both the non-spatial input (i.e., the relative distance and relative angle) and spatial input (i.e., RGB image, depth image and feature points image) to safe actions towards local waypoints while avoiding collisions and tracking failure. Three specialized movement modes are included in the local planner module to ensure safety, each corresponding to a specific navigation skill (e.g., obstacles avoidance, tracking failure avoidance).
In summary, the main contributions of this paper are as follows: 1. We introduce Safe-Nav, a hierarchical navigation framework to train and integrate the DRL-based method into the traditional robot navigation pipeline. 2. We combine three specialist movement modes in the local planner module to actively plan safe paths of a robot to a goal while avoiding collisions and tracking failure. 3. Extensive simulation results on complex and unseen scenarios, and systematic research on the safety and ablation demonstrate the effectiveness of our proposed framework.

Methods
Overview Figure 1 presents a flowchart of our proposed framework Safe-Nav. The autonomous navigation system consists of a visual SLAM module (Nav-SLAM), a global planner module and a DRL-based local planner module. After sampling the start position and goal position, a PointGoal navigation task is started. Our visual SLAM module predicts the 2D navigation map of the environment and the robot poses from incoming RGB observations and depth observations in real time. The global planner module uses the predicted navigation map and the robot pose to generate a feasible trajectory from the current position to the goal position and select the local waypoint along the path. Next, the DRL-based local planner module executes the navigation process of moving to the local waypoint. Upon reaching the goal, colliding with the environment or happening tracking failure, the PointGoal navigation task ends. The visual SLAM module, the global planner module and the DRL-based local planner module are closely connected in the proposed navigation framework Safe-Nav. The relationship between these three modules can be further elaborated as follows: • The visual SLAM module is the basis of the navigation framework Safe-Nav. On the one hand, it provides the 2D navigation map and robot pose for the global planner module. On the other hand, it generates the feature point maps for the DRL-based local planner module as one of the spatial observation inputs. • The global planner module is responsible for planning a local waypoint on the 2D navigation map based on the current robot pose. Then, the relative distance and angle between the current robot pose and the local waypoint are computed as the non-spatial observation inputs for the DRL-based local planner module. • The DRL-based local planner module can directly map both non-spatial inputs and spatial inputs from the RGB-D camera and other modules to safe actions towards local waypoints while avoiding collisions and tracking failure.
From section "Visual SLAM module" to section "DRLbased local planner module", each module is described in detail. In section "Evaluation metrics", we give a detailed description of the evaluation metrics. Then, in section "Simulation setup", we illustrate the simulation platform for training and evaluation.

Visual SLAM module
The visual SLAM module processes the RGB-D data in sequence and produces the robot pose and navigation map at each key frame. Due to the excellent performance of the Oriented FAST and Rotated BRIEF-SLAM2 (ORB- Fig. 1 The overview of the proposed framework Safe-Nav. Each module is represented by a grey box. The DRL-based local planner module learns to approach a desired local waypoint while avoiding collisions and tracking failure

LOOP CLOSING
SLAM2) [40] in most application scenarios, it is adopted in our framework to provide a feature-based visual SLAM solution. However, the original ORB-SLAM2 method can only construct sparse point cloud maps, which only meets the needs of robot localization rather than robot navigation. Therefore, a complete visual SLAM system for the robot navigation task, named Nav-SLAM, is proposed based on ORB-SLAM2, enabling us to estimate the robot pose and produce the navigation map. The architecture of Nav-SLAM is shown in Fig. 2. Four threads run in parallel in Nav-SLAM: tracking, local mapping, loop closing and navigation map creation. The navigation map creation thread gets new key frames and the key frames' poses from the tracking thread. The key frames' poses and the RGB-D images are used to generate the point cloud. Based on the camera projection model and the key frames' poses, the coordinate of the three-dimensional (3D) point corresponding to each pixel from the key frame in the world coordinate system can be obtained [41]. The spatial point P c = (x c , y c , z c ) in the camera coordinate system, which corresponds to the pixel coordinate (u, v), can be expressed as where c x , c y , f x and f y are internal camera parameters. d stands for the depth of the pixel (the distance from the pixel to the imaging plane). s indicates the depth scale.
To obtain the coordinates of the pixels in the real world, it is necessary to use the pose of the key frame to transform P c into the coordinate in the world coordinate system P w : where T i is the pose of the i-th key frame. Performing the same operation on all pixels in each key frame can generate a 3D dense point cloud map of the environment. Compared with the point cloud map, the octree map stores the map in an octree hierarchical structure [42], which saves more memory space than the point cloud map. Meanwhile, the octree map also has excellent properties such as adjustable resolution and updating nodes in probability. In the process of creating the point cloud map, we construct the octree map of the 3D environment in real time based on the octree map library. Then, we project the 3D octree map onto the 2D plane. Only those obstacles within a range of [0.3, 1.2] times the camera height (1.25 m) are projected to create the 2D navigation map, which avoids suspended objects or ground planes well.

Global planner module
This module aims to determine the global path from the current position to the target location along the local waypoints on the path. As there is no initial information about the environment, it is impossible to obtain the optimal global path or even ensure that the robot can successfully complete the PointGoal navigation task without collisions and tracking failure by simply following the global path. Therefore, to accomplish long-distance PointGoal navigation, it is neces-sary to find intermediate points from the global path as local waypoints for the local planner. Considering that A* [43] is a simple and efficient global path planning algorithm, we use A* as the global planner to compute the global path. Given the global path and the current robot's localization, a local waypoint is selected along the path that lies at a distance of d s (which is set to be 0.6 m) from the robot's current position.

Problem formulation
The task of the DRL-based local planner module is to navigate successfully from the current position to the local waypoint as fast as possible without collisions and tracking failure. The robot interacts with the environment by executing the translation velocity (ν) and rotation velocity (ω). We formulate this process as a partially observable Markov decision problem (POMDP), which can be defined as the 6-tuple (S, O, A, T, R, γ ). Here, S represents the robot's states, A stands for the action space, O is the observation space, T indicates the transition probability from one state to another, R means the reward and γ expresses the discount factor. The robot cannot fully obtain the state s t , so it has to rely on observations o t ∈ O (e.g., an RGB image). The state s t ∈ S changes based on the robot's actions a t ∈ A according to the transition probability T (s t+1 |s t , a t ), where a t ∼ π(·|o t ). After each state transition, the robot receives a reward r t . The goal of DRL is to find the optimal policy π * that maximizes the expected reward as follows: where the discount factor γ balances the trade-off between exploration and exploitation.
In this paper, we use an on-policy gradient algorithm: Proximal Policy Optimization (PPO) [44] to find the optimal policy π * . Given a series of collected trajectories (commonly named as 'rollout') and a θ -parameterized policy π θ , the policy π θ can be updated using PPO algorithm as fol- Let r t (θ ) = π θ /π θ old be the ratio of the current policy and the old policy used to collect the rollout, which can be used to achieve stability. Here, π θ old is the policy before the update. To avoid large policy updates, the ratio r t (θ ) is constrained to the range of [1 − ε, 1 + ε] via the clip term. Then, the policy parameters are updated by maximizing the object function: where the expectation E t [. . . ] indicates the empirical average over the rollout.

Observation space
There are five different observation types which can be described as follows. The first type and the second type are the RGB images (denoted by o rgb ) and the depth images (denoted by o depth ), respectively. The third type is the relative distance between the local waypoint and the robot's position, which can be denoted as o dis . Given the coordinate position of the robot on the navigation map p robot = (x 1 , y 1 ) and the coordinate position of the local waypoint p waypoint = (x 2 , y 2 ), o dis can be computed as follows: The fourth type is the relative angle between the local waypoint and the robot's position, which can be denoted as o angle . Given the deflection angle in the world coordinate system A θ , o angle can be computed as follows: where 2π constrain the relative angle o angle within [−π, π].
In addition, we use a representation of ORB feature points in the observation. In ORB-SLAM2, ORB features [45] are used for tracking. The performance of ORB-SLAM2 is greatly affected by the number of ORB feature points in the observed images, because the small number of feature points can lead to inaccurate pose estimation. In each step, we extract ORB feature points from the RGB image. Then, the binary feature point image o f pm is constructed, which serves as the fifth observation type. Each pixel on o f pm is 0 or 1, which, respectively, represents the presence or absence of the ORB feature point.

Action space
In this paper, we train our DRL-based local planner module on a discrete action space composed of four discrete actions move forward, turn left, turn right, move backwards in total, which apply different translational and angular velocities to the agents. As mentioned earlier, tracking in visual SLAM systems usually fails during the period of pure rotational motions [46,47]. Handling pure rotational motion problem for visual SLAM has always been very challenging. To avoid this problem, our method avoids the pure rotational motion by applying a small translational velocity when turning left or right. Given the limits for the translational velocity (−ν max , ν min , ν max ) and angular velocity (−ω, ω), four discrete actions can be defined as: (ν max , 0), (ν min , −ω), (ν min , ω) and (−ν max , 0). The execution time of each action is 0.5 s.

Movement modes
Avoiding collisions and tracking failure is important for our local planner module but are much harder to train simultaneously. Here, we propose three specialized movement modes: the normal mode, the collision-avoidance mode and the tracking-failure-avoidance mode. One mode might specialize in approaching the local waypoints fast, another in avoiding collisions and a third in getting enough visible features. By combining these modes, the robot can reach the local waypoints faster and learn to avoid collisions and tracking failure better.
At each time step, the robot's observation acts as a switch that selects a specialized mode. We assume a case in which the robot is very close to the obstacles. In this case, the robot's task is to approach a local waypoint safely while avoiding collisions. We define this movement mode as the collision-avoidance mode. Tracking failure can arise from observing limited visual features in featureless areas. We define the tracking-failure-avoidance mode, in which the robot is required to move to an area that can guarantee enough visual features. When the robot is far away from the obstacles and in a feature-rich area, the robot should be able to reach the local waypoint as fast as possible and try to avoid entering those dangerous areas. We define this case as the normal mode. During navigation, the normal mode is used more frequently than the other two modes.
Next, we will further describe the network architecture of each movement mode in detail.
1. Collision-avoidance mode. In this mode, a depth image along with the relative distance and angle between the robot position and the local waypoint, can be used as the input. In this way, a better capability of perceiving the distance to the obstacles and the relative position of the local waypoint can be available. Thus, we propose a two-branch feature extraction network that splits the input into non-spatial inputs (relative distance and relative angle) and spatial input (depth image). The non-spatial branch network is a single fully connected (FC) layer. It encodes the relative distance and angle into a high dimensional feature ( yellow rectangle in the upper branch in Fig. 3), further processed in the selfattention module. The depth image is fed into the spatial branch network to extract deep features, which is a ResNet18 [48] network structure (green layers in the left lower branch in Fig. 3. ResNet18 is a lightweight architecture and can achieve robust performance to the vanishing/exploding gradient problem during training, which has already shown its strengths in the task of visual navigation [49]. The adaptive average pooling layer replaces the output of the second last average pooling layer. The output of this layer is flattened and reduced to a one-dimensional output ( green rectangle in the lower branch in Fig. 3), which is further processed in the following self-attention module.
Then, we concatenate the outputs of both branches and process them together in the following PPO-based neural network architecture, which is shown on the right side of Fig. 3. As mentioned above, a PPO-based neural network architecture is used to train the DRL-based local planner. PPO is an actor-critic network architecture. For the actor and critic networks, we use the same head. The output action space is discrete and consists of three actions -move forward, turn left, turn right. Rectified linear unit (ReLU) activation follows after the first two FC layers. The output of the last FC layer is used to produce a softmax distribution over the action space. In the critic network, ReLU activation is applied on all the FC layers. The output of the last FC layer is used to produce an estimate of the value function.
2. Tracking-failure-avoidance mode. It is important to avoid tracking failure in this mode instead of reaching the local waypoint since the number of associated visual features can affect the robot's localization capabilities. Less feature points often correlate to inaccurate relative position estimation by the visual SLAM module. Therefore, we only have visual sensing capabilities in the tracking-failure-avoidance mode, such as RGB images, depth images and feature point images, and lack any position sensing capabilities such as the relative distance and relative angle. We concatenate the outputs of the RGB image branch, depth image branch and the feature point image branch into a new feature, as shown in Fig. 4. Different from the collision-avoidance mode, we add the move backwards action in the tracking-failureavoidance mode, which can provide back and forth motions to better avoid tracking failure.
3. Normal mode. In this mode, the robot is required to move to the local waypoint in the shortest possible time and try to stay away from those dangerous areas that may lead to collisions and tracking failure. Therefore, a broader range sensing capability is required in this mode. The RGB-D images, the feature point images, the relative distance and the relative angle are all used as the input, which is shown in Fig. 5.

Self-attention modules
Recently, the visual attention mechanism has been widely used in computer vision, making the deep learning network find the correlation between features and highlighting important features to improve the feature extraction ability [50][51][52].
For non-spatial input, i.e., the relative distance and relative angle, a common self-attention method is adopted [53]. We first feed the input to a fully connected layer, which encodes   Softmax Fig. 6 The architecture of self-attention module. a Non-spatial input. b Spatial input the relative distance and angle into a high dimensional feature space to process them in the following self-attention module. The self-attention module for non-spatial input is composed of two additional fully connected layers. We feed the obtained high dimensional features to the first fully connected layer followed by Tanh activation. The result is inserted into a second fully connected layer followed by a softmax activation. The produced feature weights and the original high dimensional features are combined through element-wise multiplication. The process is presented in Fig. 6a. For the spatial input, such as RGB image, depth image and feature points image, we adopt a non-local style self-attention method, which has already been used in visual localization [54], video analysis [55] and image generation [56]. It can capture the long-range dependencies and global correlation of the image features, which is presented in Fig. 6b. Given a feature vector, its self-attention can be calculated using the softmax function. Then, the residual connection will be added to the linear embedding.

Reward function
1. In the normal mode, our reward function provides negative constants to discourage collisions and tracking failure, and positive rewards to encourage the robot to go to local waypoints as soon as possible. Therefore, our reward function considers five summands regarding the local waypoint wp, the relative distance ds, the relative angle ag, the obstacles o and the tracking status of the visual SLAM module tr:

R t sa f e = R t sa f e (wp) + R t sa f e (ds) + R t sa f e (ag) +R t sa f e (o) + R t sa f e (tr).
The reward R t sa f e (wp) has a high positive reward R wp if the position of the robot P t r reaches the local waypoint position P t wp within radius D wp : The robot is rewarded for getting closer to the local waypoint P t wp , but it is punished for moving away from it. The diff(·) function in Eq. (10) computes the difference between the relative distance at the previous time step and the current time step: Moreover, to reach the local waypoint as soon as possible, the robot should learn to adjust the relative angle and move towards the local waypoint. Therefore, given the relative angle at the previous time step α t−1 and the relative angle at the current time step α t , the robot is rewarded for reducing the relative angle and punished for increasing the relative angle, which can be expressed as Of course, the robot is punished for colliding with obstacles. This enables the robot try to keep away from obstacles. If the robot collides with obstacles, it results in a high negative constant R C O : The robot is also punished for tracking failure. This allows the robot to avoid moving to areas where there are no features. If tracking failure occurs, the robot will get a high negative constant R T F : 2. In the collision-avoidance mode, our reward function provides positive rewards to encourage the robot to move away from obstacles and get closer to the local waypoints.
We compute the minimum depth value d t min and the average depth value d t avg from the depth image at each step. The robot is rewarded for increasing the minimum depth value, but it is punished for decreasing the minimum depth value in Eq. (14). To avoid the sparse reward when the robot turns right or turns left due to the small translational velocity, we reward the robot for increasing d t avg [see Eq. (15)]. In addition, we limit the sum of R t oa (mindepth) and R t oa (avgdepth) to [−5, 5] to avoid excessive changes of these rewards: In addition, the robot is punished for colliding with obstacles. If the robot collides with obstacles, it results in a high negative constant R C O : Similarly as the normal mode, the robot is rewarded for getting closer to the local waypoint, but it is punished for moving away from it. It is important for the robot to pass through narrow areas: The robot is rewarded for reaching the local waypoint position P t wp within the radius D wp without collisions: The total reward of the collision-avoidance mode can be calculated as the sum of all sub-rewards: 3. If the number of extracted ORB feature points is small during navigation in the normal mode, it will enter the tracking-failure-avoidance mode. There is no separate reward for successfully reaching the local waypoint as our aim is to avoid visual SLAM failure along the way rather than optimally reach the local waypoint. Therefore, our reward function in the tracking-failure-avoidance mode only considers three summands regarding the number of feature points po, the tracking of the visual SLAM module tr and the back motions back, as described as follows: Let N t track be the number of feature points at the current time step. The robot is rewarded for obtaining more feature points, but it is punished for getting fewer feature points in Eq. (22). In addition, we limit R t track ( po) to [−6, 6] to avoid excessive changes of this reward: where N t−1 track and N t track are the number of feature points visible at times (t-1) and t, respectively.
Since the action space includes back motions in the tracking-failure-avoidance mode, it is dangerous for the robot to select continuous back motions. Since continuous back motions will easily lead to collisions. For that reason, the robot gets punished for selecting continuous back motions for some time steps D back : Of course, the robot is punished for tracking failure and colliding with obstacles: Training Three modes were trained: the normal mode, the collisionavoidance mode and the tracking-failure-avoidance mode. The normal mode was first trained. For each PointGoal navigation task, the feasible start and goal positions were given. After obtaining the local waypoint position from the global planner module, a local planner process was started from the robot's current position to the local waypoint position. Each local planner episode was limited to k time steps. Each local planner episode ended when the local waypoint was reached within the radius D wp or collision occurred or tracking failure occurred or a given time steps were reached. For each step, the obtained reward r t was stored into the buffer R. These values were used to calculate the value loss L value and the policy loss L policy . Softmax was applied to obtain the policy values into probabilities of each action. Finally, the policy gradient and value gradient were calculated using the respective losses. The network parameters were updated accordingly. When either the goal was reached, or collision occurred, or tracking failure occurred, the PointGoal navigation task was terminated. Suppose N PointGoal navigation tasks were given, the training algorithm can be described in Algorithm 1 shown below.

Algorithm 1 Training algorithm of the normal mode
Initialize: Define actor network π θ and critic network V θ . Initialize network parameters. To better avoid possible collisions and tracking failure, we then trained the collision-avoidance mode and the tracking-failure-avoidance mode. During the navigation process, given a local waypoint, we chose the suitable movement mode according to the observation of the robot at each step. When the minimum depth value d t min in the current depth image was less than the threshold T h depth for some time steps D depth , we chose to enter the collision-avoidance mode. When the number of feature points N t track in the current binary feature point image was less than the threshold T h points for some time steps D points and the current local waypoint was not the goal, we chose to enter the trackingfailure-avoidance mode. Table 1 gives the concrete parameter values used in our evaluation. We used an Intel Core i9-10900X processor, with 64 GB RAM and NVIDIA RTX 3090 GPU for model training and testing.

Evaluation metrics
We used four evaluation metrics: success, success weighted by path length (SPL) [3], average Collision Rate (CR) and average Tracking Failure Rate (TFR). A PointGoal navigation task was considered successful if the robot reached the goal within 0.2 m without collisions or tracking failure. Given the geodesic distance of the shortest path l and the length of the path p taken by the robot for each task, the popular SPL metric was calculated as follows: where S i is the binary indicator of success in the task i.
In addition, we can calculate the average Collision Rate (CR) and average Tracking Failure Rate (TFR) by counting the number of collisions and tracking failure.

Simulation setup
We used the Habitat simulator [28] as our training and evaluation platform, which are 3D reconstructions of real indoor spaces from the Gibson dataset [57] . We used the train split of the dataset from the Habitat repository. We evaluated our approach over five trials across five unseen test environments (see Fig. 7). In the Habitat simulator, we set the camera height to 0.45 m. The spatial resolution of the RGB image and depth image were both set to be (640, 480). Considering that many mainstream RGB-D cameras have narrow horizontal field of view (FoV). For example, Kinect v1 has only a horizontal FOV of 57-degree. Therefore, the horizontal FOV was set to be 57-degree, which is very important for practical PointGoal navigation applications.

Training analysis
In the training process, we randomly sampled the PointGoal navigation task from the training dataset. In the first 500 tasks, we only trained the normal mode. In the following 1000 tasks, we compared the difference between continuously training the normal mode and freezing the weights of the normal mode to train the other two modes. The comparison results are shown in Fig. 8. For convenience, we define the model trained with and without the collision-avoidance mode and the tracking-failure-avoidance mode as Safe-Nav and Baseline, respectively.  1. First, we compare the success rate during the training process of Safe-Nav and Baseline. Figure 8a presents the results, where the brown curve represents the training process of Baseline. At the 500th task, the collisionavoidance mode and the tracking-failure-avoidance mode were added to the training process for comparison, shown in the green curve. The success rate of Safe-Nav first decreases and then increases rapidly from the 600th task to the 1500th task. This is because the other two modes were not well trained from the 500th task to the 600th task. After the 900th task, the advantages of adding the collision-avoidance mode and the tracking-failureavoidance mode can be reflected. Safe-Nav has better performance than Baseline from the 900th task to the end of the training. 2. Second, we compare the collision rate during the training process of Safe-Nav and Baseline. Figure 8b shows the comparison results. After adding the collision-avoidance mode, the collision rate of Safe-Nav first increases slightly. As the training continues, the collision rate of Safe-Nav tends to decrease and reaches a minimum when the training ends. However, with the increase of training tasks, the collision rate of Baseline decreases very slowly. There are two possible reasons. On the one hand, compared with the normal mode, there is only one type of spatial input (depth image) in the collision-avoidance mode, which reduces the interference of irrelevant information. On the other hand, the collision-avoidance mode benefits from the specially designed multi-constraint reward function, which further improves the performance of avoiding obstacles. 3. Third, we compare the tracking failure rate during the training process of Safe-Nav and Baseline. Figure 8c shows the comparison results. The tracking failure rate of Safe-Nav first increases from the 500th task to the 700th task, which indicates that the tracking-failure-avoidance mode is unstable in the early stage of the training process. After the 700th task, the tracking failure rate declines significantly and finally achieves almost the same result as Baseline. It is noted that the tracking failure rate is calculated in the whole training process, which reflects the changing trend of the tracking failure rate during the training process, including the early unstable stage of the tracking-failure-avoidance mode. Therefore, the final tracking failure rate of Safe-Nav is actually much lower than Baseline.
The following conclusions can be obtained based on the results described above: . 8 The success rate, collision rate and tracking failure rate during training. The green and brown curves are training process of our model trained with and without the collision-avoidance mode and the tracking-failure-avoidance mode Remark 1 As the number of training tasks increases, the performance of Baseline grows very slowly. This is because a lot of interaction with the environment is needed in the training process of deep reinforcement learning, which increases the cost of training.

Remark 2
In Safe-Nav, benefited from the specialized collision-avoidance mode and the specialized trackingfailure-avoidance mode, the training time is obviously shortened, and the collision rate and the tracking failure rate can be significantly reduced.

Comparison with the existing DRL-based methods
We compared Safe-Nav with two state-of-the-art DRL-based methods: PPO [28] and DD-PPO [23]. The average success rate, SPL scores and average collision rate are reported in Table 2. Different from the works in [23,28], a PointGoal navigation task using DRL-based methods was considered successful only when the robot reached the goal position without collisions in our research. From Table 2, we can observe that uniformly across the datasets, our method performs best. Our method effectively improves the success rate and SPL scores, while decreasing the collision rate by a large margin. In addition, both the PPO and DD-PPO methods have very high collision rates. During the navigation process, we observed that the robot using the PPO and DD-PPO methods tended to hit the obstacles instead of avoiding them. However, this behavior was not realistic, a real-world robot would bump into obstacles and simply stop upon colliding. In contrast, our method performed much better in most PointGoal navigation tasks.

Comparison with classic methods
This section includes an extensive comparison with the wellknown classic local planner DWA. We kept the visual SLAM module and the global planner module unchanged and only changed the local planner module. The representative qualitative trajectories of our Safe-Nav method and DWA on each test scene are illustrated in Figs. 9, 10, 11, 11 and 13. In all the cases, the DWA trajectories are shown in red. The trajectories of our normal mode, collision-avoidance mode and trackingfailure-avoidance mode are shown in blue, green and pink, respectively. As shown in Fig. 9, the left and right sides of the Edgemere scene contain many featureless areas, such as white walls, increasing the risk of tracking failure. In Fig. 9a, the DWA method selects to move close to a wall and slide along the wall, which leads to a collision eventually. Instead of moving close to obstacles, the proposed method selects to move along the wall at a distance. When the minimum depth value is less than the threshold, the robot will adjust the angle in time and then continue to move forward. In Fig. 9b, there are two major featureless regions in this scene: the white wall in the lower left corner of the starting point and the white wall on the right side of the small room in the lower left corner. The selected path by the DWA method traverses through the featureless region, which results in tracking failure. In contrast, the robot enters the tracking-failure-avoidance mode around the starting point using the proposed method. Instead of moving close to the local waypoint, the robot chooses to move backward and turn right to get more feature points.
As shown in Fig. 10, the Pablo scene contains two narrow corridors on the right side, which increase the risk of collisions. In Fig. 10a and b, the robot is easy to collide with obstacles at the corner. This may be because DWA only relies on the navigation map and location to plan local paths, and lacks the ability of using rich visual information to perceive surrounding obstacles. Moreover, the estimated navigation map and location often contain errors inevitably, resulting in collisions easily. However, the proposed method can reach the goal position successfully without collisions in both cases, which reflects that it has better ability of avoiding obstacles.
As shown in Fig. 11, the Greigsville scene contains a narrow and featureless corridor on the right side of the center. In Fig. 11a, with the help of the tracking-failure-avoidance mode, the robot successfully passes the featureless corridor by turning one circle to adjust its orientation. However, the robot has already collided with obstacles before it gets close to this corridor using the DWA method. In Fig. 11b, the robot collides in the first few steps using the DWA method. On the contrary, our robot can successfully pass the narrow corridor and finally reaches the goal position.
As shown in Fig. 12, the size of the Ribera scene is large, which includes many featureless areas and narrow corridors, increasing the difficulty of PointGoal navigation tasks. Figure 12a illustrates a long-distance PointGoal navigation task.  The robot finally approaches a featureless sofa, resulting in tracking failure using the DWA method. On the contrary, the proposed method can successfully reach the goal position without collisions and tracking failure. In Fig. 12b, the left side of the start position is a featureless region, which is a white wall. In contrast, many feature-rich regions are distributed along the walls surrounding the environment, especially the walls of the upper and the bottom sides. However, the local waypoint planned by the A* algorithm is in the featureless region when the robot is at the starting point. Tracking failure occurs when the robot turns left using the DWA method. Instead of naively adjusting the orientation towards the local waypoint, the robot chooses to move backward and turn right to get more feature points. As shown in Fig. 13, the Denmark test scene contains one major featureless region: the white wall at the end of the longitudinal corridor on the right of the center. In contrast, many feature-rich regions are distributed along the corridor, especially the cabinets of the left and right side. In Fig. 13a, the robot approaches the wall and collides with it using the DWA method. Similar to the previous environments, DWA chooses to naively move towards the local waypoints in Fig. 13b. However, since the robot looks towards featureless regions, it results in tracking failure. On the contrary, the proposed algo-   Table 3 lists the quantitative evaluations for DWA and the proposed method. In total, we conducted five test runs for each PointGoal navigation task. We evaluate the performance of the local planner by calculating the success rate, the average collision rates with all obstacles, the average tracking failure rates and SPL. The results show that the proposed method outperforms the classic local planner DWA method in terms of all the indicators, and the safety becomes more evident. However, we found that the robot often had trouble passing through narrow featureless corridors, producing dozens of time steps in length. We believe that actively adjusting the RGB-D camera's orientation may help the robot pass through these areas easier. Another limitation was that the robot often easily collided with the corners of obstacles due to the narrow FOV. As Fig. 14a shows, once the robot observes that the depth value is very small, it will immediately enter into the collision-avoidance mode. Then, it will turn right and move forward to avoid obstacles. Collisions can be avoided in many cases after adding the collision-avoidance mode. However, as Fig. 14b shows, the current position of the robot is very close to the obstacles and there are no obstacles in the historical observations along the way due to a narrow FOV. In this situation, the robot is very likely to move forward and collide with obstacles. Regarding this issue, we believe that the proposed method could be further improved with the help of memory modules, which are active areas in the research of computer vision [58,59].

Ablative analysis
To demonstrate the effectiveness of the collision-avoidance mode and the tracking-failure-avoidance mode, we compared Safe-Nav with an ablated version: Baseline policy which only uses the normal mode without other modes. The summarized results (Table 4) shows that the proposed Safe-Nav performs better than Baseline policy, as it introduces two extra special-ized movement modes, which can better deal with avoiding obstacles and featureless areas.

Conclusion
In this paper, we present and validate a framework Safe-Nav for planning a path by a mobile robot toward a goal through an unknown environment without collisions and tracking failure. The DRL-based local planner is integrated into the conventional navigation pipeline. We can directly train and deploy DRL-based algorithms as local planners by combining the visual SLAM module and global planner module. First, the method Nav-SLAM is proposed for navigation mapping under complex indoor environments using the RGB-D sensors. Second, the proposed global planner module generates local waypoints. Third, three different movement modes in the local planner module are combined to plan a path free of collisions and tracking failure.
In the future, the end-to-end DRL-based local planner approaches by integrating some long-term memory modules [58,59] will be our focus. Moreover, how to actively adjust the RGB-D camera's orientation to choose the best viewing directions to pass through featureless areas will also be our interest. In addition, DRL is treated as black-box models, lacking interpretability. We believe that more explainable and robust models could be obtained using some explainable AI technologies [60]. Code Availability Not applicable.

Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval This article does not contain any studies with human participants or animals performed by any of the authors.

Consent for publication Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.