1 Introduction

Autonomous navigation is an important module for intelligent robots to accomplish the task of path finding. In unknown environments, robots usually need to traverse the environment exhaustively to build a map before navigating [1]. With the development of deep learning, it is possible to achieve autonomous navigation without building a map in advance [2]. A challenge for autonomous navigation in unknown environments is how to explore the unseen scenario quickly and effectively. The robot needs to identify its current position and build the map directly from the camera data, without any manual assistance.

Chen et al. [3] used end-to-end learning to realize robot navigation in unknown environments. This work proposes that the learning approach can extract geometric shapes directly from RGB images and is more flexible than traditional navigation methods. However, the method may not be effective for exhaustive exploration or long-term planning in large environments. The exponential growth of the search space with the distance to goals can be a significant challenge for this approach. The high cost associated with performing map matching, pose estimation, and path planning tasks due to the large search space may limit its practical application [4].

Devendra et al. [5] presented an autonomous exploration model called Activate Neural SLAM (ANS). The modular and hierarchical strategy reduces the search space during training and has better performance and sampling efficiency compared to end-to-end learning. There are still some problems with the ANS model. First, the model may have difficulty recognizing whether the floor is an obstacle if it has a similar color to the furniture. Second, the agent may visit frequently explored areas due to improper local goal selection. Figure 1 (the blue line) shows that ANS has many errors in the predicted map, and the agent wanders around in the explored room.

Fig. 1
figure 1

Disadvantage of ANS. The gray part is the real map and the colored part is the predicted map. ANS has many localization errors and mapping errors in the blue circle. (Color figure online)

Recognition difficulties can arise when the environment is complex or when there are similar features that are difficult to distinguish. Attention mechanisms can help the robot to clearly identify targets [6, 7]. In addition, more sophisticated decision-making strategies can help robots navigate more efficiently and reduce wandering. Hierarchical planning and reinforcement learning are examples of decision-making strategies that can take into account both global and local information to help robots make more informed and efficient navigation decisions [8].

To address the above challenges, we propose both spatial and channel attention mechanisms to improve the efficiency of map matching and navigation. We add some different attention mechanisms based on the ANS framework proposed by Devendra. We use the CBAM (Convolutional Block Attention Module) [9] module to get better access to semantic information in the environment.

  • We use a spatial attention mechanism to increase the weight of the main obstacle in processing RGB images, which helps the robot to identify obstacles (such as similarly colored floor or tea table) and predict the position of robot.

  • We use a channel attention mechanism is developed to increase the weight of channels with unknown space during navigation. It makes the exploration for unvisited regions a higher priority and avoids visiting the explored area as much as possible. In summary, we conclude our contributions listed as follows:

  • Our work proposes a deep learning SLAM (Simultaneous Localization and Mapping) strategy to explore unknown 3D environments. It increases the priority of unknown regions to avoid frequent exploration.

  • We present a mechanism with a spatial attention and a channel attention to improve the matching efficiency and the prediction accuracy. This method is able to quickly identify obstacles and reduce the traversal time.

  • We implement our model in the Habitat simulator platform [10, 11] and evaluate the performance on Gibson datasets [12]. The experimental results demonstrate our model can build a more accurate map than ANS and other baseline methods with less runtime. The exploration coverage of our model is higher than baseline methods in unknown environments with the summary measure called SPL (Success weighted by (normalized inverse) Path Length) [13].

The structure of the paper is arranged as follows. We study the related work in Sect. 2. Our model and detailed design are presented in Sects. 3 and 4. The experimental evaluation and comparison are conducted in Sect. 5, and we conclude this paper in Sect. 6.

2 Related Work

In active SLAM, we need to maximize both environment exploration (to better understand the environment) and environment exploitation (to revisit previously traversed regions for closed-loop detection) for better map estimation and localization. There are several approaches to active SLAM which we address and summarize in our research.

The geometric approaches consider active SLAM as a task for the robot to select the optimal path and trajectory. The methods reduce the uncertainty in pose estimation and mapping to achieve efficient autonomous navigation in unknown environments. The approaches enable the use of frontier-based exploration as well as conventional path planners (e.g., RRT* [14] (Rapidly-exploring Random Tree), D* [15] (D-star), and A* [16] (A-star).

IT-based methods solve the exploration problem of joint entropy minimization. DRRT [17] (Distance Rapidly-exploring Random Tree) uses distance as the cost function, while ERRT [18] (Entropy Rapidly-exploring Random Tree) uses the entropy change of the traveled distance as the cost function. These algorithms aim to improve the exploration efficiency in path planning by considering the information gain and uncertainty reduction in the environment.

Frontier-based explorations characterize the boundary as a set of unknown points, where each point has at least one known spatial neighbor [19]. A hybrid control-switching exploration method with a particle-filtered SLAM backend uses frontier-based exploration with A* as the global planner and the Dynamic Window Approach (DWA) [20] reaction algorithm as the local planner [21]. Another work uses graph-based SLAM (i-SLAM) [22] optimization as a SLAM backend for autonomous exploration using frontier exploration.

In contrast to the use of traditional path planners (RRT*, D*, A*), the dynamics-based approaches formulate active SLAM as a problem of selecting a sequence of control inputs to generate collision-free trajectories and cover as much area as possible [23]. The methods minimize the uncertainty in state estimation, which improves localization and mapping in the environment. For the selection of the optimal target location, the theories employ a Linear Quadratic Regulator (LQR) [24] or Model Predictive Control (MPC) [25] to select the optimal future trajectory in order to balance the need of exploring new regions with the need of utilizing visited regions.

Several deep learning-based methods have been used in SLAM research. Direct methods use deep neural networks to directly estimate the pose of robot and construct a map from raw sensor data, while indirect methods extract features from sensor data and use them to estimate the pose of robot and map [26]. End-to-end methods learn the entire SLAM pipeline from raw sensor data to the final map [27]. Hybrid methods combine deep learning with traditional SLAM approaches [28]. While traditional multi-view learning mechanisms cannot handle multiple views with variable features, deep learning-based methods use multi-view image stitching to stitch images taken from different viewpoints into an image with a broader field of view [29, 30].

Some studies define the active SLAM problem as an approximation of a partially observable Markov decision process (POMDP) [31, 32]. They also seek to find a sequence of actions that minimizes the map uncertainty [33, 34]. Wang et al. used the depth network and the optical flow network to improve scale accuracy in unsupervised monocular vision odometry [35]. Savva et al. [10] used simple convolutional layers with a gate recurrent unit (GRU) [36] layer as a reinforcement learning policy. And Mirowski et al. [37] added Resnet18 [38] to extract image information. Chen et al. [3] tried to solve this problem by an end-to-end RL policy. Devendra et al. [5] took advantage of both classical and learning-based approaches to analyze paths by adding the Proximal Policy Optimization (PPO) [39] policy and the Fast Marching Method (FMM) [40]. In the follow-up study, Devendra et al. used self-supervised learning to improve decision making in active learning [41, 42]. They implemented goal-oriented navigation while retaining the semantic information in the images [43, 44]. Meanwhile, they used topological representation of space or transformer-based spatial planning for navigating [45, 46].The use of an attention network injected into the neural network enables a greater focus on useful images and rapid prediction of objects to augment the representations of the depth map in detail [47, 48].

3 Overall Design

We propose a modular model called Neural Attention-based Visual SLAM (NAVS) for navigation and mapping. The model combines the approaches of traditional navigation strategies, deep reinforcement learning with spatial and channel attention mechanisms to address the above challenges. The agent obtains the maximum reward from the environment through reinforcement learning policy to simulate the motion of robot in an unknown indoor environment. As shown in Fig. 2, it consists of three main components: SLAM, Global Policy, and Local Policy.

Fig. 2
figure 2

Overview of NAVS. We add the CBAM module in the neural network structure of SLAM module and local policy module

3.1 SLAM Module

The SLAM module predicts the current map of the environment and the pose of the agent. The CBAM module increases the weight of main obstacle in processing RGB images, which helps the agent to identify obstacles (such as similarly colored floor or tea table) and predict its position. As shown in Fig. 3, the SLAM module inputs the RGB observations and outputs the predicted map. The SLAM module transforms the self-centered map of the previous moment spatially, in the same coordinate system as the current moment, according to the change of sensor reading \(x_t\). The agent compares the current self-centered map with the previous transformed self-centered map and predicts the change in relative pose. The agent transforms the current self-centered map into a part of the global map according to the change of the pose estimation.

Fig. 3
figure 3

Overview of SLAM module. The SLAM module mainly consists of the mapper module and the pose estimator module. We add the CBAM attention mechanism module to the mapper module to improve the weight of the main obstacles when building the map

To ensure continuity in building the map, we use the map and the pose of agent from the previous moment to infer the current values, which can be established by

$$\begin{aligned} \begin{aligned} {\ M_t, X_t = f_{\text {SLAM}}(m_t, d_x, X_{t-1}, M_{t-1})}\ \end{aligned} \end{aligned}$$
(1)

where M denotes the global spatial map, m denotes the self-centered map, X denotes the pose of agent, and \(d_x\) denotes the amount of change in the pose of agent. The Mapper module predicts the output self-centered 2D spatial map according to the observation of agent, and it is given by

$$\begin{aligned} \begin{aligned} {\ m_t = f_{\text {Map}}(S_t, \theta _M), \quad m_t \in [0,1]^{2 \times V \times V} }\ \end{aligned} \end{aligned}$$
(2)

where \(S_t\) denotes the input image of the sensor, V denotes the field of view, and \(\theta _M\) is the hyperparameter of the Mapper module.

We collect the change in sensor readings \(p_{t-1}=f_{ST}(m_{t-1}|d_x)\), and use this parameter to predict the relative position change of agent, \(dx_t=f_{PE}(p_{t-1} |\theta _P)\) where \(\theta _P\) is the hyperparameter of pose estimator. Eventually, we can predict the current pose of agent \(x_t=x_{t-1}+dx_t\).

We use a spatial transformation to transform the current self-centered map of the observers into a global spatial map. We converge the past map and the observed map into the current global map. The global map and the pose of agent can be described by

$$\begin{aligned}{} & {} {\ X_t = X_{t-1} + f_{\text {PE}}\left( f_{\text {ST}}(m_{t-1}X_{t-1:t}, m_t|\theta _P)\right) }\ \end{aligned}$$
(3)
$$\begin{aligned}{} & {} \quad {\ M_t = M_{t-1} + f_{\text {ST}}(m_t, x_t) }\ \end{aligned}$$
(4)

where \(\theta _p\) is the hyperparameter for neural network training.

3.2 Global Policy

The Global Policy module generates long-term goals that the agent can achieve using the current predicted environment map and the pose of agent. The Global Policy module uses the PPO policy to set long-term goals, and maximize the exploration area of the environment as a reward for the policy. The dominance function of this policy is shown in Eq. (5).

$$\begin{aligned} \begin{aligned} {A(s_t,a_t) = Q(s_t,a_t) - V(s_t)} \end{aligned} \end{aligned}$$
(5)

where \(s_t\) denotes the state, \(a_t\) denotes the action, \(Q(s_t,a_t)\) denotes the state action value function, \(A(s_t,a_t)\) is the advantage of taking action \(a_t\) in state \(s_t\), and \(V(s_t)\) denotes the state value function. The objective function is shown in Eqs. (6) and (7).

$$\begin{aligned} L^{PPO}(\theta )= & {} \text {mean}[\min (r_t(\theta )A(s_t,a_t),\textrm{clip}(r_t(\theta ),1-\epsilon ,1+\epsilon )A(s_t,a_t))] \end{aligned}$$
(6)
$$\begin{aligned} r_t(\theta )= & {} \frac{\pi _\theta (a_t|s_t)}{\pi _{\theta _\textrm{old}}(a_t|s_t)} \end{aligned}$$
(7)

where \(\theta \) denotes the policy parameter, \(\pi _{\theta }(a_t|s_t)\) denotes the action probability of a given state under the current policy, \(\pi _{\theta _\textrm{old}}(a_t|s_t)\) denotes the action probability of a given state under the old policy, \(\epsilon \) denotes a hyperparameter, and \(L^{PPO}(\theta )\) is the clipped surrogate objective. The value function by minimizing the squared error between the predicted value and the actual return for this policy is shown in Eq. (8).

$$\begin{aligned} \begin{aligned} {L^{value}(\theta )=\frac{1}{N} \sum _i\left( V\left( s_i, \theta \right) -R_i\right) ^2} \end{aligned} \end{aligned}$$
(8)

where \(L^{value}(\theta )\) is the loss of value, \(V(s_i, \theta )\) is the predicted value, and \(R_i\) is the actual cumulative reward obtained in the episode.

The Global Policy module uses \(h_t\in [0,1]^{(4*M*M)}\) as input. The \(h_t\) of the first two channels is the spatial map, the \(h_t\) of the third channel indicates the current pose estimation of agent, and the \(h_t\) of the fourth channel indicates the visited region. The \(h_t\) can be described by

$$\begin{aligned} \begin{aligned}&\forall i, j \in \left\{ 1,2,\ldots ,m\right\} : \\&h_t[c,i,j] = m_t[c,i,j] \quad \quad \quad \quad \quad \forall c \in \{0,1\} \\&h_t[2,i,j] = 1 \quad \quad \quad \quad \quad \text {if } i=x_t[0] \text { and } j=x_{t1} \\&h_[t3,i,j] = 1 \quad \quad \quad \quad \quad \text {if } i,j \in x_k[0], j=x_k[1] \text { and } k \in \{1,2,\ldots ,t\} \end{aligned} \end{aligned}$$
(9)

The CBAM module increases the weight of channels with unknown space before the Global Policy module. It makes the exploration for unvisited regions with more priority and avoids visiting the explored area as much as possible.

3.3 Path Planning

The agent translates long-term goals into short-term goals using the path planning approach. The path planning takes the long-term goal (\(g_t^l\)), the spatial obstacle map (\(m_t\)) and the pose estimation of agent (\(x_t\)) as inputs to compute and generate the short-term goal (\(g_t^s\)), i.e. \(g_t^s=\ f_{Plan}\left( g_t^l,\ m_t,\ x_t\right) \). The path planning uses current environment map to compute the shortest path from the current location of agent to long-term goal with the FMM method. We choose 0.25m from the agent on this path as the short-term goal point. The unexplored areas are assumed to be open space.

3.4 Local Policy

The Local Policy module needs to transform the coordinates of short-term goal into relative distances and angles from the position of agent. The Local Policy module generates the orders of actions for agent to navigate to the short-term goals according to the current observations. The Local Policy module takes the current RGB observation (\(s_t\)) and the short-term goal (\(g_t^s\)) as input, and outputs a navigation action \(a_t\ ={\ \pi }_L\left( s_t,{\ g}_t^s|\theta _L\right) \). \(\theta _L\) is a parameter of the Local Policy module. The input RGB images are feature-optimized by the CBAM module, as in the SLAM module.

4 Methods

The goal of the attentional mechanism is to selectively focus on goals and ignore distractors as much as possible. In Neural Attention-based Visual SLAM, we use deep neural networks in a modular and hierarchical way. Meanwhile, we add various attention mechanisms into the deep neural network to adjust the different feature weights and optimize the performance.

4.1 Spatial Attention Mechanism in Mapping

In computer vision, it is not easy to finely identify objects in an image. Because texture features, object vestiges, light interference, and other factors can make identification more difficult. However, we do not need to perform such detailed processing in the construction of an obstacle map. We only need to recognize the outlines of obstacles and walls, not detailed patterns or placed objects. The spatial attention module we use is shown in Fig. 4.

Fig. 4
figure 4

Spatial attention module. The input image is first subjected to maximum pooling and average pooling, and \(7\times 7\) convolution to obtain the feature map. It can acquire the position of goal in the image. Finally, this module uses sigmoid activation function to obtain a feature map of spatial attention. It increases the weight of main obstacle in processing RGB images

We focus on targeting large pieces of furniture and walls, rather than the decorations and small household items in the room. The key to accurate mapping is to clearly identify the contours of large objects. We set up a weight mask layer for the input images using the formula based on Eq. (10), where \(\sigma \) is the sigmoid operation and \(7\times 7\) is the size of the convolution kernel. Spatial regions of large objects receive higher weights.

$$\begin{aligned} \ F_S(X) = \sigma (f^{7 \times 7}(\text {AvgPool}(X), \text {MaxPool}(X))) = \sigma (f^{7 \times 7}(X_{\text {avg}}, X_{\text {max}})) \end{aligned}$$
(10)

Because of the accuracy of obstacle map that uses spatial attention, it also helps the agent to predict its location more accurately.

4.2 Channel Attention Mechanism in Navigation

Each feature map carries favorable features. As shown in Fig. 5, the channel attention mechanism can assign different weights to the channel latitude.

Fig. 5
figure 5

Channel attention module. The input feature maps are processed through the parallel maximum pooling and average pooling layers, and then through the share MLP module. It differs the importance of different channels. After that, this module uses sigmoid activation function to get the result and sums it by elements. Finally, this module uses sigmoid to obtain a feature map of channel attention. It increases the weight of channels with unknown space

In the Global Policy module, we have four channel dimensions for spatial mapping and agent location. We reduce the weight of the mapping channel and increase the weight of the channels with unknown space in the navigation progress. As shown in Eq. (11), where \(\sigma \) is the sigmoid operation, we find the channels with key features that exist in different tasks. We find that the improvement effect is more obvious when the interference of unimportant features is reduced.

$$\begin{aligned} \begin{aligned} F_S(X)&= \sigma (\text {MLP}(\text {AvgPool}(X)) + \text {MLP}(\text {MaxPool}(X))) \\&= \sigma (W_1W_0X_{\text {avg}} + W_1W_0X_{\text {max}}) \end{aligned} \end{aligned}$$
(11)

4.3 CBAM Module in Active SLAM

As shown in Fig. 6, the CBAM module infers the attention map sequentially along two independent dimensions (channel and spatial). The CBAM module multiplies the attention map with the input feature map to perform adaptive feature optimization. We can ignore the cost of the module and integrate it into the CNN architecture for joint training.

Fig. 6
figure 6

CBAM module. We combine the advantages of spatial attention in mapping and the advantages of channel attention in navigation. In our neural attention-based visual SLAM, we add the CBAM module to the network that uses RGB images as input in order to improve the accuracy of mapping and increase the relative coverage of navigation

We choose to use the pre-training model of ResNet18 network, taking into account the size of the whole system. Therefore, we cannot add the CBAM module into the block of network layer, in order not to change the network structure of ResNet18. We add the CBAM module in the first convolutional layer and the last convolutional layer of ResNet18 network. It is able to use the pre-trained parameters of ResNet18 networks to accelerate the convergence to the goal task. Table 1 shows the network structure of the ResNet18 network with the CBAM module.

Table 1 Neural network

The Mapper module in the SLAM module consists of 2 CBAM modules, 1 ResNet18 convolutional layer, 2 fully connected layers with a dropout of 0.5, and 3 deconvolutional layers. We use 3 convolutional layers and 3 fully connected layers for pose estimator. The Global Policy module uses 5 convolutional layers and 3 fully connected layers. The Local Policy module consists of 2 CBAM modules, 1 ResNet18 convolutional layer, 1 fully connected layer and a cyclic GRU (Gate Recurrent Unit) layer. The network structure of the Mapper module is shown in Fig. 7.

Fig. 7
figure 7

The network structure of the Mapper module. We use the CBAM module for feature optimization of the input and output to improve the weight of obstacles in the image. This improves the accuracy of the predicted map and preserves semantic information for pose estimation

5 Results

5.1 Experimental Setup

Datasets. We use the Gibson datasets to evaluate our models. Gibson datasets are 3D reconstruction scenes of real-world environments with different sets of cameras, which are widely adopted as Navigation Benchmark Scenarios. It contains 572 buildings composed of 1447 floors, and covers a total area of 211,000m2. Each building comes with a certain level of clutter and navigation complexity, as shown in Fig. 8. The agent encounters problems such as obstacle occlusion and lighting variations while exploring, just like in a real environment. These scenes provide a significant diversity in terms of size and complexity, allowing us to evaluate the efficiency and performance of our model in handling environments of different sizes.

Fig. 8
figure 8

The gibson datasets. The gibson dataset renders 3D scenes of various styles. The scenes contain multiple rooms with varying lighting conditions and object complexity

We choose the part of the Gibson dataset (indoor scenarios) for the point goal navigation in our experiment. We train models on Gibson training sets which have 72 scenes consisting mostly indoor scenes, and validate models on Gibson Val datasets which have 14 different indoor scenes. We train the model on some specific datasets and test it on unseen scenarios. This simulates agents performing active SLAM tasks in unseen environments and can help us evaluate the generalization ability of the model.

Implementation. We deploy the Habitat simulator platform in an Alibaba Cloud server with 4 NVIDIA-V100. The operating system is Ubuntu 18.04 Server Edition. We use the sensor and actuation noise models in Habitat to simulate the real world. The Habitat-Sim is deployed to process Gibson datasets to render a scene and simulate the robot interaction. Meanwhile, we apply the Habitat-Lab to define the navigation tasks, train the map matching models and evaluate the benchmark performance with standard metrics.

We conduct experiments with a camera frame size of \(128\times 128\) for the Local Policy and \(256\times 256\) for the Global Policy. We use Adam optimization with a learning rate of 0.0001 for the Local Policy and SLAM module, and \(2.5 \times 10^{-5}\) for the Global Policy. We use the SLAM module with a batch size of 72 and a memory size of 300000. The experience loss coefficient is set to 1.0, the projection loss coefficient is set to 1.0, and the pose loss coefficient is set to 10000.0. The value loss coefficient is set to 0.5, and the entropy coefficient is set to 0.001. We enable the pose estimator instead of Generalized Advantage Estimation (GAE). These specific parameter configurations played a crucial role in shaping the learning behavior and overall performance of the agent.

Baseline. In active SLAM, perception is used not only for map construction and localization, but also for proactively choosing the next action to maximize some goal. Traditional SLAM approaches such as ORB-SLAM, which rely only on sensor data, may not provide enough information to guide these proactive decisions. In addition, traditional SLAM approaches typically divide perception, mapping, and localization into separate steps, making it difficult to achieve real-time, tightly integrated proactive decision making. Therefore, we use some end-to-end Reinforcement Learning methods and Active Neural SLAM as baselines.

RL + 3LConv: The model which Savva et al. described in their work uses 3 layers convolutional network with a GRU as the RL Policy.

RL + Res18: The model uses pre-trained ResNet18 network with a GRU as the RL Policy.

RL + Res18 + AuxDepth: The baseline is proposed by Mirowski et al. The model uses the convolutional layers from ResNet18 with a deconvolutional layer for depth prediction, and adds 3 convolutional layers and GRU.

RL + Res18 + ProjDepth: This baseline is proposed by Chen et al. The model projects the predicted depth before passing to the GRU based on the architecture of RL + Res18 + AuxDepth.

Activate Neural SLAM: We implement the model of Active Neural SLAM as our comparative method. The model exploits the strengths of classical and learning-based navigation methods. We consider that the re-trained model is almost identical to the pre-trained model of Devendra by evaluation.

Our Models (NAVS): We also report the results of our Neural Attention-based Visual SLAM with channel attention (NAVSca) only, with spatial attention (NAVSsa) only, and with both attention techniques (NAVS).

The baseline models and our Neural Attention-based Visual SLAM (NAVS) algorithms are implemented in the Habitat simulator with the same datasets in our experiments.

5.2 Evaluation Metrics

We use the summary measure called SPL to denote the average absolute (Cov) and relative coverage (% Cov). The SPL is shown in Eq. (12), where N is the number of episodes, \(l_i\) is the shortest-path distance from the starting position of agent to the goal, \(p_i\) is the length of the path actually taken by the agent in this episode, \(S_i\) is the binary indicator of success. The value of SPL is equal to the value of the average relative coverage, and the value of \(p_i\) is equal to the value of the average absolute coverage.

$$\begin{aligned} \begin{aligned} SPL = \frac{1}{N}\sum _{i=1}^{N}{S_i\frac{l_i}{\max {\left( p_i,l_i\right) }}} \end{aligned} \end{aligned}$$
(12)

We calculate the relative improvement of the relative coverage in the medium term with the average relative difference. The rate of relative improvement can be obtained in Eq. (13), where i and n are timesteps, RC is the relative coverage, \(\Delta RC\) is the change of relative coverage.

$$\begin{aligned} \begin{aligned} \Delta RC = \frac{1}{N}\sum _{i=1}^{N} \frac{RC_{\text {NAVS}} - RC_{\text {ANS}}}{RC_{\text {ANS}}} \end{aligned} \end{aligned}$$
(13)

5.3 Experimental Evaluation

We measure the performance of the model in map matching, navigation, and operation. And, we provide detailed comparisons of the data and visualized images. By conducting these experiments, we will offer a more comprehensive evaluation of the NAVS model and further demonstrate its strengths and limitations.

5.3.1 The Performance of Matching the Map

In Fig. 9, we observe the predicted maps of ANS, NAVSca and NAVSsa at about 900 timesteps. NAVSca is the model that adds only channel attention. The exploration paths of NAVS overlap much less than ANS when the agent is in an explored area. This is because channel attention marks the channels that contain location information. When generating goal points, the agent gives preference to unexplored areas. This means that the goals do not need to appear frequently appeared in the same area. As a result, the agent avoids visiting the explored areas as much as possible.

Fig. 9
figure 9

Visualization map of ANS, NAVSca and NAVSsa. In the blue circle, the exploration routes of agent in NAVS overlap much less than in ANS, and the map of NAVSsa is more accurate than ANS. (Color figure online)

NAVSsa is the model that adds only spatial attention. The performance of NAVSsa is more stable and accurate than ANS, as shown in Fig. 9. The trajectory result shows that our model can match the real map with less traversed space due to the spatial attention in the model NAVSsa by increasing the weight of large obstacles. This means that the agent almost exclusively detects large obstacles for mapping and pose estimation. As a result, the agent can build accurate maps without the interference of decorations.

The experiment uses the setup of a point goal navigation task in Habitat. It is difficult to quantitatively assess the accuracy of the predicted maps when Habitat-sim is unable to provide the agent with information about the category and location of objects in the environment. This is because an obstacle with an incorrectly predicted location that happens to be at the location of a real obstacle will still be shown as correctly judged in this task, but it is actually offset. However, the superior performance of NAVSsa allows us to clearly determine the gap between ANS and it, and the mapping accuracy of NAVSsa is about 80%.

Fig. 10
figure 10

Visualization map of ANS and NAVS after 600 timesteps. In the blue circle, the predicted map of ANS has much error and the predicted map of NAVS is more accurate. (Color figure online)

Figure 10 shows the predicted maps of NAVS compared with ANS at about 900 timesteps. NAVS adds channel attention and spatial attention as the CBAM module. In the figure, ANS shows significant mapping and poss errors with the increase of cumulative errors. The predicted maps and pose estimation of agent deviate from the real data. Meanwhile, the agent explores the same area several times during exploration. These are the problems which we want to solve in active SLAM. The visualization map shows the performance improvements due to the CBAM module in NAVS. Compared to ANS, the pose estimation of agent is closer to the real position and the predicted map has less error. In the same area, the agent avoids visiting the explored areas as much as possible. As a result, the agent can explore more area with less traversing space and build accurate maps.

We just run the Global Policy module and the SLAM module. It causes the agent does not to move to the goal point, but to keep turning around in place. We compare the predicted maps for ANS and NAVS in place with 1000 timesteps of rotation in Fig. 11a and b.

Fig. 11
figure 11

Predicted maps of ANS and NAVS rotated in place. ANS did not recognize the object within the red circle, but NAVS recognizes the object information

The scanning range of agent at a fixed position should be the same with the same camera parameters. When the objective conditions are identical, NAVS can better extract the feature information from the high-weighted obstacle images and effectively build accurate prediction maps. This means that the agent of NAVS can explore more area faster and more efficiently in the large environment. As a result, NAVS has better performance of matching the map than ANS in active SLAM.

5.3.2 The performance of navigation

Table 2 The results of all modes

In Table 2, we count the average results of all models in 14 different unknown scenes of the Gibson Val dataset, for a total of 994 episodes. As a baseline, the average absolute and relative coverage of ANS of 32.701 m\(^2/0.948\). NAVSca and NAVSsa achieve an average absolute and relative coverage of 31.542 m\(^2/0.943\) and 31.653 m\(^2/0.948\), respectively. NAVSca and NAVSsa have an obvious deficiency compared with ANS in large environments, and about the same as ANS in small environments. NAVS achieves the average absolute and relative coverage of 31.981 m\(^2/0.953\). In the relative coverage, NAVS achieves an improvement of 0.5% over ANS in overall environments and an improvement of 1.1% over ANS in large environments.

Fig. 12
figure 12

Relative coverage growth of ANS, NAVSca, NAVSsa and NAVS

In Fig. 12, we plot the changes of relative coverage (% Cov) in different Gibson Val scenarios as the timestep progresses. We find that NAVSca has the highest growth rate of relative coverage. However, NAVSca does not achieve the best relative coverage because the errors in the predicted map affects the exploration. On the whole, the relative coverage of NAVS is the highest of all the models. NAVS has the higher relative coverage than ANS at all times and a relative improvement of about 20% over ANS from 300 of episode length to 600 of episode length in large environment.

Fig. 13
figure 13

Box plots of all models. The box plots show the median, interquartile range (IQR) as a box, whiskers indicating the data range, and outliers in the data distribution

We make box plots to observe the distribution of the coverage of each model, as shown in Fig. 13a and b. There are a total of 274 episodes in the large scenes, and 710 episodes in the small scenes.

The box consists of two horizontal line segments and a rectangle. The upper and lower boundaries of the rectangle indicate the first quartile (Q1) and the third quartile (Q3) of the data, respectively, while the height of the rectangle indicates the range of the data distribution (IQR, i.e., interquartile range) between these two quartiles. Inside the box, a horizontal line segment indicates the median of the data (Q2). Above and below the box, two vertical line segments extend from the whisker line, connecting the maximum and minimum values. Outside the whisker lines, outliers that deviate from the regular distribution are indicated by separate small circles.

As can be seen in Fig. 13a, NAVS has a smaller box height. This indicates that the distribution of results for NAVS is more stable and concentrated in many experiments. We can determine that the exploration results of NAVS are better than those of ANS by comparing the median line, the lower boundary of the box, and the minimum value. In Fig. 13b, all models perform well in small scenes. However, we notice that the number of outliers with SPL below 0.7 is much lower for NAVS than for ANS. This means that NAVS is more robust compared to ANS.

5.3.3 The performance of operation

In Table 3, we count the runtime and FPS (Frames Per Second) of all models to complete running 71 episodes in the Gibson Val datasets from the log files. The runtime of ANS is 5 h, 34 min and 20 s at 49 FPS. The runtime of NAVSca is 6 h, 26 min and 52 s at 42 FPS, and the runtime of NAVSsa is 1 h, 54 min and 35 s at 144 FPS.

Table 3 The operation of all models

We consider that the channel attention mechanism marks different weights for the same channel at different stages. In mapping, it marks a high weight on map information and a low weight on pose information. In navigation, it is the opposite. That leads to a mismatch of channel information during running. As a result, NAVSca takes longer to run than ANS. On the other hand, NAVSsa improves the weight of obstacles and drastically reduces the amount of data to be processed in the input. The agent can perform accurate mapping and pose estimation faster using the processed RGB images. As a result, NAVSsa can complete the task at higher FPS and in less runtime.

In Table 3, the runtime of NAVS is 5 h, 34 min and 7 s. The runtime of NAVS is about 1 h less than NAVSca, even though the spatial attention mechanism results in a mismatch of channel weights. This means that that we can mark important goal information in mismatched channels to reduce the runtime. It is clear that even with the use of mismatched channels, NAVS still operates at the efficiency standard of ANS.

As shown in Fig. 14a and b, we observe the log files and plot the comparison of rewards in the test phase. In active SLAM, the models enable to maximize the environmental exploration area as a reward for the Global Policy.

Fig. 14
figure 14

The comparison of rewards. Global step mean rew refers to the average reward value across all steps, and is a measure of overall model performance. In reinforcement learning, a model typically completes a series of episodes by interacting with the environment and receiving a reward value. Global eps mean rew refers to the average reward value across all episodes

In Fig. 14a, All models get higher rewards in the early stages. However, NAVS maintains the highest reward acquisition as the timesteps increase. This means that that each step of NAVS explores the largest area after a large number of areas have already been explored. NAVS has the highest and consistent rewards compared to ANS across all episodes, as shown in Fig. 14b. It proves that NAVS gets the most explored area out of all the episodes.

From the above experiments, the CBAM module that incorporates multiple attention mechanisms can achieve a more stable overall improvement for active SLAM models. We consider that it may be more important for active SLAM to simplify and clarify the object goals in images.

6 Conclusion

The problem of large deviations in the predicted maps and low position accuracy exists when navigating in unknown 3D environments. In this paper, we propose a modular and hierarchical navigation model. The model combines classical navigation policy, deep reinforcement learning approach and attention mechanism. This makes the agent focus more on identifying large furniture in mapping and more on the pose estimation during navigation. Experiments show that our model outperforms previous methods. The prediction map of agent matches the contours of the room and exists with fewer prediction errors. There are few overlapping parts of the trajectory. These improve the exploration rate of room. The model can also be further studied for use in a realistic home environment. In the future, we will work on extending the model for more complex semantic tasks, such as semantic goal navigation.