Introduction

The autonomous exploration task involving sensing and decision-making techniques requires the sensor-equipped robots to actively navigate through unknown environments with the goal of fully perceiving and reconstructing the environments. In recent years, such a robotic exploration task has received extensive attention in the academic research field and has been applied in many domains. Such applications include, for example, search and rescue missions [1], intelligence, surveillance, and reconnaissance [2, 51], and geological exploration [3,4,5, 52, 53]. The employment of mobile robots in such complicated contexts depends on a robust and efficient exploration strategy.

Several situations require robotic autonomy when the map or model of the environment is not previously known. Intelligent exploration in unknown and complex environments is crucial as it allows unsupervised preparation for unspecified downstream tasks (e.g. reconstruction, rescue, and reconnaissance) in the new environments. Robots that have the ability to explore are flexible to deploy since they can use prior knowledge about previously seen settings to quickly gather useful information in the new one, without having to rely on humans.

In the past few decades, different human-designed exploration methods have been proposed, such as the well-known frontier-based exploration [6] and the artificial potential fields [7]. However, manually defining the functioning of an appropriate resilient exploration strategy is hard in complex real-world settings. The reason is that human-designed exploration techniques usually make strong assumptions about the environments and the tasks, which may restrict their adaptability to complex environments and thus limit their application in real-world practices.

In that context, the employment of Deep Reinforcement Learning (DRL) for robotic exploration tasks is being increasingly investigated [8], driven by the idea of letting the robots automatically learn skills from interactions with the environment instead of receiving explicit instructions. It is worth mentioning that DRL does not require dataset labeling, which represents a large cost for supervised learning solutions. DRL methods directly learn a mapping from observations to actions through trial-and-error interactions with their workspace. Model-based DRL methods use learning and a model of the environment’s transitions to approximate a global value or policy function. However, the environments tend to be too complex to be precisely modeled. In contrast, model-free DRL methods learn the policy or value directly through trial-and-error with the physical or simulated systems.

However, applying DRL to real robots is not a straightforward task, facing problems like low sampling efficiency, the curse of dimensionality, reward shaping, and awkward sim-to-real transference [8]. Previous DRL-based exploration strategies [9, 10] usually set the entire current reconstruction as the action space, thus suffering from lower sampling efficiency. Moreover, as the exploration range gradually expands with lidar sensing, these methods are damaged by awkward transference from simulation to reality and the dimensional curse from observation space and action space. As their exploration goals are mostly regressed and may be far from the frontiers, thus leading to incomplete exploration. Here, a frontier is defined as the boundary where free space meets unmapped space.

To alleviate the above problems, we propose a novel model-free DRL-based method based on (1) Voronoi Domain Conversion (VDC), (2) Generalized Voronoi Networks (GVN) and (3) Invalid Action Masking (IAM) to learn a lidar-based exploration strategy in a trial-and-error manner. Firstly, the VDC is achieved by utilizing the Generalized Voronoi Diagram (GVD) [11] to incrementally obtain a high-dimensional Topological Environmental Representation (TER) from a historic scanned area, which preserves the main structural clues of the environment with minimal feature dimensions. The VDC is mainly proposed to improve the sampling efficiency of the action space and further break the dimensional curse.

Then, the GVN with spatial awareness and episodic memory is designed to learn smart exploration strategies with the TER as input. A Graph convolutional networks (GCN) [48] based encoder is leveraged to capture the spatial features from the TER. Episodic memory is another part of GVN, which empowers robots with temporal awareness for active exploration reasoning. The historic features encoded by the spatial encoder at different time steps in the past are collected as spatial memory fragments and stored in a memory buffer. The spatial memory fragments are used to enhance exploration strategy learning and reduce the number of trials and errors. Temporal Convolutional Network (TCN) [49] is utilized to comprehensively manage the time-series episodic memory.

Although works based on variants of Recurrent Neural Networks (RNNs) [50] can perform well on time dependency problems, they have trouble learning over long sequences [12, 13]. In the context of exploring partially observable environments, we propose to utilize the topological map-like memory (i.e. our TER), which associates a robot’s lidar sensing with position information and summarizes the entire map as the robot executes a task. Empirical studies [14, 15] demonstrate that map-like memory outperforms RNNs. However, the storage consumption of map-like memory is related to the scale of exploration environments, which limits the lightweight and scalability of exploration strategies. This is the reason why we store the high-dimensional features extracted by the spatial encoder as episodic memory instead of storing the TER directly.

Furthermore, we propose to leverage IAM [16] to reshape the configuration space of exploration tasks to further cope with the low sampling efficiency and the explosion of action space caused by the expansion of the sensing range. Specifically, the selection of candidate exploration goals is restricted to the frontier nodes with large information gain in the TER, which ensures faster and more complete exploration. An illustration of the exploration process is shown in Fig. 1. Overall, our main contributions include:

Fig. 1
figure 1

ad Illustrate the exploration result, the exploration process, the shape of high-dimensional TER, and the bird’s-eye view of the simulated environment, respectively. e, f Illustrate the 3D structure of a realistic simulation scene

  1. (1)

    We propose a VDC method to achieve efficient and high-dimensional TER. Then a GVN with spatial awareness and episodic memory is designed for lidar-based smart exploration strategy learning.

  2. (2)

    We propose an IAM method to cope with low sampling efficiency and the explosion of action space caused by the expansion of the sensing range during DRL.

  3. (3)

    We introduce a carefully designed reward function for learning exploration policies. We verify the superiority and feasibility of our approach through sufficient simulation comparisons and real robot experiments.

The remainder of the paper is structured as follows: “Related work” introduces related work. “Problem statement” defines the lidar-based autonomous exploration problem. “Episodic memory enhanced autonomous exploration with VDC and IAM” specifically describes our exploration strategy based on VDC, GVN, and IAM, followed by simulation comparisons, ablation studies, and mobile robot experiments in “Simulation analysis and experimental verification”. Finally, we summarize the work in “Conclusion and future work”.

Related work

Classic human-designed exploration strategies

The human-designed autonomous exploration strategies are classically categorized into two categories, i.e., frontier-based [6, 17,18,19,20,21] and sampling-based [22, 23] methods.

Frontier-based methods first detect the frontiers in the map and extract the Frontier Points (FPs) as candidate exploration goals. The original idea of frontier-based exploration is introduced in [6], where the nearest frontier is always greedily chosen as the exploration goal. Subsequently, by employing local and global Rapidly-exploring Random Trees (RRTs) to detect the frontiers, Umari et al. [17] propose a new exploration strategy to achieve efficient exploration. By incrementally constructing and maintaining a graph structure during exploration, a novel path planning algorithm is proposed by Wang et al. [18]. Meanwhile, a utility function that integrates path cost and information gain is introduced in [18] for selecting the next best frontier.

Recently, Cavinato et al. [19] formulate a novel frontier-based exploration strategy capable of explicitly handling dynamic obstacles, thus leading to complete and reliable exploration outcomes in populated environments. Yu et al. [20] propose an exploration strategy based on Multi-target Potential Field (MMPF), which can eliminate the goal’s back-and-forth changes and overlapping trajectories. It is worth mentioning that MMPF is suitable for multi-robot collaborative exploration. In addition, Julia et al. [22] present a frontier-based hybrid approach to the multi-robot integrated exploration problem.

As the most advanced sampling-based method, Zhu et al. [21] divide autonomous exploration into the exploration phase and relocation phase. A local RRT is developed in the free space of the workspace in the exploration stage. On this basis, the relocation stage is employed to explicitly transit the robot to different sub-areas in the sense to maximize information gain. Unlike the 2D lidar-based exploration strategy proposed in [21], Zhang et al. [23] propose a novel depth camera-based informative exploration strategy with uniform sampling to efficiently reduce the unknown volume of 3D environments and provide accurate reconstruction for quadrotors. However, although sampling-based methods are probabilistically complete, they take a long time to sample the exploration goals to achieve complete exploration. In our work, we propose an exploration strategy that combines the advantages of DRL-based and frontier-based approaches for efficient and complete exploration.

Learning-based exploration strategies

Learning-based exploration strategies are mainly divided into two categories: map prediction-based [24,25,26] and DRL-based [27,28,29,30] methods. To remedy the lack of prior knowledge and enhance the exploration, Zhong et al. [24] and Shrestha et al. [25] propose similar methods to predict map occupancy by using generative neural networks. Saroya et al. [26] propose a method for learning subtle structural cues, which may be exploited by a decision-maker to improve the speed of exploration.

The work in [27] and [28] combine the classical frontier-based exploration with DRL to empower robots to explore unknown cluttered environments autonomously. The difference is that a hierarchical approach is proposed in [28] for both effective exploration and collision-free navigation in crowded environments. Zhu et al. [29] propose an exploration architecture that integrates a DRL model, a Next Best View (NBV) selection approach, and a structural integrity measurement to improve the exploration performance. Li et al. [30] construct an exploration framework by decomposing the exploration process into decision, planning, and mapping modules. Then, they propose a DRL-based decision-making algorithm that uses a deep neural network to learn exploration strategy from the partial map.

For cutting-edge learning-based approaches, Chen et al. [31] combine GCN with DRL to achieve exploratory decision-making under uncertainty. Lee et al. [32] propose the Extendable Navigation Network (ENN) to encode the partially observed high-dimensional indoor Euclidean space to a sparse graph representation. Then, the robot’s motion is generated by a learned Q-network whose input is the ENN. Xu et al. [33] propose a benchmark for autonomous exploration tasks of wheeled robots, which inspired our work. More comprehensive insights into DRL-based autonomous exploration are summarized in [8].

However, with the expansion of the exploration scale, most of the above methods fall into the dilemma of configuration space explosion. In our work, GVD-based TER and IAM are used to alleviate such problems for faster and higher-quality exploration. Then a GVN with spatial awareness and episodic memory is designed for lidar-based smart exploration strategy learning.

Topological map-based robotic navigation

Classic iterative learning control is a high-performance control scheme that is widely applied to point-to-point motion tasks (e.g. robotic navigation) in existing work [56,57,58]. In recent years, topological maps are employed to break the dimensional curse of the search space while preserving the main structural clues of the workspace, which provide many novel schemes for robotic navigation. By performing multiple start-to-goal navigation tasks on a topological roadmap, Tsang et al. [34] propose a method that captures information from past executions to learn a motion policy to handle obstacles that the robot has seen before. Francis et al. [35] combine a probabilistic roadmap-based planner with RL methods to address long-term navigation problems in the indoor context.

To obtain a more refined topological map representation, Hou et al. [36] propose a novel approach to the automatic generation of the indoor hierarchical topological map based on GVD, which considerably reduces human intervention in robotic navigation. Since partial planning algorithms are prone to fall into traps in complex environments, resulting in the failure of motion planning, Chi et al. [37] propose a feature tree algorithm based on GVD to generate heuristic paths to guide partial motion planning.

The work in [38, 39] employs GVD to explicitly guide sampling-based path planning algorithms for optimal robot navigation. The difference is that the method in [38] employs multiple potential functions which provides a nonuniform sampling distribution for the path planner. In [39], the GVD is adopted to build an equidistant roadmap and generates an initial collision-free solution rapidly. Then, a secure tunnel is established via the minimum distance from the obstacles to the initial solution to facilitate the concentration of sampling. Chi et al. [40] present a GVD-based heuristic path planning algorithm to generate a heuristic path and guide the sampling process of RRT to address the trap space problem. Benefiting from the GVD, the proposed algorithm in [40] can automatically identify the environment feature and provide a reasonable heuristic path.

In our work, the GVD-based high-dimensional TER is proposed to achieve domain conversion, which breaks the dimensional curse from action space and observation space without losing the structural properties of the workspace.

Problem statement

The autonomous exploration problem we consider involves a nonholonomic mobile robot equipped with a 2D lidar with a limited sensing range. The robot chooses exploration goals and navigates continuously in a closed environment \(M \in \mathcal {F}^{w \times h}\), incrementally sensing and constructing an estimated map \(\hat{M} \in \mathcal {F}^{w \times h}\), in which \(\mathcal {F} \in \{ F_{\text {o}}: {\text {occupied}}, F_{\text {f}}: {\text {free}}, F_{\text {u}}: {\text {unknown}} \}\) is the possible map configuration. w and h are the maximum width and height of the map, respectively. Unlike previous fine-grained grid map-based exploration decision-making [7, 17, 27], we employ a domain conversion \(T: \mathcal {F}^{w \times h} \times \mathbb {R}^2 \xrightarrow {} \mathcal {G}\) from Euclidean space to an undirected topological graph to maintain high-dimensional spatial clues and mitigate the explosion of configuration space:

$$\begin{aligned} G = T(\hat{M}, x), \end{aligned}$$
(1)

where \(G = (\mathcal {V}, X, \mathcal {E}, E)\) consists of \(\vert \mathcal {V}\vert \) nodes and \(\vert \mathcal {E}\vert \) edges. In particular, \(\mathcal {V} = \{ v_1, \ldots , v_{\vert \mathcal {V}\vert } \}\) is the set of nodes, \(X \in \mathbb {R}^{\vert \mathcal {V}\vert \times d_{N}}\) is the collection of node features, \(\mathcal {E} \in \{ e_1, \ldots , e_{\vert \mathcal {E}\vert } \}\) is the set of edges, and \(E \in \mathbb {R}^{\vert \mathcal {E}\vert \times d_{E}}\) is the collection of edge features. \(d_N\) and \(d_E\) denote the dimensions of the node features and the edge features, respectively. \(x \in \mathbb {R}^2 \) indicates the position of the robot.

We model autonomous exploration as a Partially Observed Markov Decision Process (POMDP) in the RL framework [42], which is represented as a tuple \((\mathcal {S}, \mathcal {O}, \mathcal {A}, {\Lambda }, \mathcal {T}, \mathcal {R}, {\rho _{0}}, {\gamma }, {L_{{\text {exp}}}})\) with state space \(\mathcal {S}\), observation space \(\mathcal {O}\), action space \(\mathcal {A}\), state-conditioned observation distribution \({\Lambda }\), transition distribution \(\mathcal {T}\), reward function \(\mathcal {R}\), initial state distribution \({\rho _{0}}\), discount factor \({\gamma }\), and finite episode length \(L_{exp}\). When a new exploration episode begins, the sensor-equipped robot is spawned at an initial state \(s_0 = (\hat{G}_0, x_0) \sim {\rho _{0}}\) in a previously unknown environment. At time t, the agent at state \(s_t = (\hat{G}_t, x_t)\) receives an observation \(o_t \sim {\Lambda }(\cdot \vert s_t)\), executes a exploration goal action \(a_t \sim \pi _\theta (\hat{s}_t)\), receives a reward \(r_t \sim \mathcal {R}(\cdot \vert s_t,a_t,s_{t+1})\), and reaches state \(s_{t+1} \sim \mathcal {T}(\cdot \vert s_t,a_t)\). \(\pi _\theta \) denotes the robot’s exploration policy.

The goal is to learn an optimal exploration policy \(\pi ^{*}_\theta \) that maximizes the expected cumulative sum of the discounted rewards over an exploration episode:

$$\begin{aligned} \pi ^{*}_\theta = \mathop {\arg \max }\limits _{\pi _\theta }\mathbb {E}_{\tau \sim \pi _\theta }\left[ \sum _{t=0}^{L_{{\text {exp}}}}\gamma ^{t}r_t\right] , \end{aligned}$$
(2)

where \(\tau \) denotes an exploration trajectory \(\{(s_t, a_t, r_t)\}^{T_{{\text {exp}}}}_{t=0}\), which is a sequence of \((s_t, a_t, r_t)\) tuples generated by starting at \(s_0\) and behaving according to policy \(\pi _\theta \) at each time step. The reward function \(\mathcal {R}\) captures the method-specific incentive for exploration. The detailed domain conversion mechanism, policy learning method, and reward shaping are described in the following sections.

Episodic memory enhanced autonomous exploration with VDC and IAM

In this section, GVD-based domain conversion is first introduced to extract the main structural features of the workspace and deal with the ambiguity caused by the expansion of the exploration scale. Then, the spatio-temporal GVN for policy learning and the IAM for configuration space reshaping are described in detail. Finally, the navigation pipeline for autonomous exploration and the reward function for guiding policy learning are presented at the end of this section.

GVD-based domain conversion

Assume that the robot operates in a 2-dimensional workspace \(\mathcal {W}\), which is populated by convex obstacles \(C_1, C_2, \ldots , C_i\). Nonconvex obstacles can be modeled as the union of convex obstacles and the boundary of \(\mathcal {W}\) is considered as a collection of convex sets, which belong to the obstacle set. The distance \(d_i(x)\) between a point x and one obstacle \(C_i\) is defined as:

$$\begin{aligned} d_i(x) = \mathop {\min }\limits _{c_0 \in C_i}x-c_0, \end{aligned}$$
(3)

and the gradient of \(d_i(x)\) is denated as:

$$\begin{aligned} \nabla d_i(x) = \frac{x - c_0}{\vert \vert x - c_0 \vert \vert }, \end{aligned}$$
(4)

where \(\nabla d_i(x)\) is a unit vector in the direction from x to \({c_0}\) and \({c_0}\) is the nearest point to x in \(C_i\). Typically, there are multiple obstacles in the environment. From Fig. 2a, an equidistant point is found among obstacles, and the shortest distance of equidistant points from obstacles is stored as \(D(x) = \mathop {\min }\limits _{i}d_i(x)\). Then, the edges connected between each equidistant point are called the generalized Voronoi edge and are defined as:

$$\begin{aligned} F_{ij} = \{x \in \mathcal {W}: d_i(x) = d_j(x)\ and\ \nabla d_i(x) \ne \nabla d_j(x)\}.\nonumber \\ \end{aligned}$$
(5)

Then, the face \(F_{ij}\) formed by two edges is represented as two-equidistant faces. As shown in Fig. 2b, the intersection of all equidistant faces is called three-\(equidistant\ faces\), which are defined as:

$$\begin{aligned} \begin{aligned} F_{ijk}&\!= \!F_{ij}\! \cup \!F_{ik}\! \cup \!F_{jk} = \{ x \!\in \! \mathcal {W}: d_i(x) \!=\! d_j(x) \!=\! d_k(x)\\&\quad \!and\ \nabla d_i(x) \!\ne \!\nabla d_j(x) \ne \nabla d_k(x) \}. \end{aligned}\nonumber \\ \end{aligned}$$
(6)

In the 2D case, \(F_{ijk}\) could be a point, called the generalized Voronoi vertex. Accordingly, the GVD consists of a set of generalized Voronoi edges and vertices. A well-connected roadmap can be obtained after initializing the environment map by the GVD method. In practice, the construction of a roadmap is mainly based on the detected environmental information and mathematical analysis, where the lidar is adopted for mobile robot environment perception. Unlike previous applications [37,38,39], in our work, the GVD is obtained via an incremental construction procedure to reduce computational cost.

Fig. 2
figure 2

Principle of GVD. a Set of equidistant points from obstacles \(\{ C_i \}\) and \(\{ C_j \}\). Grey areas represent obstacles and white areas are free. b GVD graph composed of Voronoi edges and vertices

For the implementation of VDC, we detect frontiers in the incrementally constructed map \(\hat{M}\), which is as candidate exploration goals. The information gain \(G_{{\text {info}}}\) at a frontier f measures the amount of free space left to explore beyond f, i.e., navigable cells which are unexplored in the current partial map. To calculate \(G_{{\text {info}}}\), we first group the unexplored free-space cells into connected components \(\Omega = \{\omega _1, \ldots ,\omega _n\}\) using connected components labeling (CCL) method in OpenCVFootnote 1, and then associate each connected component \(\omega _i\) to the map frontiers. A component \(\omega \) is associated with frontier f only if at least one pixel in \(\omega \) is an 8-connected neighbor of some pixel in f. For each frontier f, we can then calculate the information gain \(G_{{\text {info}}}\) as the sum of areas of connected components associated with f, normalized by the total free space on the complete map.

By connecting the frontiers and the robot’s position with the GVD on the principle of the shortest distance, we obtain the final TER. The connections between all nodes are collision-free, and the above operations formulate the domain conversion \(G = T(\hat{M}, x)\). The edge feature of the topological graph is represented as the Euclidean distance between nodes. The node feature of the topological graph is represented as:

$$\begin{aligned} S = [p_x, p_y, G_{{\text {info}}}, \phi _{{\text {diff}}}, D, {\text {flag}}], \end{aligned}$$
(7)

where \(p_x\) and \(p_y\) denote the current position of the robot, \(G_{{\text {info}}}\) denotes the information gain of the node. D and \(\phi _{{\text {diff}}}\) denote the Euclidean distance and the relative orientation between the current robot pose and each node in the graph, respectively. flag is an indicator variable, denoting the current pose to be 0, all frontiers to be 1, and all other nodes − 1. To alleviate the reward sparsity in DRL, we assign a discounted information gain to those nodes whose indicator variable is − 1 according to the distance from the nodes to the frontiers. The distance threshold for assigning information gain is determined by \(D_{{\text {gain}}}\).

Generalized Voronoi network configuration

As shown in Fig. 3, our GVN mainly consists of two components, a spatial feature encoder \(E_{\text {S}}\) and a temporal feature encoder \(E_{\text {T}}\). \(E_{\text {S}}\) is designed to encode the structural clues of the TER G. As GCN has achieved great success in graph structure data processing, GCN is employed to deal with spatial feature interaction between the nodes in the TER. Specifically, a 3-layer GCN with residual connections is utilized to formulate the spatial encoder. The l-layer node features are denoted as \(G^{(l)} = [g^{(l)}_1, \ldots ,g^{(l)}_N ]^T\), the node aggregation at layer l is then calculated as:

$$\begin{aligned} G^{(l+1)}=\sigma _{\text {S}}(D^{-\frac{1}{2}}\hat{A}D^{-\frac{1}{2}}G^{l}W^l_{\text {S}}), l=0,1,2, \end{aligned}$$
(8)

where \(\hat{A} = A + I\), I denotes the unit matrix. A and D respectively denote the adjacency matrix and the degree matrix of the topological graph, which are utilized to maintain the topological properties of the TER. \(W^l_{\text {S}}\) denotes the learnable weight matrix of layer l and \(\sigma _{\text {S}}(\cdot )\) denotes the ReLU function.

Robotic autonomous exploration is a long sequential decision-making task that requires continuous selection of exploration goals. To tackle dependency over longer time periods, episodic memory is realized with a buffer \(\mathcal {M}\) where the robot reads from and writes to at each time step. To trade off storage consumption and convergence speed, we employ a sliding window of length \(\epsilon \), which results in a fixed-length memory buffer. If the buffer is full, the oldest memory fragment is deleted when a new memory fragment is added. \(\mathcal {M}\) stores the memory fragments of the previous \(\epsilon \) time steps, where each memory fragment consists of a high-dimensional spatial feature encoded by \(E_{\text {S}}\) and a frontier feature, as shown in Fig. 3.

A multi-layer TCN is utilized to comprehensively manage time-series episodic memory and the input of the TCN is \(\mathcal {M} = \{ m_0, m_1, \ldots , m_{t-1} \}\). Note that, a neural transformer [54] may be helpful to our temporal feature fusion by leveraging the self-attention module. As we consider deploying our exploration strategy to the robot’s embedded controller, we only employ multi-layer TCN in this work to keep the simplicity and reduce computational cost. We denote the t-th memory fragment in \(\mathcal {M}\) as \(m_t\). The update of \(m^{(l)}_t\) at layer l in TCN is represented as:

$$\begin{aligned} m^{(l+1)}_t = \mathcal {F}_{d_l}(m^{(l)})\vert _{m^{(l)}_t} = \sum _{k=1}^K \eta _{k} m^{(l)}_{t-(K-k)d_l}, \end{aligned}$$
(9)

where K is the one-dimensional convolution kernel size, \(d_l\) is the expansion factor controlling the jump connection of layer l, \(\eta _1 \sim \eta _K\) are the learnable parameters of convolution kernel.

Fig. 3
figure 3

Architecture of Generalized Voronoi Network. \(f_1, f_2, \ldots , f_n\) represent frontiers in the TERs \(G_0, G_1, \ldots , G_t\) at different time steps. R and \(f_{{\text {cur}}}\) represent the robot’s position and the selected exploration goal at different time steps, respectively. At time step t, \(E_{\text {S}}\) is used to encode TER \(G_t\) (the blue pipeline). Before time step t, the spatial features extracted by \(E_s\) are inserted into a memory buffer \(\mathcal {M}\) as partial episodic memory. \(\mathcal {M}\) stores the memory fragments of the previous \(\epsilon \) time steps, where each memory fragment consists of a high-dimensional spatial feature encoded by \(E_{\text {S}}\) and a frontier feature (the green pipeline). The temporal feature encoder \(E_{\text {T}}\) is used to manage time-series episodic memory. An attention layer is utilized to fuse the spatio-temporal features and to retrieve information related to the current state from \(\mathcal {M}\). An aggregator is used to integrate current spatial features and the corresponding query on exploratory memory. The actor head and the critic head are both realized by MLPs

An attention-based query operation is utilized to fuse the spatio-temporal features and to retrieve information related to the current state from long-term memory. The output of the attention layer can be represented as:

$$\begin{aligned} \hat{h}^\mathcal {M}_t = \alpha _1 *r^G_t + \alpha _2 *h_t^\mathcal {M}, \end{aligned}$$
(10)

where

$$\begin{aligned} \alpha _1 = \frac{{\text {exp}}(W r^G_t + b)}{{\text {exp}}(W r^G_t + b) + {\text {exp}}(W h_t^\mathcal {M} + b)}, \end{aligned}$$
(11)

and the calculation of \(\alpha _2\) is similar to (11). \(r^G_t\) and \(h_t^\mathcal {M}\) denote the outputs of \(E_{\text {S}}\) and \(E_{\text {T}}\), respectively. \({\text {exp}}(\cdot )\) is the expedition function, and W and b are the parameters of attention mechanism. Finally, an aggregator is used to integrate current spatial features and the corresponding query on exploratory memory:

$$\begin{aligned} z_t = {\text {MLP}}({\text {cat}}(\hat{h}^\mathcal {M}_t, r_G^t)), \end{aligned}$$
(12)

where MLP represents a Multi-layer Perceptron (MLP) followed by a ReLU activation. We model the policy as a diagonal Gaussian distribution, i.e. \(\pi _{\theta }(a_t\vert s_t) = \mathcal {N}(\mu , \sigma ^2)\). As shown in Fig. 3, our model employs the Actor-Critic architecture [44]. The distribution’s mean \(\mu \) and log-standard-deviation \(log\sigma \) are learned by the actor head. The critic head estimates the state value function \(V(s_t) = \mathbb {E}_{\pi _{\theta }}[\sum _{t=0}^{T_{exp}}]\gamma ^t r_t\) of the current policy. The actor head and the critic head are both realized by MLPs.

Configuration space reshaping using invalid action masking

Previous DRL-based exploration strategies [9, 10] usually set the entire current map \(\hat{M}\) as the action space \(\mathcal {A}\) and the observation space \(\mathcal {O}\). Their candidate exploration goals are mostly regressed and may not lay on the frontiers, thus resulting in low sampling efficiency and incomplete map construction. Due to the curse of dimensionality, \(\mathcal {A}\) and \(\mathcal {O}\) gradually expand with lidar sensing and mapping, which makes the problem even more ambiguous. To alleviate the above problems, we employ IAM to restrict actions (the selection of exploration goals) to frontiers and their neighbors in the TER.

Classic policy gradient algorithms for DRL propose the following policy gradient estimate to the objective J:

$$\begin{aligned} \begin{aligned} g_{{\text {policy}}}&= \mathbb {E}_\tau [\nabla _\theta {\text {log}} \pi _\theta (a_\tau \vert s_\tau )R_\tau ] \\&= \mathbb {E}_\tau \left[ \nabla _\theta \sum ^{T-1}_{t=0} {\text {log}}\pi _\theta (a_t\vert s_t)R_t\right] , \end{aligned} \end{aligned}$$
(13)

where \(R_t = \sum _{k=T_{{\text {exp}}}-t}^{T_{{\text {exp}}}} \gamma ^k r_{t+k}\) denotes the discounted return following time step t. The expected discounted reward is maximized by doing gradient ascent \(\theta = \theta + \nabla _\theta J\). Specifically, we treat nodes far away from frontiers as invalid actions, with no information gain. Then, IAM is utilized to avoid sampling invalid actions by masking out the logits corresponding to the invalid actions. This is usually accomplished by replacing the logits of the actions to be masked in (13) by a large negative number N (e.g. \(N = -1 \times 10^8\)) [16]. Given state \(s \in \mathcal {S}\), the process of IAM is considered as a differentiable function invs to be applied to the logits l(s) outputted by policy \(\pi _\theta \). The IAM process is defined as:

$$\begin{aligned} \begin{aligned} \pi _{\theta }^{'}(\cdot \vert s_t) = {\text {softmax}}({\text {inv}}_s(l(s))), \end{aligned} \end{aligned}$$
(14)

where

$$\begin{aligned} {\text {inv}}_s(l(s))_i=\left\{ \begin{aligned} l_i,\ if\ a_i\ {\text {is}}\ {\text {valid}}\ {\text {in}}\ s \\ N,\ {\text {otherwise}}, \end{aligned} \right. \end{aligned}$$
(15)

The IAM process appears to do more than just renormalizing the probability distribution, it in fact makes the gradient corresponding to the logits of the invalid action zero [16]. In other words, taking invalid actions will not contribute to maximizing cumulative rewards, therefore the robot will tend not to take them.

Navigation pipeline and reward function

For robotic navigation during autonomous exploration, path planning and mapping are implemented based on \(move\_base\)Footnote 2 and GmappingFootnote 3, respectively. It is worth mentioning that our TER constructed and maintained in real-time can provide safe and efficient motion paths for robots. Once the robot navigates to a position with a distance less than a threshold \(\zeta \) from the current exploration goal, the decision-making mechanism is triggered to select the next exploration goal.

To encourage robots to better trade off map quality and exploration efficiency, we design the following reward function:

$$\begin{aligned} \begin{aligned} R = \beta _1 G_{{\text {info}}}^t - \beta _2 L_{{\text {path}}}^t + \beta _3 r_{{\text {Acc}}}^t - \beta _4 N_T, \end{aligned} \end{aligned}$$
(16)

where \(G_{{\text {info}}}^t\) and \(L_{{\text {path}}}^t\) represent the total information gain and total path length before time step t, respectively. \(N_T\) represents the total number of time steps before time step t. The above settings encourage robots to trade less path and time consumption for the most exploration coverage. \(r_{{\text {Acc}}}^t\) is utilized to reward or punish the effect of the currently selected action on the map accuracy:

$$\begin{aligned} \begin{aligned} r_{{\text {Acc}}}^t = {\text {Accuracy}}(\hat{M}_t, \widetilde{M}) - {\text {Accuracy}}(\hat{M}_{t-1}, \widetilde{M}), \end{aligned} \end{aligned}$$
(17)

where

$$\begin{aligned} \begin{aligned} {\text {Accuracy}}(\hat{M}, \widetilde{M}) = \sum ^{w \times h}_{i=1}\varvec{1}[\hat{M}_i = \widetilde{M}_i], \end{aligned} \end{aligned}$$
(18)

where \(\varvec{1}[\hat{M}_i = \widetilde{M}_i]\) is an indicator function that returns one if \(\hat{M}_i = \widetilde{M}_i\) and zero otherwise. \(\widetilde{M}\) represents the ground-truth layout of the workspace. Additionally, this reward provides a better learning signal while training under noisy conditions by accounting for mapping errors arising from noisy pose and noisy sensing [43].

Fig. 4
figure 4

Illustrations for the augmented data set for policy learning. The data set contains typical basic scenarios and their combinations such as loop, narrow corridor, multiple rooms, etc

Simulation analysis and experimental verification

Simulation setup and implementation details

We evaluate the proposed exploration strategy in multiple simulation scenarios in Ubuntu 16.04 using ROS Kinetic and the GazeboFootnote 4 simulator. The robot model used in simulations is the Turtlebot3 WaffleFootnote 5, which is equipped with a 2D lidar sensor with a sensing distance of 6 m to capture the distance information of the simulated scenarios in real-time. The maximum linear velocity and angular velocity of the mobile robot are 0.5 m/s and 1.8 rad/s, respectively. The capacity of the scenario memory buffer is set to \(\epsilon = 10\). The \(\beta _1\), \(\beta _2\), \(\beta _3\), and \(\beta _4\) in Eq. (16) are set to 1.0, 0.6, 0.5, and 0.3, respectively. The decision frequency for autonomous exploration using different methods is uniformly set to 5 Hz. All the simulations run on a computer with Core i7-9700K at 4.9 GHz, 16 GB memory.

We employ the scenarios published in [33] for the learning and evaluation of different exploration strategies, which contains typical basic scenarios and their combinations such as loop, narrow corridor, multiple rooms, etc. To augment the complexity of the data set, several large-scale environments are further utilized to extend the data set, as shown in Fig. 4. We selected nine scenes for policy learning (Fig. 4) and three for evaluation (Fig. 6). For the evaluation of different methods, we consider three challenging synthetic scenes of increasing complexity and scale.

  1. (1)

    Scene1-Loop with Corridor: As shown in Fig. 6a, a loopback scene with a scale of 24 \(\times \) 24 \(\times \) 3 m\(^3\). This type of scene often results in inevitable back-and-forth movements.

  2. (2)

    Scene2-Maze: As shown in Fig. 6b, a hand-crafted maze-shaped scene with a scale of 20 \(\times \) 20 \(\times \) 3 m\(^3\). This type of scene often requires exploration strategies to deal with dead-end dilemmas in the exploration process.

  3. (3)

    Scene3-Office: As shown in Fig. 6c, a hand-crafted large-scale scene with a scale of 40 \(\times \) 25 \(\times \) 3 m\(^3\), consisting of many rooms. This type of scene often results in incomplete exploration and inefficient back-and-forth movement.

We compare the proposed method with multiple baselines that use various types of exploration strategies. The considered baselines and our methods are as follows:

  • Cost [6]: This approach detects frontiers in real-time in the background and always selects the nearest frontier during exploration.

  • Sample [17]: This method detects frontiers utilizing the RRT and selects the highest-scoring frontier for exploration utilizing a carefully designed utility function.

  • Potential Field [7]: This approach builds the potential field from the surroundings and selects the frontier with the largest potential gradient for exploration.

  • Naive Exp [27]: This technique deploys a DRL-based policy to determine the global goal and then the nearest frontier to the global goal is selected. This baseline simulation emulates the work in [27], employing CNNs to extract features from the original grid map. In keeping with [27], we also adopt the Actor-Critic [44] architecture for reinforcement learning.

  • GVN (ours): This method employs GVN and IAM for exploratory decision-making. The maximum side length of the GVD is set to 2 m in our method. Parameters \(D_{\text {gain}}\) and \(\zeta \) during exploration simulations are set to 5m and 1m, respectively.

For the DRL-based strategies, the policies are trained using Proximal Policy Optimization (PPO) [45] and the training procedure builds upon the open-source PPO implementation provided by the Stable Baselines framework [46]. We train the exploration policies with 16 processes for rollouts on a computer with Core i7-9700K at 4.9 GHz, 16 GB memory.

To quantify the above methods, we follow the evaluation indicators from previous work [23, 33, 43] to measure both map quality and exploration efficiency:

  1. (1)

    IoU [43]: the intersection over the union between that same global map and the ground-truth layout of the environment, which Indicates the accuracy of the constructed map.

  2. (2)

    Exploration Ratio (ER) [23]: the ratio of the area in the global map built during exploration (both free and occupied) to the ground-truth layout of the environment, which indicates the integrity of the constructed map.

  3. (3)

    Total Exploration Time (ET) [33]: the total exploration time spent until the end of the exploration, which indicates the time efficiency of completing the exploration task.

  4. (4)

    Total Path Length (PL) [23]: the total path length moved until the end of the exploration, which indicates the energy consumption for completing exploration tasks to some extent. In our exploration missions, longer paths mean more energy consumption.

Table 1 Statistics of comparisons with baselines

Among them, the first two metrics measure the quality of the created map, the latter two indicate the exploration efficiency. In our exploration missions, higher IoU and ER in fewer ET or shorter PL are better.

Simulation comparisons

According to the simulation settings in “Simulation setup and implementation details”, each exploration strategy is used to conduct 10 exploration simulations in each evaluation environment. An exploration mission ends when the scene is fully explored or the exploration terminates unexpectedly. Unexpected termination of exploration refers to the robot lingering in a local area for an extended period of time. According to the evaluation metrics in “Simulation setup and implementation details”, Table 1 and Fig. 5 show the quantitative analysis of simulation comparisons.

In most scenarios, an \(80\%\) exploration ratio indicates that the robots have acquired essential topological information, including the overall terrain structure and connectivity. Based on this viewpoint, Fig. 5 illustrates that our method improves exploration speed by 7.0–\(11.1\%\) over the state-of-the-art Potential Field based method in [7]. When the exploration ratio reaches \(98.5\%\), we consider a previously unknown environment to be fully explored. Based on this viewpoint, Fig. 5 illustrates that our method improves exploration speed by 4.0–\(16.2\%\) over the state-of-the-art Potential Field based method in [7]. Naive Exp [33] exhibits comparable performance to Cost [6] in small-scale environments, and the relative performance gradually decreases as the scene scale increases. The performance of classic Cost and Sample [17] is overall inferior to Potential Field based method [7] and our method.

The superiority of our method in terms of map quality and exploration efficiency is shown in Table 1. Specifically, our method results in the shortest average exploration time and competitive exploration ratio. Benefiting from map accuracy-based rewards and penalties mentioned in “Navigation pipeline and reward function”, our method exhibits competitive IoU metrics. In terms of total path length, our method is dominant in small-scale scenarios and slightly inferior to classical Cost [6] in large-scale scenarios. The reason for this phenomenon may be that Cost always selects the nearest frontier for exploration, resulting in the lowest distance consumption. We find that a similar phenomenon also occurs in the simulation tests of the work in [47]. Figure 6 shows the qualitative performance of our method in three different evaluation environments. The smooth fluorescent exploration trajectories in Fig. 6 mean smooth motion and less jitter. A more specific kinematic analysis will be described in “Mobile robot experiments”.

Fig. 5
figure 5

ac Illustrate the trend of exploration ratio over time for different methods in three different evaluation scenarios, respectively, including mean and variance fluctuations over 10 experiments. These figures further highlight the exploration time consumption at \(80\%\) and \(98.5\%\) exploration ratios, respectively

Fig. 6
figure 6

ac Illustrate the exploration performance of our method in three different evaluation scenarios, respectively. Fluorescent green marks the smooth motion trajectories of the robots. In the middle layer of the figures, the frontiers are marked by blue

Table 2 Ablation studies for invalid action masking

Ablation studies for invalid action masking

To examine the empirical importance of invalid action masking, we compare the following four strategies to handle invalid actions.

  • Without IAM: All nodes of a GVD can be selected as exploration goals for the robot and invalid action masking is not considered.

  • Naive IAM: At each time step t, the robot receives the same mask on the node features of a GVD as described for invalid action masking. The action shall still be sampled according to invalid action masking, but the gradient is updated according to Eq. (13), that is, no differentiable function invs is applied to logits. We call this implementation naive invalid action masking because its gradient does not replace the gradient of the logits corresponding to invalid actions with zero.

  • Masking Removed: At each time step t, the robot receives the same mask on the node features of a GVD as described for invalid action masking, and trains in the same way as the robot trained under invalid action masking. However, we then evaluate the robot without providing the mask. In other words, in this scenario, we evaluate what happens if we train with a mask, but then perform without it.

  • IAM: At each time step t, the robot receives a mask on the node features of a GVD such that only frontiers (valid actions) with information gain can be selected as exploration goals.

Fig. 7
figure 7

ac Illustrate the exploration process of our method in the mobile robot experiment. The motion trajectories for three different strategies are shown in c, blue-Ours, orange-MMPF, and green-Cost. d Illustrates the experimental ground and the robot platform

Table 3 Statistics of mobile robot experiments

The above four strategies are employed to conduct 5 exploration simulations in Scene 1 and Scene 2, respectively. Under the premise of exploring as completely as possible, we focus on the contribution of IAM to map quality and exploration efficiency. Qualitative analysis of ablation studies for IAM is shown in Table 2, which shows that IAM scales well in both scenes, and the case Without IAM performed the worst. IAM leads to the shortest exploration time and total path length while guaranteeing a high IoU metric. It is worth noting that Masking Removed is still able to perform well to a certain degree, which shows that the policies trained with IAM can still produce useful behaviors when the IAM can no longer be provided. In addition, the performance metrics demonstrated by Naive IAM are second only to IAM, reflecting the necessity of masking invalid actions to improve the performance of exploration strategies.

Mobile robot experiments

To verify the effectiveness of our method, multiple mobile robot experiments are conducted in a complex real-world environment. As shown in Fig. 7, a lobby with a size of 17.5 \(\times \) 4.0 m is used as the experimental ground. The controller and lower computer of the robot are NVIDIA Jetson AGX Xavier and OpenCRFootnote 6, respectively. The robot is equipped with a 360\(^\circ \) lidar sensor with a sensing distance of 5 m. The maximum linear velocity and angular velocity of the mobile robot are 0.18 m/s and 1.8 rad/s, respectively.

By deploying different exploration strategies to the same mobile robot, we mainly compare the performance of GVN with the state-of-the-art exploration method proposed in [7]. Each exploration strategy is used to conduct 5 experiments in the real scenario. The decision frequency for autonomous exploration using different methods is uniformly set to 5 Hz. The maximum side length of the GVD in our method is set to 2 m. Each round of GVD update in the experiment only takes tens of milliseconds.

Fig. 8
figure 8

An illustration for the trend of exploration ratio over time in the mobile robot experiments

Fig. 9
figure 9

ac Illustrate the kinematic metrics during exploration using three different methods in mobile robot experiments, respectively. a GVN, b Potential Field and c Cost

Taking Cost [6] as a reference, the quantitative analysis of the experiments is shown in Table 3 and Fig. 8. Figure 8 shows the trend of exploration ratio over time, confirming the advantage of our method in terms of exploration speed. The specific experimental results in Table 3 confirm the advantages of our method in terms of map quality and exploration efficiency. Specifically, our exploration strategy leads to shorter exploration time and total path length while ensuring the highest IoU and exploration ratio. The qualitative analysis of the experiments is presented in Figs. 7 and 9. Figure 7a–c illustrate the exploration process of our method. The motion trajectories for three different strategies are shown in Fig. 7c, where our strategy results in the shortest and smoothest trajectory. Figure 9 illustrates the kinematic metrics during exploration using three different methods, respectively. With the same robot configuration and decision frequency, our method results in the smallest velocity and acceleration fluctuations, which reflects that our exploration strategy reduces the frequency of back-and-forth movements and wandering.

Conclusion and future work

In this work, we propose a model-free DRL-based smart exploration strategy based on VDC, GVN, and IAM to sense and map the workspace efficiently. Our lidar-based exploration strategy is free from human interventions and task-related assumptions, learning the policy by interacting with the environment in a trial-and-error manner. We propose to adopt GVD-based domain conversion to obtain a high-dimensional TER, overcoming the dimensional curse from action space and observation space. The GVN with spatial awareness and episodic memory is utilized to learn exploration strategies online. Additionally, the IAM is used to reshape the configuration space and cope with low sampling efficiency for faster and more complete exploration.

Extensive baseline tests and comparative simulations confirm the feasibility and superiority of our method. By deploying our method to a mobile robot, we alleviate the awkward transference from simulation to reality. However, the proposed method still faces some limitations. For example, the ability of our model to cope with dynamically changing environments is unknown. In addition, considering the complexity of 3D point cloud data processing, the construction of TER may be computationally expensive when extended to a 3D workspace. In our future work, we will investigate and discuss these two aspects. We will refine existing work and try to extend the proposed method to 3D exploration tasks or crowded environments with dynamic pedestrians.