EMExplorer: an episodic memory enhanced autonomous exploration strategy with Voronoi domain conversion and invalid action masking

Chen, Bolei; Zhong, Ping; Cui, Yongzheng; Lu, Siyi; Liang, Yixiong; Sheng, Yu

doi:10.1007/s40747-023-01144-x

EMExplorer: an episodic memory enhanced autonomous exploration strategy with Voronoi domain conversion and invalid action masking

Original Article
Open access
Published: 29 June 2023

Volume 9, pages 7365–7379, (2023)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

EMExplorer: an episodic memory enhanced autonomous exploration strategy with Voronoi domain conversion and invalid action masking

Download PDF

Bolei Chen¹,
Ping Zhong ORCID: orcid.org/0000-0003-3393-8874¹,
Yongzheng Cui¹,
Siyi Lu¹,
Yixiong Liang¹ &
…
Yu Sheng¹

1238 Accesses
Explore all metrics

Abstract

Autonomous exploration is a critical technology to realize robotic intelligence as it allows unsupervised preparation for future tasks and facilitates flexible deployment. In this paper, a novel Deep Reinforcement Learning (DRL) based autonomous exploration strategy is proposed to efficiently reduce the unknown area of the workspace and provide accurate 2D map construction for mobile robots. Different from existing human-designed exploration techniques that usually make strong assumptions about the scenarios and the tasks, we utilize a model-free method to directly learn an exploration strategy through trial-and-error interactions with complex environments. To be specific, the Generalized Voronoi Diagram (GVD) is first utilized for domain conversion to obtain a high-dimensional Topological Environmental Representation (TER). Then, the Generalized Voronoi Networks (GVN) with spatial awareness and episodic memory is designed to learn autonomous exploration policies interactively online. For complete and efficient exploration, Invalid Action Masking (IAM) is employed to reshape the configuration space of exploration tasks to cope with the explosion of action space and observation space caused by the expansion of the exploration range. Furthermore, a well-designed reward function is leveraged to guide the learning of policies. Extensive baseline tests and comparative simulations show that our strategy outperforms the state-of-the-art strategies in terms of map quality and exploration speed. Sufficient ablation studies and mobile robot experiments demonstrate the effectiveness and superiority of our strategy.

Target-Driven Autonomous Robot Exploration in Mappless Indoor Environments Through Deep Reinforcement Learning

Learning to explore by reinforcement over high-level options

Article 30 November 2023

Deep Reinforcement Learning-Based Robot Exploration for Constructing Map of Unknown Environment

Article 06 November 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The autonomous exploration task involving sensing and decision-making techniques requires the sensor-equipped robots to actively navigate through unknown environments with the goal of fully perceiving and reconstructing the environments. In recent years, such a robotic exploration task has received extensive attention in the academic research field and has been applied in many domains. Such applications include, for example, search and rescue missions [1], intelligence, surveillance, and reconnaissance [2, 51], and geological exploration [3,4,5, 52, 53]. The employment of mobile robots in such complicated contexts depends on a robust and efficient exploration strategy.

Several situations require robotic autonomy when the map or model of the environment is not previously known. Intelligent exploration in unknown and complex environments is crucial as it allows unsupervised preparation for unspecified downstream tasks (e.g. reconstruction, rescue, and reconnaissance) in the new environments. Robots that have the ability to explore are flexible to deploy since they can use prior knowledge about previously seen settings to quickly gather useful information in the new one, without having to rely on humans.

In the past few decades, different human-designed exploration methods have been proposed, such as the well-known frontier-based exploration [6] and the artificial potential fields [7]. However, manually defining the functioning of an appropriate resilient exploration strategy is hard in complex real-world settings. The reason is that human-designed exploration techniques usually make strong assumptions about the environments and the tasks, which may restrict their adaptability to complex environments and thus limit their application in real-world practices.

In that context, the employment of Deep Reinforcement Learning (DRL) for robotic exploration tasks is being increasingly investigated [8], driven by the idea of letting the robots automatically learn skills from interactions with the environment instead of receiving explicit instructions. It is worth mentioning that DRL does not require dataset labeling, which represents a large cost for supervised learning solutions. DRL methods directly learn a mapping from observations to actions through trial-and-error interactions with their workspace. Model-based DRL methods use learning and a model of the environment’s transitions to approximate a global value or policy function. However, the environments tend to be too complex to be precisely modeled. In contrast, model-free DRL methods learn the policy or value directly through trial-and-error with the physical or simulated systems.

However, applying DRL to real robots is not a straightforward task, facing problems like low sampling efficiency, the curse of dimensionality, reward shaping, and awkward sim-to-real transference [8]. Previous DRL-based exploration strategies [9, 10] usually set the entire current reconstruction as the action space, thus suffering from lower sampling efficiency. Moreover, as the exploration range gradually expands with lidar sensing, these methods are damaged by awkward transference from simulation to reality and the dimensional curse from observation space and action space. As their exploration goals are mostly regressed and may be far from the frontiers, thus leading to incomplete exploration. Here, a frontier is defined as the boundary where free space meets unmapped space.

To alleviate the above problems, we propose a novel model-free DRL-based method based on (1) Voronoi Domain Conversion (VDC), (2) Generalized Voronoi Networks (GVN) and (3) Invalid Action Masking (IAM) to learn a lidar-based exploration strategy in a trial-and-error manner. Firstly, the VDC is achieved by utilizing the Generalized Voronoi Diagram (GVD) [11] to incrementally obtain a high-dimensional Topological Environmental Representation (TER) from a historic scanned area, which preserves the main structural clues of the environment with minimal feature dimensions. The VDC is mainly proposed to improve the sampling efficiency of the action space and further break the dimensional curse.

Then, the GVN with spatial awareness and episodic memory is designed to learn smart exploration strategies with the TER as input. A Graph convolutional networks (GCN) [48] based encoder is leveraged to capture the spatial features from the TER. Episodic memory is another part of GVN, which empowers robots with temporal awareness for active exploration reasoning. The historic features encoded by the spatial encoder at different time steps in the past are collected as spatial memory fragments and stored in a memory buffer. The spatial memory fragments are used to enhance exploration strategy learning and reduce the number of trials and errors. Temporal Convolutional Network (TCN) [49] is utilized to comprehensively manage the time-series episodic memory.

Although works based on variants of Recurrent Neural Networks (RNNs) [50] can perform well on time dependency problems, they have trouble learning over long sequences [12, 13]. In the context of exploring partially observable environments, we propose to utilize the topological map-like memory (i.e. our TER), which associates a robot’s lidar sensing with position information and summarizes the entire map as the robot executes a task. Empirical studies [14, 15] demonstrate that map-like memory outperforms RNNs. However, the storage consumption of map-like memory is related to the scale of exploration environments, which limits the lightweight and scalability of exploration strategies. This is the reason why we store the high-dimensional features extracted by the spatial encoder as episodic memory instead of storing the TER directly.

Furthermore, we propose to leverage IAM [16] to reshape the configuration space of exploration tasks to further cope with the low sampling efficiency and the explosion of action space caused by the expansion of the sensing range. Specifically, the selection of candidate exploration goals is restricted to the frontier nodes with large information gain in the TER, which ensures faster and more complete exploration. An illustration of the exploration process is shown in Fig. 1. Overall, our main contributions include:

(1)
We propose a VDC method to achieve efficient and high-dimensional TER. Then a GVN with spatial awareness and episodic memory is designed for lidar-based smart exploration strategy learning.
(2)
We propose an IAM method to cope with low sampling efficiency and the explosion of action space caused by the expansion of the sensing range during DRL.
(3)
We introduce a carefully designed reward function for learning exploration policies. We verify the superiority and feasibility of our approach through sufficient simulation comparisons and real robot experiments.

The remainder of the paper is structured as follows: “Related work” introduces related work. “Problem statement” defines the lidar-based autonomous exploration problem. “Episodic memory enhanced autonomous exploration with VDC and IAM” specifically describes our exploration strategy based on VDC, GVN, and IAM, followed by simulation comparisons, ablation studies, and mobile robot experiments in “Simulation analysis and experimental verification”. Finally, we summarize the work in “Conclusion and future work”.

Related work

Classic human-designed exploration strategies

The human-designed autonomous exploration strategies are classically categorized into two categories, i.e., frontier-based [6, 17,18,19,20,21] and sampling-based [22, 23] methods.

Frontier-based methods first detect the frontiers in the map and extract the Frontier Points (FPs) as candidate exploration goals. The original idea of frontier-based exploration is introduced in [6], where the nearest frontier is always greedily chosen as the exploration goal. Subsequently, by employing local and global Rapidly-exploring Random Trees (RRTs) to detect the frontiers, Umari et al. [17] propose a new exploration strategy to achieve efficient exploration. By incrementally constructing and maintaining a graph structure during exploration, a novel path planning algorithm is proposed by Wang et al. [18]. Meanwhile, a utility function that integrates path cost and information gain is introduced in [18] for selecting the next best frontier.

Recently, Cavinato et al. [19] formulate a novel frontier-based exploration strategy capable of explicitly handling dynamic obstacles, thus leading to complete and reliable exploration outcomes in populated environments. Yu et al. [20] propose an exploration strategy based on Multi-target Potential Field (MMPF), which can eliminate the goal’s back-and-forth changes and overlapping trajectories. It is worth mentioning that MMPF is suitable for multi-robot collaborative exploration. In addition, Julia et al. [22] present a frontier-based hybrid approach to the multi-robot integrated exploration problem.

As the most advanced sampling-based method, Zhu et al. [21] divide autonomous exploration into the exploration phase and relocation phase. A local RRT is developed in the free space of the workspace in the exploration stage. On this basis, the relocation stage is employed to explicitly transit the robot to different sub-areas in the sense to maximize information gain. Unlike the 2D lidar-based exploration strategy proposed in [21], Zhang et al. [23] propose a novel depth camera-based informative exploration strategy with uniform sampling to efficiently reduce the unknown volume of 3D environments and provide accurate reconstruction for quadrotors. However, although sampling-based methods are probabilistically complete, they take a long time to sample the exploration goals to achieve complete exploration. In our work, we propose an exploration strategy that combines the advantages of DRL-based and frontier-based approaches for efficient and complete exploration.

Learning-based exploration strategies

Learning-based exploration strategies are mainly divided into two categories: map prediction-based [24,25,26] and DRL-based [27,28,29,30] methods. To remedy the lack of prior knowledge and enhance the exploration, Zhong et al. [24] and Shrestha et al. [25] propose similar methods to predict map occupancy by using generative neural networks. Saroya et al. [26] propose a method for learning subtle structural cues, which may be exploited by a decision-maker to improve the speed of exploration.

The work in [27] and [28] combine the classical frontier-based exploration with DRL to empower robots to explore unknown cluttered environments autonomously. The difference is that a hierarchical approach is proposed in [28] for both effective exploration and collision-free navigation in crowded environments. Zhu et al. [29] propose an exploration architecture that integrates a DRL model, a Next Best View (NBV) selection approach, and a structural integrity measurement to improve the exploration performance. Li et al. [30] construct an exploration framework by decomposing the exploration process into decision, planning, and mapping modules. Then, they propose a DRL-based decision-making algorithm that uses a deep neural network to learn exploration strategy from the partial map.

For cutting-edge learning-based approaches, Chen et al. [31] combine GCN with DRL to achieve exploratory decision-making under uncertainty. Lee et al. [32] propose the Extendable Navigation Network (ENN) to encode the partially observed high-dimensional indoor Euclidean space to a sparse graph representation. Then, the robot’s motion is generated by a learned Q-network whose input is the ENN. Xu et al. [33] propose a benchmark for autonomous exploration tasks of wheeled robots, which inspired our work. More comprehensive insights into DRL-based autonomous exploration are summarized in [8].

However, with the expansion of the exploration scale, most of the above methods fall into the dilemma of configuration space explosion. In our work, GVD-based TER and IAM are used to alleviate such problems for faster and higher-quality exploration. Then a GVN with spatial awareness and episodic memory is designed for lidar-based smart exploration strategy learning.

Topological map-based robotic navigation

Classic iterative learning control is a high-performance control scheme that is widely applied to point-to-point motion tasks (e.g. robotic navigation) in existing work [56,57,58]. In recent years, topological maps are employed to break the dimensional curse of the search space while preserving the main structural clues of the workspace, which provide many novel schemes for robotic navigation. By performing multiple start-to-goal navigation tasks on a topological roadmap, Tsang et al. [34] propose a method that captures information from past executions to learn a motion policy to handle obstacles that the robot has seen before. Francis et al. [35] combine a probabilistic roadmap-based planner with RL methods to address long-term navigation problems in the indoor context.

To obtain a more refined topological map representation, Hou et al. [36] propose a novel approach to the automatic generation of the indoor hierarchical topological map based on GVD, which considerably reduces human intervention in robotic navigation. Since partial planning algorithms are prone to fall into traps in complex environments, resulting in the failure of motion planning, Chi et al. [37] propose a feature tree algorithm based on GVD to generate heuristic paths to guide partial motion planning.

The work in [38, 39] employs GVD to explicitly guide sampling-based path planning algorithms for optimal robot navigation. The difference is that the method in [38] employs multiple potential functions which provides a nonuniform sampling distribution for the path planner. In [39], the GVD is adopted to build an equidistant roadmap and generates an initial collision-free solution rapidly. Then, a secure tunnel is established via the minimum distance from the obstacles to the initial solution to facilitate the concentration of sampling. Chi et al. [40] present a GVD-based heuristic path planning algorithm to generate a heuristic path and guide the sampling process of RRT to address the trap space problem. Benefiting from the GVD, the proposed algorithm in [40] can automatically identify the environment feature and provide a reasonable heuristic path.

In our work, the GVD-based high-dimensional TER is proposed to achieve domain conversion, which breaks the dimensional curse from action space and observation space without losing the structural properties of the workspace.

Problem statement

The autonomous exploration problem we consider involves a nonholonomic mobile robot equipped with a 2D lidar with a limited sensing range. The robot chooses exploration goals and navigates continuously in a closed environment $M \in \mathcal {F}^{w \times h}$, incrementally sensing and constructing an estimated map $\hat{M} \in \mathcal {F}^{w \times h}$, in which $\mathcal {F} \in \{ F_{\text {o}}: {\text {occupied}}, F_{\text {f}}: {\text {free}}, F_{\text {u}}: {\text {unknown}} \}$ is the possible map configuration. w and h are the maximum width and height of the map, respectively. Unlike previous fine-grained grid map-based exploration decision-making [7, 17, 27], we employ a domain conversion $T: \mathcal {F}^{w \times h} \times \mathbb {R}^2 \xrightarrow {} \mathcal {G}$ from Euclidean space to an undirected topological graph to maintain high-dimensional spatial clues and mitigate the explosion of configuration space:

$$\begin{aligned} G = T(\hat{M}, x), \end{aligned}$$

(1)

where $G = (\mathcal {V}, X, \mathcal {E}, E)$ consists of $\vert \mathcal {V}\vert $ nodes and $\vert \mathcal {E}\vert $ edges. In particular, $\mathcal {V} = \{ v_1, \ldots , v_{\vert \mathcal {V}\vert } \}$ is the set of nodes, $X \in \mathbb {R}^{\vert \mathcal {V}\vert \times d_{N}}$ is the collection of node features, $\mathcal {E} \in \{ e_1, \ldots , e_{\vert \mathcal {E}\vert } \}$ is the set of edges, and $E \in \mathbb {R}^{\vert \mathcal {E}\vert \times d_{E}}$ is the collection of edge features. $d_N$ and $d_E$ denote the dimensions of the node features and the edge features, respectively. $x \in \mathbb {R}^2 $ indicates the position of the robot.

We model autonomous exploration as a Partially Observed Markov Decision Process (POMDP) in the RL framework [42], which is represented as a tuple $(\mathcal {S}, \mathcal {O}, \mathcal {A}, {\Lambda }, \mathcal {T}, \mathcal {R}, {\rho _{0}}, {\gamma }, {L_{{\text {exp}}}})$ with state space $\mathcal {S}$, observation space $\mathcal {O}$, action space $\mathcal {A}$, state-conditioned observation distribution ${\Lambda }$, transition distribution $\mathcal {T}$, reward function $\mathcal {R}$, initial state distribution ${\rho _{0}}$, discount factor ${\gamma }$, and finite episode length $L_{exp}$. When a new exploration episode begins, the sensor-equipped robot is spawned at an initial state $s_0 = (\hat{G}_0, x_0) \sim {\rho _{0}}$ in a previously unknown environment. At time t, the agent at state $s_t = (\hat{G}_t, x_t)$ receives an observation $o_t \sim {\Lambda }(\cdot \vert s_t)$, executes a exploration goal action $a_t \sim \pi _\theta (\hat{s}_t)$, receives a reward $r_t \sim \mathcal {R}(\cdot \vert s_t,a_t,s_{t+1})$, and reaches state $s_{t+1} \sim \mathcal {T}(\cdot \vert s_t,a_t)$. $\pi _\theta $ denotes the robot’s exploration policy.

The goal is to learn an optimal exploration policy $\pi ^{*}_\theta $ that maximizes the expected cumulative sum of the discounted rewards over an exploration episode:

$$\begin{aligned} \pi ^{*}_\theta = \mathop {\arg \max }\limits _{\pi _\theta }\mathbb {E}_{\tau \sim \pi _\theta }\left[ \sum _{t=0}^{L_{{\text {exp}}}}\gamma ^{t}r_t\right] , \end{aligned}$$

(2)

where $\tau $ denotes an exploration trajectory $\{(s_t, a_t, r_t)\}^{T_{{\text {exp}}}}_{t=0}$, which is a sequence of $(s_t, a_t, r_t)$ tuples generated by starting at $s_0$ and behaving according to policy $\pi _\theta $ at each time step. The reward function $\mathcal {R}$ captures the method-specific incentive for exploration. The detailed domain conversion mechanism, policy learning method, and reward shaping are described in the following sections.

Episodic memory enhanced autonomous exploration with VDC and IAM

In this section, GVD-based domain conversion is first introduced to extract the main structural features of the workspace and deal with the ambiguity caused by the expansion of the exploration scale. Then, the spatio-temporal GVN for policy learning and the IAM for configuration space reshaping are described in detail. Finally, the navigation pipeline for autonomous exploration and the reward function for guiding policy learning are presented at the end of this section.

GVD-based domain conversion

Assume that the robot operates in a 2-dimensional workspace $\mathcal {W}$, which is populated by convex obstacles $C_1, C_2, \ldots , C_i$. Nonconvex obstacles can be modeled as the union of convex obstacles and the boundary of $\mathcal {W}$ is considered as a collection of convex sets, which belong to the obstacle set. The distance $d_i(x)$ between a point x and one obstacle $C_i$ is defined as:

$$\begin{aligned} d_i(x) = \mathop {\min }\limits _{c_0 \in C_i}x-c_0, \end{aligned}$$

(3)

and the gradient of $d_i(x)$ is denated as:

$$\begin{aligned} \nabla d_i(x) = \frac{x - c_0}{\vert \vert x - c_0 \vert \vert }, \end{aligned}$$

(4)

where $\nabla d_i(x)$ is a unit vector in the direction from x to ${c_0}$ and ${c_0}$ is the nearest point to x in $C_i$. Typically, there are multiple obstacles in the environment. From Fig. 2a, an equidistant point is found among obstacles, and the shortest distance of equidistant points from obstacles is stored as $D(x) = \mathop {\min }\limits _{i}d_i(x)$. Then, the edges connected between each equidistant point are called the generalized Voronoi edge and are defined as:

$$\begin{aligned} F_{ij} = \{x \in \mathcal {W}: d_i(x) = d_j(x)\ and\ \nabla d_i(x) \ne \nabla d_j(x)\}.\nonumber \\ \end{aligned}$$

(5)

Then, the face $F_{ij}$ formed by two edges is represented as two-equidistant faces. As shown in Fig. 2b, the intersection of all equidistant faces is called three-$equidistant\ faces$, which are defined as:

$$\begin{aligned} \begin{aligned} F_{ijk}&\!= \!F_{ij}\! \cup \!F_{ik}\! \cup \!F_{jk} = \{ x \!\in \! \mathcal {W}: d_i(x) \!=\! d_j(x) \!=\! d_k(x)\\&\quad \!and\ \nabla d_i(x) \!\ne \!\nabla d_j(x) \ne \nabla d_k(x) \}. \end{aligned}\nonumber \\ \end{aligned}$$

(6)

In the 2D case, $F_{ijk}$ could be a point, called the generalized Voronoi vertex. Accordingly, the GVD consists of a set of generalized Voronoi edges and vertices. A well-connected roadmap can be obtained after initializing the environment map by the GVD method. In practice, the construction of a roadmap is mainly based on the detected environmental information and mathematical analysis, where the lidar is adopted for mobile robot environment perception. Unlike previous applications [37,38,39], in our work, the GVD is obtained via an incremental construction procedure to reduce computational cost.

For the implementation of VDC, we detect frontiers in the incrementally constructed map $\hat{M}$, which is as candidate exploration goals. The information gain $G_{{\text {info}}}$ at a frontier f measures the amount of free space left to explore beyond f, i.e., navigable cells which are unexplored in the current partial map. To calculate $G_{{\text {info}}}$, we first group the unexplored free-space cells into connected components $\Omega = \{\omega _1, \ldots ,\omega _n\}$ using connected components labeling (CCL) method in OpenCV^{Footnote 1}, and then associate each connected component $\omega _i$ to the map frontiers. A component $\omega $ is associated with frontier f only if at least one pixel in $\omega $ is an 8-connected neighbor of some pixel in f. For each frontier f, we can then calculate the information gain $G_{{\text {info}}}$ as the sum of areas of connected components associated with f, normalized by the total free space on the complete map.

By connecting the frontiers and the robot’s position with the GVD on the principle of the shortest distance, we obtain the final TER. The connections between all nodes are collision-free, and the above operations formulate the domain conversion $G = T(\hat{M}, x)$. The edge feature of the topological graph is represented as the Euclidean distance between nodes. The node feature of the topological graph is represented as:

$$\begin{aligned} S = [p_x, p_y, G_{{\text {info}}}, \phi _{{\text {diff}}}, D, {\text {flag}}], \end{aligned}$$

(7)

where $p_x$ and $p_y$ denote the current position of the robot, $G_{{\text {info}}}$ denotes the information gain of the node. D and $\phi _{{\text {diff}}}$ denote the Euclidean distance and the relative orientation between the current robot pose and each node in the graph, respectively. flag is an indicator variable, denoting the current pose to be 0, all frontiers to be 1, and all other nodes − 1. To alleviate the reward sparsity in DRL, we assign a discounted information gain to those nodes whose indicator variable is − 1 according to the distance from the nodes to the frontiers. The distance threshold for assigning information gain is determined by $D_{{\text {gain}}}$.

Generalized Voronoi network configuration

As shown in Fig. 3, our GVN mainly consists of two components, a spatial feature encoder $E_{\text {S}}$ and a temporal feature encoder $E_{\text {T}}$. $E_{\text {S}}$ is designed to encode the structural clues of the TER G. As GCN has achieved great success in graph structure data processing, GCN is employed to deal with spatial feature interaction between the nodes in the TER. Specifically, a 3-layer GCN with residual connections is utilized to formulate the spatial encoder. The l-layer node features are denoted as $G^{(l)} = [g^{(l)}_1, \ldots ,g^{(l)}_N ]^T$, the node aggregation at layer l is then calculated as:

$$\begin{aligned} G^{(l+1)}=\sigma _{\text {S}}(D^{-\frac{1}{2}}\hat{A}D^{-\frac{1}{2}}G^{l}W^l_{\text {S}}), l=0,1,2, \end{aligned}$$

(8)

where $\hat{A} = A + I$, I denotes the unit matrix. A and D respectively denote the adjacency matrix and the degree matrix of the topological graph, which are utilized to maintain the topological properties of the TER. $W^l_{\text {S}}$ denotes the learnable weight matrix of layer l and $\sigma _{\text {S}}(\cdot )$ denotes the ReLU function.

Robotic autonomous exploration is a long sequential decision-making task that requires continuous selection of exploration goals. To tackle dependency over longer time periods, episodic memory is realized with a buffer $\mathcal {M}$ where the robot reads from and writes to at each time step. To trade off storage consumption and convergence speed, we employ a sliding window of length $\epsilon $, which results in a fixed-length memory buffer. If the buffer is full, the oldest memory fragment is deleted when a new memory fragment is added. $\mathcal {M}$ stores the memory fragments of the previous $\epsilon $ time steps, where each memory fragment consists of a high-dimensional spatial feature encoded by $E_{\text {S}}$ and a frontier feature, as shown in Fig. 3.

A multi-layer TCN is utilized to comprehensively manage time-series episodic memory and the input of the TCN is $\mathcal {M} = \{ m_0, m_1, \ldots , m_{t-1} \}$. Note that, a neural transformer [54] may be helpful to our temporal feature fusion by leveraging the self-attention module. As we consider deploying our exploration strategy to the robot’s embedded controller, we only employ multi-layer TCN in this work to keep the simplicity and reduce computational cost. We denote the t-th memory fragment in $\mathcal {M}$ as $m_t$. The update of $m^{(l)}_t$ at layer l in TCN is represented as:

$$\begin{aligned} m^{(l+1)}_t = \mathcal {F}_{d_l}(m^{(l)})\vert _{m^{(l)}_t} = \sum _{k=1}^K \eta _{k} m^{(l)}_{t-(K-k)d_l}, \end{aligned}$$

(9)

where K is the one-dimensional convolution kernel size, $d_l$ is the expansion factor controlling the jump connection of layer l, $\eta _1 \sim \eta _K$ are the learnable parameters of convolution kernel.

An attention-based query operation is utilized to fuse the spatio-temporal features and to retrieve information related to the current state from long-term memory. The output of the attention layer can be represented as:

$$\begin{aligned} \hat{h}^\mathcal {M}_t = \alpha _1 *r^G_t + \alpha _2 *h_t^\mathcal {M}, \end{aligned}$$

(10)

where

$$\begin{aligned} \alpha _1 = \frac{{\text {exp}}(W r^G_t + b)}{{\text {exp}}(W r^G_t + b) + {\text {exp}}(W h_t^\mathcal {M} + b)}, \end{aligned}$$

(11)

and the calculation of $\alpha _2$ is similar to (11). $r^G_t$ and $h_t^\mathcal {M}$ denote the outputs of $E_{\text {S}}$ and $E_{\text {T}}$, respectively. ${\text {exp}}(\cdot )$ is the expedition function, and W and b are the parameters of attention mechanism. Finally, an aggregator is used to integrate current spatial features and the corresponding query on exploratory memory:

$$\begin{aligned} z_t = {\text {MLP}}({\text {cat}}(\hat{h}^\mathcal {M}_t, r_G^t)), \end{aligned}$$

(12)

where MLP represents a Multi-layer Perceptron (MLP) followed by a ReLU activation. We model the policy as a diagonal Gaussian distribution, i.e. $\pi _{\theta }(a_t\vert s_t) = \mathcal {N}(\mu , \sigma ^2)$. As shown in Fig. 3, our model employs the Actor-Critic architecture [44]. The distribution’s mean $\mu $ and log-standard-deviation $log\sigma $ are learned by the actor head. The critic head estimates the state value function $V(s_t) = \mathbb {E}_{\pi _{\theta }}[\sum _{t=0}^{T_{exp}}]\gamma ^t r_t$ of the current policy. The actor head and the critic head are both realized by MLPs.

Configuration space reshaping using invalid action masking

Previous DRL-based exploration strategies [9, 10] usually set the entire current map $\hat{M}$ as the action space $\mathcal {A}$ and the observation space $\mathcal {O}$. Their candidate exploration goals are mostly regressed and may not lay on the frontiers, thus resulting in low sampling efficiency and incomplete map construction. Due to the curse of dimensionality, $\mathcal {A}$ and $\mathcal {O}$ gradually expand with lidar sensing and mapping, which makes the problem even more ambiguous. To alleviate the above problems, we employ IAM to restrict actions (the selection of exploration goals) to frontiers and their neighbors in the TER.

Classic policy gradient algorithms for DRL propose the following policy gradient estimate to the objective J:

$$\begin{aligned} \begin{aligned} g_{{\text {policy}}}&= \mathbb {E}_\tau [\nabla _\theta {\text {log}} \pi _\theta (a_\tau \vert s_\tau )R_\tau ] \\&= \mathbb {E}_\tau \left[ \nabla _\theta \sum ^{T-1}_{t=0} {\text {log}}\pi _\theta (a_t\vert s_t)R_t\right] , \end{aligned} \end{aligned}$$

(13)

where $R_t = \sum _{k=T_{{\text {exp}}}-t}^{T_{{\text {exp}}}} \gamma ^k r_{t+k}$ denotes the discounted return following time step t. The expected discounted reward is maximized by doing gradient ascent $\theta = \theta + \nabla _\theta J$. Specifically, we treat nodes far away from frontiers as invalid actions, with no information gain. Then, IAM is utilized to avoid sampling invalid actions by masking out the logits corresponding to the invalid actions. This is usually accomplished by replacing the logits of the actions to be masked in (13) by a large negative number N (e.g. $N = -1 \times 10^8$) [16]. Given state $s \in \mathcal {S}$, the process of IAM is considered as a differentiable function invs to be applied to the logits l(s) outputted by policy $\pi _\theta $. The IAM process is defined as:

$$\begin{aligned} \begin{aligned} \pi _{\theta }^{'}(\cdot \vert s_t) = {\text {softmax}}({\text {inv}}_s(l(s))), \end{aligned} \end{aligned}$$

(14)

where

$$\begin{aligned} {\text {inv}}_s(l(s))_i=\left\{ \begin{aligned} l_i,\ if\ a_i\ {\text {is}}\ {\text {valid}}\ {\text {in}}\ s \\ N,\ {\text {otherwise}}, \end{aligned} \right. \end{aligned}$$

(15)

The IAM process appears to do more than just renormalizing the probability distribution, it in fact makes the gradient corresponding to the logits of the invalid action zero [16]. In other words, taking invalid actions will not contribute to maximizing cumulative rewards, therefore the robot will tend not to take them.

Navigation pipeline and reward function

For robotic navigation during autonomous exploration, path planning and mapping are implemented based on $move\_base$^{Footnote 2} and Gmapping^{Footnote 3}, respectively. It is worth mentioning that our TER constructed and maintained in real-time can provide safe and efficient motion paths for robots. Once the robot navigates to a position with a distance less than a threshold $\zeta $ from the current exploration goal, the decision-making mechanism is triggered to select the next exploration goal.

To encourage robots to better trade off map quality and exploration efficiency, we design the following reward function:

$$\begin{aligned} \begin{aligned} R = \beta _1 G_{{\text {info}}}^t - \beta _2 L_{{\text {path}}}^t + \beta _3 r_{{\text {Acc}}}^t - \beta _4 N_T, \end{aligned} \end{aligned}$$

(16)

where $G_{{\text {info}}}^t$ and $L_{{\text {path}}}^t$ represent the total information gain and total path length before time step t, respectively. $N_T$ represents the total number of time steps before time step t. The above settings encourage robots to trade less path and time consumption for the most exploration coverage. $r_{{\text {Acc}}}^t$ is utilized to reward or punish the effect of the currently selected action on the map accuracy:

$$\begin{aligned} \begin{aligned} r_{{\text {Acc}}}^t = {\text {Accuracy}}(\hat{M}_t, \widetilde{M}) - {\text {Accuracy}}(\hat{M}_{t-1}, \widetilde{M}), \end{aligned} \end{aligned}$$

(17)

where

$$\begin{aligned} \begin{aligned} {\text {Accuracy}}(\hat{M}, \widetilde{M}) = \sum ^{w \times h}_{i=1}\varvec{1}[\hat{M}_i = \widetilde{M}_i], \end{aligned} \end{aligned}$$

(18)

where $\varvec{1}[\hat{M}_i = \widetilde{M}_i]$ is an indicator function that returns one if $\hat{M}_i = \widetilde{M}_i$ and zero otherwise. $\widetilde{M}$ represents the ground-truth layout of the workspace. Additionally, this reward provides a better learning signal while training under noisy conditions by accounting for mapping errors arising from noisy pose and noisy sensing [43].

Simulation analysis and experimental verification

Simulation setup and implementation details

We evaluate the proposed exploration strategy in multiple simulation scenarios in Ubuntu 16.04 using ROS Kinetic and the Gazebo^{Footnote 4} simulator. The robot model used in simulations is the Turtlebot3 Waffle^{Footnote 5}, which is equipped with a 2D lidar sensor with a sensing distance of 6 m to capture the distance information of the simulated scenarios in real-time. The maximum linear velocity and angular velocity of the mobile robot are 0.5 m/s and 1.8 rad/s, respectively. The capacity of the scenario memory buffer is set to $\epsilon = 10$. The $\beta _1$, $\beta _2$, $\beta _3$, and $\beta _4$ in Eq. (16) are set to 1.0, 0.6, 0.5, and 0.3, respectively. The decision frequency for autonomous exploration using different methods is uniformly set to 5 Hz. All the simulations run on a computer with Core i7-9700K at 4.9 GHz, 16 GB memory.

We employ the scenarios published in [33] for the learning and evaluation of different exploration strategies, which contains typical basic scenarios and their combinations such as loop, narrow corridor, multiple rooms, etc. To augment the complexity of the data set, several large-scale environments are further utilized to extend the data set, as shown in Fig. 4. We selected nine scenes for policy learning (Fig. 4) and three for evaluation (Fig. 6). For the evaluation of different methods, we consider three challenging synthetic scenes of increasing complexity and scale.

(1)
Scene1-Loop with Corridor: As shown in Fig. 6a, a loopback scene with a scale of 24 $\times $ 24 $\times $ 3 m$^3$. This type of scene often results in inevitable back-and-forth movements.
(2)
Scene2-Maze: As shown in Fig. 6b, a hand-crafted maze-shaped scene with a scale of 20 $\times $ 20 $\times $ 3 m$^3$. This type of scene often requires exploration strategies to deal with dead-end dilemmas in the exploration process.
(3)
Scene3-Office: As shown in Fig. 6c, a hand-crafted large-scale scene with a scale of 40 $\times $ 25 $\times $ 3 m$^3$, consisting of many rooms. This type of scene often results in incomplete exploration and inefficient back-and-forth movement.

We compare the proposed method with multiple baselines that use various types of exploration strategies. The considered baselines and our methods are as follows:

Cost [6]: This approach detects frontiers in real-time in the background and always selects the nearest frontier during exploration.
Sample [17]: This method detects frontiers utilizing the RRT and selects the highest-scoring frontier for exploration utilizing a carefully designed utility function.
Potential Field [7]: This approach builds the potential field from the surroundings and selects the frontier with the largest potential gradient for exploration.
Naive Exp [27]: This technique deploys a DRL-based policy to determine the global goal and then the nearest frontier to the global goal is selected. This baseline simulation emulates the work in [27], employing CNNs to extract features from the original grid map. In keeping with [27], we also adopt the Actor-Critic [44] architecture for reinforcement learning.
GVN (ours): This method employs GVN and IAM for exploratory decision-making. The maximum side length of the GVD is set to 2 m in our method. Parameters $D_{\text {gain}}$ and $\zeta $ during exploration simulations are set to 5m and 1m, respectively.

For the DRL-based strategies, the policies are trained using Proximal Policy Optimization (PPO) [45] and the training procedure builds upon the open-source PPO implementation provided by the Stable Baselines framework [46]. We train the exploration policies with 16 processes for rollouts on a computer with Core i7-9700K at 4.9 GHz, 16 GB memory.

To quantify the above methods, we follow the evaluation indicators from previous work [23, 33, 43] to measure both map quality and exploration efficiency:

(1)
IoU [43]: the intersection over the union between that same global map and the ground-truth layout of the environment, which Indicates the accuracy of the constructed map.
(2)
Exploration Ratio (ER) [23]: the ratio of the area in the global map built during exploration (both free and occupied) to the ground-truth layout of the environment, which indicates the integrity of the constructed map.
(3)
Total Exploration Time (ET) [33]: the total exploration time spent until the end of the exploration, which indicates the time efficiency of completing the exploration task.
(4)
Total Path Length (PL) [23]: the total path length moved until the end of the exploration, which indicates the energy consumption for completing exploration tasks to some extent. In our exploration missions, longer paths mean more energy consumption.

Table 1 Statistics of comparisons with baselines

Full size table

Among them, the first two metrics measure the quality of the created map, the latter two indicate the exploration efficiency. In our exploration missions, higher IoU and ER in fewer ET or shorter PL are better.

Simulation comparisons

According to the simulation settings in “Simulation setup and implementation details”, each exploration strategy is used to conduct 10 exploration simulations in each evaluation environment. An exploration mission ends when the scene is fully explored or the exploration terminates unexpectedly. Unexpected termination of exploration refers to the robot lingering in a local area for an extended period of time. According to the evaluation metrics in “Simulation setup and implementation details”, Table 1 and Fig. 5 show the quantitative analysis of simulation comparisons.

In most scenarios, an $80\%$ exploration ratio indicates that the robots have acquired essential topological information, including the overall terrain structure and connectivity. Based on this viewpoint, Fig. 5 illustrates that our method improves exploration speed by 7.0–$11.1\%$ over the state-of-the-art Potential Field based method in [7]. When the exploration ratio reaches $98.5\%$, we consider a previously unknown environment to be fully explored. Based on this viewpoint, Fig. 5 illustrates that our method improves exploration speed by 4.0–$16.2\%$ over the state-of-the-art Potential Field based method in [7]. Naive Exp [33] exhibits comparable performance to Cost [6] in small-scale environments, and the relative performance gradually decreases as the scene scale increases. The performance of classic Cost and Sample [17] is overall inferior to Potential Field based method [7] and our method.

The superiority of our method in terms of map quality and exploration efficiency is shown in Table 1. Specifically, our method results in the shortest average exploration time and competitive exploration ratio. Benefiting from map accuracy-based rewards and penalties mentioned in “Navigation pipeline and reward function”, our method exhibits competitive IoU metrics. In terms of total path length, our method is dominant in small-scale scenarios and slightly inferior to classical Cost [6] in large-scale scenarios. The reason for this phenomenon may be that Cost always selects the nearest frontier for exploration, resulting in the lowest distance consumption. We find that a similar phenomenon also occurs in the simulation tests of the work in [47]. Figure 6 shows the qualitative performance of our method in three different evaluation environments. The smooth fluorescent exploration trajectories in Fig. 6 mean smooth motion and less jitter. A more specific kinematic analysis will be described in “Mobile robot experiments”.

Table 2 Ablation studies for invalid action masking

Full size table

Ablation studies for invalid action masking

To examine the empirical importance of invalid action masking, we compare the following four strategies to handle invalid actions.

Without IAM: All nodes of a GVD can be selected as exploration goals for the robot and invalid action masking is not considered.
Naive IAM: At each time step t, the robot receives the same mask on the node features of a GVD as described for invalid action masking. The action shall still be sampled according to invalid action masking, but the gradient is updated according to Eq. (13), that is, no differentiable function invs is applied to logits. We call this implementation naive invalid action masking because its gradient does not replace the gradient of the logits corresponding to invalid actions with zero.
Masking Removed: At each time step t, the robot receives the same mask on the node features of a GVD as described for invalid action masking, and trains in the same way as the robot trained under invalid action masking. However, we then evaluate the robot without providing the mask. In other words, in this scenario, we evaluate what happens if we train with a mask, but then perform without it.
IAM: At each time step t, the robot receives a mask on the node features of a GVD such that only frontiers (valid actions) with information gain can be selected as exploration goals.

Table 3 Statistics of mobile robot experiments

Full size table

The above four strategies are employed to conduct 5 exploration simulations in Scene 1 and Scene 2, respectively. Under the premise of exploring as completely as possible, we focus on the contribution of IAM to map quality and exploration efficiency. Qualitative analysis of ablation studies for IAM is shown in Table 2, which shows that IAM scales well in both scenes, and the case Without IAM performed the worst. IAM leads to the shortest exploration time and total path length while guaranteeing a high IoU metric. It is worth noting that Masking Removed is still able to perform well to a certain degree, which shows that the policies trained with IAM can still produce useful behaviors when the IAM can no longer be provided. In addition, the performance metrics demonstrated by Naive IAM are second only to IAM, reflecting the necessity of masking invalid actions to improve the performance of exploration strategies.

Mobile robot experiments

To verify the effectiveness of our method, multiple mobile robot experiments are conducted in a complex real-world environment. As shown in Fig. 7, a lobby with a size of 17.5 $\times $ 4.0 m is used as the experimental ground. The controller and lower computer of the robot are NVIDIA Jetson AGX Xavier and OpenCR^{Footnote 6}, respectively. The robot is equipped with a 360$^\circ $ lidar sensor with a sensing distance of 5 m. The maximum linear velocity and angular velocity of the mobile robot are 0.18 m/s and 1.8 rad/s, respectively.

By deploying different exploration strategies to the same mobile robot, we mainly compare the performance of GVN with the state-of-the-art exploration method proposed in [7]. Each exploration strategy is used to conduct 5 experiments in the real scenario. The decision frequency for autonomous exploration using different methods is uniformly set to 5 Hz. The maximum side length of the GVD in our method is set to 2 m. Each round of GVD update in the experiment only takes tens of milliseconds.

Taking Cost [6] as a reference, the quantitative analysis of the experiments is shown in Table 3 and Fig. 8. Figure 8 shows the trend of exploration ratio over time, confirming the advantage of our method in terms of exploration speed. The specific experimental results in Table 3 confirm the advantages of our method in terms of map quality and exploration efficiency. Specifically, our exploration strategy leads to shorter exploration time and total path length while ensuring the highest IoU and exploration ratio. The qualitative analysis of the experiments is presented in Figs. 7 and 9. Figure 7a–c illustrate the exploration process of our method. The motion trajectories for three different strategies are shown in Fig. 7c, where our strategy results in the shortest and smoothest trajectory. Figure 9 illustrates the kinematic metrics during exploration using three different methods, respectively. With the same robot configuration and decision frequency, our method results in the smallest velocity and acceleration fluctuations, which reflects that our exploration strategy reduces the frequency of back-and-forth movements and wandering.

Conclusion and future work

In this work, we propose a model-free DRL-based smart exploration strategy based on VDC, GVN, and IAM to sense and map the workspace efficiently. Our lidar-based exploration strategy is free from human interventions and task-related assumptions, learning the policy by interacting with the environment in a trial-and-error manner. We propose to adopt GVD-based domain conversion to obtain a high-dimensional TER, overcoming the dimensional curse from action space and observation space. The GVN with spatial awareness and episodic memory is utilized to learn exploration strategies online. Additionally, the IAM is used to reshape the configuration space and cope with low sampling efficiency for faster and more complete exploration.

Extensive baseline tests and comparative simulations confirm the feasibility and superiority of our method. By deploying our method to a mobile robot, we alleviate the awkward transference from simulation to reality. However, the proposed method still faces some limitations. For example, the ability of our model to cope with dynamically changing environments is unknown. In addition, considering the complexity of 3D point cloud data processing, the construction of TER may be computationally expensive when extended to a 3D workspace. In our future work, we will investigate and discuss these two aspects. We will refine existing work and try to extend the proposed method to 3D exploration tasks or crowded environments with dynamic pedestrians.

Notes

References

Zhang S, Zhang X, Li T et al (2022) Fast active aerial exploration for traversable path finding of ground robots in unknown environments. IEEE Trans Instrum Meas 71:1–13
Google Scholar
Wang Y, Tan R, Xing G et al (2016) Energy-efficient aquatic environment monitoring using smartphone-based robots. ACM Trans Sens Netw TOSN 12(3):1–28
Google Scholar
Wang Y, Tan R, Xing G et al (2014) Spatiotemporal aquatic field reconstruction using cyber-physical robotic sensor systems. ACM Trans Sens Netw TOSN 10(4):1–27
Article Google Scholar
Wang D, Liu J, Zhang Q (2013) On mobile sensor assisted field coverage. ACM Trans Sens Netw TOSN 9(2):1–27
Article Google Scholar
Ropero F, Muñoz P, R-Moreno MD (2019) TERRA: a path planning algorithm for cooperative UGV-UAV exploration. Eng Appl Artif Intelli 78:260–272
Yamauchi BA (1997) frontier-based approach for autonomous exploration. In: Proceedings, IEEE international symposium on computational intelligence in robotics and automation CIRA’97.’ Towards new computational principles for robotics and automation’. IEEE, pp 146–151 (1997)
Yu J, Tong J, Xu Y et al (2021) Smmr-explore: Submap-based multi-robot exploration system with multi-robot multi-target potential field exploration method. In: 2021 IEEE international conference on robotics and automation (ICRA). IEEE, pp 8779–8785
Garaffa LC, Basso M, Konzen AA et al (2021) Reinforcement learning for mobile robotics exploration: a survey. IEEE Trans Neural Netw Learn Syst
Lodel M, Brito B, Serra-Gómez A et al (2022) Where to look next: learning viewpoint recommendations for informative trajectory planning. In: 2022 IEEE international conference on robotics and automation (ICRA). IEEE
Li H, Zhang Q, Zhao D (2019) Deep reinforcement learning-based automatic exploration for navigation in unknown environment[J]. IEEE Trans Neural Netw Learn Syst 31(6):2064–2076
Article Google Scholar
Lee JY, Choset H (2005) Sensor-based exploration for convex bodies: a new roadmap for a convex-shaped robot. IEEE Trans Robot 21(2):240–247
Article Google Scholar
Fang K, Toshev A, Fei-Fei L et al (2019) Scene memory transformer for embodied agents in long-horizon tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp 538–547
Fortunato M, Tan M, Faulkner R et al (2019) Generalization of reinforcement learners with working and episodic memory. Adv Neural Inf Process Syst 32
Gupta S, Davidson J, Levine S et al (2017) Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2616–2625
Chaplot DS, Gandhi D, Gupta S, Gupta A, Salakhutdinov R (2020) Learning to explore using active neural slam. In: International conference on learning representations (ICLR)
Huang S, Ontañón S (2020) A closer look at invalid action masking in policy gradient algorithms. arXiv preprint arXiv:2006.14171
Umari H, Mukhopadhyay S (2017) Autonomous robotic exploration based on multiple rapidly-exploring randomized trees. In: 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 1396–1402
Wang C, Chi W, Sun Y et al (2019) Autonomous robotic exploration by incremental road map construction. IEEE Trans Autom Sci Eng 16(4):1720–1731
Article Google Scholar
Cavinato V, Eppenberger T, Youakim D et al (2021) Dynamic-aware autonomous exploration in populated environments. In: 2021 IEEE international conference on robotics and automation (ICRA). IEEE, pp 1312–1318
Yu J, Tong J, Xu Y et al (2021) Smmr-explore: Submap-based multi-robot exploration system with multi-robot multi-target potential field exploration method. In: 2021 IEEE international conference on robotics and automation (ICRA). IEEE, pp 8779–8785
Zhu H, Cao C, Xia Y et al (2021) DSVP: Dual-stage viewpoint planner for rapid exploration by dynamic expansion. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 7623–7630
Juliá M, Reinoso O, Gil A et al (2010) A hybrid solution to the multi-robot integrated exploration problem. Eng Appl Artif Intell 23(4):473–486
Article Google Scholar
Zhang X, Chu Y, Liu Y et al (2021) A novel informative autonomous exploration strategy with uniform sampling for quadrotors. In: IEEE transactions on industrial electronics
Zhong P, Chen B, Cui Y et al (2021) Space-heuristic navigation and occupancy map prediction for robot autonomous exploration. In: International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 578–594
Shrestha R, Tian FP, Feng W et al (2019) Learned map prediction for enhanced mobile robot exploration. In: 2019 international conference on robotics and automation (ICRA). IEEE, pp 1197–1204
Zhu D, Li T, Ho D et al (2018) Deep reinforcement learning supervised autonomous exploration in office environments. In: 2018 IEEE international conference on robotics and automation (ICRA). IEEE, pp 7548–7555
Niroui F, Zhang K, Kashino Z et al (2019) Deep reinforcement learning robot for search and rescue applications: exploration in unknown cluttered environments[J]. IEEE Robot Autom Lett 4(2):610–617
Article Google Scholar
Zheng Z, Cao C, Pan J (2021) A hierarchical approach for mobile robot exploration in pedestrian crowd. IEEE Robot Autom Lett 7(1):175–182
Article Google Scholar
Zhu D, Li T, Ho D et al (2018) Deep reinforcement learning supervised autonomous exploration in office environments. In: 2018 IEEE international conference on robotics and automation (ICRA). IEEE, pp 7548–7555
Li H, Zhang Q, Zhao D (2019) Deep reinforcement learning-based automatic exploration for navigation in unknown environment[J]. IEEE Trans Neural Netw Learn Syst 31(6):2064–2076
Article Google Scholar
Chen F, Martin JD, Huang Y et al (2020) Autonomous exploration under uncertainty via deep reinforcement learning on graphs. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 6140–6147
Lee WC, Lim MC, Choi HL (2021) Extendable navigation network based reinforcement learning for indoor robot exploration. In: 2021 IEEE international conference on robotics and automation (ICRA). IEEE, pp 11508–11514
Xu Y, Yu J, Tang J et al (2022) Explore-bench: data sets, metrics and evaluations for frontier-based and deep-reinforcement-learning-based autonomous exploration. arXiv preprint arXiv:2202.11931
Tsang F, Walker T, MacDonald RA et al (2021) LAMP: learning a motion policy to repeatedly navigate in an uncertain environment. In: IEEE transactions on robotics
Francis A, Faust A, Chiang HTL et al (2020) Long-range indoor navigation with prm-rl. IEEE Trans Robot 36(4):1115–1134
Article Google Scholar
Hou Q, Zhang S, Chen S et al (2021) Straight skeleton based automatic generation of hierarchical topological map in indoor environment. In: 2021 IEEE international intelligent transportation systems conference (ITSC). IEEE, pp 2229–2236
Chi W, Wang J, Ding Z et al (2021) A reusable generalized voronoi diagram based feature tree for fast robot motion planning in trapped environments. IEEE Sens J
Wang J, Meng MQH (2020) Optimal path planning using generalized voronoi graph and multiple potential functions. IEEE Trans Ind Electron 67(12):10621–10630
Article MathSciNet Google Scholar
Wu Z, Chen Y, Liang J et al (2022) ST-FMT*: a fast optimal global motion planning for mobile robot. IEEE Trans Ind Electron 69(4):3854–3864
Article Google Scholar
Chi W, Ding Z, Wang J et al (2022) A generalized Voronoi diagram based efficient heuristic path planning method for RRTs in mobile robots[J]. IEEE Trans Ind Electron 69(5):4926–4937
Article Google Scholar
Lau B, Sprunk C, Burgard W (2013) Efficient grid-based spatial representations for robot navigation in dynamic environments. Robot Autonom Syst 61(10):1116–1130
Article Google Scholar
Castellini A, Marchesini E, Farinelli A (2021) Partially observable Monte Carlo Planning with state variable constraints for mobile robot navigation. Eng Appl Artif Intell 104
Ramakrishnan SK, Al-Halah Z, Grauman K (2020) Occupancy anticipation for efficient exploration and navigation. In: European conference on computer vision. Springer, Cham, pp 400–418
Sutton RS, McAllester D, Singh S et al (1999) Policy gradient methods for reinforcement learning with function approximation. Adv Neural Inf Process Syst 12
Schulman J, Wolski F, Dhariwal P et al (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347
Hill A, Raffin A, Ernestus M, Gleave A, Kanervisto A, Traore R, Dhariwal P, Hesse C, Klimov O, Nichol A, Plappert M, Radford A, Schulman J, Sidor S, Wu Y (2018) Stable baselines. https://github.com/hill-a/stable-baselines
Xu Z, Deng D, Shimada K (2021) Autonomous UAV exploration of dynamic environments via incremental sampling and probabilistic roadmap[J]. IEEE Robot Autom Lett 6(2):2729–2736
Article Google Scholar
Welling M, Kipf TN (2016) Semi-supervised classification with graph convolutional networks. In: J. International conference on learning representations (ICLR 2017)
Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271
Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv preprint arXiv:1409.2329
Dang T, Tranzatto M, Khattak S et al (2020) Graph-based subterranean exploration path planning using aerial and legged robots. J Field Robot 37(8):1363–1388
Article Google Scholar
Yan S, Wu Z, Wang J et al (2022) Real-world learning control for autonomous exploration of a biomimetic robotic shark. IEEE Trans Ind Electron
Kim SK, Bouman A, Salhotra G et al (2021) Plgrim: Hierarchical value learning for large-scale exploration in unknown environments. In: Proceedings of the international conference on automated planning and scheduling, vol 31, pp 652–662
Lim B, Arık SÖ, Loeff N et al (2021) Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int J Forecast 37(4):1748–1764
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Zhou C, Tao H, Chen Y et al (2022) Robust point-to-point iterative learning control for constrained systems: a minimum energy approach[J]. Int J Robust Nonlinear Control 32(18):10139–10161
Article MathSciNet Google Scholar
Guan S, Zhuang Z, Tao H et al (2023) Feedback-aided PD-type iterative learning control for time-varying systems with non-uniform trial lengths. Trans Inst Meas Control 01423312221142564
Zhuang Z, Tao H, Chen Y et al (2022) Iterative learning control for repetitive tasks with randomly varying trial lengths using successive projection[J]. Int J Adapt Control Signal Process 36(5):1196–1215
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under 62172443, in part by the Natural Science Foundation of Hunan Province under 2022JJ30760, in part by the Natural Science Foundation of Changsha under kq2202107 and kq2202108. We are grateful for resources from the High Performance Computing Center of Central South University.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Central South University, Changsha, 410083, Hunan, China
Bolei Chen, Ping Zhong, Yongzheng Cui, Siyi Lu, Yixiong Liang & Yu Sheng

Authors

Bolei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ping Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Yongzheng Cui
View author publications
You can also search for this author in PubMed Google Scholar
Siyi Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yixiong Liang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Sheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ping Zhong.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, B., Zhong, P., Cui, Y. et al. EMExplorer: an episodic memory enhanced autonomous exploration strategy with Voronoi domain conversion and invalid action masking. Complex Intell. Syst. 9, 7365–7379 (2023). https://doi.org/10.1007/s40747-023-01144-x

Download citation

Received: 20 February 2023
Accepted: 21 April 2023
Published: 29 June 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s40747-023-01144-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

EMExplorer: an episodic memory enhanced autonomous exploration strategy with Voronoi domain conversion and invalid action masking

Abstract

Similar content being viewed by others

Target-Driven Autonomous Robot Exploration in Mappless Indoor Environments Through Deep Reinforcement Learning

Learning to explore by reinforcement over high-level options

Deep Reinforcement Learning-Based Robot Exploration for Constructing Map of Unknown Environment

Explore related subjects

Introduction

Related work

Classic human-designed exploration strategies

Learning-based exploration strategies

Topological map-based robotic navigation

Problem statement

Episodic memory enhanced autonomous exploration with VDC and IAM

GVD-based domain conversion

Generalized Voronoi network configuration

Configuration space reshaping using invalid action masking

Navigation pipeline and reward function

Simulation analysis and experimental verification

Simulation setup and implementation details

Simulation comparisons

Ablation studies for invalid action masking

Mobile robot experiments

Conclusion and future work

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation