Abstract
Robotic motion planning in dense and dynamic indoor scenarios constantly challenges the researchers because of the motion unpredictability of obstacles. Recent progress in reinforcement learning enables robots to better cope with the dense and unpredictable obstacles by encoding complex features of the robot and obstacles into the encoders like the longshort term memory (LSTM). Then these features are learned by the robot using reinforcement learning algorithms, such as the deep Q network and asynchronous advantage actor critic algorithm. However, existing methods depend heavily on expert experiences to enhance the convergence speed of the networks by initializing them via imitation learning. Moreover, those approaches based on LSTM to encode the obstacle features are not always efficient and robust enough, therefore sometimes causing the network overfitting in training. This paper focuses on the advantage actor critic algorithm and introduces an attentionbased actor critic algorithm with experience replay algorithm to improve the performance of existing algorithm from two perspectives. First, LSTM encoder is replaced by a robust encoder attention weight to better interpret the complex features of the robot and obstacles. Second, the robot learns from its past prioritized experiences to initialize the networks of the advantage actorcritic algorithm. This is achieved by applying the prioritized experience replay method, which makes the best of past useful experiences to improve the convergence speed. As results, the network based on our algorithm takes only around 15% and 30% experiences to get rid of the earlystage training without the expert experiences in cases with five and ten obstacles, respectively. Then it converges faster to a better reward with less experiences (near 45% and 65% of experiences in cases with ten and five obstacles respectively) when comparing with the baseline LSTMbased advantage actor critic algorithm. Our source code is freely available at the GitHub (https://github.com/CHUENGMINCHOU/AWPERA2C).
Similar content being viewed by others
Introduction
Service robots, especially the indoor service robots, appeared in scenarios like airports, restaurants, and train stations nowadays to provide simple services to visitors. For instance, luggage delivery, food delivery, and direction consulting. However, these robots suffer poor motion planning performance in scenarios with dense and dynamic obstacles (pedestrians) because of the motion unpredictability of these obstacles. This barricades the further commercial use of these service robots. Motions of some robots are controlled by classical path planning algorithms like the graph search algorithm [e.g., A* (Hart et al., 1968)], samplebased algorithm [e.g., the rapidlyexploring random tree (RRT) (Bry & Roy, 2011)], and interpolating curve algorithms (Farouki & Sakkalis, 1994; Funke et al., 2012; González et al., 2014; Reeds & Shepp, 1990; Xu et al., 2012). These algorithms work well in static environments or lowspeed scenarios with less obstacles. However, they make robots suffer more collisions in cases with dense and dynamic obstacles because these algorithms generate the motions or paths in an online way. Online motion generation depends on the update of environmental maps that require much computing resources. Reactionbased algorithms like the dynamic window approach (DWA) (Fox et al., 1997) and optimal reciprocal collision avoidance (ORCA) (Van Berg et al., 2008) perform fast to handle the obstacle’s unpredictability. This enables the robot to better avoid the slowspeed obstacles. However, these algorithms still require the online updates of environmental information that should not be ignored in cases with dense and dynamic obstacles.
Deep learning (DL) algorithms generate the robotic motions by performing a trained model, in which the time consumed is short and it can be ignored. Classical DL like the convolutional neural network (CNN) (Bai et al., 2019) can generate instant motions to dynamic obstacles. These motions are onestep predictions which do not consider task goals, therefore obtaining suboptimal solutions or trajectories eventually. Recent progress in the deep reinforcement learning (RL) like the optimal value RL [e.g., Dueling deep Q network (DQN) (Wang et al., 2016)] and policy gradient RL [e.g., asynchronous advantage actor critic algorithm (A3C) (Mnih et al., 2016)] enable the robot to consider the task goal and instant obstacle avoidance simultaneously. These algorithms obtain nearoptimal solutions to better cope with cases with dense and dynamic obstacles. However, RL algorithms face many challenges, like overfitting and slow convergence speed caused by high bias or variance. It is still not enough to obtain a desired performance of motion planning by improving the RL algorithms merely. The performance of robotic motion planning can be further improved by improving other factors, such as the input quality.
Input quality in this paper is defined as the efficacy of data and the efficiency to make the best of useful data. This means: (1) The input data of the algorithm should fully represent the environmental information or environmental state (e.g., the relationship of obstacles, the speed, radius, moving direction, and position of obstacles). (2) The algorithm should first select the highquality input data (e.g., the trajectory in which the robot reaches the goal), and then makes the best of it by the data replay and replay strategies. The efficacy of data denotes how good the data represents the environmental information. High data efficacy depends on suitable methods to describe the environmental information. This provides qualified or comprehensive data without noise for RL algorithms to learn from.
Some works use source data from the environment, like the source images (Bai et al., 2019), as the input of algorithms directly. This type of data includes much noise (e.g., background information), hence may causing the overfitting of algorithms. Some works use the methods (e.g., LSTM) that partially interpret the environmental information (Everett et al., 2018) to generate dataset. Algorithms based on these methods cannot fully learn the core or needed information from dataset and then the overfitting or slow convergence speed follows. High efficiency to make the best of data relies on the methods to reuse the dataset generated by performing the algorithm itself on the robot. Recent RL algorithms based on the experience replay (Wang et al., 2017) and the prioritized experience replay (PER) (Schaul et al., 2016) make it possible to better make the best of dataset, therefore realizing decent performances in many continuouscontrol problems. These methods are dataefficient and independent than the online RL [e.g., online A3C(Mnih et al., 2016)] and the imitation learning (IL) based method (Chen et al., 2019b). Online RL is the data guzzler which discards the data after the training. IL based method heavily depends on other algorithms or artificial data to generate expert experiences (e.g., trajectories generated via ORCA, and artificial trajectories planned by human) to initialize the RL algorithms.
To cope with shortcomings mentioned above, this paper first focuses on the advantage actor critic algorithm (A2C) which is a robust policy gradient RL algorithm to cope with the sequentialdecision robotic motion planning problems. Then the performance of robotic motion planning is improved by enhancing the input quality (efficacy of data, and efficiency to make the best of useful data). The main contributions of this paper include:

(1)
The improvement in efficacy of data: Improving the stateofart longshort term memory (LSTM) based actor critic algorithm by applying the attention weight (AW) encoder to replace the LSTM encoder for the interpretation of environmental information.

(2)
Combination of online and offline A2C with PER: Fitting the PER into combined online and offline A2C to fast improve its convergence speed. Therefore AWbased A2C converges steeply to a better reward without expert experiences from other methods. To our knowledge, our method is the first to specially focus on two aspects of input quality to better improve the robotics motion planning in dense and dynamic scenarios. Other works are the data guzzler, or they depend heavily on expert experiences.
The main contents of this paper include following four sections: Sect. 2 is the research background that describes related works, preliminary of RL, and problem formulation of complex robotic motion planning; Sect. 3 is the research methods that consist of the principle to design the RL network, combination of online/offline A2C, LSTM/AW encoders, AWPERbased A2C and its training strategies; Sect. 4 is the experiment results that include the design of network architecture, model trainings, and model evaluations; Sect. 5 analyzes the problems found in the training and their solutions, followed by the future research directions.
Research background
Research background includes three parts that are the preliminary of RL, related works and problem formulation. Some preliminary of RL are listed to provide basic understanding or definition of RLrelated terms. Related works focus on recent progresses of RL algorithms in solving continuouscontrol problems. These progresses are elaborated from two different levels: algorithm level, and input quality level. Algorithm level here denotes the stateofart RL algorithms to solve the timesequential problems in games. Input quality level here includes recent algorithms to interpret and reuse the data for improving the data efficacy and efficiency to make the best of it.
Preliminary of RL
Markov decision process
The Markov decision process (MDP) is the sequential decision process based on Markov Chain (Bas, 2019) which is a variable set \({\varvec{X}}=\left\{{X}_{n}:n>0\right\}\) and \(p\left({X}_{t+1}{X}_{t},\dots ,{X}_{1}\right)=p({X}_{t+1}{X}_{t})\). This means the state and action of the next step only depend on the state and action of the current step. MDP is described as a tuple \(<S,A,P,R>\). State S: S denotes the state and here it refers to the state of robot and obstacles. Action A: A denotes an action taken by the robot. Reward R: R denotes the reward or punishment received by the robot after executing actions. State transition probability P: P denotes the possibility to transit from one state to the next state.
Value function
The values denote how good one state is or how good one action is in one state, and they are called the state value (V value) and stateaction value (Q value) respectively. Values are defined as the expectation of accumulative rewards \(V\left(s\right)={\mathbb{E}}\left[{R}_{t+1}+\gamma {R}_{t+1}+\dots +{\gamma }^{T1}{R}_{T}{s}_{t}\right]\) or \(Q\left(s,a\right)={\mathbb{E}}\left[{R}_{t+1}+\gamma {R}_{t+1}+\dots +{\gamma }^{T1}{R}_{T}{(s}_{t},{a}_{t})\right]\) where \(\gamma \) is a discounted factor. The value function in deep RL scope is represented by neural networks to estimate the value of environmental state via the function approximation (Baird, 1995).
Policy function
The policy denotes the way to select actions. Policy function is also represented by neural networks in deep RL, and actions are decided by either indirect way (e.g., \(a\leftarrow arg{max}_{a}R\left(s,a\right)+Q(s,a;\theta )\) in DQN (Mnih et al., 2013; Mnih et al., 2015)) or direct way [e.g., \({\pi }_{\theta }:s\to a\) in actorcritic algorithm (Konda & Tsitsiklis, 2000)].
Related works
Algorithm level
RL algorithms basically include the optimal value RL and policy gradient RL. They are primarily tested in games like Atari game, and their representatives are DQN (Mnih et al., 2013) and actorcritic algorithm (Konda & Tsitsiklis, 2000). Then, these two algorithms are continuously improved from different perspectives and many variants follow. DQN evolves into double DQN (Van Hasselt et al., 2016) and dueling DQN (Wang et al., 2016), while actorcritic algorithm basically evolves from three perspectives: multithread policy improvement, deterministic policy improvement, and monotonous policy improvement. Variants from multithread policy improvement focus on making the best of the multiply thread method and policy entropy to accelerate the convergence speed, and typical examples are A3C and A2C (Mnih et al., 2016). Deterministic policy improvement proves that the policy to select actions is stable in one state s and actions can be directly decided by this state \(a\leftarrow {\mu }_{\theta }\left(s\right)\), while its counterpart the stochastic policy selects actions by the possibility \(a\leftarrow {\pi }_{\theta }\left(as\right)\). Typical examples of variant based on the deterministic policy improvement are the deterministic policy gradient (DPG) (Silver et al., 2014) and the deep deterministic policy gradient DDPG (Munos et al., 2016). Monotonous policy improvement introduces the trust region constraint, surrogate, and adaptive penalty to ensure the monotonous update of policy. Typical instances are the trust region policy optimization (TRPO) (Schulman et al., 2015) and the proximal policy optimization (PPO) (Schulman et al., 2017).
Input quality level
(1) Data interpretation: earlystage RL research makes use of the CNN (Bai et al., 2019) to preprocess the source images for extracting the features. These methods seem like the “black box”, because it is hard to know what these CNNbased RL algorithms learnt from the source images. Source images also include much noise which probably causes the overfitting of network. Then, the feature of the agents in the environment (e.g., robots, obstacles, or pedestrians) are specially defined into clear forms (e.g., vector or tensor) according to the requirements of the motion planning task, and these features are encoded accordingly to form an integrated description or interpretation of the environmental information for RL algorithms to learn. Representatives of encoding methods include LSTM (Everett et al., 2018; Inoue et al., 2019), AW (Chen et al., 2019b; Lin et al., 2017) and the relation graph (RG) (Chen et al., 2019a). They are all robust methods, but performance of motion planning based on these encoding methods varies, because it not only depends on the robustness of these methods but also relies on how to use these methods. For example, LSTM (Everett et al., 2018) just encodes partial environmental information therefore RL algorithms also partially learn the needed information required by motion planning tasks, hence may causing the overfitting and suboptimal solutions in the training. (2) Experience replay: online RL is a data guzzler therefore many researchers turn to offline learning or batch learning for better making the best of limited dataset to improve the convergence speed. A milestone work is the DQN (Mnih et al., 2013) which stochastically samples and learns dataset stored in the memory. However, stochastic sampling method ignores the importance or priority of data which is the essential part for further improvement of convergence, hence greedy sampling method (Schaul et al., 2016) is introduced to fast improve the convergence but the results are always suboptimal. PER (Schaul et al., 2016) solves this problem by finding a better tradeoff between the stochastic sampling and greedy sampling. Following works in experience replay are about the justification of PER from mathematical perspectives (Li et al., 2021). There are also some variants (Jiang et al., 2020; Zha et al., 2019) but their improvements in convergence are limited.
Problem formulation
Recent work (Van Berg et al., 2008) introduced a competent simulative environment (Fig. 1) that includes dynamic robot and obstacles in a fixsize 2D indoor area. The robot and obstacles move towards their own goals simultaneously and avoid collisions to each other. They obey the same or different policies for motion planning to avoid collisions. This simulative environment creates circlecrossing and squarecrossing scenarios that add predictable complexity to the environment. It is therefore a good platform to evaluate algorithms adopted by the robot or obstacles.
The robot and obstacles plan motions towards their goals and avoid collisions by sequential decision making. Let \(s\) represents the state of the robot. Let \(a\) and \(v\) represent action and velocity of the robot, and \(a=v=\left[{v}_{x},{v}_{y}\right]\). Let \(p=\left[{p}_{x},{p}_{y}\right]\) represents position of the robot. Let \({s}_{t}\) represents state of the robot at time step t. \({s}_{t}\) is composed by observable and hidden parts \({s}_{t}=\left[{s}_{t}^{obs},{s}_{t}^{h}\right]\), \({s}_{t}\in {R}^{9}\). Observable part refers to factors of state that can be measured by others. It is composed by the position, velocity, and radius \({s}^{obs}=\left[{p}_{x},{p}_{y},{v}_{x},{v}_{y},r\right]\), \({s}^{obs}\in {R}^{5}\). Hidden part refers to factors of state that cannot be seen by others. It is composed by the planned goal position, preferred speed and heading angle \({s}^{h}=\left[{p}_{gx},{p}_{gy},{v}_{pref},\theta \right],{s}^{h}\in {R}^{4}\). The state, position, and radius of obstacles are described by \(\widehat{s}\), \(\widehat{p}\) and \(\widehat{r}\).
To analyze this decisionmaking process, we first introduce the onerobot oneobstacle case. Then extend it to the onerobot multiobstacle case. The robot plans its motion by obeying policy \(\pi :\left({s}_{0:t},{\widehat{s}}_{0:t}^{obs}\right)\to {a}_{t}\) while obstacles obey \(\widehat{\pi }:\left({\widehat{s}}_{0:t},{s}_{0:t}^{obs}\right)\to {a}_{t}\). The objective of the robot is to minimize the time to its goal \(E\left[{t}_{g}\right]\) (Eq. 1) under the policy \(\pi \) without collisions to obstacles. Constraints of robot’s motion planning in this sequential decision problem can be formulated via Eqs. 2–5 that represent the collision avoidance constraint, goal constraint, kinematics of the robot and kinematics of obstacle, respectively. The collision avoidance constraint denotes that the distance of the robot and obstacles \({\Vert {p}_{t}{\widehat{p}}_{t}\Vert }_{2}\) should be greater than or equal to the radius sum of the robot and obstacles \(r+\widehat{r}\). The goal constraint denotes that the position of the robot \({p}_{tg}\) should be equal to the goal position \({p}_{g}\) if the robot reaches the goal. Kinematics of the robot denotes that the position of the robot in time step t \({p}_{t}\) is equal to the sum of the robot position in time step t1 \({p}_{t1}\). The change of the robot position \(\Delta t\cdot \pi :\left({s}_{0:t},{\widehat{s}}_{0:t}^{obs}\right)\) where \(\pi :\left({s}_{0:t},{\widehat{s}}_{0:t}^{obs}\right)\) is a velocity decided by the policy \(\pi \). Kinematics of obstacles is the same as that of the robot.
Constraints of onerobot oneobstacle case can be easily extended into onerobot Nobstacle case where the objective (Eq. 1) is replaced by the \(minimizeE\left[{t}_{g}{s}_{0},\left\{{\widehat{s}}_{0}^{obs}\dots {\widehat{s}}_{N}^{obs}\right\},\pi ,\widehat{\pi }\right]\) where we assume that obstacles use the same policy \(\widehat{\pi }\). Collision avoidance constraint (Eq. 2) is replaced by
assuming that obstacles are in the same radius \(\widehat{r}\). \({\widehat{p}}_{N1:t}\) denotes the position of the Nth obstacle in time step t. Kinematics of the robot is replaced by \({p}_{t}={p}_{t1}+\Delta t \cdot \pi :\left({s}_{0:t},\{{\widehat{s}}_{0:t}^{obs}\dots {\widehat{s}}_{N1:t}^{obs}\}\right)\). Kinematics of obstacles is replaced by
Methods
This section first describes the principles to design RL network. These principles are about the strategies to configure the number of layer and node in each layer (Sect. 3.1). Then, principles of combining the online A2C and offline A2C with PER are given (Sect. 3.2). It is followed by the principles of the LSTM encoder and AW encoder (Sect. 3.3). The last, AWbased A2C can be further improved by our algorithm AWPERbased A2C while some training strategies are also presented (Sect. 3.4).
Our contributions are the AWbased A2C and AWPERbased A2C (Sect. 3.4). Note that AWbased A2C is just a part of the AWPERbased A2C. Hence, we just present the pseudocode of AWPERbased A2C (Algorithms 1–4). AWbased A2C is the optimized version of LSTMbased A2C by replacing the LSTM encoder with AW encoder. AWPERbased A2C is the optimized version of AWbased A2C by combining the online and offline learning (batch learning) to update the network. Architectures of AWbased A2C and AWPERbased A2C are shown in Fig. 2, and their workflows are almost the same:

(1)
These two algorithms collect the same source environment information (the robot and obstacles) which is encoded by the attention encoder.

(2)
The encoded environment information is then fed to A2C algorithm to update the network (model). However, the network of AWbased A2C is just updated by online A2C, while that of AWPERbased A2C is updated by online and offline A2C simultaneously.

(3)
The network configurations of these two algorithms are always suboptimal. Hence, the number of layer and node of each layer need to be configured to obtain the optimal network configurations.
Principles to design networks for RL
Principles of configuration for number of layer and node
The network with one layer is only used for simple linear problem (e.g., twoclass classification). A multiple layer perceptron (MLP) with one hidden layer can approximate any function that we require (Hornik et al., 1989). A feedforward network can approximate any Borel measurable function. This network should be with a linear output layer and at least one hidden layer with any “squashing” activation function (e.g., the logistic sigmoid activation function). The feedforward network approximates from one finitedimensional space to another with any desired nonzero amount of error, provided that the network is given enough hidden units (Goodfellow et al., 2016). Artificial neural network (e.g., MLP) with two hidden layers is sufficient for creating any classification regions of desired shape (Lippmann, 1988). Although a single hidden layer is optimal for some functions, there are others for which a singlehiddenlayersolution is very inefficient compared to solutions with two or more hidden layers (Reed & MarksII, 1999).
However, the question of “how many nodes should be used in each layer” had not been solved yet. Currently, the selection of number in the layer and node is more art than science. This means how to configure layers and nodes is more likely to consider the empirical findings in literature or intuitions from experience (Goodfellow et al., 2016).
Strategies in configuration of layer and node
It is an efficient starting point to find better solutions for the configuration of layers and nodes, but it still requires robust test harnesses and controlled experiments that include some basic strategies (Brownlee, 2018):

(1)
Random test: random configurations for the number of layers and the number of nodes in each layer.

(2)
Grid test: systematic search for the number of layers and the number of nodes in each layer.

(3)
Heuristic test: directed search for the number of layers and the number of nodes in each layer according to search algorithms [e.g., genetic algorithm (Stathakis, 2009)].

(4)
Exhaustive test: all combinations of the number of layers and the number of nodes in each layer, but this strategy is feasible for simple network or dataset.
Combination of online A2C and offline A2C with PER
Online A2C
A2C is a variant of A3C (Mnih et al., 2016), and they share the same objectives. Their difference is that A3C collects experience and updates the wight of network individually and asynchronously in each thread. However, A2C collects experience only in each thread and the weight of network is updated synchronously. The objective (loss function) of A2C is defined as the expectation of losses in the policy function and value function:
where \({\beta }^{online}\) is a discounted factor. Losses of policy function and value function are defined by
where \({\mathcal{H}}_{t}^{{\pi }_{\theta }}={\sum }_{a}\pi \left(a{s}_{t}\right)\mathrm{log}\pi \left(a{s}_{t}\right)\) is the policy entropy to encourage the exploration for finding better potential action, and \(\alpha \) is a discounted factor. \({V}_{t}^{n}={\gamma }^{n}{V}_{\theta }({s}_{t+n})+{\sum }_{m=0}^{n1}{\gamma }^{m}{r}_{t+m}\) is the nstep discounted accumulative state value to represent how good the state \({s}_{t}\) is. In our experiment, the data is the episodic experience therefore the n is defined as the number of steps in each episode.
Offline A2C and the combination of online/offline A2C
(Oh et al., 2018) is inspired by PER (Schaul et al., 2016) and proposes an offline actorcritic loss, hence PER can be easily integrated into the actorcritic based algorithms. The offline actorcritic loss is defined by
where \({\beta }^{offline}\) is a discounted factor. \({L}_{policy}^{offline}\) and \({L}_{value}^{offline}\) are defined accordingly by
where \({R}_{t}={\sum }_{k}^{\infty }{\gamma }^{kt}{r}_{k}\) is the MonteCarlo return instead of the accumulative return \({V}_{t}^{n}\), and \({\left(\cdot \right)}_{+}=\mathrm{max}(\cdot ,0)\).
To make the best of offline A2C to improve the convergence speed of online A2C, two problems should be solved: (1) selection of useful data; (2) different way of weight update caused by different data distribution in the online learning and batch learning. Online learning updates its weight in a stochastic way. Its dataset is unnecessary to fit any distribution therefore the way of weight update is unbiased. Dataset in the batch learning, however, is expected to meet the independently identical distribution (IID). Therefore, weight update in batch learning mismatches that in the online learning, hence introducing the bias if online learning and batch learning are simply combined.
The PER better solves these two problems by setting the priority of data and applying the importance sampling weigh in the batch learning. The priority is defined by the temporal difference (TD)error \(\delta \):
where \(\varepsilon \) is a small positive constant, while the priority in the offline A2C is set to be the lower bound of TDerror:
Hence, the probability to sample the experience is defined by
where \(\alpha \) is the discount of priority, k the transitions in one episode and \(i\epsilon k\). The bias caused by batch learning is reduced by the importance sampling weight that is defined by
where \(\frac{1}{{max }_{i}{w}_{i}}\) is a weight for normalization to stabilize the weight update, N the size of samples and \(\beta \in [\mathrm{0,1}]\) is a compensation of \(P\left(i\right)\). Hence the way of weight update in the offline/batch learning turns into
LSTM and AW encoders
LSTM encoder uses LSTM to encode the obstacles near the robot, hence forming the description of environmental state \({S}^{lstm}\) which includes the features of the robot and obstacles
where \({S}_{o}^{lstm}\) denotes the features of the obstacle in the environment. Obstacles are encoded by LSTM according to their distances (Everett et al., 2018) to the robot:
where N denotes the number of obstacles. \({d}_{i}\) denotes the distance of obstacle \({o}_{i}\) to the robot. \({e}_{{d}_{min}}^{lstm}\) and \({e}_{{d}_{max}}^{lstm}\) denote the pairwise features of the obstacles with shortest and largest distances to the robot. The pairwise feature is defined as the combined feature of the robot and one of the obstacles:
AW encoder is based on the attention weight (Chen et al., 2019b; Lin et al., 2017). It also defines the environmental state as the feature combination of the robot and obstacles:
where \({S}_{o}^{aw}\) denotes the features of the obstacle encoded by AW, and it is defined by
where \({\alpha }_{i}\) and \({h}_{i}\) denote the attention score and the interaction feature of the robot and obstacle \({o}_{i}\) respectively. The interaction feature is defined as
where \({f}_{h}(\cdot )\) and \({w}_{h}\) denote the neural network and its weight. \({e}_{i}\) denotes the embedded feature of the robot and obstacle \({o}_{i}\). The attention score is defined by
where \({f}_{\alpha }(\cdot )\) and \({w}_{a}\) denote the neural network and its weight. \({e}_{mean}\) denotes the mean of all embedded features. The embedded feature and \({e}_{mean}\) are defined by
where \({f}_{e}(\cdot )\) and \({w}_{e}\) denote the neural network and its weigh. \({M}_{i}\) denotes the occupancy map of obstacle \({o}_{i}\) and it is defined by
where \({w}_{j}^{^{\prime}}\) is a local state vector of obstacle \({o}_{j}\). \({\mathcal{N}}_{i}\) denotes other obstacles near the obstacle \({o}_{i}\). The indicator function \({\delta }_{ab}\left[{x}_{j}{x}_{i},{y}_{j}{y}_{i}\right]=1\) if \(({x}_{j}{x}_{i},{y}_{j}{y}_{i})\in (a,b)\) where \((a,b)\) is a twodimension cell.
AWPERbased A2C and its training strategies
This part elaborates our proposed algorithm. It features the new encoder (AW) and combined online and offline (batch) learning approach. We first recall and clarify the definition of environmental description, and then the proposed algorithm is given.
The definition of environmental description (environmental state)
Before introducing our algorithm, it is necessary to clarify all sorts of descriptions of state. Let’s first recall the definitions of state in the section of Problem Formulation:
where \({s}_{agent}\) denotes the state of agent (obstacles or robot). It consists of its observable state \({s}^{obs}\) and hidden state \({s}^{h}\). Hence, the source state of the robot in the environment is descripted by
where \({\widehat{s}}_{N}^{obs}\) denotes the observable state of Nth obstacles. However, the source state cannot be directly used for the training therefore it is replaced by the robotcentric states which is transformed from the source state of the robot. The robotcentric states are defined by
where \({s}_{r}\) denotes new state of the robot while \({o}_{i}\) denotes robotcentric observable state of ith obstacle. Note that ith denotes the obstacle’s order which is generated randomly in the simulator when one episode of the experiment starts. In the definition of robotcentric observable state \({o}_{i}\), the \({r}_{i}+r\) denotes the collision constraint of each obstacle to the robot. The collision constraint varies among obstacles. To some degree it represents how “danger” the obstacle is. The robot is easy to learn how to keep a safe distance to each obstacle with its collision constraint. Otherwise, more data is required for the robot to learn the safe distance strategy in the trial and error. Finally, the state of the robot in the environment (environmental description) from training is defined by
Note that \({o}_{i}\) in \(\{{o}_{0}\dots {o}_{N}\}\) is the source description of the obstacles. It is not encoded by any encoders.
AWPERbased A2C and its training strategies
Our algorithm features the improvements in the description of the environmental state and the convergence speed by applying the AW encoder and PER to A2C algorithm. Hence, the robot in the dense and dynamic scenario can better understand the obstacles nearby via a robust environmental state which is encoded by AW. Our AWbased A2C also improves the convergence speed of the baseline algorithm LSTMbased A2C in the earlystage training. The PER further improves the convergence speed of the AWbased A2C by combining the online learning and batch learning to learn from data generated by the AWbased A2C itself. Therefore, we introduce the AWPERbased A2C which converges steeply to a better reward without the expert experience. Our algorithm is shown in detail in the Algorithm 1, while the Algorithms 2–4 are its subfunctions.
First, episodic actions are executed via the policy network \(\pi ({a}_{t}{s}_{t};\theta )\) to generate experiences \({{<s}_{t},a}_{t},{r}_{t},{S}_{t+1}>\) (lines 4–6). Second, MonteCarlo returns \({R}_{t}\) are computed according to the stored reward \({r}_{t}\) from this episode. Episodic experiences \(({{s}_{t},a}_{t}, {R}_{t})\) are stored in \(\mathcal{E}\) which are also stored in the prioritized replay buffer \(\mathcal{D}\) if \({R}_{t}>0\) (lines 7–12). Third, the AWbased A2C is trained in a combined manner via the online learning and batch learning based on PER (lines 13–17). Fourth, these three steps repeat until that the network gets rid of the firststage training (lines 3–18). Fifth, the weight of the firststage training is saved for the network initialization of secondstage training, in which the AWbased A2C is only trained in an online manner (lines 20–32). Thirdstage training is almost the same with the second stagetraining. Their difference is that the thirdstage training uses a smaller step size (learning rate) which encourages a stable convergence of network. Note that the state \({s}_{t}\) here refers to the source state \(s=[{s}_{r},\{{o}_{0}\dots {o}_{N}\}]\) which does not use any encoders to encode the state of obstacles nearby.
The subfunction \({{\varvec{T}}{\varvec{r}}{\varvec{a}}{\varvec{i}}{\varvec{n}}\_{\varvec{A}}2{\varvec{C}}}_{{\varvec{o}}{\varvec{n}}{\varvec{l}}{\varvec{i}}{\varvec{n}}{\varvec{e}}}(\mathcal{E})\) first computes the probability distribution of the actions and values by the subfunction \({{\varvec{N}}{\varvec{N}}}_{{\varvec{a}}2{\varvec{c}}}({\varvec{s}})\) (line 1). Second, TDerror \(\delta \) is obtained by \(\delta ={R}_{t} v(\cdot ;\theta )\) which is used for computing the policy loss and value loss (line 2–4). Third, the weight of AWbased A2C \(\theta \) is updated according to the gradient of value loss and the policy loss (line 5 where \(\eta \) is the learning rate).
Unlike \({{\varvec{T}}{\varvec{r}}{\varvec{a}}{\varvec{i}}{\varvec{n}}\_{\varvec{A}}2{\varvec{C}}}_{{\varvec{o}}{\varvec{n}}{\varvec{l}}{\varvec{i}}{\varvec{n}}{\varvec{e}}}(\mathcal{E})\) in which the network is trained once in a manner of online learning, the network in the subfunction \({\varvec{T}}{\varvec{r}}{\varvec{a}}{\varvec{i}}{\varvec{n}}\_{{\varvec{A}}2{\varvec{C}}}_{{\varvec{o}}{\varvec{f}}{\varvec{f}}{\varvec{l}}{\varvec{i}}{\varvec{n}}{\varvec{e}}}(\overline{\mathcal{D} })\) can be trained for M times (line 1) to further update the network intensively in a manner of batch learning. \({\varvec{T}}{\varvec{r}}{\varvec{a}}{\varvec{i}}{\varvec{n}}\_{{\varvec{A}}2{\varvec{C}}}_{{\varvec{o}}{\varvec{f}}{\varvec{f}}{\varvec{l}}{\varvec{i}}{\varvec{n}}{\varvec{e}}}(\overline{\mathcal{D} })\) cannot work independently. It must work with \({{\varvec{T}}{\varvec{r}}{\varvec{a}}{\varvec{i}}{\varvec{n}}\_{\varvec{A}}2{\varvec{C}}}_{{\varvec{o}}{\varvec{n}}{\varvec{l}}{\varvec{i}}{\varvec{n}}{\varvec{e}}}(\mathcal{E})\) as its supplement. First, a batch of experience \(\overline{\mathcal{D} }\) is sampled from the prioritized replay buffer \(\mathcal{D}\) according to the probability \(P\left(j\right)=\frac{{p}_{j}^{\alpha }}{{\sum }_{k}{p}_{k}^{\alpha }}\) (line 2). Second, the importancesampling weight can be obtained accordingly by \({w}_{j}=\frac{{(N\cdot P\left(j\right))}^{\beta }}{{max }_{k}{w}_{k}}\) where N denotes the number of experiences < \({{s}_{j},a}_{j}, {R}_{j},{p}_{j},idx\)> in the \(\overline{\mathcal{D} }\) (line 3). Third, the policy distribution and value are obtained by executing the subfunction \({{\varvec{N}}{\varvec{N}}}_{{\varvec{a}}2{\varvec{c}}}({\varvec{s}})\). The TDerrors \({\delta }_{j}\) and new priorities \({\left({\delta }_{j}\right)}_{+}\) are obtained accordingly by \({\delta }_{j}={(R}_{j} v\left(\cdot ;\theta \right))\) and \({\left(\cdot \right)}_{+}=\mathrm{max}(\cdot ,0)\) (lines 4–6). Fourth, the policy loss and value loss in the batch learning of AWbased A2C are obtained. They are used for updating the network \(\theta \) (lines 7–9). Fifth, the priority of experiences is updated for the next training (line 10).
The subfunction \({{\varvec{N}}{\varvec{N}}}_{{\varvec{a}}2{\varvec{c}}}({\varvec{s}})\) is about the forward propagation of AWbased A2C network which consists of two parts: AW network and A2C network. First, the AW network computes the embedded feature \({e}_{i}\) and interaction feature \({h}_{i}\) by MLP layers \({f}_{e}\) and \({f}_{h}\) without the activation function (lines 1–2). Second, the mean weight of embedded feature \({e}_{mean}\) is obtained according to all embedded features (line 3). Third, the attention score of the robot to the obstacle \({o}_{i}\) and its surrounded obstacles is obtained by a MLP layer \({f}_{\alpha }\) (line 4). Fourth, the description of surrounding obstacles \({S}_{o}^{aw}\) is obtained by \({S}_{o}^{aw}={\sum }_{i=1}^{n}[softmax({\alpha }_{i})]\cdot {h}_{i}\) (line 5). Fifth, the state of robot and the description of surrounding obstacles form the full description of environmental state \({S}^{aw}\) (line 6). Finally, the policy distribution of actions and the state value are obtained (line 7) by A2C network \({f}_{a2c}\) that needs to be designed to find the optimal configurations in the experiment for better performance of the AWbased A2C.
Experiments and results
Experimental environment
The experiments are conducted in the simulators with one robot and multiple obstacles. The behaviors of robot are controlled by the trained policies, while the behaviors of obstacles are controlled by ORCA algorithm. The policies of robot are trained in the circlecrossing simulator. They are tested in the squarecrossing and circlecrossing simulators simultaneously (Fig. 3). Other environment settings: (1) action space: action space consists of 81 actions that include no action and 5 speed choices in each direction from 16 moving directions; (2) time limit and time cost in each step: time limit of each episode is set to 25 s and one step/action costs 0.25 s.
Design of network architecture
Configurations of layer and node
Before designing the network architecture, we first introduce our expectations to reach our goals of algorithm training: (1) as less training data as possible; (2) as short training time as possible. Note that these goals are achieved under the condition that the accuracy and reliability of initial algorithms are not affected. The number of layers should be kept in a reasonable slot. The network with large number of hidden layers is not feasible for our task because the large network requires more data to converge. Its training process is also timeconsuming. According to principles for the configuration of the layer and node in each layer, the network with two hidden layers is sufficient to solve any nonlinear classification or regression problems. Hence the shortcoming of learning inefficiency caused by one hidden layer is better avoided (Brownlee, 2018; Goodfellow et al., 2016; Hornik et al., 1989; Lippmann, 1988; Reed & MarksII, 1999). However, the learning efficiency may be improved with the increase of number of hidden layers, but the consumed training time may increase accordingly.
Given the number of node in each layer reported in literatures (Everett et al., 2018), the grid test is used to find the optimal or nearoptimal number of layer and number of node in each layer. Let \({N}_{layer}\) denotes the number of the hidden layer, and the grid test is conducted under \({N}_{layer}\in \{\mathrm{1,2}\}\). Let \({N}_{node}\) denotes the number of nodes in each layer, and the grid test is conducted under \({N}_{node}\in \{\mathrm{64,128,256}\}\). Activation function is required to activate the hidden layers, and Tanh functions is used in this experiment. Hence, two types of architecture for A2C are obtained, and they are shown in Fig. 4. Note that the actor network and critic network in these two architectures only share the same input and output. Parameters in the layers cannot be shared in these architectures because networks cannot converge once the task is to solve the continuous control problem (Schulman et al., 2017). Recall that AWbased A2C consists of AW network and A2C network. We attempt to keep the configurations of AW network in (Chen et al., 2019b), and then find the nearoptimal configurations of A2C network. That means the configurations of AW networks are set to [150, 100], [100, 50] and [100, 100] respectively for MLP layers that consist of the layer of embedded feature \({f}_{e}\), interaction feature \({f}_{h}\) and attention score \({f}_{a}\). The AW network is included in the propagation of AWbased A2C network. In other words, AW network provides two layers (one input layer and one hidden layer) to the AWbased A2C network. Hence A2C network just need two layers (one hidden layer and one output layer) theoretically according to the claim that two hidden layers are sufficient and efficient for creating any classification regions of desired shape (Goodfellow et al., 2016; Reed & MarksII, 1999).
Grid test (grid search)
To better select the optimal or nearoptimal architecture for A2C algorithm, we consider seven factors (Table 1) in which value coefficient, entropy coefficient, optimizer, and discount factor are set to be stable. The learning rate, input dimension (\({S}^{aw}\) dimension or the output dimension of AW network), and multiple obstacle (complex feature) are set to different values. Given \({N}_{node}\in \{\mathrm{64,128,256}\}\), the first architecture (Fig. 4a) has three configurations, while the second architecture (Fig. 4b) has nine configurations (Table 2). The grid test is expected to involve \(96(12\times 2\times 2\times 2)\) experiments. It is timeconsuming to train and evaluate the performance of all configurations. Hence, we first chose a simple onerobot oneobstacle case to test the efficiency and efficacy of each possible configurations before the grid test. Then we select the best configurations for the further tests. The results of training are shown in Fig. 5.
The first grid test (impact from the learning rate): the learning rate is the most important hyperparameters in the training (Goodfellow et al., 2016). A larger learning rate may lead to a fast convergence of network to the suboptimal and the weight fluctuation of network, while a smaller learning rate makes the training slow relatively. Hence, it is necessary to tune the learning rate to find the best tradeoff for the training. However, instead of finding the best learning rate for training, here our aim is about figuring out the impact of learning rate to the performance of network convergence by changing the learning rate from the default larger one to the smaller one (3e−4). The test results with smaller learning rate are shown in Fig. 6b, while Fig. 6a denotes the results with a larger learning rate. The second grid test (impact from the input dimension): performances of configuration are further tested by changing the input dimensions from 13 to 56 for onerobot oneobstacle case by applying the AW encoder (AW network) to describe the environmental state. Figure 6b works as the benchmark in which its input dimension and learning rate are 13 and 3e−4 respectively. Figure 6c denotes the test results of configurations with input dimension 56, and its learning rate keeps 3e−4. The third grid test (impact from the complex feature): number of obstacles changes from 1 to 3 to add the complexity of the environment. Figure 6d denotes the test results of these configurations, and their learning rates also keep 3e−4.
Efficacy and efficiency analysis of possible architectures
According to Fig. 6, all configurations in the candidate architectures are qualified to reach the goal of convergence with high rewards, except for Arc24 in Fig. 6c. Arc12 outperforms other configurations in the efficiency of convergence and the value of converged reward, regardless of the learning rate, input dimensions and complexity of feature. Hence, the Arc12 is selected as the network configuration for the further experiments. Detailed observations of test results include: (1) small learning rate slows down the convergence speed slightly but leads to a stable convergence simultaneously; (2) large input dimension shows less impact to cases with less hidden layer (Arc11 and Arc12), while the convergence speed in cases with more hidden layer (Arc21, Arc24, and Arc27) are slightly slowed down, because a large input dimension causes a larger number of the learning unit in configurations, especially for networks with more layers; (3) a complex feature causes a slow and fluctuated convergence speed to the configurations with more layers in the earlystage training, while cases with less layer are more robust when the number of obstacles in the environment increases.
Model training
Basic settings of training
The behaviors of robot are controlled by the trained policies, while the behaviors of obstacles are controlled by ORCA policy in the training. The simulator for model training is the circlecrossing simulator, while the squarecrossing simulator is for model evaluation. All algorithms in the training include LSTMbased A2C, AWbased A2C and AWPERbased A2C. We first implement LSTMbased A2C and then optimize it by applying the AW encoder to describe the environmental information for the improvements in the convergence speed and reduction of overfitting. Finally, the convergence speed of AWbased A2C is further accelerated by applying the batch learning with PER and new training strategy.
LSTMbased A2C
Existing LSTMbased algorithms to tackle the complex indoor motion planning problem are too dependent on the expert experience in the network initialization (Chen et al., 2019b; Everett et al., 2018). Otherwise, their networks cannot converge even in oneobstacle case because of their incompetent reward function which is designed for networks initialized by the imitation learning. We first implement the online A2C with the reward function reported in (Everett et al., 2018; Chen et al., 2017, 2019b). In the implementation, the states of the agent are encoded by LSTM encoder to form a new description of the environment. Then this LSTMbased description of environmental state is concatenated with the state of the robot as the input of A2C algorithm (Fig. 7). As a result, the reported reward function leads to the difficulty in network convergence, hence the reported reward function is replaced by a new reward function (Eq. 34). The reward received by the robot is 1 if the robot reaches the goal (\({\mathrm{p}}_{\mathrm{current}}={\mathrm{p}}_{\mathrm{g}}\) where \({\mathrm{p}}_{\mathrm{current}}\) denotes the position of the robot while \({\mathrm{p}}_{\mathrm{g}}\) denotes the position of the goal) within the time limit. If \(0<{\mathrm{d}}_{\mathrm{min}}<0.2\) where \({\mathrm{d}}_{\mathrm{min}}\) denotes the minimum distance of the robot and obstacles, the reward received by the robot is set to \(0.1+\frac{{d}_{min}}{2}\). If the robot collides with the obstacles, the reward is set to 0.25. If the experimental time reaches the time limit \({\mathrm{t}=\mathrm{t}}_{\mathrm{max}}\) and the robot does not reach the goal \({\mathrm{p}}_{\mathrm{t}}\ne {\mathrm{p}}_{\mathrm{g}}\), the reward is set to \(\frac{{d}_{start\_to\_goal}\left({p}_{g}{p}_{current}\right)}{{d}_{start\_to\_goal}}\cdot 0.5\) where \({d}_{start\_to\_goal}\) denotes the distance of the start to the goal. New reward function accelerates the convergence speed by attaching a reward to the final position of the robot that are with shorter distance to the goal. The training results of LSTMbased A2C with reported and modified reward functions in the case with one obstacle are shown in Fig. 8.
Note that in current reinforcement learning or deep learning, the setting of constant or the setting of reward is decided according to the intuition and trial and error. Then a parameter which leads to better performance will be selected. Here parameters of reward (e.g., 1, − 0.25 and 0) is the feedback to the robot when the robot interacts with the environment. The setting of reward function is from the trial and error. The rewards will be used in the backpropagation indirectly to make the neural network update towards the direction of convergence (global optimum). Otherwise, the network may not converge or converges towards the local optimum. The safe distance 0.2 m is set artificially according to the requirement of tasks (e.g., case with highspeed obstacles). The safe distance can be 0.1 m if obstacles move slow.
The experiments of LSTMbased A2C algorithm are extended to the multipleobstacle cases in which the number of obstacles is \(N\in \{\mathrm{2,3},\mathrm{4,5},\mathrm{10,15}\}\) (Table 3). Motion of obstacles is controlled by ORCA policy. The circlecrossing simulator is used in the training. LSTMbased A2C algorithm can successfully accomplish the motion planning tasks in multipleobstacle cases. However, the training of robot’s policy still suffers the suboptimal reward and slower convergence speed with the increase of obstacles nearby as Fig. 9a, b. Moreover, the LSTMbased A2C is likely to fall into the problem of the overfitting because LSTM encodes the states of the agent according to distances of the obstacle to the robot. This distancebased encoding method causes the result that the closer obstacles have larger impacts on the robot. However, it cannot work always, e.g., a failed training case in Fig. 9c, because the distancebased encoding method describes the environmental state partially. Other factors like the speed and moving direction of obstacles should be considered as well. Hence, a robust description of environmental state is needed to better solve the overfitting, slow convergence speed and suboptimal reward problems. This is achieved by applying the AW encoder to replace the LSTM encoder.
AWbased online A2C
New environmental state is descripted or interpreted by the AW encoder that consists of four versions: basic version, global state version, occupancy map version and full version. Note that the global feature \({e}_{mean}=\frac{1}{n}{\sum }_{k=1}^{n}{e}_{k}\) and the occupancy map \({M}_{i}\left(a,b,:\right)\) are not included in the basic version, while the full version includes them simultaneously. The performances of four versions of AWbased A2C in the multipleobstacle training are shown in Fig. 10, in which (a–d) represent the training results of full version, occupancy map version, global state version and basic version respectively.
According to their training results, they all follow the same trend: there are less differences in the convergence speed and converged reward in lesser obstacle cases for four versions of AW. Their convergence speeds are fast, and rewards obtained are high. However, the convergence speeds slow down and rewards obtained become smaller with the increase of obstacles.
The training curves of AWbased A2C are also compared with that of LSTMbased A2C (Fig. 11). The criterions for comparisons are the number of episodes used for earlystage training and the peak reward within the same number of episodes. Note that the parameters/hyperparameters considered in the training of AWbased A2C is the same with that of LSTMbased A2C in Table 3. The earlystage training here is defined as the moment at which the robot can find its goal constantly. This means the average accumulative reward and success rate reach around 0.1 and 40% respectively. Detailed comparisons of these two algorithms are listed in Table 4.
According to the comparisons, near all versions of AWbased A2C largely outperform the LSTMbased A2C to get rid of the earlystage training, not only in cases with less obstacles but also the cases with dense obstacles. The global state version works better in cases with dense obstacles, while the occupancy map version did not take obvious effect when comparing with the basic version. However, obvious improvements are easy to be noticed once the global state and occupation map work together to form the full version of AWbased A2C. The training curves and peak reward of AWbased A2C are also slightly better than that of LSTMbased A2C. The full version of AWbased A2C performs robust than the rest versions of AWbased A2C. However, AWbased A2C still converges slowly in cases with dense obstacles, although it outperforms the convergence speed of LSTMbased A2C. The convergence speed of AWbased A2C can be further improved by applying the batch learning with PER to make the best of experiences discarded in the online learning.
AWPERbased A2C and its training strategy
The batch learning with PER steeply improves the convergence speed of AWbased A2C in the earlystage training, but the network of AWbased A2C eventually converges to a suboptimal reward since the distribution of experiences collected from the dense and dynamic environment for the batch learning changes in the training. This may cause the result that the final converged policy or network is slightly different from that of online AWbased A2C, and some unexpected collisions follow. Hence, we choose a training strategy that consists of three training stages: (1) AWbased A2C is trained in a manner of online learning and batch learning in the earlystage training; (2) the model from the firststage training is trained in an online learning manner; (3) the model from the secondstage training continues to be trained online with a smaller learning rate. A suboptimal model is obtained from the firststage training with less dataset. The model then converges from the suboptimal to nearoptimal in the secondstage training. The performance of model can be further improved slightly in the thirdstage training by applying a smaller learning rate.
The experiments of AWPERbased A2C include cases with 5 and 10 obstacles to represent two levels of density of obstacles: normal density, and high density. The parameters/hyperparameters considered in the trainings of AWPERbased A2C are shown in Table 5.
The results of the first and secondstage trainings with 10 obstacles are shown in Fig. 12. We find that the batch learning is sensitive to the large learning rate (1e−2) in the training, hence contributing less to the convergence speed. This problem is better solved by applying a smaller learning rate (1e−4) in the training (Fig. 12a). According to the training results, AWPERbased A2C costs near 10,000 episodes to get rid of the earlystage training. However, online AWbased A2C and LSTMbased A2C spend around 25,000 and 32,000 episodes respectively in the firststage training. For the secondstage training of AWPERbased A2C in which the model is initialized by a model trained with 10,000 episode, 15,000 episodes are used to reach a nearoptimal reward. However, AWbased A2C and LSTMbased A2C cost 47,000 and 53,000 episodes respectively to reach that level.
The results of first and secondstage trainings with 5 obstacles are shown in Fig. 13. According to that, it is obvious to notice that the AWPERbased A2C takes near 2000 episodes to get rid of the earlystage training, while AWbased A2C and LSTMbased A2C take 8000 and 15,000 episodes respectively to achieve that result. The model of AWPERbased A2C trained with 2000 episodes continues to be trained in an online manner. A nearoptimal model is obtained after 18,000 episodes. However, AWbased A2C and LSTMbased A2C cost around 30,000 episodes respectively to reach almost the same result.
The results of thirdstage training in cases with 5 and 10 obstacles are shown in Fig. 14. The performance of AWPERbased A2C can be further improved by applying a smaller learning rate in the training. The performance of model improved a lot in the case with 5 obstacles in the thirdstage training (near 0.06 in the reward), while the reward increased a little in the case with 10 obstacles (around 0.03). It is hard to see further improvement after 10,000 episodes for cases with 10 and 5 obstacles, hence the thirdstage models after 10,000 episodes are saved for the model evaluations that will be shown in the next section.
Model evaluations
We selected five evaluation indicators that consist of training evaluations, quantitative evaluations, qualitative evaluations, computational evaluations, and robustness evaluations to evaluate the performance of algorithms. These indicators are widely used in reinforcement learning. Many works just select one or two to evaluate their algorithms. It is not enough. Hence, we summarize mentioned indicators from many reinforcement learning works. Then, these indicators are used to evaluate our algorithms to give readers a comprehensive understanding.
This section first summarizes the results of training (training evaluations). Then the trained models are evaluated from four perspectives: (1) quantitative evaluations; (2) qualitive evaluations; (3) computational evaluations; (4) robustness evaluations. Note that our experiments did not involve many finetunings of three algorithms in the training. We believe that there still some spaces for further enhancement in the performance of three algorithms.
Training evaluations
This part considers the converged average accumulative reward and the number of episodes spent in three training stages. Note that LSTMbased A2C and AWbased A2C don’t have 2ndstage training. They are first trained with 60,000 episodes. Then they are retrained with 15,000 episodes for a stable convergence. However, we still use the 2ndstage training in these two algorithms for clear comparisons with AWPERbased A2C. Detailed compassions of three algorithms are shown in Table 6. According to the results, our AWbased A2C and AWPERbased A2C outperform the LSTMbased A2C, especially the AWPERbased A2C which costs merely around 30%/15% of data (10,000/2000 episodes) to get rid of the 1ststage training when comparing to that of LSTMbased A2C (32,000/15000 episodes). In the 2ndstage training, AWbased A2C converges slightly slower than LSTMbased A2C, but the convergence speed of AWPERbased A2C is faster than that of LSTMbased A2C, especially in cases with 10 obstacles. In the 3rdstage training, AWPERbased A2C outperforms the rest two algorithms not only in the convergence speed but also in the converged reward in cases with 5 obstacles, while its converged reward (0.30) is slightly lower than that of the rest two algorithms in cases with 10 obstacles (0.31 and 0.33). Overall, AWPERbased A2C takes near 50%/30% of data (35,000/30000) to converge. Its converged reward is almost the same with that of the rest two algorithms in cases with 10 obstacles, but it is better than that of the rest two algorithms in cases with 5 obstacles.
Settings for model evaluations
Settings of the experiment for model evaluations are simply listed in Table 7. Models of three algorithms are evaluated in simulators with 10 and 5 obstacles. All the experiments for evaluations are conducted in the circlecrossing simulator, except for the experiments of robust evaluations which are conducted in the squarecrossing simulator. Obstacles in the simulators are controlled by the ORCA policy, while the robot is controlled by the trained policies of three algorithms. Total number of the test for each algorithm is set to 500 episodes.
Quantitative evaluations
Six criterions are selected for the quantitative comparisons of three algorithms. These criterions include the success rate, average time to goal, collisions rate, timeout rate, mean distance to obstacles, and accumulative discounted rewards. Detailed comparisons are shown in Table 8. According to the comparisons, the accumulative rewards received by these three algorithms are almost the same as that in the training. Robots based on these three algorithms learnt how to keep a safe distance to the obstacles (in success cases). The AWbased A2C and AWPERbased A2C outperform the LSTMbased A2C in the success rate and timeout rate. This means these two algorithms are easy to find their goals within the time limit (25 s), but AWPERbased A2C seems likely to suffer more collisions in cases with 10 obstacles. There is less difference for the robots based on these three algorithms on the time spent to reach their goal in cases with 10 obstacles, while the robot controlled by AWPERbased A2C outperforms the robots based on the rest two algorithms on the time to goal in cases with 5 obstacles.
Qualitative evaluations
The qualitative evaluation is about how good the policy is or what strategies the robot learnt. This is achieved by observing the trajectories of the robot. The policies based on AWPERbased A2C are shown Fig. 15, while Fig. 16 presents the trajectory comparisons of LSTMbased A2C, AWbased A2C and AWPERbased A2C.
According to Fig. 15, the robot is likely to learn a “RecedeWaitForward” strategy in the environment with highdensity obstacles: the robot first moves out of the crowded obstacles and keeps a distance to them [Fig. 15a (a1–a5)]; the robot then waits for a while until the obstacles start to move away [Fig. 15a (a6–a7)]. The last, the robot moves right forward towards the goal with the highest speed [Fig. 15a (a8–a10)]. However, the robot in the normaldensity environment is expected to learn a “WaitForward” strategy: the robot first waits for a short timeslot until the obstacles starts to move away [Fig. 15b (b1–b4)]. Then it directly forwards to its goal with the fastest speed [Fig. 15b (b5–b10)]. More relevant experiments are conducted to further prove the strategies learnt by the robot, and the results are shown in Fig. 15c, d, in which learnt strategies are illustrated by the trajectory of the robot instead of the separate states of the environment.
According to Fig. 16, in cases with 10 obstacles, the robot controlled by AWPERbased A2C learnt a fixed policy (RecedeWaitForward) as that shown in Fig. 15. However, this fixed policy did not appear on the robot based on AWbased A2C and LSTMbased A2C. AWbased A2C and LSTMbased A2C seem to generate flexible policies, in which most of them are good except for few cases like the second trajectory in Fig. 16a. It is hard for the robot based on LSTMbased A2C to handle the obstacle right approaching the robot. Then the robot gets stuck at the beginning. However, the robot based on AWbased A2C behaves smarter by performing a “ForwardAvoidForward” policy (the second trajectory in Fig. 16b) to reach a high reward. In cases with 5 obstacles, AWPERbased A2C generates a “WaitForward” policy as that shown in Fig. 15. However, AWbased A2C and LSTMbased A2C generate different “RecedeWaitForward” policies. Their difference is that the robot based on LSTMbased A2C recedes too far, while the robot based on AWbased A2C just recedes a short distance. Hence, the robot based on AWbased A2C is expected to achieve higher reward than that of the robot based on LSTMbased A2C.
Computational evaluations
This part considers the time cost in the training. Detailed comparisons of time cost (hour) are shown in Table 9. Note that just single thread is used in these three algorithms for data collection. According to the results, AWPERbased algorithm converges great faster (0.1 h) than the rest two algorithms (near 1 h) in the 1ststage training, while LSTMbased A2C is more timeefficient than the rest two algorithms in the 2nd and 3rdstage training. LSTMbased A2C performs almost the same as the AWPERbased A2C in the time cost of entire training process, but it costs more than 2/1.5 times episodes to converge in cases with 10/5 obstacles when comparing with that of AWPERbased A2C. AWbased A2C costs the most time in its training either for the case of 10 obstacles or for the case of 5 obstacle, when comparing with the rest two algorithms.
Robustness evaluations (extreme tests)
This part considers evaluating the performance of three algorithms in different environment (the squarecrossing environment) to test the robustness of these algorithms. Note that the behaviors of the obstacle in the squarecrossing environment are far different from that in the circlecrossing environment. Much noise is also added to further increase the random attribution of the obstacle in the squarecrossing environment. Policies of three algorithms are trained in the circlecrossing environment. These policies are tested in the squarecrossing environment. Six criterions are considered in the robustness evaluations of three algorithms. They consist of the success rate, time to goal, collision rate, timeout rate, mean distance to obstacles and received accumulative reward, like that in the quantitative evaluations. Detailed evaluations are shown in Table 10. According to the results, AWbased A2C outperforms the rest two algorithms on near all criterions, except for the time to goal and timeout rate which are slightly worse than that of AWPERbased A2C. The performance of AWPERbased A2C drops faster than the rest two algorithms in the success rate and collision rate. This also causes the subsequent low accumulative rewards received. The relatively lower robustness of AWPERbased A2C is caused by the PER. The PER requires less data to make the A2C converged, but it also leads to the less explorations for finding more potential actions. These actions may be useless to improve the performance in the circlecrossing environment but useful in the squarecrossing environment, hence resulting in a relatively low performance of the robot in a completely new environment.
Discussion and future work
This section first analyzes the problems found in the experiment. Then some future research directions will be discussed. Some problems came up during the experiments, for example, the slow convergence speed of AWbased A2C in the 2ndstage training and the distribution drift caused by PER.
Problem 1
Let’s recall some results of 2ndstage training from Table 6 that is partially shown in Table 11. LSTMbased A2C and AWbased A2C reach the same reward 0.23 (0.1 to 0.23) in cases with 10 obstacles. LSTMbased A2C spends 28,000 episodes to reach that, while AWbased A2C costs more episodes (35,000). Experiments of LSTMbased A2C and AWbased A2C in cases with 5 obstacles also follow the same trend. We guess this problem may relate to the architecture of AW network which consists of four layers with a complex connection among layers (Fig. 4), while LSTM network consists of merely one recurrent layer. More layers with complex connections in the neural network sometimes cause the vanishing gradient problem in the backpropagation process (Kolbusz et al., 2017). This means the early layers next to the input layer are expected to receive less gradient to update their weight, hence causing the slow update or even zero update of weight. The more layers are, the less the gradient is received for the early layers in the backpropagation process. However, AWPERbased A2C better reduces the impact from the vanishing gradient problem by providing a pretrained model which leads to a high success rate (near 0.4) in the motion planning of the robot in the 2ndstage training. A better success rate means a higher reward and gradient, therefore a higher gradient is received in the early layers of AWbased A2C, although these layers are slightly impacted by the vanishing gradient.
Problem 2
Distribution drift problem is reflected in the way of weight update which causes three consequences: (1) weight update is sensitive to a larger learning rate; (2) the final converged policy is slightly unpredictable; (3) relatively low robustness of AWPERbased A2C.
The first consequence is shown in Fig. 17, in which a pretrained model is retrained in the 2ndstage training. A lager learning rate (3e−4) is applied in this period of training. However, the AWPERbased model (green line) converges from a worse point (reward = − 0.9), instead of the training result of pretrained model (reward = 0.1). The larger learning rate is then replaced by a smaller one (3e−5), therefore this problem is better solved.
The second consequence is shown in the 3rdstage reward received by AWbased A2C and AWPERbased A2C in Table 6 that is also partially shown in Table 12. In the 2nd and 3rdstage training, AWPERbased A2C updates its weight in an online manner like that of AWbased A2C. Intuitively, these two algorithms may converge to a same reward. However, results of 3rdstage training show that their converged rewards are slightly different. Learnt policies by the robot differ slightly as well. For example, the robot using AWPERbased A2C is expected to choose the “Recede” and “Wait” strategies in the earlystage of motion planning with 10 obstacles. However, the robot based on AWbased A2C is apt to select “Follow” and “Wait” strategies. In cases with 5 obstacles, the robot based on AWPERbased A2C is likely to choose the “Wait” strategy at the beginning, while the robot using AWbased A2C likes “Recede” strategy. Higher reward is expected to be received once the “Follow” strategy is selected by the robot, while it is hard to obtain better reward if the robot recedes at the beginning.
The third consequence is shown in Table 10 which is partially shown in Table 13. The performance of AWPERbased A2C drops slightly faster than the AWbased A2C and LSTMbased A2C in the success rate and collision rate of robustness evaluations (extreme tests). We find that these three consequences are caused by PER which introduces the bias to the process of weight update in the training by changing the data distribution, therefore changing the solution or policy the network should converge to in the online learning (Schaul et al., 2016). The importancesampling weight reduces the impact caused by the distribution drift, but it cannot eliminate it, therefore slight changes are found in the converged policies of AWbased A2C and AWPERbased A2C. Moreover, the AWPERbased A2C costs less data (35,000/30000 episodes) to converge, while AWbased A2C and LSTMbased A2C spend 75,000/45000 episodes to reach the convergence. This means less explorations are done in the AWPERbased A2C when comparing with the rest two algorithms, therefore the converged policy of AWPERbased A2C slightly lacks flexibility and robustness in the qualitative evaluations and robustness evaluations respectively. In summary, algorithms based on PER converge faster and less data is consumed. However, the flexibility and robustness of their converged policies drop slightly especially in challenging scenarios. We believe all algorithms based on the PER in other fields share the same consequence caused by the PER in the flexibility and robustness of the policy. Hence, how to find a better tradeoff between the advantage and disadvantage of PER matters.
Future research directions
Future works are expected to focus on four aspects: (1) methods for better feature interpretation; (2) methods to reduce the distribution drift; (3) combination of global path planning algorithms with local motion planning algorithms; (4) 3dimension realworld implementation.
Recent relation graph performs robust as well in the description of environmental states (Chen et al., 2019a). Impact from the distribution drift problem is likely to be further reduced by changing the way to store and sample the data in the replay buffer (Bu & Chang, 2020). However, algorithms discussed in this paper are better to solve the local motion planning problems. It cannot replace the global path planning algorithms for the outdoor path planning tasks. It would be interesting to combine our local motion planning with global path planning algorithms for more challenging tasks, such as path/motion planning in large and complex airport like Daxin airport in Beijing. Our work does not consider the complex 3dimention case. 3dimention implementation would be more challenging because of the irregular or complex shapes and geometries of obstacles. If our algorithm is applied into 3dimention case directly, this will cause many potential problems. For instance, collision detection. Our work simplifies the obstacle shape as the circle with dynamic radius. The constraint of collision detection is obtained by computing the minimum distance of two agents (\({d}_{min}={r}_{robot}+{r}_{obstacle})\). If the distance of the robot and obstacle \({D<d}_{min}\), the robot and obstacle collide. It is not enough in the 3dimention case. Our future work in 3dimention implementation will consider fusing other method (Redon et al., 2005; Husty et al., 2007; Barton et al., 2009; Choi et al., 2009) to ensure an accurate collision detection constraint. In realworld implementation, unexpected fault, influence of disturbances, and other uncertainties in real system are challenges that cannot be ignored. Future work will also consider the uncertainties of realworld implementation, especially the fault detections (Cheng et al., 2021; Dong et al., 2020).
Conclusion
This paper first implements the LSTMbased A2C without the expert experience by making the modification on the reward function, but LSTMbased A2C suffers slow convergence speed and overfitting in the training. It is followed by applying the AW encoder to replace the LSTM encoder to better describe the environmental state, hence the problems in LSTMbased A2C are reduced. The convergence speed of AWbased A2C is then further improved by combining the online learning and batch learning which is based on the PER. As the results, AWPERbased A2C takes only near 15% and 30% of data to get rid of the earlystage training. It spends around 45% and 65% of data to reach the convergence when comparing with LSTMbased A2C in cases with 10 and 5 obstacles. AWPERbased A2C converges to almost the same reward as that of LSTMbased A2C and AWbased A2C in cases with 10 and 5 obstacles (even better in cases with 5 obstacles) at the expenses of slightly sacrificing its robustness in the extreme test (robustness evaluations). Our AWPERbased A2C and AWbased A2C are easy to be applied into the real motion planning tasks once the features of the agent (the robot and obstacles) are acquired by the sensors [e.g., light detection and ranging (liDAR), depth camera and encoder]. For example, position, velocity and moving direction are obtained by the liDAR and encoder. Radius is obtained by the depth camera, while goal’s position and preferred velocity are set artificially.
Data availability
Our source code is available on the website (https://github.com/CHUENGMINCHOU /AWPERA2C).
Abbreviations
 A2C:

Advantage actor critic
 A3C:

Asynchronous advantage actorcritic
 AW:

Attention weight
 CNN:

Convolutional neural network
 DDPG:

Deep deterministic policy gradient
 DL:

Deep learning
 DPG:

Deterministic policy gradient
 DQN:

Deep Q learning
 DWA:

Dynamic window approach
 IID:

Independently identical distribution
 IL:

Imitation learning
 LSTM:

Longshort term memory
 MDP:

Markov decision process
 MLP:

Multiple layer perceptron
 ORCA:

Optimal reciprocal collision avoidance
 PER:

Prioritized experience replay
 PPO:

Proximal policy optimization
 RL:

Reinforcement learning
 RRT:

Rapidlyexploring random tree
 RG:

Relation graph
 TD:

Temporal difference
 TRPO:

Trust region policy optimization
References
Bai, Z., Cai, B., Shangguan, W., & Chai, L. (2019). Deep learning based motion planning for autonomous vehicle using spatiotemporal LSTM network. In Proceedings 2018 Chinese Automation Congress, CAC 2018 (pp. 1610–1614). https://doi.org/10.1109/CAC.2018.8623233
Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. Machine Learning Proceedings, 1995, 30–37. https://doi.org/10.1016/b9781558603776.50013x
Barton, M., Shragai, N., & Elber, G. (2009). Kinematic simulation of planar and spatial mechanisms using a polynomial constraints solver. ComputerAided Design and Applications, 6(1), 115–123. https://doi.org/10.3722/cadaps.2009.115123
Bas, E. (2019). An introduction to Markov chains. Basics of Probability and Stochastic Processes. https://doi.org/10.1007/9783030323233_12
Brownlee, J. (2018). Better deep learning: train faster, reduce overfitting, and make better predictions. Machine Learning Mastery. Retrieved from https://machinelearningmastery.com/betterdeeplearning/.
Bry, A. & Roy, N. (2011). Rapidlyexploring random belief trees for motion planning under uncertainty. In Proceedings—IEEE international conference on robotics and automation. https://doi.org/10.1109/ICRA.2011.5980508.
Bu, F. & Chang, D. E. (2020). Double prioritized state recycled experience replay. In 2020 IEEE international conference on consumer electronics—Asia, ICCEAsia 2020. https://doi.org/10.1109/ICCEAsia49877.2020.9276975.
Chen, Y. F., Liu, M., Everett, M. & How, J. P. (2017). Decentralized noncommunicating multiagent collision avoidance with deep reinforcement learning. In Proceedings—IEEE international conference on robotics and automation (pp. 285–292). https://doi.org/10.1109/ICRA.2017.7989037.
Chen, C., Hu, S., Nikdel, P., Mori, G. & Savva, M. (2019a). Relational graph learning for crowd navigation. ArXiv. http://arxiv.org/abs/1909.13165.
Chen, C., Liu, Y., Kreiss, S., & Alahi, A. (2019b). Crowdrobot interaction: Crowdaware robot navigation with attentionbased deep reinforcement learning. In Proceedings—IEEE international conference on robotics and automation (pp. 6015–6022). https://doi.org/10.1109/ICRA.2019.8794134
Cheng, P., Wang, H., Stojanovic, V., He, S., Shi, K., Luan, X., Liu, F., & Sun, C. (2021). Asynchronous fault detection observer for 2D Markov jump systems. IEEE Transactions on Cybernetics. https://doi.org/10.1109/TCYB.2021.3112699
Choi, Y. K., Chang, J. W., Wang, W., Kim, M. S., & Elber, G. (2009). Continuous collision detection for ellipsoids. IEEE Transactions on Visualization and Computer Graphics, 15(2), 311–324. https://doi.org/10.1109/TVCG.2008.80
Dong, X., He, S., & Stojanovic, V. (2020). Robust fault detection filter design for a class of discretetime conictype nonlinear Markov jump systems with jump fault signals. IET Control Theory & Applications, 14(14), 1912–1919. https://doi.org/10.1049/ietcta.2019.1316
Everett, M., Chen, Y. F., & How, J. P. (2018). Motion planning among dynamic, decisionmaking agents with deep reinforcement learning. IEEE International Conference on Intelligent Robots and Systems, Iii. https://doi.org/10.1109/IROS.2018.8593871
Farouki, R. T., & Sakkalis, T. (1994). Pythagoreanhodograph space curves. Advances in Computational Mathematics, 2(1), 41–66. https://doi.org/10.1007/BF02519035
Fox, D., Burgard, W., & Thrun, S. (1997). The dynamic window approach to collision avoidance. IEEE Robotics & Automation Magazine, 4(1), 23–33. https://doi.org/10.1109/100.580977
Funke, J., Theodosis, P., Hindiyeh, R., Stanek, G., Kritatakirana, K., Gerdes, C., Langer, D., Hernandez, M., MüllerBessler, B., & Huhnke, B. (2012). Up to the limits: Autonomous Audi TTS. IEEE Intelligent Vehicles Symposium, 2012, 541–547. https://doi.org/10.1109/IVS.2012.6232212
González, D., Pérez, J., Lattarulo, R., Milanés, V. & Nashashibi, F. (2014). Continuous curvature planning with obstacle avoidance capabilities in urban scenarios. In 2014 17th IEEE international conference on intelligent transportation systems, ITSC 2014 (pp. 1430–1435). https://doi.org/10.1109/ITSC.2014.6957887.
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep learning. MIT press, Cambridge, MA. Retrieved from http://www.deeplearningbook.org.
Hart, P. E., Nilsson, N. J., & Raphael, B. (1968). A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2), 100–107. https://doi.org/10.1109/TSSC.1968.300136
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. https://doi.org/10.1016/08936080(89)900208
Husty, M. L., Pfurner, M., & Schröcker, H.P. (2007). A new and efficient algorithm for the inverse kinematics of a general serial 6R manipulator. Mechanism and Machine Theory, 42(1), 66–81. https://doi.org/10.1016/j.mechmachtheory.2006.02.001
Inoue, M., Yamashita, T., & Nishida, T. (2019). Robot path planning by LSTM network under changing environment. Advances in Intelligent Systems and Computing, 759, 317–329. https://doi.org/10.1007/9789811303418_29
Jiang, M., Grefenstette, E. & Rocktäschel, T. (2020). Prioritized level replay. ArXiv. http://arxiv.org/abs/2010.03934.
Kolbusz, J., Rozycki, P., & Wilamowski, B. M. (2017). The study of architecture MLP with linear neurons in order to eliminate the “vanishing gradient” problem BT—artificial intelligence and soft computing. In L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. A. Zadeh, & J. M. Zurada (Eds.), International conference on artificial intelligence and soft computing 2017: Artificial intelligence and soft computing (pp. 97–106). Springer. https://doi.org/10.1007/9783319590639_9
Konda, V. R. & Tsitsiklis, J. N. (2000). Actorcritic algorithms. In Advances in neural information processing systems (pp. 1008–1014).
Li, A. A., Lu, Z. & Miao, C. (2021). Revisiting prioritized experience replay: A value perspective. ArXiv. http://arxiv.org/abs/2102.03261.
Lin, Z., Feng, M., Dos Santos, C. N., Yu, M., Xiang, B., Zhou, B. & Bengio, Y. (2017). A structured selfattentive sentence embedding. In 5th international conference on learning representations, ICLR 2017—conference track proceedings (pp. 1–15).
Lippmann, R. P. (1988). An introduction to computing with neural nets. ACM SIGARCH Computer Architecture News, 16(1), 7–25. https://doi.org/10.1145/44571.44572
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. & Riedmiller, M. (2013). Playing atari with deep reinforcement learning, pp. 1–9. ArXiv, http://arxiv.org/abs/1312.5602.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Humanlevel control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In M. F. Balcan & K. Q. Weinberger (Eds.), Proeedings of machine learning research (Vol. 48, pp. 1928–1937). PMLR. http://proceedings.mlr.press/v48/mniha16.pdf
Munos, R., Stepleton, T., Harutyunyan, A., & Bellemare, M. G. (2016). Safe and efficient offpolicy reinforcement learning. Advances in Neural Information Processing Systems, Nips, 26, 1054–1062.
Oh, J., Guo, Y., Singh, S., & Lee, H. (2018). SelfImitation Learning. 35th International Conference on Machine Learning, ICML, 9(2), 6214–6223.
Redon, S., Lin, M. C., Manocha, D., & Kim, Y. J. (2005). Fast continuous collision detection for articulated models. Journal of Computing and Information Science in Engineering, 5(2), 126–137. https://doi.org/10.1115/1.1884133
Reed, R. & MarksII, R. J. (1999). Neural smithing: Supervised learning in feedforward artificial neural networks. MIT Press. Retrieved from https://mitpress.mit.edu/books/neuralsmithing.
Reeds, J. A., & Shepp, L. A. (1990). Optimal paths for a car that goes both forwards and backwards. Pacific Journal of Mathematics, 145(2), 367–393. https://doi.org/10.2140/pjm.1990.145.367
Schaul, T., Quan, J., Antonoglou, I. & Silver, D. (2016). Prioritized experience replay. In 4th international conference on learning representations, ICLR 2016—conference track proceedings (pp. 1–21).
Schulman, J., Levine, S., Moritz, P., Jordan, M. & Abbeel, P. (2015). Trust region policy optimization. In 32nd international conference on machine learning, ICML 2015, (vol. 3, pp. 1889–1897).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. & Klimov, O. (2017). Proximal policy optimization algorithms. (pp. 1–12) ArXiv.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D. & Riedmiller, M. (2014). Deterministic policy gradient algorithms. In 31st international conference on machine learning, ICML 2014, (vol. 1, pp. 605–619).
Stathakis, D. (2009). How many hidden layers and nodes? International Journal of Remote Sensing, 30(8), 2133–2147. https://doi.org/10.1080/01431160802549278
Van Den Berg, J., Lin, M., & Manocha, D. (2008). Reciprocal velocity obstacles for realtime multiagent navigation. Proceedings—IEEE International Conference on Robotics and Automation, 2(4), 100–107. https://doi.org/10.1109/ROBOT.2008.4543489
Van Hasselt, H., Guez, A. & Silver, D. (2016). Deep reinforcement learning with double QLearning. In 30th AAAI Conference on Artificial Intelligence, AAAI 2016, (pp. 2094–2100).
Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., & De Frcitas, N. (2016). Dueling network architectures for deep reinforcement learning. 33rd International Conference on Machine Learning, ICML 2016, 4(9), 2939–2947.
Wang, Z., Mnih, V., Bapst, V., Munos, R., Heess, N., Kavukcuoglu, K. & De Freitas, N. (2017). Sample efficient actorcritic with experience replay. In 5th international conference on learning representations, ICLR 2017—conference track proceedings, 2016. Retrieved from https://static.aminer.cn/upload/pdf/239/1521/964/58d82fc8d649053542fd5854.pdf.
Xu, W., Wei, J., Dolan, J. M., Zhao, H., & Zha, H. (2012). A realtime motion planner with trajectory optimization for autonomous vehicles. IEEE International Conference on Robotics and Automation, 2012, 2061–2067. https://doi.org/10.1109/ICRA.2012.6225063
Zha, D., Lai, K. H., Zhou, K. & Hu, X. (2019). Experience replay optimization. In IJCAI international joint conference on artificial intelligence (Vols. 2019Augus). https://doi.org/10.24963/ijcai.2019/589
Funding
Open access funding provided by University of Eastern Finland (UEF) including Kuopio University Hospital. The authors did not receive support from any organization for the submitted work.
Author information
Authors and Affiliations
Contributions
CZ, PF, HH, BH: conceptualization; CZ: methodology; CZ: formal analysis and investigation; CZ: writing—original draft preparation; PF, BH, HH: writing—review and editing; CZ, HH, BH, PF: funding acquisition, resources; PF, BH: Supervision.
Corresponding authors
Ethics declarations
Conflict of interest
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or nonfinancial interest in the subject matter or materials discussed in this manuscript.
Consent to participate
Not applicable.
Consent to publish
Not applicable.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhou, C., Huang, B., Hassan, H. et al. Attentionbased advantage actorcritic algorithm with prioritized experience replay for complex 2D robotic motion planning. J Intell Manuf 34, 151–180 (2023). https://doi.org/10.1007/s1084502201988z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1084502201988z