Hierarchical Reinforcement Learning from Demonstration via Reachability-Based Reward Shaping

Gao, Xiaozhu; Liu, Jinhui; Wan, Bo; An, Lingling

doi:10.1007/s11063-024-11632-x

Hierarchical Reinforcement Learning from Demonstration via Reachability-Based Reward Shaping

Open access
Published: 27 May 2024

Volume 56, article number 184, (2024)
Cite this article

Download PDF

You have full access to this open access article

Neural Processing Letters Aims and scope Submit manuscript

Hierarchical Reinforcement Learning from Demonstration via Reachability-Based Reward Shaping

Download PDF

Xiaozhu Gao¹,
Jinhui Liu^2,3,
Bo Wan^2,3 &
…
Lingling An²

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Hierarchical reinforcement learning (HRL) has achieved remarkable success and significant progress in complex and long-term decision-making problems. However, HRL training typically entails substantial computational costs and an enormous number of samples. One effective approach to tackle this challenge is hierarchical reinforcement learning from demonstrations (HRLfD), which leverages demonstrations to expedite the training process of HRL. The effectiveness of HRLfD is contingent upon the quality of the demonstrations; hence, suboptimal demonstrations may impede efficient learning. To address this issue, this paper proposes a reachability-based reward shaping (RbRS) method to alleviate the negative interference of suboptimal demonstrations for the HRL agent. The novel HRLfD algorithm based on RbRS is named HRLfD-RbRS, which incorporates the RbRS method to enhance the learning efficiency of HRLfD. Moreover, with the help of this method, the learning agent can explore better policies under the guidance of the suboptimal demonstration. We evaluate the proposed HRLfD-RbRS algorithm on various complex robotic tasks, and the experimental results demonstrate that our method outperforms current state-of-the-art HRLfD algorithms.

Offline reinforcement learning with anderson acceleration for robotic tasks

Article 10 January 2022

Demonstration-Based Proximal Policy Optimization with Action Guidance

A Guided Evaluation Method for Robot Dynamic Manipulation

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Hierarchical reinforcement learning (HRL) has significantly advanced the capabilities of reinforcement learning (RL) in tackling complex, time-extended problems that involve sparse rewards and long-term credit assignments [1,2,3,4]. HRL involves a policy on high level that decomposes the original task into multiple subtasks with corresponding subgoals [5, 6], as well as a low level policy that focuses on accomplishing these subtasks by achieving the subgoals [6,7,8]. Despite their potential, traditional HRL algorithms often suffer from low training efficiency [9], making them impractical for real-world applications. Using expert demonstrations to provide prior knowledge and guide learning agents is one of the most popular methods to improve learning efficiency [8, 10,11,12,13,14]. Hierarchical Reinforcement Learning from Demonstrations (HRLfD) is one efficient way to improve the training efficiency of HRL [8]. HRLfD leverages demonstrations as an extra supervision signal to provide prior knowledge, reward function, and sub-task structure. However, the expert demonstration influences the agent’s learning performance and generalization of the learned policy, especially in complex and high-dimensional environments [10]. Therefore, if the demonstration is sub-optimal and noisy, it may lead the agent to non-local optimal or even incorrect action decisions, or fail to capture the task structure and reward function accurately. This may result in poor learning outcomes and slow convergence.

The current approaches to address the issue caused by the suboptimality of demonstrations can be categorized into three categories: preference-based reinforcement learning [15,16,17], self-supervised reward regression [18,19,20], and forgetful experience replay [10]. Preference-based Reinforcement Learning, the first category, utilizes pairwise rankings of demonstrations to infer a better policy. However, it heavily relies on the quality of the preference model and the process of preference elicitation [21]. If the preferences are overly complicated or noisy, it may induce the learning agent to acquire a large number of suboptimal states and actions and impair the learning performance. The second category, self-supervised reward regression, leverages suboptimal demonstrations to generate synthetic data and train an idealized reward function [18,19,20]. It achieves a better correlation with the ground-truth reward and demonstrates improvement over preference-based reinforcement learning when dealing with suboptimal demonstrations. However, it requires a large amount of demonstrations to generate the synthetic data. The third category involves the use of forgetful experience replay, which reduces the ratio of expert data in the replay buffer and learning batches over time [10]. This allows the agent to overcome the influence of low-quality expert actions and adapt the trajectories to its own action space. However, forgetful experience replay may result in the loss of valuable information from the expert data as the ratio decreases over time, depending on the quality of sub-goal extraction from the expert data, and has difficulties in improving both the imitation and forging phase [10]. In summary, the challenge of constructing a simple but efficient model that mitigates the negative interference of suboptimal demonstrations for the HRL agent remains unresolved.

We propose a novel HRLfD framework to learn efficient goal-conditioned HRL policies from suboptimal demonstrations in this paper. Our framework incorporates the thought of nearest neighbor search to guide the HRL agent in improving its learning policies based on the one suboptimal trajectory as its demonstration. The main idea of our approach is illustrated in Fig. 1a: the high level policy generates subgoals that are constrained within the m-reachability proximal region around the one demonstration trajectory. This ensures that the agent not only exploits the demonstrations to accelerate learning efficiency but also explores the m-reachability proximal region to find better trajectories than those in the demonstrations. By employing the m-reachability proximal constraints, we can reduce the agent’s reliance on the quality of demonstrations. The high level policy only needs to explore the states within the set of subgoals whose reachability distance from the demonstration is less than m. To formalize this constraint, we introduce a m-step reachability-based reward shaping (RbRS) method into the HRL framework. This enables us to incorporate the m-reachability proximal constraint into the learning process of high level policy. Based on this constraint, we design an HRL algorithm utilizing m-step RbRS to enhance the training efficiency of HRLfD. In the experimental section, our method is evaluated on various tasks, including discrete and continuous robot control tasks using the simulator named MuJoCo [22], which is widely utilized in HRL algorithms [12, 23,24,25]. The results of the evaluation experimenrs demonstrate the superiority of our method in terms of both sample asymptotic performance and efficiency compared to current state-of-the-art algorithms of HRLfD and RLfD. Our method uses only one demonstration, and at the same time, our method is able to freely explore in the reachable area while constraining the searching area of learning agent. This allows our method to quickly learn effective policies while also exploring better trajectories than the demonstration offers. And the PyTorch implementation of our method is open-sourced on GitHub: https://github.com/GaoXZ1807/HRLfD-RbRS.

2 Related Work

Data-efficient learning of hierarchical policies is a long-term problem in the field of HRL. Goal-conditioned HRL [24, 26,27,28,29,30,31,32,33] ultilizes a framework that delivers the high-level policies for generating subgoals and the low-level policies to solve this problem. One effective approach to enhance the learning efficiency of HRL is to incorporate expert demonstrations as an additional supervision signal, thereby facilitating the learning process and mitigating the issue of sparse rewards [8, 34,35,36,37]. However, the creation of expert demonstrations often requires manual effort, which can be inconvenient. Moreover, when the demonstrations are suboptimal, it can lead to suboptimal convergence and poor learning outcomes.

There exist approaches that focus on reducing the dependence on the quality of demonstrations, such as preference-based reinforcement learning (PbRL) [15,16,17], Self-Supervised Reward Regression [18,19,20], and forgetful experience replay [7, 10, 27, 38]. PbRL utilizes human preferences as feedback from experts instead of numeric rewards. In detail, PbRL provides the learning agent with another alternative: using a teacher’s preferences to learn policies instead of using pre-defined rewards, thus the learning agent can overcome the association of reward engineering, i.e., reward hacking and infinite reward [15, 16, 39]. However, when the preference is suboptimal, the learning agent would gain suboptimal bias, which means the quality of the bias and whether the bias is available strongly depend on the quality of preference [21], also PbRL still lacks a coherent framework [15]. In order to learn from suboptimal demonstration, self-supervised reward regression [18,19,20] characterizes the relationship between the performance of a policy and the quantity of injected noise, and synthesizes rewards via self-supervision. However, they require a large amount of suboptimal demonstrations to generate the synthetic data and they are unable to use high level spatial features to penalize mismatches in an adaptive manner. In some tasks, especially long-horizon and sparse reward tasks, it is hard to collect so many demonstrations especially those available and usable ones. Forgetful experience replay [10] reduces the proportion of expert data in the replay buffer and learning batches over times. Forgetful experience replay allows the agent to overcome the influence of low-quality expert actions and adapt the trajectories to its own action space. However, one limitation of forgetful experience replay is the potential loss of valuable information from expert data due to the reduction in their proportion over times. Additionally, forgetful experience replay has difficulties in improving both the imitation and forging phases. At the same time, ForgER performs very poorly in our comparison algorithm, far lower than expected. All the above methods are not able to get rid of the dependence on the demonstration.

There exist previous works that develop different forms of demonstrations in order to avoid the high dependence on the inherent paradigms [26, 40,41,42,43]. While traditional demonstrations typically consist of state-action pairs, abstract demonstrations focus on capturing key information, such as a dataset of skills or different orderings of step indexes. Yang et al. [26] introduce a unified hierarchical reinforcement learning framework called the universal option framework (UOF), which utilizes abstract demonstrations to enable the agent to learn diverse outcomes in multi-step tasks. However, abstract demonstrations are more suitable for tasks involving a small number of steps and may not perform well in more complex tasks that require hundreds of steps, such as Maze tasks. Taylor et al. [42] enhance both learning time and policy performance by transferring human demonstrations into a baseline policy for an agent and refining it using reinforcement learning. However, the neural network used in this method is too complex so it is low probability to transit on the other approaches. Nair et al. [43] use demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm.However, this approach still needs a lot of experience, which makes it impractical for problems with real-world applications.

3 Preliminary

3.1 Goal-Conditioned HRL

Markov decision process (MDP) of goal-conditioned HRL is defined as a tuple of $<S, G, A, P, R, \gamma>$, where $P: S\times A\times S\rightarrow R$ is defined as the state transition function, S is defined as the set of states, A is defined as the set of actions, R is defined as the reward function, $\gamma $ is a discount factor, and G is defined as the goal set.

$\pi ^h_{\theta _h}(g|s)$ is defined as the policy of high level. The goal of high level is to maximize the external reward $r_{kt}^h$ and generate subgoals $g_t \sim \pi _h(\cdot |s_t)$ as actions of high level. Each subgoal is generated when $t \equiv 0(\bmod \; k)$, the high level utilizes a mapping function: $\varphi : S \rightarrow G$, where $k>1$ is a hyper-parameter which is pre-determined, because goal space G is considered as a subspace of state space S [22, 37, 44]. When $t \ne 0(\bmod \; k)$, high level controller utilizes the goal transition process $g_t=h(g_{t-1}, s_{t-1},s_t)$. The external reward function of high level is defined as:

$$\begin{aligned} r_{kt}^h = \sum _{i=kt}^{kt+k-1}R(s_i,a_i),\; \;\;\; t=0,1,2..., \end{aligned}$$

(1)

the reward function means the accumulation of the external reward in the time interval $[kt,\; kt+k-1]$, and $R(s_i,a_i)$ is the reward from the environment.

$\pi ^l_{\theta _l}(a|s,g)$ is defined as the low level policy. The goal of low level is to maximize the intrinsic reward provided by the high level policy. Low level is supervised by the intrinsic reward from high level policy which describes subgoal-reaching performance, the low level policy set current state $s_t$ and the corresponding subgoal $g_t$ as input, and performs a primary action $a_t \sim \pi _l(a|s_t,g_t) \in A $ at every time step. To induce the low level policy to reach the subgoal $g_t$, an intrinsic reward is provided to measure the performance of reaching subgoal, and this low level reward function is defined as $r_l^t=-D(g,\; \varphi (s_{t+1}))$, where D is Euclidean distance in practice.

The goal-conditioned HRL framework makes the low level policy to recieve learning signals before it reaches the desired final goal. Besides, goal-conditioned HRL enables concurrent end-to-end training of both the high level and low level policies [45].

4 Methods

This section presents a novel reward shaping method named reachability-based reward shaping to reduce the reliance on the quality of demonstrations. It constrains the high level policy of the HRL agent to produce subgoals near the demonstration trajectory, which makes the agent perform a proximal search around the demonstration to find a better trajectory. We then incorporated this reward shaping into goal-conditioned HRL framework to make the HRL agent convergence at better policies than the demonstration.

4.1 Reachability-Based Reward Shaping

According to the preliminary above, we require a metric to measure the distance between the subgoal and the demonstration. Since the state space is not a Euclidean space [9, 46], but a state manifold that incorporates environmental structures. We adopt the shortest transition distance (ST distance) [46], which is used to measure the reachability between two states and can take the environmental manifold into account. The ST distance is the shortest steps that the agent needs to reach the state $s_1$ starting from another $s_2$, instead of the Euclidian distance between states $s_1$ and $s_2$. It is defined as:

$$\begin{aligned} d_{st}(s_1, s_2):= \mathop {\min }_{\pi \in \Pi }E[{\mathcal {T}}_{s_1s_2}|\pi ]=\mathop {\min }_{\pi \in \Pi }\sum _{t=0}^{\infty }tP({\mathcal {T}}_{s_1s_2}=t|\pi ), \end{aligned}$$

(2)

where ${\mathcal {T}}_{s_1s_2}$ denotes the first hit time of reaching $s_2$ from $s_1$and $\pi $ is the set of complete policy.

One effective approach to leverage demonstrations for the high-level agent is to generate sub-goals in close proximity to the demonstrations, which enables the agent to produce a trajectory that is similar to the demonstrated trajectory. However, if the subgoal is too far from the demonstration, the demonstration will be not able to offer efficient guidance to the HRL agent. Also, if the generated subgoal is too close to the demonstration, the learning agent cannot explore better trajectories. We thus leverage the ST-distance to constrain the high level policy to generate subgoals in the proximal region around the demonstration to the HRL agent, as illustrated in Fig. 1a, b. and how the reachability constraint reshapes the reward function which allows learning agent to gain enough guidance from demonstration also is able to explore better policies.

In order to generate subgoals that are not too far from the demonstration, we determine a concept of m-step reachable region, the distance between every state in this reachable region and the demonstration is less than m steps. And we call the states in reachable region are reachable.

Definition 1

Let $s^D \in S^D$, $S^D$ is the set of states on demonstration, g is the subgoal of learning agent at state s. The m-step reachable region of state s is defined as:

$$\begin{aligned} G_R(s^D,m):=\{g \in G|d_{st}(s^D,\phi ^{-1}(g))<=m\} \end{aligned}$$

(3)

We make the objective of high level incorporating the reachable region as:

$$\begin{aligned} \mathop {\max }_{\theta _h}E_{\pi _{\theta _h}^h}\sum _{t=0}^{T-1}\gamma ^t r_{kt}^h, \end{aligned}$$

(4)

subject to $d_{st}(s^D_{kt},\varphi ^{-1}(g_{kt}))<m, t=0,1,2,...,T-1$, where $g_{kt} \sim \pi ^h_{\theta _h}(g|s_{kt})$ and $r_{kt}^h$ is defined as the reward on high level. In practice, we employ the shortest transition distance from the learning agent’s subgoal to the demonstration and derive the following objective through reward shaping is defined as:

$$\begin{aligned} \mathop {\max }_{\theta _h}E_{\pi _{\theta _h}^h}\sum _{t=0}^{T-1}\gamma ^t\left( r_{kt}^h- \eta _{1} \cdot \mathop {\min }_{s' \in {\mathcal {T}}_{kt}}\left( d_{st}\left( \varphi ^{-1}(g_{kt}),{\mathcal {T}}_{kt}\right) -m\right) \right) , \end{aligned}$$

(5)

where $\mathop {\min }_{s' \in {\mathcal {T}}_{kt}}(d_{st}(\varphi ^{-1}(g_{kt}),{\mathcal {T}}_{kt})-m)$ means to choose the shortest transition distance from agent to the demonstration.

However, the only constraint the subgoal to be generated near the demonstration has risks to convergence at suboptimal policy. To avoid this, k-step adjacent region is introduced into our method. k-step adjacent region is defined as:

$$\begin{aligned} G_A(s,k):=\{g \in G|d_st(s',\varphi ^{-1}(g))\leqslant k\}, s \in S. \end{aligned}$$

(6)

The concept of k-step adjacent region is used when there is only one subset that can be reliably reached with an optimal goal-conditioned policy, the subgoals sets are mapped from the states’ reachable subset. The k-step adjacent region makes use of the property of $\pi *$ in deterministic MDP. Subgoals falling in the k-step adjacent region of the current state can represent all optimal subgoals in the entire goal space in terms of the induced k-step low-level action sequence given an optimal low-level policy $\pi _l^*=\pi ^*$. Zhang [46] has proved that the reduction of goal space still preserves the policy’s optimality.

In order to make the subgoal generated in both the adjacent region and the reachable region, we make a new reachable region ${G'}_R$ be the intersection of adjacent region $G_A$ and the original reachable region $G_R$: ${G'}_R = G_A \cap G_R$, and we use the reachable region to constrain the generation of subgoals.

In this case, the generation of subgoal is constrained by both adjacency constraint from adjacent region and reachability constraint from reachable region. We reshape the objective function:

$$\begin{aligned} \mathop {\max }_{\theta _h}E_{\pi _{\theta _h}^h}\sum _{t=0}^{T-1}\gamma ^t (r_{kt}^h-\eta _{1} \cdot \mathop {\min }_{s' \in {\mathcal {T}}_{kt}}(d_{st}(\varphi ^{-1}(g_{kt}),{\mathcal {T}}_{kt})-m)- \eta _{2} \cdot H(d_{st}(s',\varphi ^{-1}(g_{kt})),k))\nonumber \\ \end{aligned}$$

(7)

Where $H(d_{st}(s',\varphi ^{-1}(g_{kt})),k)$ is the adjacency constraint defined by Zhang [46], $\eta _{2}$ is a balancing coefficient and $H(x,k)=max(x/k-1,0)$ is a hinge loss function.

4.2 HRLfD Algorithm Through RbRS

In Sect. 3, we have already clarified the use of RS and LfD in RL and these sorts of methods have made significant progress in accelerating the convergence of the learning process. In this section, we mainly introduce our method. Our main procedure of this method is shown in Fig. 1.

Unlike some of the methods that combine LfD and HRL, which make policy constraints only on high level or low level, our method makes constraints on both high level and low level. The high level uses state to generate subgoal of this specific state, and then the low level uses state and subgoal from high level to make an action, we have used experiments to prove that both high level and low level should be constrained to make better learning results.

Simply stated, our demonstration uses only one suboptimal demonstration to supervise the agent to learn a better trajectory. The trajectory of the agent could be used as a collection of finite states, subgoals and actions. We stipulate that each state has its corresponding subgoal and action to form a triplet $<s_t,sg_t,a_t>$, besides every state on the trajectory has a corresponding state $s_t^{demo}$ and a triplet $<s_t^{demo},sg_t^{demo}>,a_t^{demo}$ on demonstration. When two corresponding triplets $s_t^{demo}$ and $<s_t^{demo},sg_t^{demo}>,a_t^{demo}$ have high similarity, then we call the triplet from the agent’s trajectory is a well-trained triplet. If every or most of the triplets of trajectory are well-trained, we call this trajectory a well-trained trajectory. Meanwhile, the subgoal is an important part of low level to generate a correct action to reach goal G. Therefore, it is reasonable and necessary to use subgoal as a part of the demo as well as to make constraints on both high level and low level. We consider our demonstration maintains states, subgoals, and action. We define full demonstration that fits HRL’s hierarchy structure as $Full_{demo}=\lbrace Hi_{demo},Lo_{demo} \rbrace $, where $Hi_{demo}$ contains state, subgoal, and next state while $Lo_{demo}$ contains state, subgoal, action, next state, and next subgoal. $Full_{demo}$, $Hi_{demo}$ and $Lo_{demo}$ are written as follows:

$$\begin{aligned} Full_{demo}=&\lbrace \lbrace Hi_{demo} \rbrace ,\lbrace Lo_{demo} \rbrace \rbrace \end{aligned}$$

(8)

$$\begin{aligned} Hi_{demo} =&\lbrace (s_0^{demo},sg_0^{demo}),(s_1^{demo},sg_1^{demo}),...,(s_{t-1}^{demo},sg_{t-1}^{demo}) \rbrace \end{aligned}$$

(9)

$$\begin{aligned} Lo_{demo}=&\lbrace (s_0^{demo},sg_0^{demo},a_0^{demo}),(s_1^{demo},sg_1^{demo},a_1^{demo}),...,\end{aligned}$$

(10)

$$\begin{aligned}&(s_{t-1}^{demo},sg_{t-1}^{demo},a_(t-1)^{demo})\rbrace \end{aligned}$$

(11)

On about how to gain demonstration, we suppose to gain demonstration from one successful but suboptimal trajectory in replay buffer. Besides, we have formulated different judgment criteria against different environments, we will explain each criterion in Experiment Section. Next, we mainly elaborate on our constraint approach on each level.

The high level approach is already delivered in section 4.1. The objective function of high level is as Eq. (7).

In order to take full advantage of every item in the triplet $<s,sg,a>$ especially the item sg, we need to consider how this important intermediate product sg is generated on high level and how it works on low level. As the Reachability Constraint on high level strongly effect the generation of more reachable subgoals, once these kinds of subgoals are set as a section of demonstration which means these subgoals are able to give the agent positive constraint in order to mitigate the amount of training data. As a result, these give us full reasons to extend to use subgoals as a part of demonstration and give subgoals a higher constraint weight on low level.

On the low level, we should firstly clarify the low level reward function:

$$\begin{aligned} r_t^l(s_t,a_t)=-D(g_t, \varphi (s_{t+1})), \end{aligned}$$

(12)

The reward describes the subgoal-reaching performance and supervises the low level learning process, where D is a Euclidean distance function. From this reward function we are able to extend the reward function to the low level rejective function:

$$\begin{aligned} J_{low}(\pi _l):=E [\sum _{t=0}^{\infty } \gamma ^t r_t^l(s_t,a_t)]\end{aligned}$$

(13)

where $\gamma $ is the discount coefficient.

Through the spirit of shaping reinforcement learning using demonstration [19], we continue to use the concept of “similarity”. The definition of similarity is to describe how similar the triplets of training trajectory and demonstration are, and the value of similarity is the area of [0, 1], the more similar those two triplets are, the larger the value of similarity h is. To measure the similarity, first of all, we need to judge whether the action $a_{t_0}$ and $a_{t_0}^{demo}$ at timestep $t_0$ are the same. If not, we give the similarity h a value of 0 directly, otherwise we calculate:

$$\begin{aligned} h(s,s^{demo})=1/(d(s,s^{demo})+1), \end{aligned}$$

(14)

where the d is the Euclidian distance, $h(s,s^{demo})$ equals to 1 when s and $s^{demo}$ coincide and trials to 0 as s and $s^{demo}$ get further apart. Calculating the similarity is to make the demonstration available to the learning agent for using a bias for its exploration. After calculating the similarity, we then use this similarity to define a potential function. To calculate the potential of a given triplet $<s_t,sg_t,a_t>$, we check the set of demonstrations, and find the sample with the same action that yields the highest similarity:

$$\begin{aligned} \Phi ^{demo}(s,sg,a)=\max _{(s^{demo},sg^{demo})}[h(s,s^{demo})+h(sg,sg^{demo})] \end{aligned}$$

(15)

After creating the potential function, we need to integrate it into the learning process by creating a reward shaping function $F^D$:

$$\begin{aligned} F^D(s,sg,a,s',sg',a')=\gamma \Phi ^{demo}(s',sg',a')-\Phi ^{demo}(s,sg,a). \end{aligned}$$

(16)

Then we add this reward shaping function into objective function of low level:

$$\begin{aligned} J_{low}(\pi _l):=E\sum _{t=0}^{\infty }\gamma ^t(r_t^l(s_t,a_t)+F_t^D) \end{aligned}$$

(17)

5 Experiments

We have presented our method of HRLfD through Reachability-based Reward Shaping. The experiments are designed to answer the following questions: (1) Can our method improve overall performance and the sample efficiency of demonstration guide HRL? (2) Can our method outperform methods that might also enhance the effectiveness of hierarchical agents’ learning? (3) Can our method perform stably during different demonstrations?

5.1 Environment Setup

We first introduce how we generate the demonstration. Our demonstration is set up as a series of $<state, subgoal, action>$ tuples. All the demonstrations could reach the final goal, the shortest ones are called optimal demonstrations, otherwise we call them suboptimal demonstrations. If there are only incomplete demonstrations, we could add the state of goal at the end of the demonstration.

The effectiveness of the method is evaluated on two sorts of tasks with continuous and discrete action and state spaces. Discrete tasks comprise Maze, which are adopted from HRAC and require the agents to accomplish tasks that entail both low level control and high level planning in grid worlds with injected stochasticity. Continuous tasks consist of Ant Maze and Ant Gather, which are widely used benchmarks in the HRL community. In all these tasks, We employ a 2D goal space that is pre-defined and represents the agent’s (x, y) position. The structures of each environment are shown in Fig. 3a–c. In addition, to demonstrate the transferability of our method and the diversity of tasks it can adapt to, we apply our method on a robotic arm for method validation. The task used in this section is to put the blue box onto the red box, shown in Fig. 3d.

Maze This environment is $13 \times 17$ in size, with a discrete 4D action space for actions and a discrete 2D state space for the agent’s (x, y) position. A substantial incentive is given to the agent to encourage exploration. The maximum duration for each episode is 200 steps.

Ant Gather This environment features a continuous state space and is $20 \times $20 in size. The ant moves from a fixed place, gathers green apples with a +1 reward, avoids red bombs with a $-1$ reward, and moves a maximum of 500 steps.

Ant Maze This environment has a continuous state space, with a size of 24 $\times $ 24. During the training stage, the target position is sampled by the environment randomly at the beginning of each episode, and the agent receives a dense reward according to its negative Euclidean distance from its current state to the target position at each time step. During the evaluation stage, (0, 16) is fixed as the target position. The target must be within a Euclidean distance of 5 in order for the success to be considered. Each episode ends when the time steps reach 500.

Door Open Door Open is selected from Meta-World, and this task could be separated into 3 steps: reach the door $\rightarrow $ roll the lock $\rightarrow $ pull the door. In the practical experiment, we train this task with 500 iterations and each iteration has 25 timesteps. During the evaluation, we test with 50 timesteps.

Drawer Open Drawer Open is selected from Meta-World, and this task could be separated into 2 steps: reach the drawer and pull it out. In the practical experiment, we train this task with 500 iterations and each iteration has 25 timesteps. During the evaluation, we test with 50 timesteps.

5.2 Comparative Experiments

To evaluate the performance of the method comprehensively with different HRL implementations, we follow the two separate HRL for different tasks. On discrete tasks, on-policy A2C is employed for low level and off-policy TD3 is used for high level training while on continuous tasks TD3 is used on both low level and high level training.

We compared our method with the following baseline. (1) UOF: A state-of-the-art [47] method of HRLfD. (2) LfD-HRL: The most used learning from demonstration method based on HRL, which is extended from RLfD through shaping [37]. (3)ForgER [10]: a method that allows an agent to use low-quality expert demonstrations in complex environments by effectively handling expert data errors and adapting the action space and states representation to the agent’s capabilities(4)$A^2$ [48]: a method that integrates two components inspired by human experiences: Abstract demonstrations and Adaptive exploration.

The learning curves of HRLfD-RbRS and baselines across all tasks are plotted in Fig. 3. HRLfD-RbRS outperforms all those baselines in Maze tasks and achieves our preset of increasing learning efficiency and preserving optimality. Especially in the Maze tasks, HRLfD-RbRS even performs better than the demonstration. We can conclude that our method gets rid of the dependence on the quality of demonstration.

In the Door Open and Drawer Open task, we could find out our method still performs the best among all the baselines, while LfD suddenly performs worse. We could make a conclusion that our method can not only performs well in ordinary tasks like Maze tasks, but also can be adapted in Robotic Arm tasks.

5.3 Ablation Experiments

In this section, we mean to compare (1) how well the adjacency constraint and reward shaping work, and (2) the stability of our method under different suboptimal demonstrations or optimal demonstrations.

To verify separately the effect of adjacent constraint and reward shaping constraint, and to examine the overall performance of HRLfD-RbRS we set (1)HRL-RS: our framework without adjacent constraint, (2) HRL-AC: our framework with only adjacent constraint, (3) HRLfD-RbRS: our framework and (4) HRL-0: a blank control group, a framework without any constraint. We test the effect of these frameworks on the Ant Gather task and Maze task and the learning curves are shown in Fig. 5.

Through the learning curves in Fig. 5(a) and Fig. 5(b), HRL-AC and HRL-RS both perform better than HRL-0, they learn faster and gain more rewards. However, HRLfD-RbRS has a bit better learning efficiency and has an overall better and more stable performance than HRL-AC and HRL-RS.

In order to verify the performance of our method under different demonstrations, we set up a set of comparative experiments based on the optimal demonstration and the suboptimal demonstration and test them on Maze and Ant Gather. However, in the Maze task, we can only sample the highest value of reward as 5.8 while it should be 6 theoretically. So in the Maze task, we set the optimal demonstration with $reward = 5$, and the suboptimal demonstrations with $reward = 4$ and $reward =3$. In the Ant Gather task, we set the optimal demonstration with $reward = 8$, the suboptimal demonstration with $reward = 4$, and $reward = 5$. The learning curves are shown in Fig. 4a.

According to the learning curves in Fig. 4b, we can easily find that when the quality of demonstration is not too far from optimal demonstration, our method can perform similarly in learning efficiency and learning effects. On the other hand, when the quality of demonstration is far from optimal (shown as the Maze task with $reward =3$), the learning efficiency is a bit worse than those whose demonstration is near-optimal. We suppose it is because the bad quality or complexity of the demonstration increases the difficulty of searching for better trajectories.

Figure 5 shows the visualization of training process. In each figure we randomly pick 20 trajectories of the agent at 0% trained, 50% trained, and 100% trained. When the agent has not been trained (Fig. 5a), it can only travel around the start position. When it is half-trained, it can reach the goal with 65% probability and the light spots in Fig. 5b represented the agent wandering at that position. When the agent is fully trained, it can reach the goal without doubt and the visualization shows there is no light spot like Fig. 5b.

Figure 6 shows the trajectories of learning agent and demonstration when the agent is 100% trained. The red one, which is a certain distance away from other trajectories, is the demonstration, while the others are collected from the agent’s trajectory. From this figure, we could find that the agent learns a better and shorter trajectory than the demonstration.

6 Conclusion and Discussion

In the face of the problem that the learning performance in the HRLfD field is heavily dependent on the quality of the demonstration. We present a novel HRLfD framework that reduces the reliance on the quality of demonstrations. Moreover, based on suboptimal demonstrations, our method can explore better decision trajectories than demonstrations. The practical applicability of the method is demonstrated through experiments on several tasks with continuous and discrete action and state spaces.

HRLfD is one of the most promising avenues to scale up reinforcement learning (RL), and provides a compelling paradigm to address long-horizon and sparse reward issues. However, there are still some important problems to be resolved regarding how to design an understandable and effective presentation and how to enhance the method’s generalization capacity. For instance, whether the demonstrations are produced by humans or robots, it is still difficult to develop demonstrations of complicated action patterns and challenging jobs in the actual world, especially in the domain of massive industrial production. Making progress on random settings like random start position and random reward position in order to adjust to the circumstance that will occur in reality may be another area of future development.

Data Availability

The training and testing data used in this paper and the PyTorch implementation of our method are available at the following links: https://github.com/GaoXZ1807/HRLfD-RbRS.

References

Mousavi SS, Schukat M, Howley E (2018) Deep reinforcement learning: an overview. In: Proceedings of SAI intelligent systems conference (IntelliSys) 2016: vol 2, pp 426–440. Springer
François-Lavet V, Henderson P, Islam R, Bellemare MG, Pineau J (2018) An introduction to deep reinforcement learning
Henderson P, Islam R, Bachman P, Pineau J, Precup D, Meger D (2018) Deep reinforcement learning that matters. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Zhang T, Guo S, Tan T, Hu X, Chen F (2023) Adjacency constraint for efficient hierarchical reinforcement learning. IEEE Trans Pattern Anal Mach Intell 45(4):4152–4166
Google Scholar
Pateria S, Subagdja B, Tan A-H, Quek C (2021) Hierarchical reinforcement learning: a comprehensive survey. ACM Comput Surv (CSUR) 54(5):1–35
Article Google Scholar
Barto AG, Mahadevan S (2003) Recent advances in hierarchical reinforcement learning. Discrete Event Dyn Syst 13(1–2):41–77
Article MathSciNet Google Scholar
Skrynnik A, Staroverov A, Aitygulov E, Aksenov K, Davydov V, Panov AI (2021) Hierarchical deep Q-network from imperfect demonstrations in Minecraft. Cogn Syst Res 65:74–78
Article Google Scholar
Le H, Jiang N, Agarwal A, Dudík M, Yue Y, Daumé III H (2018) Hierarchical imitation and reinforcement learning, 2917–2926 . PMLR
Guo S, Yan Q, Su X, Hu X, Chen F (2022) State-temporal compression in reinforcement learning with the reward-restricted geodesic metric. IEEE Trans Pattern Anal Mach Intell 44(9):5572–5589
Article Google Scholar
Skrynnik A, Staroverov A, Aitygulov E, Aksenov K, Davydov V, Panov AI (2021) Forgetful experience replay in hierarchical reinforcement learning from expert demonstrations. Knowl-Based Syst 218:106844
Article Google Scholar
Kulkarni TD, Narasimhan K, Saeedi A, Tenenbaum J (2016) Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems 29
Nachum O, Gu SS, Lee H, Levine S (2018) Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems31
Hua J, Zeng L, Li G, Ju Z (2021) Learning for a robot: deep reinforcement learning, imitation learning, transfer learning. Sensors 21(4):1278
Article Google Scholar
Gupta A, Kumar V, Lynch C, Levine S, Hausman K (2020) Relay policy learning: solving long-horizon tasks via imitation and reinforcement learning. In: Conference on robot learning, pp 1025–1037. PMLR
Wirth C, Akrour R, Neumann G, Fürnkranz J et al (2017) A survey of preference-based reinforcement learning methods. J Mach Learn Res 18(136):1–46
MathSciNet Google Scholar
Lee K, Smith L, Dragan A, Abbeel P (2021) B-pref: Benchmarking preference-based reinforcement learning. In: Thirty-fifth conference on neural information processing systems datasets and benchmarks track (round 1)
Wirth C, Fürnkranz J, Neumann G (2016) Model-free preference-based reinforcement learning. In: Proceedings of the AAAI conference on artificial intelligence, vol 30
Wang C, Bai X, Wang X, Liu X, Zhou J, Wu X, Li H, Tao D (2020) Self-supervised multiscale adversarial regression network for stereo disparity estimation. IEEE Trans Cybern 51(10):4770–4783
Article Google Scholar
Chen L, Paleja R, Gombolay M (2021) Learning from suboptimal demonstration via self-supervised reward regression. In: Conference on robot learning, pp 1262–1277. PMLR
Shelhamer E, Mahmoudieh P, Argus M, Darrell T (2017) Loss is its own reward: self-supervision for reinforcement learning
Kim C, Park J, Shin J, Lee H, Abbeel P, Lee K (2022) Preference transformer: modeling human preferences using transformers for RL. In: The eleventh international conference on learning representations
Todorov E, Erez T, Tassa Y (2012) Mujoco: A physics engine for model-based control, 5026–5033. IEEE
Florensa C, Duan Y, Abbeel P (2016) Stochastic neural networks for hierarchical reinforcement learning. In: International conference on learning representations
Levy A, Konidaris G, Platt R, Saenko K (2018) Learning multi-level hierarchies with hindsight. In: International conference on learning representations
Nachum O, Gu S, Lee H, Levine S (2018) Near-optimal representation learning for hierarchical reinforcement learning. In: International conference on learning representations
Dayan P, Hinton GE (1992) Feudal reinforcement learning. Adv Neural Inf Process Syst 5
Kim W, Lee C, Kim HJ (2018) Learning and generalization of dynamic movement primitives by hierarchical deep reinforcement learning from demonstration. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 3117–3123 . IEEE
Nachum O, Gu SS, Lee H, Levine S (2018) Data-efficient hierarchical reinforcement learning. Adv Neural Inf Process Syst 31
Vezhnevets AS, Osindero S, Schaul T, Heess N, Jaderberg M, Silver D, Kavukcuoglu K (2017) Feudal networks for hierarchical reinforcement learning. In: International conference on machine learning, pp 3540–3549. PMLR
Zhang W, Ji M, Yu H, Zhen C (2023) Relp: reinforcement learning pruning method based on prior knowledge. Neural Process Lett 55(4):4661–4678
Article Google Scholar
Zhao X, Ding S, An Y, Jia W (2019) Applications of asynchronous deep reinforcement learning based on dynamic updating weights. Appl Intell 49:581–591
Article Google Scholar
Yi M, Yang P, Du M, Ma R (2022) DMADRL: A distributed multi-agent deep reinforcement learning algorithm for cognitive offloading in dynamic MEC networks. Neural Process Lett 54(5):4341–4373
Article Google Scholar
Zhao X, Ding S, An Y, Jia W (2018) Asynchronous reinforcement learning algorithms for solving discrete space path planning problems. Appl Intell 48:4889–4904
Article Google Scholar
Hester T, Vecerik M, Pietquin O, Lanctot M, Schaul T, Piot B, Horgan D, Quan J, Sendonaris A, Osband I, et al.: (2018) Deep q-learning from demonstrations. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Krishnan S, Garg A, Liaw R, Thananjeyan B, Miller L, Pokorny FT, Goldberg K (2019) Swirl: a sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards. Int J Robot Res 38(2–3):126–145
Article Google Scholar
Fox R, Krishnan S, Stoica I, Goldberg K (2017) Multi-level discovery of deep options. Representations 5(6):7–8
Google Scholar
Brys T, Harutyunyan A, Suay HB, Chernova S, Taylor ME, Nowé A (2015) Reinforcement learning from demonstration through shaping. In: Twenty-fourth international joint conference on artificial intelligence
Lin Z, Li J, Shi J, Ye D, Fu Q, Yang W (2022) Juewu-mc: playing minecraft with sample-efficient hierarchical reinforcement learning. In: IJCIA
Abdelkareem Y, Shehata S, Karray F (2022) Advances in preference-based reinforcement learning: a review. In: 2022 IEEE international conference on systems, man, and cybernetics (SMC), pp 2527–2532. IEEE
Pertsch K, Lee Y, Wu Y, Lim JJ (2021) Guided reinforcement learning with learned skills. In: 5th Annual conference on robot learning
Ramírez J, Yu W, Perrusquía A (2022) Model-free reinforcement learning from expert demonstrations: a survey. Artif Intell Rev 55:1–29
Article Google Scholar
Taylor ME, Suay HB, Chernova S (2011) Integrating reinforcement learning with human demonstrations of varying ability. In: The 10th international conference on autonomous agents and multiagent systems, vol 2, pp 617–624
Nair A, McGrew B, Andrychowicz M, Zaremba W, Abbeel P (2018) Overcoming exploration in reinforcement learning with demonstrations. In: 2018 IEEE international conference on robotics and automation (ICRA), pp. 6292–6299. IEEE
Aradi S (2020) Survey of deep reinforcement learning for motion planning of autonomous vehicles. IEEE Trans Intell Transp Syst 23(2):740–759
Article Google Scholar
Wang VH, Pajarinen J, Wang T, Kämäräinen J-K (2023) State-conditioned adversarial subgoal generation. In: Proceedings of the AAAI conference on artificial intelligence, vol 37, pp 10184–10191
Zhang T, Guo S, Tan T, Hu X, Chen F (2020) Generating adjacency-constrained subgoals in hierarchical reinforcement learning. Adv Neural Inf Process Syst 33:21579–21590
Google Scholar
Yang X, Ji Z, Wu J, Lai Y-K, Wei C, Liu G, Setchi R (2021) Hierarchical reinforcement learning with universal policies for multistep robotic manipulation. IEEE Trans Neural Netw Learn Syst 33(9):4727–4741
Article MathSciNet Google Scholar
Yang X, Ji Z, Wu J, Lai Y-K (2022) Abstract demonstrations and adaptive exploration for efficient and stable multi-step sparse reward reinforcement learning. In: 2022 27th international conference on automation and computing (ICAC), pp 1–6 . IEEE

Download references

Funding

This work is supported by the Fundamental Research Funds for the Central Universities under Grant YJSJ23011 and the Innovation Fund of Xidian University.

Author information

Authors and Affiliations

Guangzhou Institute of Technology, Xidian University, Guangzhou, 510555, China
Xiaozhu Gao
School of Computer Science and Technology, Xidian University, Xi’an, 710071, China
Jinhui Liu, Bo Wan & Lingling An
Key Laboratory of Smart Human-Computer Interaction and Wearable Technology of Shaanxi, Xi’an, 710071, China
Jinhui Liu & Bo Wan

Authors

Xiaozhu Gao
View author publications
You can also search for this author in PubMed Google Scholar
Jinhui Liu
View author publications
You can also search for this author in PubMed Google Scholar
Bo Wan
View author publications
You can also search for this author in PubMed Google Scholar
Lingling An
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, LA and BW; methodology, XG; software XG; validation, XG and JL; formal analysis XG; writing, XG; review and editing, LA; supervision, BW and JL. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Jinhui Liu.

Ethics declarations

Conflict of interest

The authors declare no Conflict of interest.

Informed Consent Statement

Not applicable.

Institutional Review Board Statement

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Network Architecture

For the discrete and continuous tasks, we employ the same network architecture as HRAC [46].In the discrete task, TD3 is employed for high level and A2C for low level. In the continuous task, TD3 is employed for both high and low level. For the training of adjacency network, we adopt the training process in Hrac [46]. This network consists of 4 fully connected layers with nonlinearities of ReLU. The size of each hidden layer of the adjacency network is (128,128), and the output embedding dimension is 32. For the Robotic Arm tsks, we employ the same network architecture as UOF [26]. Both high level and low level employ the Actor-Critic algorithm.

Table 1 Hyper-parameters of Ant Maze and Ant Gather

Full size table

Table 2 Hyper-parameters of Maze

Full size table

Table 3 Hyper-parameters of robotic arm

Full size table

Appendix B: Hyper-parameters

All the hyper-parameters that are used in continuous Ant Maze and Ant Gather task are listed in Table 1, the discrete Maze task respectively in Table 2, and the Robotic Arm task respectively in Table 3.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gao, X., Liu, J., Wan, B. et al. Hierarchical Reinforcement Learning from Demonstration via Reachability-Based Reward Shaping. Neural Process Lett 56, 184 (2024). https://doi.org/10.1007/s11063-024-11632-x

Download citation

Accepted: 29 April 2024
Published: 27 May 2024
DOI: https://doi.org/10.1007/s11063-024-11632-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Hierarchical Reinforcement Learning from Demonstration via Reachability-Based Reward Shaping

Abstract

Similar content being viewed by others

Offline reinforcement learning with anderson acceleration for robotic tasks

Demonstration-Based Proximal Policy Optimization with Action Guidance

A Guided Evaluation Method for Robot Dynamic Manipulation

1 Introduction

2 Related Work