1 Introduction

Hierarchical reinforcement learning (HRL) has significantly advanced the capabilities of reinforcement learning (RL) in tackling complex, time-extended problems that involve sparse rewards and long-term credit assignments [1,2,3,4]. HRL involves a policy on high level that decomposes the original task into multiple subtasks with corresponding subgoals [5, 6], as well as a low level policy that focuses on accomplishing these subtasks by achieving the subgoals [6,7,8]. Despite their potential, traditional HRL algorithms often suffer from low training efficiency [9], making them impractical for real-world applications. Using expert demonstrations to provide prior knowledge and guide learning agents is one of the most popular methods to improve learning efficiency [8, 10,11,12,13,14]. Hierarchical Reinforcement Learning from Demonstrations (HRLfD) is one efficient way to improve the training efficiency of HRL [8]. HRLfD leverages demonstrations as an extra supervision signal to provide prior knowledge, reward function, and sub-task structure. However, the expert demonstration influences the agent’s learning performance and generalization of the learned policy, especially in complex and high-dimensional environments [10]. Therefore, if the demonstration is sub-optimal and noisy, it may lead the agent to non-local optimal or even incorrect action decisions, or fail to capture the task structure and reward function accurately. This may result in poor learning outcomes and slow convergence.

The current approaches to address the issue caused by the suboptimality of demonstrations can be categorized into three categories: preference-based reinforcement learning [15,16,17], self-supervised reward regression [18,19,20], and forgetful experience replay [10]. Preference-based Reinforcement Learning, the first category, utilizes pairwise rankings of demonstrations to infer a better policy. However, it heavily relies on the quality of the preference model and the process of preference elicitation [21]. If the preferences are overly complicated or noisy, it may induce the learning agent to acquire a large number of suboptimal states and actions and impair the learning performance. The second category, self-supervised reward regression, leverages suboptimal demonstrations to generate synthetic data and train an idealized reward function [18,19,20]. It achieves a better correlation with the ground-truth reward and demonstrates improvement over preference-based reinforcement learning when dealing with suboptimal demonstrations. However, it requires a large amount of demonstrations to generate the synthetic data. The third category involves the use of forgetful experience replay, which reduces the ratio of expert data in the replay buffer and learning batches over time [10]. This allows the agent to overcome the influence of low-quality expert actions and adapt the trajectories to its own action space. However, forgetful experience replay may result in the loss of valuable information from the expert data as the ratio decreases over time, depending on the quality of sub-goal extraction from the expert data, and has difficulties in improving both the imitation and forging phase [10]. In summary, the challenge of constructing a simple but efficient model that mitigates the negative interference of suboptimal demonstrations for the HRL agent remains unresolved.

We propose a novel HRLfD framework to learn efficient goal-conditioned HRL policies from suboptimal demonstrations in this paper. Our framework incorporates the thought of nearest neighbor search to guide the HRL agent in improving its learning policies based on the one suboptimal trajectory as its demonstration. The main idea of our approach is illustrated in Fig. 1a: the high level policy generates subgoals that are constrained within the m-reachability proximal region around the one demonstration trajectory. This ensures that the agent not only exploits the demonstrations to accelerate learning efficiency but also explores the m-reachability proximal region to find better trajectories than those in the demonstrations. By employing the m-reachability proximal constraints, we can reduce the agent’s reliance on the quality of demonstrations. The high level policy only needs to explore the states within the set of subgoals whose reachability distance from the demonstration is less than m. To formalize this constraint, we introduce a m-step reachability-based reward shaping (RbRS) method into the HRL framework. This enables us to incorporate the m-reachability proximal constraint into the learning process of high level policy. Based on this constraint, we design an HRL algorithm utilizing m-step RbRS to enhance the training efficiency of HRLfD. In the experimental section, our method is evaluated on various tasks, including discrete and continuous robot control tasks using the simulator named MuJoCo [22], which is widely utilized in HRL algorithms [12, 23,24,25]. The results of the evaluation experimenrs demonstrate the superiority of our method in terms of both sample asymptotic performance and efficiency compared to current state-of-the-art algorithms of HRLfD and RLfD. Our method uses only one demonstration, and at the same time, our method is able to freely explore in the reachable area while constraining the searching area of learning agent. This allows our method to quickly learn effective policies while also exploring better trajectories than the demonstration offers. And the PyTorch implementation of our method is open-sourced on GitHub: https://github.com/GaoXZ1807/HRLfD-RbRS.

Fig. 1
figure 1

a Illustration of our method: the light green curve represents the sub-optimal demonstration trajectory and the orange curve represents the better trajectory generated by the learning agent and the Ant Maze task is set as an example. Each trajectory from state to subgoal is constrained by m-step reachability constraint. b The goal-conditioned HRL framework and the reachability constraint implemented by the m-step reachability proximal space (dashed orange box with "Reachability Space"). The m-step reachability proximal space is the intersection of the adjacency space (the pink circle) and reachability space (the yellow rectangle)

2 Related Work

Data-efficient learning of hierarchical policies is a long-term problem in the field of HRL. Goal-conditioned HRL [24, 26,27,28,29,30,31,32,33] ultilizes a framework that delivers the high-level policies for generating subgoals and the low-level policies to solve this problem. One effective approach to enhance the learning efficiency of HRL is to incorporate expert demonstrations as an additional supervision signal, thereby facilitating the learning process and mitigating the issue of sparse rewards [8, 34,35,36,37]. However, the creation of expert demonstrations often requires manual effort, which can be inconvenient. Moreover, when the demonstrations are suboptimal, it can lead to suboptimal convergence and poor learning outcomes.

There exist approaches that focus on reducing the dependence on the quality of demonstrations, such as preference-based reinforcement learning (PbRL) [15,16,17], Self-Supervised Reward Regression [18,19,20], and forgetful experience replay [7, 10, 27, 38]. PbRL utilizes human preferences as feedback from experts instead of numeric rewards. In detail, PbRL provides the learning agent with another alternative: using a teacher’s preferences to learn policies instead of using pre-defined rewards, thus the learning agent can overcome the association of reward engineering, i.e., reward hacking and infinite reward [15, 16, 39]. However, when the preference is suboptimal, the learning agent would gain suboptimal bias, which means the quality of the bias and whether the bias is available strongly depend on the quality of preference [21], also PbRL still lacks a coherent framework [15]. In order to learn from suboptimal demonstration, self-supervised reward regression [18,19,20] characterizes the relationship between the performance of a policy and the quantity of injected noise, and synthesizes rewards via self-supervision. However, they require a large amount of suboptimal demonstrations to generate the synthetic data and they are unable to use high level spatial features to penalize mismatches in an adaptive manner. In some tasks, especially long-horizon and sparse reward tasks, it is hard to collect so many demonstrations especially those available and usable ones. Forgetful experience replay [10] reduces the proportion of expert data in the replay buffer and learning batches over times. Forgetful experience replay allows the agent to overcome the influence of low-quality expert actions and adapt the trajectories to its own action space. However, one limitation of forgetful experience replay is the potential loss of valuable information from expert data due to the reduction in their proportion over times. Additionally, forgetful experience replay has difficulties in improving both the imitation and forging phases. At the same time, ForgER performs very poorly in our comparison algorithm, far lower than expected. All the above methods are not able to get rid of the dependence on the demonstration.

There exist previous works that develop different forms of demonstrations in order to avoid the high dependence on the inherent paradigms [26, 40,41,42,43]. While traditional demonstrations typically consist of state-action pairs, abstract demonstrations focus on capturing key information, such as a dataset of skills or different orderings of step indexes. Yang et al. [26] introduce a unified hierarchical reinforcement learning framework called the universal option framework (UOF), which utilizes abstract demonstrations to enable the agent to learn diverse outcomes in multi-step tasks. However, abstract demonstrations are more suitable for tasks involving a small number of steps and may not perform well in more complex tasks that require hundreds of steps, such as Maze tasks. Taylor et al. [42] enhance both learning time and policy performance by transferring human demonstrations into a baseline policy for an agent and refining it using reinforcement learning. However, the neural network used in this method is too complex so it is low probability to transit on the other approaches. Nair et al. [43] use demonstrations to overcome the exploration problem and successfully learn to perform long-horizon, multi-step robotics tasks with continuous control such as stacking blocks with a robot arm.However, this approach still needs a lot of experience, which makes it impractical for problems with real-world applications.

3 Preliminary

3.1 Goal-Conditioned HRL

Markov decision process (MDP) of goal-conditioned HRL is defined as a tuple of \(<S, G, A, P, R, \gamma>\), where \(P: S\times A\times S\rightarrow R\) is defined as the state transition function, S is defined as the set of states, A is defined as the set of actions, R is defined as the reward function, \(\gamma \) is a discount factor, and G is defined as the goal set.

\(\pi ^h_{\theta _h}(g|s)\) is defined as the policy of high level. The goal of high level is to maximize the external reward \(r_{kt}^h\) and generate subgoals \(g_t \sim \pi _h(\cdot |s_t)\) as actions of high level. Each subgoal is generated when \(t \equiv 0(\bmod \; k)\), the high level utilizes a mapping function: \(\varphi : S \rightarrow G\), where \(k>1\) is a hyper-parameter which is pre-determined, because goal space G is considered as a subspace of state space S [22, 37, 44]. When \(t \ne 0(\bmod \; k)\), high level controller utilizes the goal transition process \(g_t=h(g_{t-1}, s_{t-1},s_t)\). The external reward function of high level is defined as:

$$\begin{aligned} r_{kt}^h = \sum _{i=kt}^{kt+k-1}R(s_i,a_i),\; \;\;\; t=0,1,2..., \end{aligned}$$
(1)

the reward function means the accumulation of the external reward in the time interval \([kt,\; kt+k-1]\), and \(R(s_i,a_i)\) is the reward from the environment.

\(\pi ^l_{\theta _l}(a|s,g)\) is defined as the low level policy. The goal of low level is to maximize the intrinsic reward provided by the high level policy. Low level is supervised by the intrinsic reward from high level policy which describes subgoal-reaching performance, the low level policy set current state \(s_t\) and the corresponding subgoal \(g_t\) as input, and performs a primary action \(a_t \sim \pi _l(a|s_t,g_t) \in A \) at every time step. To induce the low level policy to reach the subgoal \(g_t\), an intrinsic reward is provided to measure the performance of reaching subgoal, and this low level reward function is defined as \(r_l^t=-D(g,\; \varphi (s_{t+1}))\), where D is Euclidean distance in practice.

The goal-conditioned HRL framework makes the low level policy to recieve learning signals before it reaches the desired final goal. Besides, goal-conditioned HRL enables concurrent end-to-end training of both the high level and low level policies [45].

4 Methods

This section presents a novel reward shaping method named reachability-based reward shaping to reduce the reliance on the quality of demonstrations. It constrains the high level policy of the HRL agent to produce subgoals near the demonstration trajectory, which makes the agent perform a proximal search around the demonstration to find a better trajectory. We then incorporated this reward shaping into goal-conditioned HRL framework to make the HRL agent convergence at better policies than the demonstration.

4.1 Reachability-Based Reward Shaping

According to the preliminary above, we require a metric to measure the distance between the subgoal and the demonstration. Since the state space is not a Euclidean space [9, 46], but a state manifold that incorporates environmental structures. We adopt the shortest transition distance (ST distance) [46], which is used to measure the reachability between two states and can take the environmental manifold into account. The ST distance is the shortest steps that the agent needs to reach the state \(s_1\) starting from another \(s_2\), instead of the Euclidian distance between states \(s_1\) and \(s_2\). It is defined as:

$$\begin{aligned} d_{st}(s_1, s_2):= \mathop {\min }_{\pi \in \Pi }E[{\mathcal {T}}_{s_1s_2}|\pi ]=\mathop {\min }_{\pi \in \Pi }\sum _{t=0}^{\infty }tP({\mathcal {T}}_{s_1s_2}=t|\pi ), \end{aligned}$$
(2)

where \({\mathcal {T}}_{s_1s_2}\) denotes the first hit time of reaching \(s_2\) from \(s_1\)and \(\pi \) is the set of complete policy.

One effective approach to leverage demonstrations for the high-level agent is to generate sub-goals in close proximity to the demonstrations, which enables the agent to produce a trajectory that is similar to the demonstrated trajectory. However, if the subgoal is too far from the demonstration, the demonstration will be not able to offer efficient guidance to the HRL agent. Also, if the generated subgoal is too close to the demonstration, the learning agent cannot explore better trajectories. We thus leverage the ST-distance to constrain the high level policy to generate subgoals in the proximal region around the demonstration to the HRL agent, as illustrated in Fig. 1a, b. and how the reachability constraint reshapes the reward function which allows learning agent to gain enough guidance from demonstration also is able to explore better policies.

In order to generate subgoals that are not too far from the demonstration, we determine a concept of m-step reachable region, the distance between every state in this reachable region and the demonstration is less than m steps. And we call the states in reachable region are reachable.

Definition 1

Let \(s^D \in S^D\), \(S^D\) is the set of states on demonstration, g is the subgoal of learning agent at state s. The m-step reachable region of state s is defined as:

$$\begin{aligned} G_R(s^D,m):=\{g \in G|d_{st}(s^D,\phi ^{-1}(g))<=m\} \end{aligned}$$
(3)

We make the objective of high level incorporating the reachable region as:

$$\begin{aligned} \mathop {\max }_{\theta _h}E_{\pi _{\theta _h}^h}\sum _{t=0}^{T-1}\gamma ^t r_{kt}^h, \end{aligned}$$
(4)

subject to \(d_{st}(s^D_{kt},\varphi ^{-1}(g_{kt}))<m, t=0,1,2,...,T-1\), where \(g_{kt} \sim \pi ^h_{\theta _h}(g|s_{kt})\) and \(r_{kt}^h\) is defined as the reward on high level. In practice, we employ the shortest transition distance from the learning agent’s subgoal to the demonstration and derive the following objective through reward shaping is defined as:

$$\begin{aligned} \mathop {\max }_{\theta _h}E_{\pi _{\theta _h}^h}\sum _{t=0}^{T-1}\gamma ^t\left( r_{kt}^h- \eta _{1} \cdot \mathop {\min }_{s' \in {\mathcal {T}}_{kt}}\left( d_{st}\left( \varphi ^{-1}(g_{kt}),{\mathcal {T}}_{kt}\right) -m\right) \right) , \end{aligned}$$
(5)

where \(\mathop {\min }_{s' \in {\mathcal {T}}_{kt}}(d_{st}(\varphi ^{-1}(g_{kt}),{\mathcal {T}}_{kt})-m)\) means to choose the shortest transition distance from agent to the demonstration.

However, the only constraint the subgoal to be generated near the demonstration has risks to convergence at suboptimal policy. To avoid this, k-step adjacent region is introduced into our method. k-step adjacent region is defined as:

$$\begin{aligned} G_A(s,k):=\{g \in G|d_st(s',\varphi ^{-1}(g))\leqslant k\}, s \in S. \end{aligned}$$
(6)

The concept of k-step adjacent region is used when there is only one subset that can be reliably reached with an optimal goal-conditioned policy, the subgoals sets are mapped from the states’ reachable subset. The k-step adjacent region makes use of the property of \(\pi *\) in deterministic MDP. Subgoals falling in the k-step adjacent region of the current state can represent all optimal subgoals in the entire goal space in terms of the induced k-step low-level action sequence given an optimal low-level policy \(\pi _l^*=\pi ^*\). Zhang [46] has proved that the reduction of goal space still preserves the policy’s optimality.

In order to make the subgoal generated in both the adjacent region and the reachable region, we make a new reachable region \({G'}_R\) be the intersection of adjacent region \(G_A\) and the original reachable region \(G_R\): \({G'}_R = G_A \cap G_R\), and we use the reachable region to constrain the generation of subgoals.

In this case, the generation of subgoal is constrained by both adjacency constraint from adjacent region and reachability constraint from reachable region. We reshape the objective function:

$$\begin{aligned} \mathop {\max }_{\theta _h}E_{\pi _{\theta _h}^h}\sum _{t=0}^{T-1}\gamma ^t (r_{kt}^h-\eta _{1} \cdot \mathop {\min }_{s' \in {\mathcal {T}}_{kt}}(d_{st}(\varphi ^{-1}(g_{kt}),{\mathcal {T}}_{kt})-m)- \eta _{2} \cdot H(d_{st}(s',\varphi ^{-1}(g_{kt})),k))\nonumber \\ \end{aligned}$$
(7)

Where \(H(d_{st}(s',\varphi ^{-1}(g_{kt})),k)\) is the adjacency constraint defined by Zhang [46], \(\eta _{2}\) is a balancing coefficient and \(H(x,k)=max(x/k-1,0)\) is a hinge loss function.

4.2 HRLfD Algorithm Through RbRS

In Sect. 3, we have already clarified the use of RS and LfD in RL and these sorts of methods have made significant progress in accelerating the convergence of the learning process. In this section, we mainly introduce our method. Our main procedure of this method is shown in Fig. 1.

Unlike some of the methods that combine LfD and HRL, which make policy constraints only on high level or low level, our method makes constraints on both high level and low level. The high level uses state to generate subgoal of this specific state, and then the low level uses state and subgoal from high level to make an action, we have used experiments to prove that both high level and low level should be constrained to make better learning results.

Simply stated, our demonstration uses only one suboptimal demonstration to supervise the agent to learn a better trajectory. The trajectory of the agent could be used as a collection of finite states, subgoals and actions. We stipulate that each state has its corresponding subgoal and action to form a triplet \(<s_t,sg_t,a_t>\), besides every state on the trajectory has a corresponding state \(s_t^{demo}\) and a triplet \(<s_t^{demo},sg_t^{demo}>,a_t^{demo}\) on demonstration. When two corresponding triplets \(s_t^{demo}\) and \(<s_t^{demo},sg_t^{demo}>,a_t^{demo}\) have high similarity, then we call the triplet from the agent’s trajectory is a well-trained triplet. If every or most of the triplets of trajectory are well-trained, we call this trajectory a well-trained trajectory. Meanwhile, the subgoal is an important part of low level to generate a correct action to reach goal G. Therefore, it is reasonable and necessary to use subgoal as a part of the demo as well as to make constraints on both high level and low level. We consider our demonstration maintains states, subgoals, and action. We define full demonstration that fits HRL’s hierarchy structure as \(Full_{demo}=\lbrace Hi_{demo},Lo_{demo} \rbrace \), where \(Hi_{demo}\) contains state, subgoal, and next state while \(Lo_{demo}\) contains state, subgoal, action, next state, and next subgoal. \(Full_{demo}\), \(Hi_{demo}\) and \(Lo_{demo}\) are written as follows:

$$\begin{aligned} Full_{demo}=&\lbrace \lbrace Hi_{demo} \rbrace ,\lbrace Lo_{demo} \rbrace \rbrace \end{aligned}$$
(8)
$$\begin{aligned} Hi_{demo} =&\lbrace (s_0^{demo},sg_0^{demo}),(s_1^{demo},sg_1^{demo}),...,(s_{t-1}^{demo},sg_{t-1}^{demo}) \rbrace \end{aligned}$$
(9)
$$\begin{aligned} Lo_{demo}=&\lbrace (s_0^{demo},sg_0^{demo},a_0^{demo}),(s_1^{demo},sg_1^{demo},a_1^{demo}),...,\end{aligned}$$
(10)
$$\begin{aligned}&(s_{t-1}^{demo},sg_{t-1}^{demo},a_(t-1)^{demo})\rbrace \end{aligned}$$
(11)

On about how to gain demonstration, we suppose to gain demonstration from one successful but suboptimal trajectory in replay buffer. Besides, we have formulated different judgment criteria against different environments, we will explain each criterion in Experiment Section. Next, we mainly elaborate on our constraint approach on each level.

The high level approach is already delivered in section 4.1. The objective function of high level is as Eq. (7).

In order to take full advantage of every item in the triplet \(<s,sg,a>\) especially the item sg, we need to consider how this important intermediate product sg is generated on high level and how it works on low level. As the Reachability Constraint on high level strongly effect the generation of more reachable subgoals, once these kinds of subgoals are set as a section of demonstration which means these subgoals are able to give the agent positive constraint in order to mitigate the amount of training data. As a result, these give us full reasons to extend to use subgoals as a part of demonstration and give subgoals a higher constraint weight on low level.

On the low level, we should firstly clarify the low level reward function:

$$\begin{aligned} r_t^l(s_t,a_t)=-D(g_t, \varphi (s_{t+1})), \end{aligned}$$
(12)

The reward describes the subgoal-reaching performance and supervises the low level learning process, where D is a Euclidean distance function. From this reward function we are able to extend the reward function to the low level rejective function:

$$\begin{aligned} J_{low}(\pi _l):=E [\sum _{t=0}^{\infty } \gamma ^t r_t^l(s_t,a_t)]\end{aligned}$$
(13)

where \(\gamma \) is the discount coefficient.

Algorithm 1
figure a

HRLfD through Reachability-based Reward Shaping

Through the spirit of shaping reinforcement learning using demonstration [19], we continue to use the concept of “similarity”. The definition of similarity is to describe how similar the triplets of training trajectory and demonstration are, and the value of similarity is the area of [0, 1], the more similar those two triplets are, the larger the value of similarity h is. To measure the similarity, first of all, we need to judge whether the action \(a_{t_0}\) and \(a_{t_0}^{demo}\) at timestep \(t_0\) are the same. If not, we give the similarity h a value of 0 directly, otherwise we calculate:

$$\begin{aligned} h(s,s^{demo})=1/(d(s,s^{demo})+1), \end{aligned}$$
(14)

where the d is the Euclidian distance, \(h(s,s^{demo})\) equals to 1 when s and \(s^{demo}\) coincide and trials to 0 as s and \(s^{demo}\) get further apart. Calculating the similarity is to make the demonstration available to the learning agent for using a bias for its exploration. After calculating the similarity, we then use this similarity to define a potential function. To calculate the potential of a given triplet \(<s_t,sg_t,a_t>\), we check the set of demonstrations, and find the sample with the same action that yields the highest similarity:

$$\begin{aligned} \Phi ^{demo}(s,sg,a)=\max _{(s^{demo},sg^{demo})}[h(s,s^{demo})+h(sg,sg^{demo})] \end{aligned}$$
(15)

After creating the potential function, we need to integrate it into the learning process by creating a reward shaping function \(F^D\):

$$\begin{aligned} F^D(s,sg,a,s',sg',a')=\gamma \Phi ^{demo}(s',sg',a')-\Phi ^{demo}(s,sg,a). \end{aligned}$$
(16)

Then we add this reward shaping function into objective function of low level:

$$\begin{aligned} J_{low}(\pi _l):=E\sum _{t=0}^{\infty }\gamma ^t(r_t^l(s_t,a_t)+F_t^D) \end{aligned}$$
(17)

5 Experiments

We have presented our method of HRLfD through Reachability-based Reward Shaping. The experiments are designed to answer the following questions: (1) Can our method improve overall performance and the sample efficiency of demonstration guide HRL? (2) Can our method outperform methods that might also enhance the effectiveness of hierarchical agents’ learning? (3) Can our method perform stably during different demonstrations?

5.1 Environment Setup

We first introduce how we generate the demonstration. Our demonstration is set up as a series of \(<state, subgoal, action>\) tuples. All the demonstrations could reach the final goal, the shortest ones are called optimal demonstrations, otherwise we call them suboptimal demonstrations. If there are only incomplete demonstrations, we could add the state of goal at the end of the demonstration.

The effectiveness of the method is evaluated on two sorts of tasks with continuous and discrete action and state spaces. Discrete tasks comprise Maze, which are adopted from HRAC and require the agents to accomplish tasks that entail both low level control and high level planning in grid worlds with injected stochasticity. Continuous tasks consist of Ant Maze and Ant Gather, which are widely used benchmarks in the HRL community. In all these tasks, We employ a 2D goal space that is pre-defined and represents the agent’s (x, y) position. The structures of each environment are shown in Fig. 3a–c. In addition, to demonstrate the transferability of our method and the diversity of tasks it can adapt to, we apply our method on a robotic arm for method validation. The task used in this section is to put the blue box onto the red box, shown in Fig. 3d.

Fig. 2
figure 2

The environments used in the experiments are Maze, Ant Gather, Ant Maze and Robotic Arm. In Maze (a), the agent starts from a given position and aims to reach the final goal with dense rewards. In Ant Gather (b), the ant robot starts from a fixed position and needs to avoid bombs (red) and collect apples (green). In Ant Maze (c), the ant robot starts from a fixed position and needs to reach a target position in a maze with dense rewards. In Door Open (d), the robotic arm should reach and open the door. In Drawer Open (e), the robotic arm should reach and pull out the drawer

Maze This environment is \(13 \times 17\) in size, with a discrete 4D action space for actions and a discrete 2D state space for the agent’s (x, y) position. A substantial incentive is given to the agent to encourage exploration. The maximum duration for each episode is 200 steps.

Ant Gather This environment features a continuous state space and is \(20 \times \)20 in size. The ant moves from a fixed place, gathers green apples with a +1 reward, avoids red bombs with a \(-1\) reward, and moves a maximum of 500 steps.

Ant Maze This environment has a continuous state space, with a size of 24 \(\times \) 24. During the training stage, the target position is sampled by the environment randomly at the beginning of each episode, and the agent receives a dense reward according to its negative Euclidean distance from its current state to the target position at each time step. During the evaluation stage, (0, 16) is fixed as the target position. The target must be within a Euclidean distance of 5 in order for the success to be considered. Each episode ends when the time steps reach 500.

Door Open Door Open is selected from Meta-World, and this task could be separated into 3 steps: reach the door \(\rightarrow \) roll the lock \(\rightarrow \) pull the door. In the practical experiment, we train this task with 500 iterations and each iteration has 25 timesteps. During the evaluation, we test with 50 timesteps.

Drawer Open Drawer Open is selected from Meta-World, and this task could be separated into 2 steps: reach the drawer and pull it out. In the practical experiment, we train this task with 500 iterations and each iteration has 25 timesteps. During the evaluation, we test with 50 timesteps.

5.2 Comparative Experiments

To evaluate the performance of the method comprehensively with different HRL implementations, we follow the two separate HRL for different tasks. On discrete tasks, on-policy A2C is employed for low level and off-policy TD3 is used for high level training while on continuous tasks TD3 is used on both low level and high level training.

Fig. 3
figure 3

Learning curves of our method (HRLfD-RbRS) and baselines on all tasks. Each curve represents mean episode reward, averaged over 5 independent trials. All curves have been smoothed equally for visual clarity

We compared our method with the following baseline. (1) UOF: A state-of-the-art [47] method of HRLfD. (2) LfD-HRL: The most used learning from demonstration method based on HRL, which is extended from RLfD through shaping [37]. (3)ForgER [10]: a method that allows an agent to use low-quality expert demonstrations in complex environments by effectively handling expert data errors and adapting the action space and states representation to the agent’s capabilities(4)\(A^2\) [48]: a method that integrates two components inspired by human experiences: Abstract demonstrations and Adaptive exploration.

The learning curves of HRLfD-RbRS and baselines across all tasks are plotted in Fig. 3. HRLfD-RbRS outperforms all those baselines in Maze tasks and achieves our preset of increasing learning efficiency and preserving optimality. Especially in the Maze tasks, HRLfD-RbRS even performs better than the demonstration. We can conclude that our method gets rid of the dependence on the quality of demonstration.

In the Door Open and Drawer Open task, we could find out our method still performs the best among all the baselines, while LfD suddenly performs worse. We could make a conclusion that our method can not only performs well in ordinary tasks like Maze tasks, but also can be adapted in Robotic Arm tasks.

5.3 Ablation Experiments

In this section, we mean to compare (1) how well the adjacency constraint and reward shaping work, and (2) the stability of our method under different suboptimal demonstrations or optimal demonstrations.

To verify separately the effect of adjacent constraint and reward shaping constraint, and to examine the overall performance of HRLfD-RbRS we set (1)HRL-RS: our framework without adjacent constraint, (2) HRL-AC: our framework with only adjacent constraint, (3) HRLfD-RbRS: our framework and (4) HRL-0: a blank control group, a framework without any constraint. We test the effect of these frameworks on the Ant Gather task and Maze task and the learning curves are shown in Fig. 5.

Fig. 4
figure 4

The learning curves in the ablation study are averaged over 5 independent trials. a Ablation test on Maze. b Ablation test on Ant Gather. c Different demonstrations test on Maze. d Different demonstrations test on Ant Gather

Through the learning curves in Fig. 5(a) and Fig. 5(b), HRL-AC and HRL-RS both perform better than HRL-0, they learn faster and gain more rewards. However, HRLfD-RbRS has a bit better learning efficiency and has an overall better and more stable performance than HRL-AC and HRL-RS.

Fig. 5
figure 5

Visualization of the training process of Ant Maze

In order to verify the performance of our method under different demonstrations, we set up a set of comparative experiments based on the optimal demonstration and the suboptimal demonstration and test them on Maze and Ant Gather. However, in the Maze task, we can only sample the highest value of reward as 5.8 while it should be 6 theoretically. So in the Maze task, we set the optimal demonstration with \(reward = 5\), and the suboptimal demonstrations with \(reward = 4\) and \(reward =3\). In the Ant Gather task, we set the optimal demonstration with \(reward = 8\), the suboptimal demonstration with \(reward = 4\), and \(reward = 5\). The learning curves are shown in Fig. 4a.

According to the learning curves in Fig. 4b, we can easily find that when the quality of demonstration is not too far from optimal demonstration, our method can perform similarly in learning efficiency and learning effects. On the other hand, when the quality of demonstration is far from optimal (shown as the Maze task with \(reward =3\)), the learning efficiency is a bit worse than those whose demonstration is near-optimal. We suppose it is because the bad quality or complexity of the demonstration increases the difficulty of searching for better trajectories.

Figure 5 shows the visualization of training process. In each figure we randomly pick 20 trajectories of the agent at 0% trained, 50% trained, and 100% trained. When the agent has not been trained (Fig. 5a), it can only travel around the start position. When it is half-trained, it can reach the goal with 65% probability and the light spots in Fig. 5b represented the agent wandering at that position. When the agent is fully trained, it can reach the goal without doubt and the visualization shows there is no light spot like Fig. 5b.

Fig. 6
figure 6

Trajectories of 100% trained agent and demonstration

Figure 6 shows the trajectories of learning agent and demonstration when the agent is 100% trained. The red one, which is a certain distance away from other trajectories, is the demonstration, while the others are collected from the agent’s trajectory. From this figure, we could find that the agent learns a better and shorter trajectory than the demonstration.

6 Conclusion and Discussion

In the face of the problem that the learning performance in the HRLfD field is heavily dependent on the quality of the demonstration. We present a novel HRLfD framework that reduces the reliance on the quality of demonstrations. Moreover, based on suboptimal demonstrations, our method can explore better decision trajectories than demonstrations. The practical applicability of the method is demonstrated through experiments on several tasks with continuous and discrete action and state spaces.

HRLfD is one of the most promising avenues to scale up reinforcement learning (RL), and provides a compelling paradigm to address long-horizon and sparse reward issues. However, there are still some important problems to be resolved regarding how to design an understandable and effective presentation and how to enhance the method’s generalization capacity. For instance, whether the demonstrations are produced by humans or robots, it is still difficult to develop demonstrations of complicated action patterns and challenging jobs in the actual world, especially in the domain of massive industrial production. Making progress on random settings like random start position and random reward position in order to adjust to the circumstance that will occur in reality may be another area of future development.