1 Introduction

Autonomous navigation is a crucial challenge that mobile robots must overcome to successfully execute complex tasks. Traditional path-planning algorithms often depend on pre-existing global maps for navigation (Hart et al. 1968; Dijkstra 2022; LaValle 1998). This dependency implies that the robot requires complete knowledge of its environment beforehand. These approaches result in a clear division between perception and decision-making, which can be problematic when dealing with environments where the map is unknown or subject to change. Furthermore, real-world applications frequently demand that mobile robots navigate towards multiple goals, such as automated robotic sorting (Shah and Pandey 2018), delivery systems (Hernandez et al. 2017) and museum tour guide robots (Burgard et al. 1999). To address the above problems together, we formulate the task as a multi-goal navigation problem in an unknown environment with sparse rewards.

It is reasonable to consider the navigation task as a sequential decision problem, and thus RL-based approaches played an important role in solving robot navigation problems (Faust et al. 2017; Garaffa et al. 2021; Wang et al. 2020; Zhu et al. 2017; Li et al. 2020). However, it is worth noting that the majority of current navigation tasks are designed for single-goal and the multi-goal navigation task still remains an open problem.

Typically, in a single-goal navigation task, the reward function is set up using artificial reward shaping, e.g. \(R= - \Vert (x_t,y_t)-(x_d,y_d)\Vert _{2}\), where \((x_t,y_t)\) and \((x_d,y_d)\) represent the coordinates of the agent and goal at step t, respectively. However, the extension of such reward function to a multi-goal navigation task has significant limitations: (i) In practical applications of multi-goal navigation, environments are often highly complex, making it challenging to precisely define reward functions that accurately reflect the real-world environment. (ii) Reward shaping for multi-goal tasks will lead to sub-optimal outcomes. Under typical distance-based reward settings, multi-goal navigation tasks may encounter difficulties in completion due to the reliance on distance rewards, often leading to suboptimal solutions. For example, in a multi-goal navigation task with n goals, \(G\{ g_1,g_2,\ldots,g_n \}\). If reward shaping is done using method such as \(L_2\)-norm, the agent will tend to hover around an already-reached goal \(g_i\), \(\left( g_i\in G, i\in \left[ 1,n\right] \right)\). Exploration is premised on the fact that the agent needs to move away from the current location first (i.e. increase the Euclidean distance from current goal \(d(s_t,g_i)\)),Footnote 1 at which point reward shaping \(R=-d(s_t,g_i)\) (Trott et al. 2019) would prevent the agent from exploring outwards. Thus, introducing the agent to sub-optimal and preventing it from completing the multi-goal navigation task.

In this paper, the multi-goal navigation problem is formulated in an unknown environment with sparse rewards. It implies that positive rewards can only be earned after all desired goals have been completed:

$$\begin{aligned} R_{final} = {\left\{ \begin{array}{ll} 1, & \quad \text{achieved}\_\text{goal}\_\text{num} = n, (n\in N_+) \\ 0, & \quad \text{otherwise} \end{array}\right. } \end{aligned}$$
(1)

where n is the number of end goals and ‘achieved_goal_num’ is the number of goals that have been reached so far.

While the definition of sparse rewards may be straightforward and basic, the challenges of dealing with reward scarcity and training difficulties in sparse reward scenarios are intractable. Presently, a series of methods are used to solve sparse reward problems. Inverse RL (IRL) learns a reward function in reverse by expert demonstration and then uses the learned reward function for forward training of the exploration policy (Abbeel and Ng 2004; Ho and Ermon 2016; Ng et al. 2000; Ziebart et al. 2008; Fu et al. 2017; Kretzschmar et al. 2016). In other words, instead of learning a policy or value function directly from the environment’s rewards, IRL aims to understand what rewards would best explain the observed behaviour of an agent. However, in this case, the learned reward function usually has a random bias. There may be a mismatch between the assumed reward function model and the true underlying reward structure of the environment. Counting-based (Bellemare et al. 2016; Tang et al. 2017) or curiosity-based approaches (Pathak et al. 2017; Burda et al. 2018; Reizinger and Szemenyei 2020; Zhelo et al. 2018) provide additional intrinsic rewards by increasing the agent’s exploration of unseen states. Hindsight experience replay (HER) (Andrychowicz et al. 2017; Fang et al. 2019) increases the intermediate rewards of the agent by hindsight relabeling the goal \(g_t\) and reward \(r_t\) in the experience replay transitions \(\tau = \big \{(s_{t}, g_{t}, a_{t},\,s_{t}^{\prime },\, r_{t})\big \}^{T-1}_{t=0}\). Hierarchical RL (HRL) makes it easier for agent to reach shorter goals by splitting long-sequence decision tasks into shorter decision tasks, thus reducing the difficulty of sparse rewards in long-sequence decisions (Barto and Mahadevan 2003; Kulkarni et al. 2016; Levy et al. 2017; Nachum et al. 2018). In HRL, the agent learns a hierarchy of policies, each responsible for accomplishing a specific sub-task. This approach reduces the search space and allows for more efficient learning, as the agent can focus on mastering smaller components of the task before integrating them into larger policies.

The above approaches are effective for most single-goal tasks with sparse reward, but there is a lack of research for multi-goal navigation tasks under similar sparse reward conditions. Furthermore, these methods are complicated to directly extend to navigation task scenarios where the robot must solve multiple desired goals.

In this paper, we proposed a novel composite HRL framework called Hierarchical Reinforcement Learning with Multi-Goal (HRL-MG). Our approach consists of the following main contributions:

  • We proposed a novel HRL framework that integrates both a temporal abstraction architecture and a continuous goal-orientation architecture to address the challenges posed by hybrid action spaces in multi-goal tasks. HRL-MG framework is designed to decompose long-sequence decisions in the context of multi-goal scenarios.

  • A dynamic goal-testing approach in a hierarchical architecture is proposed to further overcome the complex sparse reward problem encountered by both selector and actuator in multi-goal navigation tasks.

2 Related works

2.1 Hierarchical architecture in HRL techniques

A range of approaches have been developed in the field of HRL to address the problem of sparse rewards and intractable training in RL (Sutton et al. 1999; Parr and Russell 1997; Dietterich 2000; Bacon et al. 2017; Vezhnevets et al. 2017; Kulkarni et al. 2016; Nachum et al. 2018; Levy et al. 2017; Botvinick 2012). Current HRL architectures fall into two main categories: (i) discrete temporal abstraction architecture, (ii) continuous goal-oriented architecture. Discrete temporal abstraction hierarchy usually contains a high-level meta-controller and a low-level controller (Kulkarni et al. 2016). This framework involves organizing tasks into hierarchical structures, allowing for the abstraction of actions over different time scales. In continuous goal-oriented hierarchy framework, goals are continuously updated based on the interaction between the agent and the environment, and actions are selected to achieve these dynamic goals (Levy et al. 2017; Nachum et al. 2018). This framework emphasizes real-time adaptation and goal pursuit, where the agent continuously adjusts its actions to achieve evolving goals. However, both types of hierarchical approaches are aimed at tasks that involve a single desired goal, such as reaching a single destination in a navigation task or grasping a single object with a robotic arm. Our work bridges the multi-goal navigation task, a widespread application potential in practice scenarios.

2.2 Tasks with hybrid action space

In practical tasks of robot navigation, the action space is not only just a single discrete or continuous action space but usually a hybrid action space where both exist. For RL methods based on environment-provided rewards, mastering a complex task requires precise shaping of the reward function. The reward function must accurately formalize the task description into a mathematical representation. Due to challenges in reward shaping within complex scenarios, a simpler approach is to split the composite task. For example, in a multi-goal navigation scenario, a destination to be reached is selected first involving a discrete action space. Then the primitive actions such as the robot’s velocity and position are chosen involving a continuous action space. For hybrid action spaces, the most naive approach is to convert the hybrid action space into a unified action space through discretization or continuity, however, it changes the structure of the original action space. A parametric action space is proposed where each discrete action corresponds to a set of continuous parameters (Hausknecht and Stone 2015; Fan et al. 2019; Fu et al. 2019; Xiong et al. 2018). Authors in Li et al. (2021) construct an implicit action space containing both discrete and continuous actions via conditional variational autoencoder (CVAE). As can easily be seen, the main ideas for hybrid action space at present either aim at the representation of action space or adopt the hierarchy of action space. Below is the analysis of the advantages of hierarchical decision-making, explaining why we choose to adopt a hierarchical framework in this work to address implicit hybrid action space tasks.

  1. (i)

    Complexity Management: Hybrid tasks involve multiple decision factors, such as local actions and global strategies. Hierarchical decision-making breaks tasks into simpler sub-tasks or sub-goals, effectively managing complexity and reducing search space.

  2. (ii)

    Iterative Optimization: Hierarchical decision-making allows systems to independently optimize and learn at each level, and then integrate the results into higher-level decisions. This iterative optimization approach is often more effective than directly parameterized searches across the entire action space, especially in tasks with high dimensionality and large-scale action spaces.

  3. (iii)

    Adaptability and Intuition: Hierarchical decision-making aligns with human intuition and decision-making sequences, making it more intuitive and easier to understand. Additionally, it fosters adaptability, allowing flexible adjustment of strategies at different levels, making systems better suited for complex and changing environments.

In summary, hierarchical decision-making excels in managing complexity, achieving iterative optimization, and aligning with human intuition, thereby providing greater adaptability and flexibility in facing hybrid action space tasks.

3 Preliminaries

3.1 Semi-Markov decision processes

HRL builds upon the theoretical underpinning of semi-markov decision processes (SMDPs) (Feinberg 1994; Pateria et al. 2021; Sutton et al. 1999), a framework that extends the fundamental concept of time present in MDP \(<\mathcal {S},\mathcal {A},P,r,\gamma>\) (Sutton and Barto 2018). A MDP is defined by the state space \(\mathcal {S}\), action space A, reward function \(R:{\mathcal {S}}\times {\mathcal {A}}\rightarrow \mathbb {R}\), and transition function \(P:{\mathcal {S}}\times {\mathcal {A}}\rightarrow \mathcal{S}^{\prime}\). At each discrete time step t, the agent and the environment interact, the agent observes the state \(s_t\) from the environment, \(s_t\in \mathcal {S}\), and on this basis chooses an action \(a_t\in \mathcal {A}\). At the next time step \(t+1\), as a result of its action, the agent receives a numerical reward \(r_{t+1}\in \mathcal {R}\in \mathbb {R}\) and turns to the next state \(s_{t+1}\). The objective is to maximize the expected discounted cumulative rewards \(G(\pi _\theta )=\mathbb {E}\left[ \sum _{t=0}^{\infty }\gamma ^{t}r_{t}\mid a_{t}\sim \pi _\theta ,s_{t}\sim P\right]\), where \(\gamma\) is a discount factor. In option-to-option SMDPs framework, an option is defined as \(\mathcal {O} <\chi ,\pi ,\beta>\), consisting of an initiation set \(\chi \in \mathcal {S}\), a policy \(\pi : \mathcal {S}\times \mathcal {A} \rightarrow [0,1]\) and a termination condition \(\beta: {\mathcal{S}}^{+} \rightarrow [0,1]\). For any option \(o\in \mathcal {O}\), let \({\mathcal {E}}(o,s,t)\) denote the event of option o being initiated in state \(s\in \chi\) at time t. Then the reward of the option o for any state \(s\in S\) is \(r_{s}^{o}=E\big \{r_{t+1}+\gamma r_{t+2}+\cdot \cdot \cdot +\gamma ^{k-1}r_{t+k}\,\big |\mathcal {E}(o,s,t)\big \}\).

3.2 Goal-oriented reinforcement learning (GoRL)

GoRL is an RL paradigm that aims to train a policy that empowers the agent to be able to achieve various goals, which is also referred to as goal-conditional RL or multi-goal RL. In this paper, we refer to it simply as GoRL. GoRL tells the agent what to do by giving it an additional goal input. The value function approximator we normally use is \(V(s;\theta )\;\approx \;V^{\ast}(s)\). To allow the value function to be extended to both state and goal inputs, we follow the universal value function approximator (UVFA) (Schaul et al. 2015), which the value function \(V(s,g;\theta )\;\approx \;V_{g}^{\ast}(s)\) or \(Q(s,a,g;\theta )\,\approx \,Q_{g}^{\ast}(s,a)\). In this paper, we use the most direct approach, \(F:{\mathcal {S}}\times {\mathcal {G}}\rightarrow \mathbb {R}\), simply concatenate state and goal together as a joint input, F denotes the connection function. GoRL naturally has the property of binary sparse rewards, i.e., the agent is given a positive reward when the input goal is reached, otherwise 0. Thus, GoRL is by default accompanied by the problem of reward sparsity (Liu et al. 2022; Plappert et al. 2018).

4 Method

We proposed an innovative hierarchical framework called HRL-MG, specifically designed to tackle the complex problem of multi-goal navigation for robots operating in sparse reward scenarios. Our hierarchical framework applies to goal-oriented tasks, i.e., tasks where both state and goal are vectorial input. HRL-MG consists of two components: (i) a specific hierarchical framework compounded by temporal abstraction and continuous goal orientation and (ii) an HER-based dynamic goal testing method. Later in this section, the two components will be described in turn.

4.1 HRL-MG hierarchical framework

The HRL-MG framework is divided into two sections. The outer loop section is a discrete action-space decision mechanism referred to as the selector, which is responsible for choosing a desired goal among the multi-goal set. The inner loop section is a continuous action-space decision based on the primitive action, referred to as the actuator, as shown in Fig. 1. In this case, the interaction between the selector and actuator is a hierarchical structure of temporal abstraction for splitting multi-goal navigation tasks into individual desired goal navigation tasks. At the same time, the actuator itself is a continuous goal-oriented hierarchy for the inner loop of one desired goal navigation task. The inner loop continuously selects sub-goals, using them as the expected goals at the primitive action layer, to guide the agent towards achieving the desired targets provided by the selector. Once the actuator reaches the desired goal provided by the selector, it returns to the outer loop. The selector selects the next desired goal and repeats this process until all desired goals are reached by the agent.

Fig. 1
figure 1

An overview of the HRL-MG hierarchical framework: a hierarchical architecture combining temporal abstraction and continuous goal orientation, consisting of two parts: a selector and an actuator. The outer loop selector is responsible for selecting the desired goal g (current end goal) among the multi-goal set G and the inner loop actuator is responsible for reaching the desired goal g given by the selector

4.1.1 Selector

In this phase, our focus is on developing a policy for the outer loop of the HRL-MG framework, called selector. The selector enables choosing a current navigation goal among multiple desired goals. The detailed network structure of the selector can be seen in the left part of Fig. 2. The details of the Markov model setup for the selector are as follows:

  • Goal: The goal of the selector is to reach all desired goals \(\mathcal {G}_{final}\{g_0,g_1,\ldots,g_n\}\) under the condition of sparse reward as quickly as possible in a multi-goal navigation task.

  • State: According to the GoRL paradigm, the input of the selector is the concatenation of the state \(s_t\) and the multi-goal set \(\mathcal {G}_{final}\), thus the joint input is \((s_t,\mathcal {G}_{final})\). As shown in Fig. 2, the state of the continuous robot environment ‘Ant’ is represented as \(\mathcal {S}[\theta _p,\theta _v,\theta _g]\), where the \(\theta _p\), \(\theta _v\), \(\theta _g\) represent the parameters of the agent’s position vector, velocity vector and goal vector respectively.

  • Action: The selector’s action space is a discrete action space consisting of \(\mathcal {G}_{final}\{g_0,g_1,\ldots,g_n\}\) and the dimension of the action space depends on the number of multi-goal n. The selector gives a definite goal \(g_n \in \mathcal {G}_{final}\) as the action \(a_t\).

  • Reward: The selector’s reward design follows a sparse reward mechanism, where the selector is positively rewarded if and only if the agent successfully reaches up to a desired goal and 0 otherwise. If all the desired goals are met, then a larger positive reward is returned.

$$\begin{aligned} R_{selector} = {\left\{ \begin{array}{ll} 1, & \quad \text{mission accomplished} \\ 0, & \quad \Vert \phi (s_t) - g_n \Vert _2 \le \epsilon \\ -1, & \quad \text{otherwise} \end{array}\right. } \end{aligned}$$
(2)

where \(\phi\) is a mapping function that retrieves the position information in the current state, and \(\epsilon\) is a very small positive constant threshold.

Fig. 2
figure 2

Diagram of the HRL-MG network architecture: the \(Q\_ net\) on the left represents the process by which the selector gives a desired goal \(g_n\) to be reached by the actuator, and the two actor-critic structures on the right represent the division of the actuator into two layers, with the \(Actor\_ 1\) outputting a sub-goal \(g^{\prime }\) based on \(g_n\) given by the selector and the \(Actor\_ 0\) outputting the primitive action space to reach the sub-goal \(g^{\prime }\)

The selector aims to learn a policy \(\pi _s=(a_t|s_t, \mathcal {G}_{final};\theta _2)\) for guiding the agent to accomplish all desired goals under the sparse reward condition. Note that in an episode, at time step t, when the selector has given the desired goal \(g_n\), handing over to the actuator for execution. And then after N time steps returning to the next state of the selector \(s_{t+N}\). We store the transition \((s_t,\mathcal {G}_{final},g_n,R_{t+N},s_{t+N})\) in this episode into experience memory pool \(D_{selector}\).

For a goal-oriented task with multiple desired goals, in each episode, the set \(\mathcal {G}_{final}\{g_0,g_1,\ldots,g_n\}\) are randomly initialized by the environment which satisfies a minimum threshold between the initial desired goals and the initial position of the agent, where \(g_n\) satisfies the following equation:

$$\begin{aligned} \Vert \phi (s_0) - g_n \Vert _2 \ge D_{threshold} \end{aligned}$$
(3)

\(s_0\) is the random initial position of the agent and \(D_{threshold}\) is the minimum distance between the agent and each desired goal.

The selector will choose a deterministic desired goal \(g_n\) from \(\mathcal {G}_{final}\) using the Deep Q-Network (DQN) algorithm with hyperparameter \(\theta _2\). The objective function of the selector can be stated as:

$$\begin{aligned} L_{selector}(\theta _2)&= \textrm{E}_{(s,a,\mathcal {G},R,s^{\prime })\sim D_2}[(y-Q_{selector}(s,\mathcal {G},a;\theta _2))^{2}] \end{aligned}$$
(4)
$$\begin{aligned} y&= R+\gamma \ \textrm{max}_{a^{\prime }}Q_{selector}(s^{\prime },\mathcal {G},a^{\prime };\theta _2) \end{aligned}$$
(5)

The selector’s desired goal selection acts as the outer loop of the HRL-MG hierarchical framework, decomposing a navigation task with multi-goal in a sparse scenario into each individual desired goal. The sparsity of multiple desired goal navigation rewards can be effectively mitigated by hierarchical splitting.

4.1.2 Actuator

In this phase, our focus is on developing a policy for the inner loop named actuator that enables reaching the goal \(g_n\) given by the selector. The actuator consists of a continuous subgoal-directed hierarchical structure. The upper layer in the actuator gives a more attainable subgoal \(g^{\prime }\) based on the current desired goal \(g_n\) generated by the selector, guiding the agent to explore the environment as much as possible. The bottom layer in the actuator reaches the subgoal \(g^{\prime }\) by performing the primitive actions based on the simulation environment, as shown in the right part of Fig. 2. The details of the Markov model setup for the actuator are as follows:

  • Goal: The goal for the upper level of the actuator is to reach the desired goal \(g_n\) given by the selector. The goal for the bottom level of the actuator is to reach the subgoal \(g^{\prime }\) given by the upper level of the actuator.

  • State: The observation of the environment \(\mathcal {S}_a\) for the actuator is identical to the observation \(\mathcal {S}_e\) for the selector. Actually, in the HRL-MG hierarchical framework, each layer has the same state set or state description. Consequently, whether the agent acting as the selector or the actuator, can inhabit identical states or state configurations. In other words, in the HRL-MG hierarchical framework, the same state space at each layer can be interpreted as the state of the system being consistent or shared between different layers. This consistency simplifies the difficulty of analyzing and managing the hierarchical framework. Similar to the selector’s input, the input for the upper level of the actuator is the concatenation of the current state \(s_t\) and the desired goal \(g_n\) given by the selector. And the input for the bottom level of the actuator is the concatenation of the current state \(s_t\) and the subgoal \(g'\) given by the upper level of the actuator. The joint input is \((s_t,g_n)\) for the upper layer and \((s_t,g')\) for the bottom layer.Footnote 2

  • Action: Let \(A_u\) and \(A_b\) denote the action space for the upper and bottom layers of the actuator, respectively. Both \(A_u\) and \(A_b\) are continuous action spaces. \(A_u\) is a subset of the state space \(\mathcal {S}_a\) (\(A_u\subseteq \mathcal {S}_a\)). \(A_b\) is the primitive action space for navigation tasks.

  • Reward: The actuator’s reward settings still follow the sparse reward mechanism, and the reward functions for the upper and bottom layers are represented as follows:

    $$\begin{aligned} R_u = {\left\{ \begin{array}{ll} 0, & \quad \Vert \phi (s_t) - g_n \Vert _2 \le \epsilon \\ -1, & \quad \text{otherwise} \end{array}\right. } \end{aligned}$$
    (6)
    $$\begin{aligned} \displaystyle R_b = {\left\{ \begin{array}{ll} 0, &\quad \Vert \phi (s_t) - g^{\prime } \Vert _2 \le \epsilon \\ -1, & \quad \text{otherwise}\\ \end{array}\right. } \end{aligned}$$
    (7)

where \(\phi\) is a mapping function that retrieves the position information in the current state, and \(\epsilon\) is a very small positive constant threshold.

The actuator aims to learn two policies \(\pi _u=(a_t|s_t, g_n;\theta _1)\) and \(\pi _b=(a_b|s_t, g^{\prime };\theta _0)\) which correspond to the upper and bottom layers respectively. The actuator splits again based on a single desired goal navigation, decomposing the long sequence of decisions into shorter subgoal navigation decisions. Utilizing short-term subgoals can be very effective in guiding the agent’s exploration of the environment, thus further overcoming the sparse reward dilemma.

The actuator uses the deep deterministic policy gradient (DDPG) algorithm with hyperparameter \(\theta _1\) and \(\theta _0\) for the upper level and the bottom level respectively. The objective function for the upper level of the actuator can be stated as:

$$\begin{aligned} L_{upper}(\theta _1)&= \textrm{E}_{(s,a,g_n,R_u,s^{\prime })\sim D_1}[(y_{upper}-Q_u(s,g_n,a;\theta _1^{Q_u}))^{2}] \end{aligned}$$
(8)
$$\begin{aligned} y_{upper}&= R_u+\gamma \ \textrm{max}_{a^{\prime }}Q_u^{\prime }(s^{\prime },g_n,\mu _u^{\prime }(s^{\prime };\theta _1^{\mu _u^{\prime }});\theta _1^{Q_u^{\prime }}) \end{aligned}$$
(9)

The objective function for the bottom level of the actuator can be stated as:

$$\begin{aligned} L_{bottom}(\theta _0)&= \textrm{E}_{(s,a,g_{sub},R_b,s^{\prime })\sim D_1}[(y_{bottom}-Q_b(s,g_{sub},a;\theta _0^{Q_b}))^{2}] \end{aligned}$$
(10)
$$\begin{aligned} y_{bottom}&= R_b+\gamma \ \textrm{max}_{a^{\prime }}Q_b^{\prime }(s^{\prime },g_{sub},\mu _b^{\prime } (s^{\prime };\theta _0^{\mu _b^{\prime }});\theta _0^{Q_b^{\prime }}) \end{aligned}$$
(11)

where Q represents the critic network and \(Q^{\prime }\) represents the target-critic network. \(\mu\) represents the actor network and \(\mu ^{\prime }\) represents the target-actor network.

4.2 Dynamic goal testing

Once we have disentangled the task within the HRL-MG hierarchical framework, to further address the challenging convergence issues during training in a sparse reward environment, we implemented a strategy involving dynamic goal testing for both the selector and actuator, leveraging the principles of HER.

For the selector, at time step t, the selector begins with the state \(s_t\) and desired goals \(\mathcal {G}_{final}\). A desired goal \(g_n\) is chosen as the current endpoint for the actuator based on the \(\epsilon\)-greedy algorithm. Normally, at each time step t, we would store the transition \((s_t,g_n,a_t,r_t,s_{t+1})\) into experience memory \(D_1\), where \(r_t=1\) if the actuator reaches \(g_n\) and 0 otherwise. After N time steps, we would store the transition \((s_t,\mathcal {G}_{final},g_n,R_{t+N},s_{t+N})\) into experience memory \(D_2\), where \(R_{t+N}=1\) if the selector reaches all desired goals \(\mathcal {G}_{final}\), 0 if one desired goal is achieved and − 1 otherwise. In this case, if the actuator reaches a non-\(g_n\) goal \(g_n^{\prime }\subseteq \mathcal {G}_{final}\) during execution, it also does not receive a positive reward and will be much less efficient for the selector to complete the overall task. Therefore, after the actuator performs each primitive action, the current state \(s^{\prime }\) is checked. If \(g_n\) has been reached, \(s^{\prime }\) is returned to the selector for another desired goal \(g_n^{\prime} \subseteq {\mathcal{G}}_{final}\). If not \(g_n\) but \(g_n^{\prime} \subseteq {\mathcal{G}}_{final}\) has been reached again s′ is returned and the transition of the selector is re-labelled from \((s_t,\mathcal {G}_{final},g_n,R,s^{\prime })\) into \((s_t,\mathcal {G}_{final},g_n^{\prime },R,s^{\prime })\), the transition of the actuator is re-labelled from \((s_t,g_n^{\prime },a_t,r_t,s^{\prime })\) into \((s_t,g_n^{\prime },a_t,r_t^{\prime },s^{\prime }).\) Due to the re-labelling of the actuator’s goal, the reward recalculation has \(r^{\prime }=1\).

4.3 Algorithm

Algorithm 1 shows the overall procedure of the Hierarchical RL with Multi-Goal algorithm (HRL-MG). Algorithm 2 shows the pseudocode of the dynamic goal testing procedure.

Algorithm 1
figure a

HRL-MG

Algorithm 2
figure b

Dynamic goal test

5 Experiments

We perform experiments in both discrete and continuous multi-goal navigation environments. The two-dimensional maze game with a discrete state and action space serves as a motivating example to test the feasibility of our idea. Thereafter, we experiment in a continuous robot environment based on MuJoCo, called ‘Ant’ (Schulman et al. 2015). Next, we will show the results of our experiments in discrete and continuous environments respectively.

Fig. 3
figure 3

Maze_2D: Discrete 2D maze tasks, where the green squares represent the desired goals and the red squares represent the agent’s position. The left side of figures (ad) shows the agent’s local observation and the right side shows “God’s view”. m represents the difficulty of the map, and n is the corresponding state space dimension. The agent’s task is to navigate through the maze to all the green squares to be successful. (Color figure online)

5.1 A discrete motivating example

Consider a discrete two-dimensional maze environment with the state space \(\mathcal {S}=n \times n\) (n specified by user) and the observation space \(\mathcal {O}\): its surrounding \(\left( 2k+1\right) \times \left( 2k+1\right)\) (k specified by view_grid parameter) grids and the action space \(\mathcal {A}=\left\{ 0,1,2,3\right\}\) which correspond to the four directions of movement. For every episode, we randomly sample an initial state as well as two desired goals and the policy gets a reward of − 1 as long as the agent has not completed the exploration of two desired goals.

In the practical experimental setup for the maze environment, the default is to take k = 1, and the observation range of the agent is a 3 × 3 size local observation that can be used to determine the action that can avoid obstacles in the next step. In this work, the parameter n is set to \(n=11+2 \times m\) to realize a navigation task with increasing difficulty of the environment, and the parameter m denotes the number of state space expansions. With the increasing number of m, the state space dimension n will start from the initial value of 11 and increase the difficulty of the task by 2 dimensions at a time. Regarding the value range of parameter m, we set a set of integers [0, 19]. As m increases, the dimension of the environment’s state space n will gradually increase from 11 to 49. This implies that the difficulty of the environment will continually increase, as the state space will grow exponentially. At this time, the difficulty of the task has far exceeded the difficulty of exploring the general discrete task, which is enough to verify the effectiveness of our algorithm. Excluding the space occupied by the walls around the maze environment map, thus the overall state space size is \((n-1) \times (n-1)\) and the observation space size of the agent is \(3 \times 3\), as shown in Fig. 3.

We perform the two-goal tasks in the maze environment using our hierarchical framework with the subgoal guide method and Q-Learning (Watkins and Dayan 1992) algorithm respectively. In this setup, for each difficulty level m, we conduct 20 episodes of experiments. We keep track of the number of successful experiments out of the 20 episodes and calculate the success rate for that difficulty level m. Subsequently, we increase the difficulty level m and cyclically repeat the process. The number of random seeds used for the experiment setup is 4. The smoothing parameter during image processing is 0.6. The results of the experiment are presented in Fig. 4.

From the experimental results, it is evident that around \(m=5\), both Q-Learning and our method achieve a 100% success rate in completing tasks with two goals. As m increases, around \(m=10\), the success rate of the Q-Learning algorithm gradually decreases to approximately 80%, while our method experiences a slight decline to around 90%. When m reaches 15, the success rate of the Q-Learning algorithm significantly drops, even reaching 0, while our method maintains a success rate of above 80%. Q-Learning can only solve for \(m<10\) while our method can easily keep the success rate around 80% for m up to 19. The primary challenge leading to the declining efficacy of Q-Learning algorithms is the exploration dilemma in large state spaces. As the state space continues to grow, the difficulty of Q-Learning’s exploration of the environment continues to rise. Due to the large state space, random exploration becomes very inefficient or even impractical. Consequently, Q-Learning may converge to local optima and fail to reach the global optimal solution. By structuring the learning process into hierarchical levels and setting intermediate subgoals, our method effectively decomposes the task into more manageable sub-tasks. This hierarchical organization facilitates more targeted exploration, allowing the agent to focus on relevant regions of the state space and thereby mitigating the adverse effects of its expansion.

Next, we extend our experiments toward continuous robotics tasks.

Fig. 4
figure 4

Mission success rates for maps with different levels of difficulty m. Graph a shows the exponential growth of the state-space dimension of the map corresponding to the increase of the difficulty m, reaching a maximum of 2209, making navigation over such a large map area extremely hard. Graph b shows the success rates of our method and Q-Learning as m increases. The number of randomized seeds for the experiment was 4

5.2 Continuous robot environment

The continuous robot simulation environment is based on the MuJoCo (Todorov et al. 2012) physics engine. The ‘Ant’ is a 3D robot consisting of a freely rotatable torso and four legs, which are connected by two hinges between the legs and the torso.

In this paper, We considered two different ‘Ant’ environments for the navigation task:

  • Ant_Reacher: In the Ant_Reacher task, the agent’s mission is to perform a multi-goal navigation task in a \(24 \times 24\) barrier-free map under the sparse reward condition. At the beginning of each episode, the distance from each initialized goal to the agent’s initial position in the environment is at least 8 units.

  • Ant_Four_Rooms: In the Ant_Four_Rooms task, the agent is required to perform multi-goal navigation on a \(16.5 \times 16.5\) map that is divided into four rooms by walls. Increased the difficulty of wall obstacles based on the Ant_Reacher environment. At the same time, the need for sparse reward settings is further emphasized in the context of this obstacle environment. At the beginning of each episode, the environment is initialization will ensure that each goal and agent are at least not in the same room.

Table 1 Details of the experimental setup on the ‘Ant’ environment

The details of the Markov model setup for the continuous robot environment ‘Ant’ are as follows. In order to make the HRL-MG hierarchical framework easier to read and understand, all the details have been organized in Table 1.

  • State: The ‘Ant’ robot’s state contains a 15-dimensional position vector and a 14-dimensional velocity vector. At each time step, the agent can obtain the 29-dimensional state value of the current time step from the observation of the environment.

  • Goal: We have designed three levels of difficulty for both the Ant_Reacher and the Ant_Four_Rooms environment, incorporating multi-goal tasks ranging from 2 goals, increasing to 5 goals, and eventually up to 8 or 10 goals respectively, to validate our method.

  • Action: In the selector network, the action space is an n-dimensional discrete action space, which depends on the number of final goals, n. In the actuator network, the action space is continuous. In the upper level networks, the agent’s action space overlaps with the state space and is a 5-dimensional vector, including the 3D coordinates and the two-dimensional velocity values of the sub-goals. In the bottom level network, the agent’s action space corresponds to the primitive action space of the ‘Ant’ environment based on the MuJoCo physics engine, an 8-dimensional action vector used to control the hip and ankle joints of the Ant robot’s four legs.

  • Reward: In all experimental setups, we used a reward function designed with the sparse reward mechanism. The reward function of the selector is shown in Eq. 2. The reward function for the upper level of the actuator is shown in Eq. 6 and the reward function for the bottom level of the actuator is shown in Eq. 7.

5.3 Results analysis

To verify whether HRL-MG improves the performance of multi-goal navigation tasks, we compared its performance with the HAC algorithm, which is a superior hierarchical algorithm for solving sparse rewards (Levy et al. 2017) and the algorithm with a non-hierarchical structure DDPG+HER (Andrychowicz et al. 2017), both under equivalent experimental rewards and environmental settings (Table 2).

Table 2 Hyperparameters of the HRL-MG framework

5.3.1 Baselines

In this subsection, we present two baseline algorithms used for comparison experiments.

  • Hierarchial Actor-Critic (HAC) (Levy et al. 2017): The HAC hierarchical framework employs a continuous goal-oriented hierarchical architecture that accelerates learning and sampling efficiency by simultaneously learning multiple levels of policies.

  • DDPG+HER (Andrychowicz et al. 2017): The DDPG+HER method combines hindsight experience replay with off-policy RL algorithms to cope with the sparse binary reward problem in RL algorithms by sampling and relabeling experience.

5.3.2 Results

Figure 5 shows the success rate of the navigation task given the desired goal number of two. Figure 6 and Fig. 7 demonstrate the navigation paths in the simulation environment for Ant_Reacher and Ant_Four_Rooms tasks under different numbers of desired goals, respectively.

  • DDPG+HER: The DDPG+HER approach is completely incapable of achieving success in the task of multi-goal navigation in both Ant_Reacher and Ant_Four_Rooms environment. We tried to change the setting of the DDPG+HER method that a 10 reward would be returned after one desired is achieved. In this setting, we trained 5k episodes, but the agent still unable to complete the multi-goal task. We analyze that learning a good policy for walking and not tipping over in a robotic environment is already difficult, while at the same time having a multi-goal navigation task makes the whole process a very long-term decision-making process, leading to the failure of the approach.

  • HAC: The HAC algorithm is able to perform multi-objective navigation tasks, but the success rate of the tasks does not increase after stabilizing at about 60–70%. The algorithm started to have success examples around in the fourth batch of training, a batch containing 100 episodes of training and updating, so it can be seen that the algorithm’s overall convergence rate was slow. Meanwhile, when training HAC, we found that the multi-level synchronous learning of HAC leads to a particularly large occupation of GPU storage space, which can easily lead to the out-of-memory (OOM) problem and thus interrupt the program.

  • HRL-MG: Our method effectively overcomes the challenges of multi-goal navigation tasks with sparse reward in both Ant_Reacher and Ant_Four_Rooms environment, achieving a success rate of over 80%. By the tenth batch of training, the success rate of our method was already over 80%, and as we get closer to the 20th batch, the success rate of our method is even close to 95%. Our method demonstrates a near-perfect task completion capability on successive robotics tasks with two desired goals. Therefore, we gradually increase the number of desired goals in the next in order to verify the processing capability of HRL-MG.

Fig. 5
figure 5

The success rate of the two desired goals in the Ant_Reacher environment. The blue line represents the DDPG+HER algorithm, which has a task success rate of 0 because it is unable to complete the multi-goal navigation task, as can be seen in the bottom right corner of the figure. All results were smoothed with a parameter of 0.3. (Color figure online)

Fig. 6
figure 6

Ant_Reacher: Demonstration of the navigation task in the Ant_Reacher environment with the number of the desired goals of 2, 5, and 10 respectively. The yellow spheres represent the desired goals, the two green spheres represent the subgoals and the small red spheres represent the path walked by the agent. The left column, labelled “Pathway_Begin”, represents the initial state of the task. The middle column, labelled “Pathway_Mid”, represents the intermediate state during the task. The right column, labelled “Pathway_End”, represents the final state upon task completion. (Color figure online)

Fig. 7
figure 7

Ant_Four_Rooms: Demonstration of the navigation task in the Ant_Four_Rooms environment with the number of the desired goals of 2, 5, and 8 respectively. The yellow spheres represent the desired goals, the two green spheres represent the subgoals and the small red spheres represent the path walked by the agent. The left column, labelled “Pathway_Begin”, represents the initial state of the task. The middle column, labelled “Pathway_Mid”, represents the intermediate state during the task. The right column, labelled “Pathway_End”, represents the final state upon task completion. (Color figure online)

5.4 Increase the number of desired goals

Based on the validation of the effectiveness of the HRL-MG algorithm, we further explored whether the number of goals has an impact on the performance of the algorithm. Table 3 shows the average success rates for different numbers of desired goals in both the Ant_Reacher and Ant_Four_Rooms environments.

For the Ant_Reacher environment, we respectively set 2, 5, and 10 different desired goals. For the Ant_Four_Rooms, since its map size is slightly smaller than the Ant_Reacher environment, we respectively set 2, 5, and 8 different desired goals. In the Ant_Reacher environment with no obstacles in the way, the average success rates for tasks with 2, 5, and 10 desired goals all exceed 80%. In the Ant_Four_Rooms environment, due to the presence of wall obstacles, the success rates of multi-goal navigation tasks are slightly lower compared to the Ant_Reacher environment. The average success rate of tasks is around 60%. When the number of desired goals increased to 10, the success rate of the algorithm plummeted to 37%, and we speculate that it was the obstruction of the wall and the 10 goals that caused the number of steps to complete the task to grow exponentially. While keeping the step limit constant, causes the task success rate to plummet. Another possible reason is the selection of sub-goals. Monitoring of the training process revealed that the selection of sub-goals could not ignore the location of the wall presence and the robot trying to reach a subgoal will keep moving closer to the wall. Therefore tended to cause the ‘Ant’ robot to roll over, and once the robot had rolled over it was difficult to return to a normal stance in a short time. Meanwhile such behavior is also contrary to the penalty for collision with obstacles.

5.5 Dynamic goal testing

In the training phase, we adopt the dynamic goal testing mechanism mentioned earlier, where each step in an episode is tested based on the current state and all the desired goals, to avoid the problem of an agent passing by another goal in the middle of navigation without realizing it.

Therefore, during the testing phase, when the agent detects that it has passed by another goal on the way of navigation, it will directly end the execution of the current inner loop of the actuator and return to the outer loop of the selector to perform the desired goal selection again. Meanwhile, the current experience is relabeled and stored in the corresponding experience replay buffer. Dramatically improves the speed of completing multi-goal navigation tasks. For specific experimental demonstrations, please refer to our experimental video at https://youtu.be/28HB0CxnZPE.

Table 3 The average success rate using the HRL-MG under various settings of environment and number of goals

6 Discussion

6.1 Comparison to the TSP

The multi-goal navigation task proposed in this paper is similar to the traveller’s problem (TSP) regarding top-level control. However we accomplished the multi-goal navigation problem in a situation where the map environment is unknown, the goal location is unknown, and the robot’s behavioural learning needs to be taken into account. A hierarchical reinforcement learning architecture and rewarded feedback allows the robot to learn how to perform the multi-goal navigation task in such a situation, and to adapt the top-level path to the current state. This is fundamentally different from the approach of route planning followed by traversal in navigation with maps. Our hierarchical framework has the potential to be extended to real task scenarios with multi-goal navigation problems in more complex situations.

6.2 Task scenario generality analysis

The HRL-MG hierarchical framework is not only applicable to the ‘Ant_Reacher’ and ‘Ant_Four_Room’ environments used in simulation experiments but also possesses the ability to generalize to other goal-oriented RL tasks. Particularly, it exhibits strong performance in tasks involving vector-represented objectives, such as goal-oriented manipulation and navigation tasks in simulated environments, including mechanical arm manipulation, swinging, and various navigation tasks.

Additionally, the versatility of the HRL-MG framework extends to scenarios where tasks require complex decision-making processes. Its ability to handle diverse environments and tasks underscores its robustness and adaptability. Furthermore, the framework’s hierarchical structure facilitates efficient learning and decision-making by decomposing complex tasks into manageable sub-tasks, enabling more effective exploration and exploitation strategies. These characteristics make the HRL-MG framework a promising approach for addressing a wide range of real-world problems in robotics, autonomous systems, and beyond.

7 Conclusion

In this work, we have proposed a novel and composite hierarchical reinforcement learning framework, called HRL-MG. The main idea of HRL-MG is to divide and conquer the hybrid action space, splitting long-sequence decisions into short-sequence decision tasks one by one. The HRL-MG framework consists of two main loops: the outer loop called selector and the inner loop called actuator, each playing a crucial role in the multi-goal navigation decision-making process. First, the outer loop selector will select based on multiple desired goals, giving a current desired goal to the inner loop to execute. After the inner loop executor receives the desired goal, it will again perform subgoal disassembly of the desired goal to further decompose the task into a short-term decision process. Finally, the bottom layer of the actuator will perform the output of the primitive action. At the same time, we propose a dynamic goal testing method to relabel the experience of the selector and the actuator in case there is a situation where some other desired goal is reached in the middle of the process but the agent is not aware of it. Through a series of experiments conducted in goal-oriented tasks within a robotic environment utilizing the MuJoCo physics engine, we have successfully demonstrated the efficacy of the HRL-MG framework. Our results have showcased its robust performance in various scenarios, including tasks that encompassed different numbers of the desired goal.