Goal Exploration Augmentation via Pre-trained Skills for Sparse-Reward Long-Horizon Goal-Conditioned Reinforcement Learning

Reinforcement learning (RL) often struggles to accomplish a sparse-reward long-horizon task in a complex environment. Goal-conditioned reinforcement learning (GCRL) has been employed to tackle this difficult problem via a curriculum of easy-to-reach sub-goals. In GCRL, exploring novel sub-goals is essential for the agent to ultimately find the pathway to the desired goal. How to explore novel sub-goals efficiently is one of the most challenging issues in GCRL. Several goal exploration methods have been proposed to address this issue but still struggle to find the desired goals efficiently. In this paper, we propose a novel learning objective by optimizing the entropy of both achieved and new goals to be explored for more efficient goal exploration in sub-goal selection based GCRL. To optimize this objective, we first explore and exploit the frequently occurring goal-transition patterns mined in the environments similar to the current task to compose skills via skill learning. Then, the pretrained skills are applied in goal exploration. Evaluation on a variety of spare-reward long-horizon benchmark tasks suggests that incorporating our method into several state-of-the-art GCRL baselines significantly boosts their exploration efficiency while improving or maintaining their performance. The source code is available at: https://github.com/GEAPS/GEAPS.


Introduction
Reinforcement learning (RL) has successfully solved some complex problems, e.g., board games [1], protein prediction [2] and robotic locomotion tasks [3], where rewards as supervision signals play a crucial role in the learning process.Generally, it is possible to solve most if not all tasks via RL as long as the rewards are designed properly [4].In contrast to non-trivial reward design principles, setting valuable rewards only for states that reach the desired goals is easier and can generalize across different tasks.Those tasks therefore can be easily framed as goal-conditioned reinforcement learning (GCRL) problems to target at reaching the desired goals.However, the simple reward design also makes it extremely hard for RL to learn how to reach the goals as it is hard for the agent to explore them to obtain valuable rewards for learning.The problems have become more severe in long-horizon tasks where the goals are only reachable beyond a long-horizon.Thus, under the sparse-reward design, how to explore the goals efficiently in long-horizon tasks remains a key problem for the wider applications of RL.
In sparse-reward long-horizon GCRL tasks, instead of directly targeting at the desired goals, the agent often learns to reach an implicit curriculum of sub-goals that are easier to reach and help the agent to discover the pathway to the desired goals.Following the curriculum, the agent gradually expands its reachable sub-goals to cover the desired goals.In the process, the efficiency of exploring new sub-goals for the agent to learn is essential for discovering the desired goals efficiently.Several strategies have been proposed to explore new sub-goals efficiently [5][6][7][8][9].However, there still exists a large gap to the level of efficiency required by wider RL applications.
The efficient exploration of human beings often establishes on various patterns in the interactions with the environment.Even a baby would master how to explore the room more efficiently via crawling, a kind of behavior patterns that enables the baby to move to nearby positions.We hypothesize that a key component for efficient goal exploration is to utilize the behavior patterns of the agent transitioned to goals nearby, like the baby crawling.However, existing GCRL strategies do not take such kind of patterns into consideration.In our work, we learn such kind of behavior patterns in the form of skills [10,11] that are pre-trained on the environments of the properties shared by downstream tasks.Each skill corresponds to an individual policy for the agent to conduct specific behavior patterns.The agent is trained in the pre-training environments to visit a set of different nearby goals following each skill and those skills are transferred to downstream tasks for more efficient exploration.From the viewpoint of exploration, we are interested in behavior patterns that visit goals as widely as possible as it tends to discover more novel goals.Thus, we propose a maximum entropy objective on the distribution of achieved goals induced by following those skills.
Our main contributions are summarized as follows: 1) We propose a maximum entropy goal exploration method, goal exploration augmentation via pre-trained skills (GEAPS), to augment exploration in GCRL.2) We introduce the entropy of goals in skill learning, which stabilizes skill learning and helps the agent gain more efficiency in goal exploration on challenging downstream tasks.Furthermore, we conduct a theoretical analysis of this entropy-based skill learning method.3) We provide theoretical analyses for the benefits of utilizing pre-trained skills and the effectiveness achieved through our exploration strategy under specific conditions.4) We demonstrate that incorporating our GEAPS algorithm into the state-of-the-art GCRL methods boosts their exploration efficiency for several spare-reward long-horizon benchmark tasks.

Related Work
Exploration for New Goals.Using uniformly sampled actions, like ϵ-greedy algorithm, and introducing noises to policy actions are common strategies for exploration in RL.However, they are not sufficient to solve sparse-reward and long-horizon tasks.Different goal exploration methods have been proposed to accomplish those challenging tasks.A class of methods focus on a sub-goal selection strategy that helps with better goal exploration.Skew-Fit [6] samples sub-goals from a skewed distribution that is approximately uniform over historical achieved goals, and OMEGA [7] selects sub-goals by maximizing the entropy of achieved goals from low-density regions.Goal GAN [5] and the AMIGO [12] select sub-goals of intermediate difficulties that prevent the agent from getting trapped in too easy tasks and avoiding too difficult ones.By and large, however, such methods still rely on uniformly sampled actions and action noise to find new goals while pursuing sub-goals, which restricts the goal exploration to neighboring states along the trajectory to the sub-goal.To overcome this limitation, [7,13,14] additionally explore goals via random actions after reaching the specific sub-goal, which gives the agent larger freedom to explore around the sub-goal.Nevertheless, random actions do not involve any learned knowledge about tasks other than the action space, which restricts them from exploring a wide range of goals.In contrast, we involve behavior patterns transitioned to nearby goals in the form of skills pre-trained in similar tasks.The pre-trained skills enable transition to nearby goals quicker so that a wider range of goals can be explored within the same time steps.As a model-based method, LEXA [8] trains an exploration policy in a world model of the environment to discover novel goals and perform exploration via the trained exploration policy in the environment.However, a notable lack of experiences around the novel goals makes the simulated dynamics inaccurate around them.The inaccurate simulated dynamics also prevent the exploration policy from exploring a wider range of novel goals.As a model-free method, our method does not rely on the exact dynamics around the novel goals.Instead, we explore new goals with the behavior patterns transitioned to nearby goals to increase the chance for the agent to reach the nearby goals faster than those methods without any knowledge of goal transition.
Skill Learning.To learn the behavioral patterns transitioned to nearby goals, we perform skill learning in pre-training environments with each skill learning to reach a different set of goals.To achieve this, a well-known idea is to maximize the mutual information between skills and the goals that are going to be visited, which can be expressed as follows: where G is the goal space and Z denotes the latent space of the skill policy where each skill is represented by the skill policy conditioned on an individual latent vector.As the state itself can be considered as a goal, we would review the related works below in terms of goals for simplicity.With Eq. 1a, SNN4HRL [10] and DIYAN [11] learn skills by fixing the distribution of latent vectors and minimizing the conditional entropy H(Z|G).DADS [15] estimates H(G) and H(G|Z) with the help of a skill-dynamics model and learns the skills via Eq.1b.However, their learned skills can cover only a small portion of reachable goals due to the fact that mutual information may have many optima and covering more goals does not necessarily contribute to higher mutual information, as shown in Section 4.2.EDL [16] explores the goal space at first, then encode those goals into discrete latent vectors Z via a trained VQ-VAE [17], and finally learn each skill from the rewards based on the likelihoods of the achieved goals that are predicted by the VQ-VAE decoder.Though the skills learned via EDL can reach goals further away, they are not optimized to reach all reachable goals.As the pre-training environments do not reveal the exact structures of downstream tasks, some behavioral patterns transitioned to nearby goals may not work out as expected.Thus, we expect the learned behavioral patterns to support as many transitions to nearby goals as possible so that they can be more robust to different situations in downstream tasks.
To achieve this, we introduce an alternative objective for skill learning based on mutual information maximization.The maximized entropy of goals ensures that skills can reach a wider range of nearby goals and avoid bad local optima in goal covering.as demonstrated in our experiments reported in Section 5.3.4.

Preliminary
While traditional reinforcement learning is often modeled as a Markov decision process (MDP), GCRL augments the MDP with a goal state to form goal-augmented MDP (GA-MDP) [18].A GA-MDP M G is denoted by a tuple (S, A, T , G, r, γ, ϕ, p dg , T ) where S, A, γ, T are state space, action space, discount factor and the horizon, respectively.T : S × A × S → [0, 1] is the transition function, G is the goal space, p dg is the desired goal distribution and ϕ : S → G is a tractable mapping function that maps a state to its corresponding achieved goal.The reward function r : S × G × A → R provides the learning signals for the agent, but valuable rewards can only be obtained when the agent reaches the desired goals in the sparse-reward setting.GCRL requires the agent to learn a policy π : S × G × A → [0, 1] to maximize the expected cumulative return: In GCRL 1 , the agent makes actions either in pursuit of a goal or trying to explore more goals.As depicted in Figure 1, we divide the entire interaction process in an iteration into goal pursuit and goal exploration, depending on whether the decision policies are conditioned on goals.During policy training in the kth iteration, let π g k , π c k and π e k denote the policy deciding a behavioral goal for an agent to pursue, the goal-conditioned policy for the agent to achieve a goal and the exploration policy for the agent to explore new goals, respectively.Before goal pursuit, a goal g k will be sampled from π g k (G), g k ∼ π g k (G).During goal pursuit, the agent takes an action a t ∼ π c k (s t , g k ) at each time step t until T c k ≤ T .For clarity, T c k refers to the number of steps required to reach the goal g k at iteration k in goal pursuit.In the goal exploration process, the agent takes actions a t ∼ π e k (s t ) until T e k ≤ T .To make the best use of interaction steps, we perform goal exploration subsequently after the agent achieves the goal during goal-pursuit [7,13,14] instead of conducting goal exploration separately.Thus, the total steps in iteration k is T = T c k + T e k , meaning that the number of steps taken for goal exploration, T e k , depend on T c k in iteration k.The data collected from both goal pursuit and exploration are stored in the replay buffer B k at iteration k.In the (k+1)th iteration, π g k , π c k , π e k would be updated to π g k+1 , π c k+1 , π e k+1 , respectively, based on the training data in the current replay buffer B k .Then, the updated policies will be used in the new round of data collection.Furthermore, we denote the achieved goals as the set G B : {ϕ(s)|s ∈ B k } , their distribution in the goal space G as p ag,k (G) and their entropy as H ag,k (G) at iteration k.To simplify the presentation, we 1 Some GCRL methods may not have all the components in the generic framework shown in Figure 1.
shall drop off the explicit iteration index, k, from the subscript of the above notation in the rest of the paper.

Method
In this section, we propose a new learning objective for goal exploration, then present skill learning via the goal-transition patterns to optimize our learning objective, which leads to our GEAPS algorithm.

Learning Objective for Goal Exploration
Unlike the previous works reviewed in Section 2, we focus on the goal exploration associated with goal-independent behavior.As it is hard to directly explore desired goals in long-horizon and sparse-reward tasks, a well-known learning objective is to maximize the entropy of historical achieved goal H(G).OMEGA [7] has shown how to optimize the entropy of already achieved goals, H ag (G), in the goal pursuit process.We make a step forward by analyzing how to further optimize H ag (G) via goal exploration immediately after goal pursuit in each trial.Let p e (G) and H e (G) denote the distribution of goals encountered in goal exploration and its entropy, respectively.In the goal exploration process starting with the initial state s 02 and going through T e transitions, we have where τ = (s 0 , a 0 , . . ., s T e ) denotes a trajectory that adheres to the distribution Π e (τ ) = T e −1 i=0 T (s i+1 |s i , a i )π e (a i |s i ) under the exploration policy π e .The indicator function I(ϕ(s i ) = g) indicates whether the state s i achieves the goal g.After goal exploration (c.f. Figure 1), the updated distribution of achieved goals p ′ ag (G) is a weighted mixture of p ag (G) and p e (G) as follows: where c = |B|+T c |B|+T c +T e and |B| is the size of the current replay buffer.To develop our learning objective for goal exploration augmentation, we formulate a proposition as follows: Proposition 1 Let H ′ ag (G) represent the updated entropy of achieved goals following the goal exploration.This entropy is bounded from below by the sum of the weighted entropies of the original achieved goals and the goals encountered during goal exploration, namely, c Hag(G) and (1 − c) He(G).That is, The proof of Proposition 1 can be found in Appendix A. According to Eq. ( 3), an increase in H ag (G) and H e (G) elevates the lower bound of the resulting entropy H ′ ag (G).As the OMEGA [7] asserts, H ag (G) can be maximized by selecting low-density goals as sub-goals.However, optimizing H e (G) is challenging due to the agent's limited understanding of new sub-goal dynamics, which may necessitate arbitrary exploration.
Despite unknown dynamics, we observe that overlapping elements may exist between the agent's transition mechanisms and a pre-training environment.These shared features form goal-transition patterns, beneficial for exploring unfamiliar goals.When all goal-transition patterns are available in a new sub-goal, the generic entropy of explored goals H e (G) is denoted by Ĥe (G).To optimize Ĥe (G), an exploration policy must aim to visit as many goals as feasible within a given time frame, while avoiding revisits and maintaining stochasticity.Backed by theoretical justification presented in Section 4.5.1, we suggest developing an exploration policy based on an array of stochastic pre-trained skills.Each skill targets a maximum set of sub-goals, leading to a maximized Ĥe (G).Although this assumption may not apply during actual exploration, Ĥe (G) still acts as an upper bound of H e (G) even though missing goal-transition patterns lead to failed transitions.Hence, enhancing Ĥe (G) could significantly improve the agent's exploration efficiency.

Skill Acquisition
For a given environment, optimizing our learning objective in Eq. 3 leads to the maximum entropy of goals to be explored in goal exploration (c.f. Figure 1).However, the exact dynamics around the current state is often unknown, hence it is infeasible to directly maximize the entropy of goals to be explored via p e (G) in Eq. 2. Fortunately, this issue can be addressed with the auxiliary information named goal-transition patterns.A goal transition always has a starting goal g s and an end goal g e but goal transitions of the same g s and g e may involve different intermediate states.Here, we define a goal-transition pattern as a goal transition process that can transit across different states with actions but preserves the same properties independent of g s and g e in the goal space G.It is analogous to image recognition where an object's identity is independent of its location in the image.Exploring with a goal-transition pattern from a state tends to make the changes specified by the pattern via goal-independent actions in G. Goal-transition patterns enable planning in G to avoid the canceling-out effect of different actions used for goal exploration.Composing a set of frequently occurring inherent goal-transition patterns, named skills, in a manner that maximizes the entropy of goals to be explored enables an agent to expand its achieved goal space more efficiently for better goal covering.Such skills can be learned via another policy as described below.
Although we cannot find all the frequently occurring goal-transition patterns without traversing the entire environment, we observe that there are many goal-transition patterns in common that can be mined from similar environments via pre-training.A pre-training environment should share both the same agent space S agent [10,19] and the same goal space G with the current task.The agent space S agent is simply a shared subspace of the state space S and semantically the same across a collection of relevant tasks.S agent generally does not convey goal information since the transition dynamics in their goal spaces often differ on the pre-training tasks.In our work, S agent needs to be independent of the goal space G of any tasks.Thus, the goal-transition patterns can be transferred to a GCRL task within S agent via learned policies that execute the inherent goal-transition patterns mined in the pre-training environments.As our ultimate goal is to learn the composition of goal-transition patterns or skills, we can directly learn another policy that maximizes the expected entropy of goals to be explored in the pre-training environments without modeling the behavior for each goal-transition pattern explicitly.Thus, the behavior of frequently occurring goal-transition patterns is automatically encoded by the policy via learning.We formulate such policy learning as a skill learning process.

Skill Learning
We denote a skill by a latent vector z z z, the set of all the pre-trained skills by Z, and the corresponding multi-modal skill policy by π Z .For each skill, π Z would select an action a t ∼ π Z (a t |s agent t , z z z).To learn a set of diverse skills, we formulate its learning objective as the mutual information between the skills and the goals conditioned on initial goal states by Eqs.1a and 1b.However, previous skill learning methods often fail to learn a wide coverage of goals, which is attributed to the fact that there exist many optima in the mutual information function and covering more goals does not always lead to higher mutual information.Without loss of generality, we assume both the goal space G and the latent space Z for skills are discrete and it is common to have |G| > |Z|.Even when the mutual information I(G; Z) has been maximized to be log|Z| via Eq.1a, the entropy of goals H(G) can still vary from log|Z| to log|G|.When H(G) takes low values, the goal coverage appears poor, which motivates us to develop an alternative skill learning strategy.
Unlike the prior skill learning works, e.g., SNN4HRL [10] and DIAYN [11], we want a diverse set of skills by maximizing both I(Z, G) and H(G).In our work, we do not maximize H(G) directly but H(G|Z) instead given the fact that when I(Z, G) is maximized, Eq. 1b leads to Algorithm 1 Goal Exploration Augmentation via Pre-trained Skills (GEAPS) Given: Skill space Z, pre-trained skill policy π Z , initial state for goal exploration s 0 , skill horizon T s , goal exploraton horizon T e , replay buffer B.
while t ≤ T e do 3: sample a skill z ∼ p(z|s t ) 5: end if sample the action a t ∼ π Z (s agent t , z z z) 7: save (s t , a t , s t+1 , ∅) in replay buffer B. 10: end while 11: end procedure In the skill learning process, however, we still cannot obtain the exact p(z z z|g) and p(g|z z z) that requires integration over all reachable goals and skills.We approximate p(z z z|g) and p(g|z z z) with q(z z z|g) and q(g|z z z) by using the Monte Carlo method.Motivated by the previous works [10,11], we set the reward for mutual information maximization as To maximize the entropy of H(G|Z), the distribution p(g|z z z) is expected to be as uniform as possible.Thus, we design the reward as follows: Here, we encourage visiting those goals less explored.Thus, it will converge when the distribution of goals to be explored are uniform.Combining the rewards specified in Eqs. 5 and 6, we achieve the pseudo reward for our skill training as follows: where β is a coefficient to trade-off between r I z and r H z .In GCRL, transitioning between goals g and g ′ is often represented as g → g + ∆(g, g ′ ), where ∆G = ∆(g, g ′ ) signifies the desired goal transition.To optimize Ĥe (G), we actually learn the skills by maximizing H(∆G) in a pre-training environment.

Goal Exploration Augmentation Strategy
The trained skill policy is used in π e for goal exploration (c.f. Figure 1).During goal exploration, goal-transition patterns are not available in all states, hence sometimes the desired transition of skills are unreachable.Thus, we switch a skill in every T s (T s < T e ) steps.We summarize our GEAPS algorithm in Algorithm 1 that enables the trained skills π Z to be used for learning an exploration policy π e during goal exploration.
As depicted in Figure 1, our GEAPS algorithm can be easily incorporated into the generic GCRL framework to improve exploration efficiency.In the goal pursuit process, an existing GCRL algorithm, e.g., Goal GAN [5], Skew-Fit [6], or OMEGA [7], used in our experiments, is employed to learn a policy, π g k , for an agent to decide a behavioral goal to pursue and a goal-conditioned policy, π c k , for an agent to achieve a goal.After the kth round of goal pursuit is completed, our GEAPS algorithm is invoked in the goal exploration stage to learn an exploration policy, π e k , for the agent to explore new goals.Alternating the goal pursuit and exploration processes makes the two algorithms reach a synergy to solve a sparse-reward long-horizon reinforcement learning task.

Theoretical Analysis
In this subsection, we provide theoretical analyses concerning our entropymaximization-based methods for skill learning and goal exploration.The proofs of those propositions can be found in Appendix A.

On the Role of Skill Composition in Learning Optimal Exploration Policy
Under the exploration policy π e , we characterize Ω as the set of all possible trajectories encompassed within the exploration horizon T e .Each exploration trajectory, represented by τ , conforms to the distribution portrayed by τ ∼ p(Ω).Consequently, the entropy Ĥe (G) can be articulated as the sum of the mutual information I(Ω; G) and the conditional entropy H(G|Ω).This can be represented mathematically as follows: To maximize the mutual information I(Ω; G), we strive for each trajectory to cover a distinct subset of goals.With respect to optimizing H(G|Ω), we aim for each trajectory to visit as many goals as feasible, maintaining uniform probability for visiting each goal within its respective subset.In exploration scenarios deploying uniform primitive actions, each trajectory bears an equivalent likelihood 1  |Ω| of being generated.It is crucial to note that some trajectories may be limited to a single goal, whereas a specific goal could be visited by numerous trajectories.
For the optimal exploration policy, we denote the set of trajectories following the policy as Ω * ⊆ Ω, with their corresponding distribution represented as p Ω * .Consequently, this optimal exploration policy culminates in the optimal entropy of prospective goals, denoted by Ĥ * e (G), as given as follows: Ĥ * e (G) = I(Ω * ; G) + H(G|Ω * ).
However, directly optimizing the exploration policy within the trajectory space Ω to obtain Ω * and p Ω * poses significant computational challenges.This complexity primarily arises from the exponential growth in the size of |Ω| = |A| T e as a function of T e .This difficulty is further compounded by the potential continuity of the action space.To mitigate these challenges, we consider the prospect of simplifying the optimization problem.Following this proposition, we firmly advocate the prospect of pre-trained skills as a viable mechanism for acquiring optimal settings.This approach offers a captivating alternative to the direct optimization method, providing promising opportunities for efficient exploration.

On the Role of Pre-trained Skills in Improving Exploration Efficiency
In a given environment, the occurrence of a goal-transition ∆G can be recurrent, and this repetitiveness is evident through various distinctive goaltransition patterns.To systematically understand these patterns, we formulate a goal-transition pattern as ψ = {s agent start , s agent end , ∆G, ∆A}.Within this formulation, s agent start and s agent end , which belong to the agent state space S agent , are the initial and terminal states of the agent for the goal-transition ∆G, respectively.∆A denotes a sequence of actions that result in the accomplishment of ∆G.The term |∆A| signifies the length of the action sequence ∆A, and the cardinality of ψ is denoted by |ψ| = |∆A|.The entirety of existing goal-transition patterns is symbolized by Ψ.
We now delve into an analysis of how goal-transition patterns can enhance exploration efficiency.Under the assumption that all goal-transition patterns in the pre-training environment are accessible, we aim to optimize Ĥe (G).Our initial focus lies on discerning the associations between these goal-transition patterns and individual episodes.With the established connection between each trajectory and its corresponding goal-transition patterns in Proposition 3, we proceed to analyze the potential of enhancing exploration efficiency based on the goal-transition decomposition of each trajectory.cardinality can yield equivalent exploration outcomes using an average number of steps that is less than or equal to the specified T e .Proposition 4 presents a potential avenue for enhancing exploration efficiency over the exploration policy that relies on uniform primitive actions.In a practical scenario, goal-transition patterns of different cardinality coexist for the same goal transition, so the employment of goal-transition patterns typically results in a requirement for fewer time steps than T e .Furthermore, the skill learning methodologies described in Sections 4.2 and 4.3 are designed to exploit the potential of goal-transition patterns, thereby further improving the exploration efficiency.

Experiment
In this section, we evaluate the advantage of our GEAPS in terms of success rate and sampling efficiency on a set of sparse-reward and long-horizon GCRL benchmark tasks and demonstrate the effectiveness of our pre-trained skills in our GEAPS algorithm via a comparative study.

Environments
As shown in Figure 2, we select four common long-horizon and sparse reward environments in our experiments.i) PointMaze [7,20]: a 2-D maze task that a point navigates through a 10×10 maze from the bottom left corner to the top right one.Its observation is a two-dimensional vector indicating its position in the maze.The agent may be easily trapped in somewhere with dead ends hence hardly exploring new goals.ii) AntMaze [7,20]: a robotic locomotion task that controls a 3-D four-legged robot through a long U-shaped hallway to reach the desired goal position.Its thirty-dimensional observation includes the robot's status and its location.The agent can only move in a slow and jittery manner, hence it is hard to explore new goals and learn goal-reaching behavior.iii) FetchPickAndPlace (hard version) [21]: a robot arm task where a robot arm grasps a box and moves it to a target position.The agent observes the positions of both gripper and target box as a 30-D vector, and its goal is a 3-D vector about the target position for the box.Another robot arm task, FetchStack2 [22], aims to stack two boxes at a target location, requiring the agent to move them to target positions in order.Its observation and goal are 40-D and 6-D, respectively.The agent receives no reward until placing both boxes in the correct positions, and involving two boxes makes it more difficult to explore desired goals.Reaching the desired goals once on PointMaze and AntMaze is considered as a success, while the agent is only considered to succeed on FetchPickAndPlace and FetchStack2 if it still satisfies the conditions of desired goals at the end of episodes.

Baselines
i) Goal GAN [5]: a typical heuristic-driven method where the intermediate difficulty is used as a heuristic to select sub-goals via a generative model.ii) Skew-Fit [6]: an effective exploration-based method where sub-goals are generated via sampling from a learnt skewed distribution that is approximately uniform on achieved goals.iii) OMEGA [7]: yet another effective explorationbased method where sub-goals are generated via sampling from the low-density region of an achieved goal distribution.While our GEAPS algorithm is applied to the above baselines to augment their goal exploration, we also compare three augmented GCRL models to the state-of-the-art (SOTA) model-based exploration in LEXA [8] of which explorer can augment suitable baselines, e.g., GCSL [23] and DDPG [24].

Experimental Settings
Our experiments study the following questions: Q1) How much is the sampling efficiency gained when our GEAPS is incorporated into a baseline on the condition that its performance is maintained or even improved?Q2) What are the behavioral changes resulting from incorporating our GEAPS into a baseline?Q3) Can our augmented models reach the performance yielded by compared to the model-based SOTA exploration in LEXA [8]? Q4) What are the pre-trained skills resulting from our skill learning objective in contrast to those generated by the established skill learning methods such as SNN4HRL [10] and EDL [16]?
For each baseline, we apply our GEAPS in goal exploration by keeping its original settings unchanged.Thus, we achieve three augmented models: Goal GAN+GEAPS, Skew-Fit+GEAPS and OMEGA+GEAPS, corresponding to three baselines.Five trials with different random seeds in each environment are conducted for reliability.Evaluation is made with a fixed budget; i.e., the training will be terminated after a (pre-set) number of steps if an agent still fails to reach the ultimate goals.The performances are evaluated in terms of the success rate, the most important performance evaluation criterion in reinforcement learning, and the entropy of achieved goals, a widely used evaluation criterion on sampling efficiency in GCRL.

Pre-training Settings
i) PointMaze: We generate 20 small 5 × 5 small mazes as pre-training environments, which include various topographies.In each episode, the agent is initialized in a random position of the central grid.In Figure 3(a), we exemplify four typical pre-training environments.The observation for the skill policy is the relative position with regard to the grid where the agent is located.We pre-train the skills of horizon two over those mazes via maximizing the average cumulative return averaging over the learning objective.ii) AntMaze: We pre-train the skills on the Ant environment as shown in Figure 3(b), which keeps the same 3-D four-legged robots in an open environment and sets the skill horizon as 100.iii) FetchPickAndPlace and FetchStack2: The goals of two robot arm tasks are defined as the target positions of the relevant objects.Achieving a sub-goal during goal pursuit, typically inferred when the arm continues to hold the object, provides the basis for subsequent skill development.
Consequently, each skill is deliberately handcrafted to direct the object along a random trajectory within a predetermined range, ensuring that collectively, the skills span all possible directions.This strategy guarantees an equal likelihood of encountering all potential goals within the boundary established by the exploration horizon T e .

Implementation
Our GEAPS is implemented with the mrl:modular RL codebase [25] and all the baselines adopt DDPG [24] to train goal-conditioned behavior3 .For three baselines [5][6][7], we use the source code provided by the authors and strictly adhere to their instructions in our experiments.LEXA is composed of a modelbased exploration policy and a model-based goal-conditioned policy.As our focus is goal exploration, we use its exploration policy only and adopt modelfree goal-conditioned policies optimized via GCSL [23] and DDPG [24] for a fair comparison.To obtain the pre-trained skills for PointMaze and AntMaze, we use a multi-layer perceptron trained with the TRPO [26] for stability in skill learning, while the skills for two robot arm tasks are handcrafted as moving along a direction sampled uniformly in various ranges.To pre-train the skills, we fix β = 0.1 in Eq. 7 for all our experiments.For goal exploration, we set the skill horizon T s as 2, 25, 8 and 5 for PointMaze, AntMaze, FetchPickAndPlace and FetchStack2, respectively.
Appendixes B and C describe more technical and implementation details regarding the baselines and the skill learning used in our comparative study.

Experimental Results
We report the main experimental results to provide the answers to four questions posed in Section 5.2.1.

Results on Goal Exploration Augmentation
To answer the first question, we report the results yielded by the baselines and their corresponding augmented models to gauge the gain made by our GEAPS.In our experiments, we terminate the training at one million steps for PointMaze, two million steps for FetchPickAndPlace, three million steps for AntMaze and FetchStack2.An episode consists of 50 steps for PointMaze, FetchPickAndPlace and FetchStack2 and 500 steps for AntMaze, respectively.We report statistics (mean and standard deviation) over five seeds in each environment.
As shown in Figure 4, our GEAPS has improved three baselines in different scales across the four environments.On PointMaze, Goal GAN is unable to solve the environment and Skew-Fit only manages to get 20% success at maximum.OMEGA achieves 100% success in about 0.2 million steps.In contrast, our GEAPS enables both Goal GAN and Skew-Fit to solve PointMaze to achieve 100% success in about 0.3 and 0.7 million steps.OMEGA+GEAPS is approximately twice faster as OMEGA to achieve 100% success.On AntMaze, we observe similar results; both Goal GAN and Skew-Fit fail in three million steps, while our GEAPS enables Skew-Fit to solve the environment with 60% success rates and even boost Goal GAN to have comparable results with OMEGA+GEAPS.OMEGA+GEAPS is about three times faster than OMEGA to reach over 90% success.On FetchPickAndPlace, our GEAPS boosts the success rates of Goal GAN and Skew-Fit by around 10% percent and 20% percent, respectively, for the same time steps.Although OMEGA+GEAPS reaches 100% success almost at the same time as the baseline, it is 40% faster than the baseline to reach 80% success.On FetchStack2, Goal GAN, Skew-Fit along with their augmented versions hardly solve the problems with at most 7% success observed for Skew-Fit+GEAPS and OMEGA+GEAPS yields the results comparable to OMEGA, which could be explained with the entropy of the achieved goal distribution.
As the entropy of the achieved goal distribution reflects the coverage of achieved goals in an environment, we show the empirical entropy of the achieved goals for the baselines and the augmented models in Figure 5. On PointMaze and AntMaze, the entropy increases faster with the help of our GEAPS, and the improvements are especially dramatic on Goal GAN and Skew-Fit.On FetchPickAndPlace, GEAPS boosts the entropy of all three baselines on small scales at the beginning.On FetchStack2, GEAPS improves the entropy of OMEGA while deteriorating the entropy of Goal GAN and Skew-Fit marginally.The reason on the unimproved entropy on FetchStack2 can be attributed to GEAPS that controls the goal-transition patterns for one object while making the other with little change, which leads to a slightly lower entropy of achieved goals.
We notice that in OMEGA [7], several classical or SOTA baselines were tested on the same environments used in our work.As OMEGA beats those baselines with a huge margin, e.g., OMEGA is around 100 and 10 times faster than the best performer, PPO+SR [20], in solving PointMaze and AntMaze, respectively.Thus, the performance of OMEGA+GEAPS allows us to claim a bigger gain over those baselines used for comparison in OMEGA [7].

Visualization of Exploration Behavior
To answer the second question, we visualize the final achieved goals and trajectories of goal selection at the end of the episodes.The visualization vividly exhibits the behavioral changes resulting from our goal exploration augmentation.
As shown in the top row of Figure 6, all three baselines cannot reach the entire area of PointMaze at the end of 1,600 episodes.Most goals reached by Goal GAN are located in the left half of the maze near the starting location.Most goals reached by Skew-Fit are located in a smaller area in the left half of the maze and goals mainly get stuck in two small areas close to or having a moderate distance to the starting location.OMEGA performs much better than other baselines as it covers the entire maze except those goals from the Episodes (Thousand) Final Achieved Goals  desired goal distribution in the top right corner.In contrast, it is evident from the bottom row of Figure 6 that our GEAPS makes all baselines cover a larger area and alleviates the so-called "rich get richer" problem by sampling goals uniformly towards covering the entire maze.In particular, our GEAPS helps the baselines reach goals that spread outward from easy to hard goals as training advances and enables OMEGA to quickly transition to goals in the desired goal area.
It is observed from Figure 7 that by incorporating our GEAPS into three baselines, their behavioral changes on AntMaze are similar to those on PointMaze.At the end of 1,000 episodes, no baselines can reach goals beyond the bottom of the hallway, and goals reached by Goal GAN and Skew-Fit are even trapped in the small areas close to the starting location.In contrast, the augmented models take advantage of our GEAPS, hence are able to reach goals in much larger areas.It is evident from the bottom row of Figure 7 that OMEGA+GEAPS has already explored the goals within the desired goal area and Goal GAN+GEAPS has reached quite close to this area.
As OMEGA is the best performer among the three baselines and also uses a Go-Explore [27] style strategy for exploration, we further visualize the goal pursuit and exploration trajectories made by OMEGA and OMEGA+GEAPS at different training steps.As described in Section 3, reaching a behavioral goal in goal pursuit triggers goal exploration.A trajectory can intuitively exhibit goal transitioned at different training steps to allow us to better understand the behavioral change resulting from our GEAPS.As shown in Figures 8 and  9, OMEGA generally explores goals close to the reached behavioral goals "conservatively" with a heuristic [7] in goal exploration, while OMEGA+GEAPS explores goals in a larger area around the reached behavioral goals "aggressively" by means of the frequently occurring goal-transition patterns encoded in the pre-trained skills, which vividly demonstrates the advantage of our proposed method in improving the exploration effectiveness during learning.

Comparison to LEXA Explorer
To answer the third question, we compare the state-of-the-art LEXA explorerbased goal exploration argumentation to our augmented models.As shown in Figure 10, within the same training budget, we only observe up to 7% success rates on PointMaze with LEXA+DDPG and no success achieved by the LEXA explorer-based models on other experiments.As shown in Figure 11, the entropy of the LEXA explorer-based models are far below that of our augmented models.The reasons may be two-fold: a) The LEXA explorer performs exploration via the disagreement of an ensemble of one-step world models and the disagreement is based on the novelty of states.Except for the PointMaze, the state space is not equivalent to the goal space and exploring more states does not necessarily contribute to exploring more goals.b) LEXA performs with the goal-pursuit behavior on those goals uniformly sampled from the replay buffer only, which prevents it from enhancing the experience around the explored goals.Form Figure 12, we observe that the LEXA explorer is able to explore those areas near the desired goal distribution on PointMaze while it fails to keep exploring those areas later.In contrast, our augmented models use different strategies to select those novel sub-goals to pursue, which enhances the experience around those novel goals.Our augmented models prioritize pursuing those novel goals to broaden the relevant experience in the replay buffer.
In addition, training a world model in LEXA is time-consuming and requires abundant data, while our GEAPS only needs the skills pre-trained with our alternative learning objective.Those skills acquired by pre-training are applicable to any relevant downstream tasks.In summary, the above results demonstrates that our model-free GEAPS is highly competitive with the SOTA model-based explorer in LEXA especially for the tasks of which state space is not equivalent to their goal space.

Results on Skill Learning
To answer the fourth question, we first visualize the pre-trained skills resulting from our learning objective presented in Sections 4.3.Figure 13 illustrates the trajectories of pre-trained skills for PointMaze and AntMaze.In Figure 13(a), we plot 50 trajectories for each skill pre-trained for PointMaze in an empty 5 × 5 maze.It is evident that the learned skills guide the agent to navigate along different diagonal directions so that the agent can transit to another grid quickly.The skills are intuitive and their effectiveness in boosting the   For comparison, we further illustrate the trajectories of pre-trained skills by SNN4HRL [10] and EDL [16] on the same pre-training environment for AntMaze in Figure 14.In contrast to the skills acquired by our method in Figure 13(b), it is evident from Figure 14 that both SNN4HRL and EDL cover much smaller areas and leave numerous directions uncovered.In our experiment, we observe that the skills acquired by SNN4HRL appear unstable, highly depending on the random seeds.
To investigate the impact of pre-trained skills in our GEAPS, we employ the skills acquired by the different skill learning methods in OMEGA+GEAPS for performance evaluation on AntMaze.In OMEGA [7], the factor α calculated via Eq.B6 in Appendix B is inversely proportional to the KL divergence between the distribution of achieved goals p ag and desired goals p dg and its value is capped at one.It serves as a dynamic weight to balance both distributions in a mixture distribution for sub-goal sampling and recalculated at the end of each episode.When the α reaches one, the agent only samples sub-goals from the desired goal distribution, which marks the end of exploration about sub-goals other than desired goals.It is evident from Figure 15(a) that the pre-trained skills by our method allow for reaching α = 1 around 0.7 million steps, while the SNN4HRL skills have to take around one million steps and the EDL skills never lead to α = 1.It is further observed from Figure 15(b) that our skill learning objective results in earlier success on AntMaze; i.e., the agent with our skills starts to explore the desired goal distribution within 0.2 million steps, while the agents with the SNN4HRL and the EDL skills have to take around 0.5 million episodes and 0.3 million steps, respectively.In the later training stage, the agent with our skills maintains high success rates regardless of different random seeds.In contrast, the performance of the agents with the SNN4HRL and EDL skills is degraded substantially.Moreover, we observe that the agent with the EDL skills always fails to solve the AntMaze task.In summary, the above results suggest that our skill learning objective yields the quality skills required by our GEAPS.

Discussion
In this section, we discuss the limitations/issues arising from our work and make a connection between our method and other related works.
While the advantages of our approach have been demonstrated, several limitations and open problems still remain.First, our approach relies on the pre-trained skills obtained by skill learning in the environments similar to a target task.Our approach will not work if such environments are unavailable.It is also worth stating that the skill learning incurs an additional computational overhead but is rewarded with great exploration efficiency in GCRL to accomplish a sparse-reward long-horizon task.Next, our theoretical analysis establishes the theoretical justification for the benefits of utilizing pre-trained skills and the effectiveness achieved through our exploration strategy under specific conditions.However, further theoretical analyses concerning broader conditions are still pending.Then, the environments used for evaluation have pre-defined yet well-behaved goal spaces and goals have to be in a vectorial form.It is unclear on whether our approach works in the same manner for various scenarios, e.g., an agent has to specify and model/learn its own goal space [6], and goals are in other forms [9], e.g., image and language goals.After that, all the baselines used in our experiments are sub-goal selection based GCRL algorithms [9].Without a considerable extra effort, our GEAPS method cannot be applied to other types of GCRL algorithms such as the optimization-based and the relabelling GCRL algorithms [9] for goal exploration augmentation.Finally, our approach is memoryless and thus treats both achieved and new goals to be explored equally during data rollout.Equipped with a memory mechanism, our approach would prevent any visited states from being revisited to further improve the sampling efficiency.With memory and proper pre-trained skills, an agent may accomplish new tasks via searching without any further learning.
It is well known that skills and options have been used in hierarchical reinforcement learning (HRL) for for exploration and task simplification [28].However, in the context of GCRL, the direct applicability of pre-trained skills for goal attainment and maintenance is quite limited.This is due to the potential for overshooting goals or stochastic reaching, as well as the narrow focus of skills on specific goals [29].In contrast, our GEAPS method combines the benefits of pre-trained skills with the precision of primitive actions, aiming to enhance goal exploration and achieve goals effectively.Below, we summarize several key distinctions between our GEAPS method and existing works that utilize skills/options for exploration in HRL.First, our GEAPS expands the utilization of entropy maximization as a new learning objective in GCRL.By optimizing both achieved and prospective goals, our GEAPS enhances the efficiency of goal exploration.We specifically emphasize goal exploration and incorporate goal-transition patterns into the learning process, enabling more effective exploration even in the absence of precise dynamic knowledge.To the best of our knowledge, these distinctive features cannot be found in existing works on HRL in the context of GCRL.Next, in HRL, a higher-level agent selects from these options, treating them as indivisible actions or atomic actions.Despite exploring goals while executing a skill, HRL often necessitates revisiting goals using more granular options.In contrast, the skills trained in our GEAPS maximize their exploration capabilities based on goal-transition patterns specific to GCRL, allowing for interactions with a broader array of goals during execution.Our method enhances the efficiency of goal exploration and distinguishes our work from conventional HRL practices that prioritize re-engaging with the same set of goals.Even if the skills are pre-trained as subpolicies for specific sub-tasks in HRL [29], each skill tends to primarily focus on a single goal associated with one of the sub-tasks.During execution, this narrow focus can severely limit the skill's ability to interact with a much wider range of goals that arise in GCRL.Then, distinct from the HRL approach, which typically presumes task decomposition through options, our method does not mandate the completion of tasks strictly through pre-trained skills.Rather, within the context of our GEAPS, these skills are intentionally trained to enhance their efficacy in goal exploration, drawing upon goal-transition patterns particular to GCRL.Pre-trained skills, developed with a focus on these specific goal-transition patterns, empower our GEAPS to foster efficient exploration.Our method aligns closely with the innate exploratory behaviors observed in humans and animals, thus encouraging more intuitive interactions with the environment.Finally, we acknowledge theoretical analyses on the exploration benefits of skills and options in HRL, such as the UCRL-SMDP framework [30] that provides rigorous regret bounds for MDPs with options.However, the direct transfer of UCRL-SMDP to GCRL poses challenges due to disparities in reward mechanisms and the lack of historical data for novel goals.In contrast, our GEAPS addresses these challenges by efficiently navigating exploration in the absence of precise dynamic knowledge.While UCRL-SMDP may not directly aid in exploring unknown areas, a big challenge encountered in our work, it holds promise for enhancing policy optimization to efficiently reach already explored goals in the goal pursuit stage within the generic GCRL framework.

Conclusion
In this paper, we have proposed a novel learning objective that optimizes the entropy of both achieved and new goals in sub-goal selection based goalconditioned reinforcement learning (GCRL).By optimizing this objective, we enhance the efficiency of goal exploration in complex environments, ultimately improving the performance of GCRL algorithms.
Our method incorporates skill learning, where frequently occurring goaltransition patterns are mined and composed into skills.These pre-trained skills are then utilized in goal exploration, allowing the agent to efficiently discover novel sub-goals.Through extensive evaluation on various sparse-reward longhorizon benchmark tasks and a theoretical analysis, we have demonstrated that integrating our method into state-of-the-art GCRL baselines significantly Proof We can cluster |Ω * | into |Z| clusters and each cluster is represented by a latent vector z z z ∼ Z.Then, we have the corresponding distributions related to z z z.
In the above expressions, 1(τ ∈ z z z) denotes the indicator function, which equals 1 if τ belongs to the cluster represented by z and 0 otherwise.In this setting, we can transform Ĥ * e (G) with Eqs.A1 and A2 into Ĥ * e (G) = I(Z; G) + H(G|Z).Although the mutual information term, I(Z; G), may decrease, the conditional entropy term, H(G|Z), increases, maintaining the sum unchanged.For generating the optimal trajectories within each cluster, we can train a skill to produce those trajectories.The total number of such skills is |Z| and the condition |Z| << |Ω * | can be fulfilled with appropriate clustering.During exploration, each skill corresponding to z z z ∼ Z is sampled with probability p(z z z).In the execution of each skill, the trajectory τ is generated with probability p(τ |z z z).□ Proposition 3 Given the horizon T , every trajectory τ can be decomposed into a sequence of goal-transition patterns.
Proof Our proof initiates by deconstructing the trajectory τ into two distinct sequences: the state sequence Sτ = (s i ) T i=0 and the action sequence Aτ = (a i ) T −1 i=0 .Upon acquiring Sτ , we derive the corresponding goal sequence Gτ = (ϕ(s i )) T i=0 .This goal sequence is subsequently partitioned into its maximal homogeneous segments, each embodying repetitions of a singular unique goal.The number of such segments is denoted as Ng(τ ).For each of these segments, we annotate the specific goal and the time step of its first occurrence, denoted as ((g i , t i )) . Following this, we append the tuple (ϕ(s T ), T ) to the sequence, resulting in ((g i , t i )) Consequently, the trajectory can be decomposed into a sequence of goal-transition patterns symbolized as {ψ i } Ng(τ )−1 i=0 , where each pattern ψ i is defined as Given an exploration horizon of T e , the substitution of goaltransition patterns within each trajectory τ ∈ Ω with alternative patterns of smaller cardinality can yield equivalent exploration outcomes using an average number of steps that is less than or equal to the specified T e .
Proof For any trajectory τ ∈ Ω, it can be decomposed into a sequence of goaltransition patterns {ψ i } Ng(τ )−1 i=0 as outlined in Proposition 3.There exists an alternative goal-transition pattern to ψ i for the transition ∆(g ti , g ti+1 ) as follows: Goal Exploration Augmentation via Pre-trained Skills trained to explore curious states via a world model.The explorer is trained with unsupervised rewards based on the disagreements of an ensemble of 1step transition models that predict the next world model states from a current model state.The ensemble of the one-step models can be expressed as For the achiever, we used the rewards from the environment to replace the unsupervised rewards used in [8] for fair comparison in exploration.The achiever in our experiments is trained via the standard GCSL [23] in the open-source code provided by the authors where DDPG is used in the baselines.

B.2.1 DDPG
All baselines are implemented on the basis of DDPG [24].The details of relevant hyperparameters used in DDPG are listed in Table B1.The training frequency varies over different tasks as reported in Table B2.

B.2.2 Relabelling Techniques
During training, we adopt the same relabelling strategies rfaab used in [7]: mixing different relabelling techniques real, future, actual, achieved and

B.2.3 Goal GAN
The neural network used as the discriminator has the same architecture as that of the critic in DDPG except that the sigmoid activation is used in the output layer.The discriminator is trained with a batch of 100 trajectories sampled from the 200 most recent ones for every 250 steps.The R min and R max are set to 0.25 and 0.75, respectively.

B.2.4 SkewFit
Following the same settings in Skew-Fit [6], we empoly the β-VAE as the generative model.Both the encoder and decoder of β-VAE have two hidden layers with [400, 300] ReLU units.Its latent dimension size is set to be the same as the size of the goal in the environment.In β-VAE, we set β = 10 as 10 and the α 1 = 2.5 in Eqs.B4 and B5.We set the batch size as 64 for training β-VAE and adopt the same training setting in Skew-Fit [6]: training every 4,000 steps for 1000 batches in the first 40,000 steps and every 4,000 steps for 200 batches afterwards.

B.2.5 OMEGA
We adopt the same settings used in OMEGA [7] as follows.We set b in Eq.B6 to be -3.0.To approximate the probability p ag (ĝ) for a given ĝ in Eq.B7, we use the kernel density estimator (KDE) [33] with 0.1 bandwidth and Gaussian kernel as our density model.We fit the KDE model to 10,000 normalized achieved goals sampled from the replay buffer for every optimization step.

B.2.6 LEXA
We adopt RSSM [34] as the world model.There are three hidden layers with [128, 128, 64] with [400, 300] ReLU units in both the encoder and the decoder.The hidden layer size for the recurrent model is set to 128.The sizes of the deterministic state and stochastic state are 128 and 32, respectively.We use 10 one-step world models (i.e., M =10) to construct an ensemble world model that calculates the exploration rewards specified in Eq.B8.Each component world model consists of four hidden layers where each hidden layer has 400 ELU units [35].In the GCSL implementation, we use the same actor architecture and the same learning rate used in DDPG as shown in Table B1 where only the future relabelling techniques are used during training.

Proposition 2
The optimal exploration policy leading to Ω * with the distribution p Ω * can be composed via a set of skills Z (|Z| << |Ω * |).

Proposition 3
Given the horizon T , every trajectory τ can be decomposed into a sequence of goal-transition patterns.

Fig. 4 Fig. 5
Fig. 4 Test success on the desired goal distribution throughout training on four environments for the baselines and the augmented models.0.0 0.25 0.5 0.75 1.0 M Environment Steps

Fig. 6
Fig. 6 Visualization of the final achieved goals in PointMaze: the baselines (top) vs. the augmented models (bottom), where the training evolution process is indicated with the heatmap.

Fig. 7
Fig. 7 Visualization of the final achieved goals in AntMaze: the baselines (top) vs. the augmented models (bottom), where the training evolution process is indicated with the heatmap.

Fig. 10 Fig. 11
Fig.10Test success on the desired goal distribution throughout training on four environments for our augmented models and LEXA explorer-based models.

Fig. 12
Fig. 12 Visualization of the final achieved goals on PointMaze: LEXA+GCSL and LEXA+DDPG, where the training evolution process is indicated by the heatmap.

Fig. 13 Fig. 14 Fig. 15 (
Fig. 13 Trajectories of pre-trained skills acquired by our skill learning method.(a) PointMaze in an empty maze.(b) Ant in an empty maze.
Ensemble: f (s t , θ m ) = ẑm t+1 , m = 1 . . .M, where ẑm t+1 indicates the next model state predicted by model m in the ensemble of M models.Assume that there are D dimensions totally in the model state, the reward of state s is the averaged variance of the states predicted by the ensemble model across all dimensions: r e (s t ) = 1 D D d=1 Var m [f (s t , θ m )] d .(B8) Visualization of the goal pursuit and goal exploration trajectories made by OMEGA (top) and OMEGA+GEAPS (bottom) at different training steps for PointMaze.
Fig. 9 Visualization of the goal pursuit and goal exploration trajectories made by OMEGA (top) and OMEGA+GEAPS (bottom) at different training steps for AntMaze.

Table B2
Hyperparamters for Different Tasks.Real stands for no relabelling.Future, actual, achieved, behavioral indicate relabelling with goals from future achieved goals in the belonging trajectories, all historically desired goals, all historically achieved goals and all historically behavioural goals, respectively.Their relative ratios are used to specify the specific technique.For example, rfaab 1 4 3 1 1 denote no relabelling on 10% data and relabelling 40% with future, 30% with achieved, 10% with actual goals and 10% with behavioral.The relabelling strategies vary in different environments (see TableB2for details).