Goal-conditioned offline reinforcement learning through state space partitioning

Wang, Mianchu; Jin, Yue; Montana, Giovanni

doi:10.1007/s10994-023-06500-z

Goal-conditioned offline reinforcement learning through state space partitioning

Open access
Published: 05 February 2024

Volume 113, pages 2435–2465, (2024)
Cite this article

Download PDF

You have full access to this open access article

Machine Learning Aims and scope Submit manuscript

Goal-conditioned offline reinforcement learning through state space partitioning

Download PDF

1273 Accesses
1 Altmetric
Explore all metrics

Abstract

Offline reinforcement learning (RL) aims to create policies for sequential decision-making using exclusively offline datasets. This presents a significant challenge, especially when attempting to accomplish multiple distinct goals or outcomes within a given scenario while receiving sparse rewards. Prior methods using advantage weighting for offline goal-conditioned learning improve policies monotonically. However, they still face challenges from distribution shift and multi-modality that arise due to conflicting ways to reach a goal. This issue is especially challenging in long-horizon tasks, where the presence of multiple, often conflicting, solutions makes it hard to identify a single optimal policy for transitioning from a state to a desired goal. To address these challenges, we introduce a complementary advantage-based weighting scheme that incorporates an additional source of inductive bias. Given a value-based partitioning of the state space, the contribution of actions expected to lead to target regions that are easier to reach, compared to the final goal, is further increased. Our proposed approach, Dual-Advantage Weighted Offline Goal-conditioned RL, outperforms several competing offline algorithms in widely used benchmarks. Furthermore, we provide a theoretical guarantee that the learned policy will not be inferior to the underlying behavior policy.

An offline-to-online reinforcement learning approach based on multi-action evaluation with policy extension

Article 11 September 2024

Goal exploration augmentation via pre-trained skills for sparse-reward long-horizon goal-conditioned reinforcement learning

Article Open access 05 February 2024

Model-Based Multi-objective Reinforcement Learning with Unknown Weights

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Goal-conditioned reinforcement learning (GCRL) aims to learn policies capable of reaching a wide range of distinct goals, effectively creating a vast repertoire of skills (Liu et al., 2022; Plappert et al., 2018; Andrychowicz et al., 2017). When extensive historical training datasets are available, it becomes possible to infer decision policies that surpass the unknown behavior policy (i.e., the policy that generated the data) in an offline manner, without necessitating further interactions with the environment (Eysenbach et al., 2022; Mezghani et al., 2022; Chebotar et al., 2021). A primary challenge in GCRL lies in the reward signal’s sparsity: an agent only receives a reward when it achieves the goal, providing a weak learning signal. This becomes especially challenging in long-horizon problems where reaching the goals by chance alone is difficult.

In an offline setting, the challenge of learning with sparse rewards becomes even more complex due to the inability to explore beyond the already observed states and actions. When the historical data comprises expert demonstrations, imitation learning presents a straightforward approach to offline GCRL (Ghosh et al., 2021; Emmons et al., 2022): in goal-conditioned supervised learning (GCSL), offline trajectories are iteratively relabeled, and a policy learns to imitate them directly. Furthermore, GCSL’s objective lower bounds a function of the original GCRL objective (Yang et al., 2022). However, in practice, the available demonstrations can often contain sub-optimal examples, leading to inferior policies. A simple yet effective solution involves re-weighting the actions during policy training within a likelihood maximization framework. A parameterized advantage function is employed to estimate the expected quality of an action conditioned on a target goal, so that higher-quality actions receive higher weights (Yang et al., 2022). This method is known as goal-conditioned exponential advantage weighting (GEAW).

Although GEAW is effective, we contend in this paper that it grapples with the pervasive multi-modality issue, especially in tasks with extended horizons. The challenge lies in pinpointing an optimal policy to achieve any set goal, given the multiple, sometimes conflicting, paths leading to that goal. While a goal-conditioned advantage function emphasizes actions likely to achieve the goal during training, we believe that introducing an extra layer of inductive bias can offer a shorter learning horizon, a robust learning signal, and more achievable objectives. This, in turn, aids the policy in discerning and adopting the best short-term trajectories amidst conflicting ones.

We propose a complementary advantage weighting scheme that also utilizes the goal-conditioned value function. This provides additional guidance to address multi-modality. During training, the state space is divided into a fixed number of regions, ensuring that all states within the same region have approximately the same goal-conditioned value. These regions are then ranked from the lowest to the highest value. Given the current state, the policy is encouraged to reach the immediately higher-ranking region, relative to the state’s present region, in the fewest steps possible. This target region offers a state-dependent, short-horizon objective that is easier to achieve compared to the final goal, leading to generally shorter successful trajectories. Our proposed algorithm, Dual-Advantage Weighted Offline GCRL (DAWOG), seamlessly integrates the original goal-conditioned advantage weight with the new target-based advantage to effectively address the multi-modality issue.

A prime example is showcased in Fig. 1, depicting the performance of three pre-trained policies in maze-based navigation tasks (Fu et al., 2020). A quadruped robot has been trained to navigate these mazes. It’s tasked with reaching new, unseen goals (red circles) from a starting point (orange circles). These policies were trained via supervised learning: a baseline with no action weighting (left), goal-conditioned advantage weighting (middle), and our proposed dual-advantage weighting (right). While the goal-conditioned advantage weighting often outperforms the baseline, it can occasionally guide the robot into sub-optimal areas, causing delays before redirecting towards the goal. A closer look, as shown in Fig. 2, indicates that dual-advantage weighting better distinguishes goal-aligned actions from sub-optimal ones by assigning them different weights. Consequently, our dual-advantage weighting approach mitigates the multi-modality challenge, resulting in policies that offer more direct and efficient routes to the goal.

In this work, we address the challenges of multi-modality in goal-conditioned offline RL, introducing a novel approach to tackle them. The main contributions of our paper are:

1.
A proposed dual-advantage weighted supervised learning approach, tailored to mitigate the multi-modality challenges inherent in goal-conditioned offline RL.
2.
Theoretical assurances that our method’s performance matches or exceeds that of the underlying behavior policy.
3.
Empirical evaluations across diverse benchmarks (Fu et al., 2020; Plappert et al., 2018; Yang et al., 2022) showcasing DAWOG’s consistent edge over other leading algorithms.
4.
A series of studies highlighting DAWOG’s unique properties and its robustness against hyperparameter variations.

2 Related work

In this section, we offer a brief overview of methodologically related approaches. In goal-conditioned RL (GCRL) In goal-conditioned RL (GCRL), one of the main challenges is the sparsity of the reward signal. An effective solution is hindsight experience replay (HER) (Andrychowicz et al., 2017), which relabels failed rollouts that have not been able to reach the original goals and treats them as successful examples for different goals thus effectively learning from failures. HER has been extended to solve different challenging tasks in synergy with other learning techniques, such as curriculum learning (Fang et al., 2019), model-based goal generation (Yang et al., 2021; Jurgenson et al., 2020; Nasiriany et al., 2019; Nair et al., 2018), and generative adversarial learning (Durugkar et al., 2021; Charlesworth & Montana, 2020). In the offline setting, GCRL aims to learn goal-conditioned policies using only a fixed dataset. The simplest solution has been to adapt standard offline reinforcement learning algorithms (Kumar et al., 2020; Fujimoto & Gu, 2021) by simply concatenating the state and the goal as a new state. Chebotar et al., (2021) propose goal-conditioned conservative Q-learning and goal-chaining to prevent value over-estimation and increase the diversity of the goal. Some of the previous works design offline GCRL algorithms from the perspective of state-occupancy matching (Eysenbach et al., 2022). Mezghani et al., (2022) propose a self-supervised reward shaping method to facilitate offline GCRL.

Our work is most related to goal-conditioned imitation learning (GCIL). Emmons et al., (2022) study the importance of concatenating goals with states showing its effectiveness in various environments. Ding et al., (2019) extend generative adversarial imitation learning (Ho & Ermon, 2016) to goal-conditioned settings. Ghosh et al., (2021) extend behaviors cloning (Bain and Sammut, 1995) to goal-conditioned settings and propose goal-conditioned supervised learning (GCSL) to imitate relabeled offline trajectories. Yang et al., (2022) connect GCSL to offline GCRL, and show that the objective function in GCSL is a lower bound of a function of the original GCRL objective. They propose the GEAW algorithm, which re-weights the offline data based on advantage function similarly to Peng et al., (2019); Wang et al., (2018). Additionally, Yang et al., (2022) identify the multi-modality challenges in GEAW and introduce the best-advantage weight (BAW) to exclude state-actions with low advantage during the learning process. In parallel, our DAWOG was developed to address this very challenge, offering a novel advantage-based action re-weighting approach.

Some connections can also be found with goal-based hierarchical reinforcement learning methods (Li et al., 2022; Chane-Sane et al., 2021; Kim et al., 2021; Zhang et al., 2021; Nasiriany et al., 2019). These works feature a high-level model capable of predicting a sequence of intermediate sub-goals and learn low-level policies to achieve them. Instead of learning to reach a specific sub-goals, our policy learns to reach an entire sub-region of the state space containing states that are equally valuable and provide an incremental improvement towards the final goal.

Lastly, there have been other applications of state space partitioning in reinforcement learning, such as facilitating exploration and accelerating policy learning in online settings (Ma et al., 2020; Wei et al., 2018; Karimpanal and Wilhelm, 2017; Mannor et al., 2004). Ghosh et al., (2018) demonstrate that learning a policy confined to a state partition instead of the whole space can lead to low-variance gradient estimates for learning value functions. In their work, states are partitioned using K-means to learn an ensemble of locally optimal policies, which are then progressively merged into a single, better-performing policy. Instead of partitioning states based on their geometric proximity, we partition states according to the proximity of their corresponding goal-conditioned values. We then use this information to define an auxiliary reward function and, consequently, a region-based advantage function.

3 Preliminaries

Goal-conditioned MDPs Goal-conditioned tasks are usually modeled as Goal-Conditioned Markov Decision Processes (GCMDP), denoted by a tuple $<{\mathcal {S}}, {\mathcal {A}}, {\mathcal {G}}, P, R>$ where ${\mathcal {S}}$, ${\mathcal {A}}$, and ${\mathcal {G}}$ are the state, action and goal space, respectively. For each state $s \in {\mathcal {S}}$, there is a corresponding achieved goal, $\phi (s) \in {\mathcal {G}}$, where $\phi :{\mathcal {S}} \rightarrow {\mathcal {G}}$ Liu et al., (2022). At a given state $s_t$, an action $a_t$ taken towards a desired goal g results in a visited next state $s_{t+1}$ according to the environment’s transition dynamics, $P(s_{t+1} \mid s_t, a_t)$. The environment then provides a reward, $r_t = R(s_{t+1}, g)$, which is non-zero only when the goal has been reached, i.e.,

$$\begin{aligned} R(s, g) = {\left\{ \begin{array}{ll} 1, &{} \text {if }\mid \mid \phi (s) - g \mid \mid ^2_2 \le \text {threshold},\\ 0, &{} \text {otherwise.}\\ \end{array}\right. } \end{aligned}$$

(1)

Offline goal-conditioned RL In offline GCRL, the agent aims to learn a goal-conditioned policy, $\pi : {\mathcal {S}} \times {\mathcal {G}} \rightarrow {\mathcal {A}}$, using an offline dataset containing previously logged trajectories that might be generated by any number of unknown behaviors policies. The objective is to maximize the expected and discounted cumulative returns,

$$\begin{aligned} J_{GCRL}(\pi )={\mathbb {E}}_{\begin{array}{c} g \sim P_g, s_0 \sim P_0, \\ a_t \sim \pi (\cdot \mid s_t, g), \\ s_{t+1} \sim P(\cdot \mid s_t, a_t) \end{array}} \left[ \sum ^T_{t=0} \gamma ^t r_t \right] , \end{aligned}$$

(2)

where $\gamma \in (0, 1]$ is a discount factor, $P_g$ is the distribution of the goals, $P_0$ is the distribution of the initial state, and T corresponds to the time step at which an episode ends, i.e., either the goal has been achieved or timeout has been reached.

Goal-conditioned value functions A goal-conditioned state-action value function (Schaul et al., 2015) quantifies the value of an action a taken from a state s conditioned on a goal g using the sparse rewards of Eq. 1,

$$\begin{aligned} Q^\pi (s, a, g)={\mathbb {E}}_{\pi } \left[ \sum ^T_{t=0} \gamma ^t r_t \mid s_0=s, a_0=a \right] \end{aligned}$$

(3)

where ${\mathbb {E}}_{\pi }[\cdot ]$ denotes the expectation taken with respect to $a_t \sim \pi (\cdot \mid s_t, g)$ and $s_{t+1} \sim P(\cdot \mid s_t, a_t)$. Analogously, the goal-conditioned state value function quantifies the value of a state s when trying to reach g,

$$\begin{aligned} V^\pi (s, g)={\mathbb {E}}_{\pi } \left[ \sum ^T_{t=0} \gamma ^t r_t \mid s_0=s\right] . \end{aligned}$$

(4)

The goal-conditioned advantage function,

$$\begin{aligned} A^\pi (s, a, g)=Q^\pi (s, a, g)-V^\pi (s, g), \end{aligned}$$

(5)

then quantifies how advantageous it is to take a specific action a in state s towards g over taking the actions sampled from $\pi (\cdot \mid s, g)$ (Yang et al., 2022).

Goal-conditioned supervised learning (GCSL) GCSL (Ghosh et al., 2021) relabels the desired goal in each data tuple $(s_t, a_t, g)$ with the goal achieved henceforth in the trajectory to increase the diversity and quality of the data (Andrychowicz et al., 2017; Kaelbling, 1993). The relabeled dataset is denoted as ${\mathcal {D}}_R=\{(s_t, a_t, g=\phi (s_i)) \mid T \ge i > t \ge 0\}$. GCSL learns a policy that mimics the relabeled transitions by maximizing

$$\begin{aligned} J_{GCSL}(\pi ) = {\mathbb {E}}_{(s_t, a_t, g) \sim {\mathcal {D}}_R} \left[ \pi (a_t \mid s_t, g) \right] . \end{aligned}$$

(6)

Yang et al., (2022) have connected GCSL to GCRL and demonstrated that $J_{GCSL}$ lower bounds $\frac{1}{T}\log J_{GCRL}$.

Goal-conditioned exponential advantage weighting (GEAW) GEAW, as discussed in Yang et al., (2022); Wang et al., (2018), extends GCSL by incorporating a goal-conditioned exponential advantage as the weight for Eq. 6. Its design ensures that samples with higher advantages receive larger weights and vice versa. Specifically, GEAW trains a policy that emulates relabeled transitions, but with varied weights:

$$\begin{aligned} J_{GEAW}(\pi ) = {\mathbb {E}}_{(s_t, a_t, g) \sim {\mathcal {D}}_R} \left[ \exp _{clip} (A(s_t, a_t, g)) \pi (a_t \mid s_t, g) \right] . \end{aligned}$$

(7)

Here, $\exp _{clip}(\cdot )$ clips values within the range (0, M] to ensure numerical stability. This weighting approach has been demonstrated as a closed-form solution to an offline RL problem, guaranteeing that the resultant policy aligns closely with the behavior policy (Wang et al., 2018).

4 Methods

In this section, we formally present the proposed methodology and analytical results. First, we introduce a notion of target region advantage function in Sect. 4.1, which we use to develop the learning algorithm in Sect. 4.2. In Sect. 4.3 we provide a theoretical analysis offering guarantees that DAWOG learns a policy that is never worse than the underlying behaviors policy.

4.1 Target region advantage function

For any state $s \in {\mathcal {S}}$ and goal $g \in {\mathcal {G}}$, the domain of the goal-conditioned value function in Eq. 4 is the unit interval due to the binary nature of the reward function in Eq. 1. Given a positive integer K, we partition [0, 1] into K equally sized intervals, $\{ \beta _i \}_{i=1,\ldots ,K}$. For any goal g, this partition induces a corresponding partition of the state space.

Definition 1

(Goal-conditioned state space partition) For a fixed desired goal $g \in {\mathcal {G}}$, the state space is partitioned into K equally sized regions according to $V^\pi (\cdot , g)$. The $k^{th}$ region, notated as $B_{k}(g)$, contains all states whose goal-conditioned values are within $\beta _k$, i.e.,

$$\begin{aligned} B_{k} (g) = \{s \in {\mathcal {S}} \mid V^\pi (s, g) \in \beta _k \}. \end{aligned}$$

(8)

Our ultimate objective is to up-weight actions taken in a state $s_t \in B_{k}(g)$ that are likely to lead to a region only marginally better (but never worse) than $B_{k}(g)$ as rapidly as possible.

Definition 2

(Target region) For $s \in B_{k}(g)$, the mapping $b(s, g): {\mathcal {S}} \times {\mathcal {G}} \rightarrow \{1, \ldots , K\}$ returns the correct index k. The goal-conditioned target region is defined as

$$\begin{aligned} G(s, g) = B_{\min \{b(s, g)+1, K\}}(g), \end{aligned}$$

(9)

which is the set of states whose goal-conditioned value is not less than the states in the current region. For $s \in B_k(g)$, G(s, g) is the current region $B_k(g)$ if and only if $k=K$.

We now introduce two target region value functions.

Definition 3

(Target region value functions) For a state s, action $a \in {\mathcal {A}}$, and the target region G(s, g), we define a target region V-function and a target region Q-function based on an auxiliary reward function that returns a non-zero reward only when the next state belongs to the target region, i.e.,

$$\begin{aligned} {\tilde{r}}_t = {\tilde{R}}(s_t, s_{t+1}, G(s_t, g)) = {\left\{ \begin{array}{ll} 1, &{} \text {if } s_{t+1} \in G(s_t,g)\\ 0, &{} \text {otherwise.} \\ \end{array}\right. } \end{aligned}$$

(10)

The target region Q-value function is

$$\begin{aligned} \tilde{Q}^\pi (s, a, G(s, g)) = {\mathbb {E}}_{\pi } \left[ \sum ^{{\tilde{T}}}_{t=0} \gamma ^t {\tilde{r}}_t \mid s_0=s, a_0=a \right] , \end{aligned}$$

(11)

where ${\tilde{T}}$ corresponds to the time step at which the target region is achieved or timeout is reached, ${\mathbb {E}}_{\pi }[\cdot ]$ denotes the expectation taken with respect to the policy $a_t \sim \pi (\cdot \mid s_t, g)$ and the transition dynamics $s_{t+1} \sim P(\cdot \mid s_t, a_t)$. The target region Q-function estimates the expected cumulative return when starting in $s_t$, taking an action $a_t$, and then following the policy $\pi$, based on the auxiliary reward. The discount factor $\gamma$ reduces the contribution of delayed target achievements. Analogously, the target region value function is defined as

$$\begin{aligned} \tilde{V}^\pi (s, G(s, g)) = {\mathbb {E}}_{\pi } \left[ \sum ^{{\tilde{T}}}_{t=0} \gamma ^t {\tilde{r}}_t \mid s_0=s \right] \end{aligned}$$

(12)

and quantifies the quality of a state s according to the same criterion.

Using the above value functions, we are in a position to introduce the corresponding target region advantage function.

Definition 4

(Target region advantage function) The target region-based advantage function is defined as

$$\begin{aligned} {\tilde{A}}^\pi (s, a, G(s,g)) = {\tilde{Q}}^\pi (s, a, G(s, g)) - {\tilde{V}}^\pi (s, G(s, g)). \end{aligned}$$

(13)

It estimates the advantage of action a towards the target region in terms of the cumulative return by taking a in state s and following the policy $\pi$ thereafter, compared to taking actions sampled from the policy.

4.2 The DAWOG algorithm

The proposed DAWOG belongs to the family of WGCSL algorithms, i.e. it is designed to optimize the following objective function

$$\begin{aligned} J_{DAWOG}(\pi ) = {\mathbb {E}}_{(s_t, a_t, g) \sim {\mathcal {D}}_R} \left[ w_t \log \pi (a_t \mid s_t, g) \right] \end{aligned}$$

(14)

where the role of $w_t$ is to re-weight each action’s contribution to the loss. In DAWOG, $w_t$ is an exponential weight of form

$$\begin{aligned} w_t = \exp _{clip} (\beta A^{\pi _b}(s_t, a_t, g) + \tilde{\beta } \tilde{A}^{\pi _b}(s_t, a_t, G(s_t, g)), \end{aligned}$$

(15)

where $\pi _b$ is the underlying behavior policy that generate the relabeled dataset ${\mathcal {D}}_R$. The contribution of the two advantage functions, $\tilde{A}^{\pi _b}(s_t, a_t, g)$ and $\tilde{A}^{\pi _b}(s_t, a_t, G(s_t,g))$, is controlled by positive scalars, $\beta$ and $\tilde{\beta }$, respectively. However, empirically, we have found that using a single shared parameter generally performs well across the tasks we have considered (see Sect. 5.5). The clipped exponential, $\exp _{clip}(\cdot )$, is used for numerical stability and keeps the values within the (0, M] range, for a given $M>0$ threshold.

The algorithm combines the originally proposed goal-conditioned advantage (Yang et al., 2022) with the novel target region advantage. The former ensures that actions likely to lead to the goal are up-weighted. However, when the goal is still far, there may still be several possible ways to reach it, resulting in a wide variety of favorable actions. The target region advantage function provides additional guidance by further increasing the contribution of actions expected to lead to a higher-valued sub-region of the state space as rapidly as possible. Both $A^{\pi _b}(s_t, a_t, g)$ and $\tilde{A}^{\pi _b}(s_t, a_t, G(s_t,g))$ are beneficial in a complementary fashion: whereas the former is more concerned with long-term gains, which are more difficult and uncertain, the latter is more concerned with short-term gains, which are easier to achieve. As such, these two factors are complementary and their combined effect plays an important role in the algorithm’s final performance (see Sect. 5.5). An illustration of the dual-advantage weighting scheme is shown in Fig. 3.

In the remainder, we explain the entire training procedure. The advantage $A^{\pi _b}(s_t, a_t, g)$ is estimated via

$$\begin{aligned} A^{\pi _b}(s_t, a_t, g)= r_t + \gamma V^{\pi _b}(s_{t+1}, g) - V^{\pi _b}(s_t, g). \end{aligned}$$

(16)

In practice, the goal-conditioned V-function is approximated by a deep neural network with parameter $\psi _1$, which is learned by minimizing the temporal difference (TD) error (Sutton & Barto, 2018):

$$\begin{aligned} {\mathcal {L}}(\psi _1) = {\mathbb {E}}_{(s_t, s_{t+1}, g) \sim {\mathcal {D}}_R} \left[ (V_{\psi _1}(s_t, g) - y_t)^2 \right] , \end{aligned}$$

(17)

where $y_t$ is the target value given by

$$\begin{aligned} y_t=r_t + \gamma (1 - d(s_{t+1}, g)) V_{\psi ^-_1}(s_{t+1}, g). \end{aligned}$$

(18)

Here $d(s_{t+1}, g)$) indicates whether the state $s_{t+1}$ has reached the goal g. The parameter vector $\psi ^-_1$ is a slowly moving average of $\psi _1$ to stabilize training (Mnih et al., 2015). Analogously, the target region advantage function is estimated by

$$\begin{aligned} {\tilde{A}}^{\pi _b}(s_t, a_t, G(s_t, g))= {\tilde{r}}_t + \gamma {\tilde{V}}^{\pi _b}(s_{t+1}, G(s_t, g)) - {\tilde{V}}^{\pi _b}(s_t, G(s_t, g)), \end{aligned}$$

(19)

where the target region V-function is approximated with a deep neural network parameterized with $\psi _2$. The relevant loss function is

$$\begin{aligned} {\mathcal {L}}(\psi _2) = {\mathbb {E}}_{(s_t, s_{t+1}, g) \sim {\mathcal {D}}_R} \left[ ({\tilde{V}}_{\psi _2}(s_t, G(s_t, g)) - {\tilde{y}}_t)^2 \right] , \end{aligned}$$

(20)

where the target value is

$$\begin{aligned} {\tilde{y}}_t={\tilde{r}}_t + \gamma (1 - {\tilde{d}}(s_{t+1}, G(s_t, g))) {\tilde{V}}_{\psi ^-_2}(s_{t+1}, G(s_t, g)). \end{aligned}$$

(21)

and ${\tilde{d}}(s_{t+1}, G(s_t, g))$) indicates whether the state $s_{t+1}$ has reached the target region $G(s_t, g)$. $\psi ^-_2$ is a slowly moving average of $\psi _2$. The full procedure is presented in Algorithm 1 where the two value functions are jointly optimized and contribute to optimizing Eq. 14.

4.3 Policy improvement guarantees

In this section, we demonstrate that our learned policy is never worse than the underlying behavior policy $\pi _b$ that generates the relabeled data. First, we express the policy learned by our algorithm in an equivalent form, as follows.

Proposition 1

DAWOG learns a policy $\pi _\theta$ to minimize the KL-divergence from

$$\begin{aligned} {\tilde{\pi }}_{dual}(a \mid s, g) = \exp (w + N(s, g)) \pi _b(a \mid s, g), \end{aligned}$$

(22)

where $w = \beta A^{\pi _b}(s, a, g) + {\tilde{\beta }} {\tilde{A}}^{\pi _b}(s, a, G(s,g))$, G(s, g) is the target region, and N(s, g) is a normalizing factor to ensuring that $\sum _{a \in {\mathcal {A}}} {\tilde{\pi }}_{dual}(a \mid s, g)=1$.

Proof

According to Eq. 14, DAWOG maximizes the following objective with the policy parameterized by $\theta$:

$$\begin{aligned} \begin{aligned} \arg \max _\theta J(\theta ) =&\arg \max _\theta {\mathbb {E}}_{(s, a, g) \sim {\mathcal {D}}_R} \left[ \exp (w + N(s, g)) \log \pi _\theta (a \mid s, g)\right] \\ =&\arg \max _\theta {\mathbb {E}}_{(s, g) \sim {\mathcal {D}}_R} \left[ \sum _{a} \exp (w + N(s, g)) \pi _b(a \mid s, g) \log \pi _\theta (a \mid s, g) \right] \\ =&\arg \max _\theta {\mathbb {E}}_{(s, g) \sim {\mathcal {D}}_R} \left[ \sum _{a} {\tilde{\pi }}_{dual}(a \mid s, g) \log \pi _\theta (a \mid s, g) \right] \\ =&\arg \max _\theta {\mathbb {E}}_{(s, g) \sim {\mathcal {D}}_R} \left[ \sum _{a} {\tilde{\pi }}_{dual}(a \mid s, g) \log \frac{\pi _\theta (a \mid s, g)}{{\tilde{\pi }}_{dual}(a \mid s, g)} \right] \\ =&\arg \min _\theta {\mathbb {E}}_{(s, g) \sim {\mathcal {D}}_R} \left[ D_{KL} ({\tilde{\pi }}_{dual}(\cdot \mid s, g) \mid \mid \pi _\theta (\cdot \mid s, g)) \right] \end{aligned} \end{aligned}$$

(23)

$J(\theta )$ reaches its maximum when

$$\begin{aligned} D_{KL} ({\tilde{\pi }}_{dual}(\cdot \mid s, g) \mid \mid \pi _{\theta }(\cdot \mid s, g)) = 0, \forall s \in {\mathcal {S}}, g \in {\mathcal {G}}. \end{aligned}$$

(24)

$\square$

Then, we propose Proposition 2 to show the condition for policy improvement.

Proposition 2

(Wang et al., 2018; Yang et al., 2022) Suppose two policies $\pi _1$ and $\pi _2$ satisfy

$$\begin{aligned} h_1(\pi _2(a \mid s, g)) = h_1(\pi _1(a \mid s, g)) + h_2(s, g, A^{\pi _1}(s, a, g)) \end{aligned}$$

(25)

where $h_1(\cdot )$ is a monotonically increasing function, and $h_2(s, g, \cdot )$ is monotonically increasing for any fixed s and g. Then we have

$$\begin{aligned} V^{\pi _2}(s, g) \ge V^{\pi _1}(s, g), \forall s \in {\mathcal {S}} \text { and } g \in {\mathcal {G}}. \end{aligned}$$

That is, $\pi _2$ is uniformly as good as or better than $\pi _1$.

We want to leverage this result to demonstrate that $V^{{\tilde{\pi }}_{dual}}(s, g) \ge V^{\pi _b}(s, g)$ for any state s and goal g. Firstly, we need to obtain a monotonically increasing function $h_1(\cdot )$. This is achieved by taking the logarithm of the both sides of Eq. 22, i.e.,

$$\begin{aligned} \begin{aligned} \log {\tilde{\pi }}_{dual}(a \mid s, g) =&\log \pi _b(a \mid s, g) + \beta A^{\pi _b}(s, a, g) \\ {}&+ {\tilde{\beta }} {\tilde{A}}^{\pi _b}(s, a, G(s,g)) + N(s, g). \end{aligned} \end{aligned}$$

(26)

so that $h_1(\cdot )= \log (\cdot )$. The following proposition establishes that we also have a function $h_2(s, g, A^{\pi _b}(s, a, g)) = \beta A^{\pi _b}(s, a, g) + {\tilde{\beta }} {\tilde{A}}^{\pi _b}(s, a, G(s,g)) + N(s, g)$, which is monotonically increasing for any fixed s and g. Since $\beta , {\tilde{\beta }} \ge 0$ and N(s, g) is independent of the action, it is equivalent to prove that for any fixed s and g, there exists a monotonically increasing function l satisfying

$$\begin{aligned} l(s, g, {\tilde{A}}^{\pi _b}(s, a, G(s,g))) = A^{{\pi _b}}(s, a, g). \end{aligned}$$

(27)

Proposition 3

Given fixed s, g and the target region G(s, g), the goal-conditioned advantage function $A^\pi$ and the target region-conditioned advantage function ${\tilde{A}}^\pi$ satisfy $l(s, g, A^\pi (s, a, g)) = {\tilde{A}}^\pi (s, a, G(s,g))$, where $l(s, g,\cdot )$ is monotonically increasing for any fixed s and g.

Proof

By the definition of monotonically increasing function, if for all $a', a'' \in {\mathcal {A}}$ such that $A^\pi (s, a', g) \ge A^\pi (s, a'', g)$ and we can reach ${\tilde{A}}^\pi (s, a', G(s,g)) \ge {\tilde{A}}^\pi (s, a'', G(s,g))$, then the proposition can be proved.

We start by having any two actions $a', a'' \in {\mathcal {A}}$ such that

$$\begin{aligned} A^\pi (s, a', g) \ge A^\pi (s, a'', g). \end{aligned}$$

(28)

By adding $V^\pi (s, g)$ on both sides, the inequality becomes

$$\begin{aligned} Q^\pi (s, a', g) \ge Q^\pi (s, a'', g). \end{aligned}$$

(29)

By Definition 3, the goal-conditioned Q-function can be written as

$$\begin{aligned} Q^\pi (s, a, g)={\mathbb {E}}_{\pi } \left[ R_{t,\tau _i} \mid s_t=s, a_t=a \right] , \end{aligned}$$

(30)

where $\tau _i$ represents a trajectory: $s_t, a_t, r_t^i, s_{t+1}^i, a_{t+1}^i, r_{t+1}^i, \ldots , s_{T}^i$

$$\begin{aligned} R_{t,\tau _i}=r_t^i + \gamma r_{t+1}^i + \ldots + \gamma ^{t_{tar}^i}V(s_{tar}^i, g), \end{aligned}$$

(31)

$s_{tar}^i$ corresponds to the state where $\tau _i$ gets into the target region, $t_{tar}^i$ is the corresponding time step. Because the reward is zero until the desired goal is reached, Eq. 30 can be written as

$$\begin{aligned} Q^\pi (s, a, g)={\mathbb {E}}_{\pi } \left[ \gamma ^{t_{tar}^i}V(s_{tar}^i, g) \mid s_t=s, a_t=a \right] . \end{aligned}$$

(32)

Similarly,

$$\begin{aligned} \begin{aligned} {\tilde{Q}}^\pi (s, a, G(s,g))&={\mathbb {E}}_{\pi } \left[ \gamma ^{t_{tar}^i}{\tilde{V}}(s_{tar}^i, G(s,g)) \mid s_t=s, a_t=a \right] \\&={\mathbb {E}}_{\pi } \left[ \gamma ^{t_{tar}^i} \mid s_t=s, a_t=a \right] . \end{aligned} \end{aligned}$$

(33)

According to Eq. 29 and Eq. 32, we have

$$\begin{aligned} {\mathbb {E}}_{\pi } \left[ \gamma ^{t_{tar}^{i'}}V(s_{tar}^{i'}, g) \mid s_t=s, a_t=a' \right] \ge {\mathbb {E}}_{\pi } \left[ \gamma ^{t_{tar}^{i''}}V(s_{tar}^{i''}, g) \mid s_t=s, a_t=a'' \right] . \end{aligned}$$

(34)

Given the valued-based partitioning of the state space, we assume that the goal-conditioned values of states in the target region are sufficiently close such that $\forall i', i'', V(s_{tar}^{i'}, g) \approx v, V(s_{tar}^{i''}, g) \approx v$. Then, Eq. 34 can be approximated as

$$\begin{aligned} v \cdot {\mathbb {E}}_{\pi } \left[ \gamma ^{t_{tar}^{i'}} \mid s_t=s, a_t=a' \right] \ge v \cdot {\mathbb {E}}_{\pi } \left[ \gamma ^{t_{tar}^{i''}} \mid s_t=s, a_t=a'' \right] . \end{aligned}$$

(35)

Removing v on both sides of Eq. 35 and according to Eq. 33, we have

$$\begin{aligned} {\tilde{Q}}^\pi (s, a', G(s, g)) \ge {\tilde{Q}}^\pi (s, a'', G(s, g)). \end{aligned}$$

(36)

Then,

$$\begin{aligned} {\tilde{A}}^\pi (s, a', G(s, g)) \ge {\tilde{A}}^\pi (s, a'', G(s, g)). \end{aligned}$$

(37)

$\square$

Since GCSL aims to mimic the underlying behavior policy using maximum likelihood estimation, DAWOG inherently offers guarantees in relation to GCSL.

5 Experimental results

In this section, we examine DAWOG’s performance relative to existing state-of-the-art algorithms using environments of increasing complexity. The remainder of this section is organized as follows. The benchmark tasks and datasets are presented in Sect. 5.1. The implementation details are provided in Sect. 5.2. A list of competing methods is presented in Sect. 5.3, and the comparative performance results are found in Sect. 5.4. Here, we also qualitatively inspect the policies learned by DAWOG in an attempt to characterize the improvements that can be achieved over other methods. Section 5.5 presents extensive ablation studies to appreciate the relative contribution of the different advantage weighting factors. Finally, in Sect. 5.6, we study how the dual-advantage weight depends on its hyperparameters.

5.1 Tasks and datasets

5.1.1 Grid World

We designed two $16 \times 16$ grid worlds to assess the performance on a simple navigation task. From its starting position on the grid, an agent needs to reach a goal that has been randomly placed in one of the available cells. Only four actions are available to move left, right, up, and down. The agent accrues a positive reward when it reaches the cell containing the goal. To generate the benchmark dataset, we trained a Deep Q-learning algorithm (Mnih et al., 2015), whose replay buffer, containing 4, 000 trajectories of 50 time steps, was used as the benchmark dataset.

5.1.2 AntMaze navigation

The AntMaze suite used in our experiment is obtained from the D4RL benchmark (Fu et al., 2020), which has been widely adopted by offline GCRL studies (Eysenbach et al., 2022; Emmons et al., 2022; Li et al., 2022). The task requires to control an 8-DoF quadruped robot that moves in a maze and aims to reach a target location within an allowed maximum of 1, 000 steps. The suite contains three kinds of different maze layouts: umaze (a U-shape wall in the middle), medium and large, and provides three training datasets. The datasets differ in the way the starting and goal positions of each trajectory were generated: in umaze the starting position is fixed and the goal position is sampled within a fixed-position small region; in diverse the starting and goal positions are randomly sampled in the whole environment; finally, in play, the starting and goal positions are randomly sampled within hand-picked regions. In the sparse-reward environment, the agent obtains a reward only when it reaches the target goal. We use a normalized score as originally proposed in Fu et al., (2020), i.e.,

$$\begin{aligned} s_n = 100 \cdot \frac{s - s_r}{ s_e - s_r} \end{aligned}$$

where s is the unnormalized score, $s_r$ is a score obtained using a random policy and $s_e$ is the score obtained using an expert policy.

In our evaluation phase, the policy is tested online. The agent’s starting position is always fixed, and the goal position is generated using one of the following methods:

fixed goal: the goal position is sampled within a small and fixed region in a corner of the maze, as in previous work (Eysenbach et al., 2022; Emmons et al., 2022; Li et al., 2022);
diverse goal: the goal position is uniformly sampled over the entire region. This evaluation scheme has not been adopted in previous works, but helps assess the policy’s generalization ability in goal-conditioned settings.

5.1.3 Gym robotics

Gym Robotics (Plappert et al., 2018) is a popular robotic suite used in both online and offline GCRL studies (Yang et al., 2022; Eysenbach et al., 2022). The agent to be controlled is a 7-DoF robotic arm, and several tasks are available: in FetchReach, the arm needs to touch a desired location; in FetchPush, the arm needs to move a cube to a desired location; in FetchPickAndPlace a cube needs to be picked up and moved to a desired location; finally, in FetchSlide, the arm needs to slide a cube to a desired location. Each environment returns a reward of one when the task has been completed within an allowed time horizon of 50 time steps. For this suite, we use the expert offline dataset provided by Yang et al., (2022). The dataset for FetchReach contains $1 \times 10^5$ time steps whereas all the other datasets contain $2 \times 10^6$ steps. The datasets are collected using a pre-trained policy using DDPG and hindsight relabeling (Lillicrap et al., 2016; Andrychowicz et al., 2017); the actions from the policy were perturbed by adding Gaussian noise with zero mean and 0.2 standard deviation.

5.2 Implementation details

DAWOG’s training procedure is shown in Algorithm 1. In our implementation, for continuous control tasks, we use a Gaussian policy following previous recommendations Raffin et al., (2021). When interacting with the environment, the actions are sampled from the above distribution. All the neural networks used in DAWOG are 3-layer multi-layer perceptrons with 512 units in each layer and ReLU activation functions. The parameters are trained using the Adam optimizer Kingma & Ba (2014) with a learning rate $1 \times 10^{-3}$. The training batch size is 512 across all networks. To represent G(s, g) we use a K-dimensional one-hot encoding vector where the $i^{th}$ position is non-zero for the target region and zero everywhere else along with the goal g. Four hyper-parameters need to be chosen: the state partition size, K, the two coefficients controlling the relative contribution of the two advantage functions, $\beta$ and ${\tilde{\beta }}$, and the clipping bound, M. In our experiments, we use $K=20$ for umaze and medium maze, $K=50$ for large maze, and $K=10$ for all other tasks. In all our experiments, we use fixed values of $\beta = {\tilde{\beta }} = 10$. The clipping bound is always kept at $M=10$.

5.3 Competing methods

Several competing algorithms have been selected for comparison with DAWOG, including offline DRL methods that were not originally proposed for goal-conditioned tasks and required some minor adaptation. In the remainder of this Section, the nomenclature ’g-’ indicates that the original algorithm has been implemented to operate in a goal-conditioned setting by concatenating the state and the goal as a new state and with hindsight relabeling. In all experiments, we independently optimize the hyper-parameters for every algorithm.

The first category of algorithms comprises regression-based methods that imitate the relabeled offline dataset using various weighting strategies:

GCSL (Ghosh et al., 2021) imitates the relabeled transitions without any weighting strategies.
GEAW (Wang et al., 2018; Yang et al., 2022) uses goal-conditioned advantage to weight the actions in the offline data.
WGCSL (Yang et al., 2022) employs a combination of weighting strategies: discounted relabeling weighting (DRW), goal-conditioned exponential advantage weighting (GEAW), and best-advantage weighting (BAW).

We also include three actor-critic methods:

Contrastive RL (Eysenbach et al., 2022) estimates a Q-function by contrastive learning;
g-CQL (Kumar et al., 2020) learns a conservative Q-function;
g-BCQ (Fujimoto et al., 2019) learns a Q-function by clipped double Q-learning with a restricted policy;
g-TD3-BC (Fujimoto & Gu, 2021) combines TD3 algorithm (Fujimoto et al., 2018) with a behavior cloning regularizer.

Finally, we include a hierarchical learning method, IRIS (Mandlekar et al., 2020), which employs a low-level imitation learning policy to reach sub-goals commanded by a high-level goal planner.

5.4 Performance comparisons and analysis

To appreciate how state space partitioning works, we provide examples of valued-based partition for the grid worlds environments in Fig. 4. In these cases, the environmental states simply correspond to locations in the grid. Here, the state space is divided with darker colors indicating higher values. As expected, these figures clearly show that states can be ordered based on the estimated value function, and that higher-valued states are those close to the goal. We also report the average return across five runs in Table 1 where we compare DAWOG against GCSL and GEAW - two algorithms that are easily adapted for discrete action spaces.

Table 2 presents the results for the Gym Robotics suite. We detail the average return and the standard deviation for each algorithm, derived from four independent runs with unique seeds. As can be seen from the results, most of the competing algorithms reach a comparable performance with DAWOG. However, DAWOG generally achieves higher scores and the most stable performance in different tasks.

Table 1 Experiment results for the two Grid World navigation environments. The mean and the standard deviation are calculated by 4 independent runs

Full size table

Table 2 Experiment results in Gym Robotics. The mean and the standard deviation are calculated by 4 independent runs

Full size table

Table 3 displays similar findings for the AntMaze suite. In these more complex, long-horizon environments, DAWOG consistently surpasses all baseline algorithms. In scenarios with diverse goals, while all algorithms exhibit lower performance, DAWOG still manages to secure the highest average score. This setup requires better generalization performance given that the test goals are sampled from every position within the maze.

Table 3 Experiment results in AntMaze environments. The results are normalized by the expert score from D4RL paper. The mean and the standard deviation are calculated by 4 independent runs

Full size table

To gain an appreciation for the benefits introduced by the target region approach, in Fig. 1 we visualize 100 trajectories realized by three different policies for AntMaze tasks: dual-advantage weighting (DAWOG), equally-weighting and goal-conditioned advantage weighting. The trajectories generated by equally-weighting occasionally lead to regions in the maze that should have been avoided, which results in sub-optimal solutions. The policy from goal-conditioned advantage weighting is occasionally less prone to making the same mistakes, although it still suffers from the multi-modality problem. This can be appreciated, for instance, by observing the antmaze-medium case. In contrast, DAWOG is generally able to reach the goal with fewer detours, hence in a shorter amount of time.

5.5 Further studies

In this Section we take a closer look at how the two advantage-based weights featuring in Eq. 15 perform, both separately and jointly taken, when used in the loss of Eq. 14. We compare learning curves, region occupancy times (i.e. time spent in each region of the state space whilst reaching the goal), and potential overestimation biases. We also study the effects of using different target region and using entropy to regularize policy learning.

5.5.1 Learning curves

In the AntMaze environments, we train DAWOG using no advantage ($\beta ={\tilde{\beta }}=0$), only the goal-conditioned advantage ($\beta =10, {\tilde{\beta }}=0$), only the target region advantage ($\beta =0, {\tilde{\beta }}=10$), and the proposed dual-advantage ($\beta ={\tilde{\beta }}=10$). Over the course of 50, 000 gradient updates, Fig. 5 clearly illustrates the distinct learning trajectories of each algorithm. Both the goal-advantage and region-based advantage perform better than using no advantage, and their performance is generally comparable, with the latter often achieving higher normalized returns. Combining the two advantages leads to significantly higher returns than any advantage weight individually taken.

5.5.2 Region occupancy times

In this study, we set out to confirm that the dual-advantage weighting scheme results in a policy favoring actions leading to the next higher ranking target region rapidly, i.e. by reducing the occupancy time in each region. Using the AntMaze environments, Fig. 6 shows the average time spent in a region of the state space partitioned with $K=50$ regions. As shown here, the dual-advantage weighting allows the agent to reach the target (next) region in fewer time steps compared to the goal-conditioned advantage alone. As the episode progresses, the ant’s remaining time to complete its task diminishes, influencing its decision-making process. Hence, as the ant progressively moves to higher ranking regions closer to the goal, the occupancy times decrease.

5.5.3 Over-estimation bias

We assess the extent of potential overestimation errors affecting the two advantage weighting factors used in our method (see Eq. 15). This is done by studying the error that occurred in the estimation of the corresponding V-functions (see Eq. 16 and Eq. 19). Given a state s and goal g, we compute the goal-conditioned V-value estimation error as $V_{\psi _1}(s, g) - V^{\pi }(s, g)$, where $V_{\psi _1}(s, g)$ is the parameterized function learned by our algorithm and $V^{\pi }(s, g)$ is an unbiased Monte-Carlo estimate of the goal-conditioned V-function’s true value (Sutton & Barto, 2018). Since $V^{\pi }(s, g)$ represents the expected discounted return obtained by the underlying behavior policy that generates the relabeled data, we use a policy pre-trained with the GCSL algorithm to generate 1, 000 trajectories to calculate the Monte-Carlo estimate (i.e. the average discounted return). Analogously, the target region V-value estimation error is ${\tilde{V}}_{\psi _2}(s, g, G(s, g)) - {\tilde{V}}^{\pi }(s, g, G(s,g))$. We use the learned target region V-value function to calculate ${\tilde{V}}_{\psi _2}(s, g, G(s, g))$, and Monte-Carlo estimation to approximate ${\tilde{V}}^{\pi }(s, g, G(s,g))$.

Our investigation focuses on the Grid World environment, specifically analyzing two distinct layouts: grid-umaze and grid-wall. For each layout, we randomly sample s and g uniformly within the entire maze and ensure that the number of regions separating them is uniformly distributed in $\{1, \ldots , K\}$. Then, for each k in that range: 1) 1, 000 goal positions are sampled randomly within the whole layout; 2) for each goal position, the state space is partitioned according to $V_{\psi }(\cdot ,g)$; and 3) a state is sampled randomly within the corresponding region. Since there may exist regions without any states, the observed total number of regions is smaller than $K=10$. The resulting estimation errors are shown in Fig. 7. As can be seen here, both the mean and standard deviation of the ${\tilde{V}}$-value errors are consistently smaller than those corresponding to the V-value errors. This indicates the target region value function is more robust against over-estimation bias, which may help improve the generalization performance in out-of-distribution settings.

5.5.4 Effects of different target regions

As outlined in Definition 2, the target region comprises states with goal-conditioned values marginally exceeding the current state’s value. Within DAWOG, when the current region is denoted as $B_k(g)$, the subsequent target region becomes $B_{k+1}(g)$. Yet, for immediate benefits, regions beyond $B_{k+1}(g)$ can also be contemplated. This section delves into the implications of varying target regions. Figure 8 demonstrates that targeting the immediate neighboring region with higher values using DAWOG consistently yields superior performance compared to other configurations with varied target regions. There is only one instance (antmaze-large-play) where targeting a slightly further region yields a marginally better outcome. Nonetheless, as a general trend, the performance advantage diminishes as the target region becomes increasingly distant from the current region.

5.5.5 Policy learning with entropy regularization

The concept of target region advantage can be perceived as a regularization technique. In this analysis, we juxtapose DAWOG with a version of GEAW enhanced by entropy regularization. The refined objective function is expressed as:

$$\begin{aligned} J_{reg}(\pi ) = {\mathbb {E}}_{(s_t, a_t, g) \sim {\mathcal {D}}_R} \left[ \exp _{clip} (\beta A^{\pi _b}(s_t, a_t, g)) \log \pi (a_t \mid s_t, g) + \alpha {\mathcal {H}}(\pi (\cdot \mid s_t, g))\right] \end{aligned}$$

where ${\mathcal {H}}(\pi (\cdot \mid s_t, g))$ is defined as $\frac{1}{2} \ln 2\pi e \sigma ^2$, with $\sigma$ representing the standard deviation of the Gaussian distribution conditioned as $\pi (\cdot \mid s_t, g)$. Initially, we set $\alpha$ values from the set 0, 0.01, 0.1. Subsequently, we employ a dynamic approach for $\alpha$, allowing it to decrease progressively from 0.1 to 0.01. The outcomes of these experiments are depicted in Fig. 9.

Although strategically adjusted regularization can slightly improve the GEAW baseline, it is evident that DAWOG maintains a consistent edge in performance. This superior performance of DAWOG can be attributed to the unique manner in which it introduces short-term goals.

5.6 Sensitivity to hyperparameters

Lastly, we examine the impact of the number of partitions (K) and the coefficients $\beta$ and ${\tilde{\beta }}$, which control the relative contribution of the two advantage functions on DAWOG’s overall performance. In the AntMaze task, we report the distribution of normalized returns as K increases. Figure 10 reveals that an optimal parameter yielding high average returns with low variance often depends on the specific task and is likely influenced by the environment’s complexity.

The performance of DAWOG, as depicted in Fig. 11, varies in response to different settings of $\beta$ and ${\tilde{\beta }}$, highlighting the algorithm’s sensitivity to these parameters. The plot demonstrates some minimal sensitivity to various parameter combinations but also exhibits a good degree of symmetry. In all our experiments, including those in Tables 1 and 2, we opted for a shared value, $\beta ={\tilde{\beta }}=10$, rather than optimizing each parameter combination for each task. This choice suggests that strong performance can be achieved even without extensive hyperparameter optimization.

6 Discussion and conclusions

Our study introduces a novel dual-advantage weighting scheme in supervised learning, specifically designed to tackle the complexities of multi-modality and distribution shifts in goal-conditioned offline reinforcement learning (GCRL). The corresponding algorithm, DAWOG (Dual-Advantage Weighting for Offline Goal-conditioned learning), prioritizes actions that lead to higher-reward regions, introducing an additional source of inductive bias and enhancing the ability to generalize learned skills to novel goals. Theoretical support is provided by demonstrating that the derived policy is never inferior to the underlying behavior policy. Empirical evidence shows that DAWOG learns highly competitive policies and surpasses several existing offline algorithms on demanding goal-conditioned tasks. Significantly, the ease of implementing and training DAWOG underscores its practical value, contributing substantially to the evolving understanding of offline GCRL and its interplay with goal-conditioned supervised learning (GCSL).

The potential for future research in refining and expanding upon our proposed approach is multifaceted. Firstly, our current method partitions states into equally-sized bins for the value function. Implementing an adaptive partitioning technique that does not assume equal bin sizes could provide finer control over state partition shapes (e.g., merging smaller regions into larger ones), potentially leading to further performance improvements.

Secondly, considering DAWOG’s effectiveness in alleviating the multi-modality problem in offline GCRL, it may also benefit other GCRL approaches beyond advantage-weighted GCSL. Specifically, our method could extend to actor-critic-based offline GCRL, such as TD3-BC (Fujimoto & Gu, 2021), which introduces a behavior cloning-based regularizer into the TD3 algorithm (Fujimoto et al., 2018) to keep the policy closer to actions experienced in historical data. The dual-advantage weighting scheme could offer an alternative direction for developing a TD3-based algorithm for offline GCRL.

Lastly, given our method’s ability to accurately weight actions, it might also facilitate exploration in online GCRL, potentially in combination with self-imitation learning (Oh et al., 2018; Ferret et al., 2021; Li et al., 2022). For example, a recent study demonstrated that advantage-weighted supervised learning is a competitive method for learning from good experiences in GCRL settings (Li et al., 2022). These promising directions warrant further exploration.

Availability of data and materials

All the data will be made available upon paper publication.

Availability of code

All code will be made available upon paper publication.

References

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., Zaremba, W. (2017). Hindsight experience replay. In: Advances in neural information processing systems.
Bain, M., & Sammut, C. (1995). A framework for behavioural cloning. Machine Intelligence, 15, 103–129.
Google Scholar
Chane-Sane, E., Schmid, C., Laptev, I. (2021). Goal-conditioned reinforcement learning with imagined subgoals. In: International conference on machine learning.
Charlesworth, H., & Montana, G. (2020). Plangan: Model-based planning with sparse rewards and multiple goals. Advances in Neural Information Processing Systems, 33, 8532–8542.
Google Scholar
Chebotar, Y., Hausman, K., Lu, Y., Xiao, T., Kalashnikov, D., Varley, J., Irpan, A., Eysenbach, B., Julian, R., Finn, C., Levine, S. (2021). Actionable models: Unsupervised offline reinforcement learning of robotic skills. In: International conference on machine learning.
Ding, Y., Florensa, C., Abbeel, P., Phielipp, M. (2019). Goal-conditioned imitation learning. In: Advances in Neural Information Processing Systems.
Durugkar, I., Tec, M., Niekum, S., & Stone, P. (2021). Adversarial intrinsic motivation for reinforcement learning. Advances in Neural Information Processing Systems, 34, 8622–8636.
Google Scholar
Emmons, S., Eysenbach, B., Kostrikov, I., Levine, S. (2022). RvS: What is essential for offline RL via supervised learning? In: International conference on learning representations.
Eysenbach, B., Zhang, T., Levine, S., & Salakhutdinov, R. (2022). Contrastive learning as goal-conditioned reinforcement learning. Advances in Neural Information Processing Systems, 35, 35603–35620.
Google Scholar
Fang, M., Zhou, T., Du, Y., Han, L., Zhang, Z. (2019). Curriculum-guided hindsight experience replay. In: Advances in neural information processing systems.
Ferret, J., Pietquin, O., Geist, M. (2021). Self-imitation advantage learning. In: International conference on autonomous agents and multiagent systems.
Fu, J., Kumar, A., Nachum, O., Tucker, G., Levine, S. (2020).D4RL: Datasets for deep data-driven reinforcement learning.
Fujimoto, S., Hoof, H., Meger, D. (2018). Addressing function approximation error in actor-critic methods. In: International conference on machine learning.
Fujimoto, S., Meger, D., Precup, D. (2019). Off-policy deep reinforcement learning without exploration. In: International conference on machine learning.
Fujimoto, S., & Gu, S. (2021). A minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 34, 20132–20145.
Google Scholar
Ghosh, D., Gupta, A., Reddy, A., Fu, J., Devin, C.M., Eysenbach, B., Levine, S. (2021). Learning to reach goals via iterated supervised learning. In: International conference on learning representations.
Ghosh, D., Singh, A., Rajeswaran, A., Kumar, V., Levine, S. (2018). Divide-and-conquer reinforcement learning. In: International conference on learning representations.
Ho, J., Ermon, S. (2016). Generative adversarial imitation learning. In: Advances in Neural Information Processing Systems.
Jurgenson, T., Avner, O., Groshev, E., Tamar, A. (2020). Sub-goal trees a framework for goal-based reinforcement learning. In: International conference on machine learning.
Kaelbling, L.P. (1993). Learning to achieve goals. In: International joint conference on artificial intelligence.
Karimpanal, T. G., & Wilhelm, E. (2017). Identification and off-policy learning of multiple objectives using adaptive clustering. Neurocomputing, 263, 39–47.
Article Google Scholar
Kim, J., Seo, Y., & Shin, J. (2021). Landmark-guided subgoal generation in hierarchical reinforcement learning. Advances in Neural Information Processing Systems, 34, 28336–28349.
Google Scholar
Kingma, D.P., Ba, J. (2014). Adam: A method for stochastic optimization.
Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33, 1179–1191.
Google Scholar
Li, Y., Gao, T., Yang, J., Xu, H., Wu, Y. (2022). Phasic self-imitative reduction for sparse-reward goal-conditioned reinforcement learning. In: International conference on machine learning.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N.M.O., Erez, T., Tassa, Y., Silver, D., Wierstra, D. (2016). Continuous control with deep reinforcement learning. In: International conference on learning representations.
Li, J., Tang, C., Tomizuka, M., & Zhan, W. (2022). Hierarchical planning through goal-conditioned offline reinforcement learning. IEEE Robotics and Automation Letters, 7(4), 10216–10223.
Article Google Scholar
Liu, M., Zhu, M., Zhang, W. (2022). Goal-conditioned reinforcement learning: Problems and solutions. In: International joint conference on artificial intelligence.
Ma, X., Zhao, S.-Y., Yin, Z.-H., Li, W.-J. (2020). Clustered reinforcement learning.
Mandlekar, A., Ramos, F., Boots, B., Savarese, S., Fei-Fei, L., Garg, A., Fox, D. (2020). Iris: Implicit reinforcement without interaction at scale for learning control from offline robot manipulation data. In: International conference on robotics and automation.
Mannor, S., Menache, I., Hoze, A., Klein, U. (2004). Dynamic abstraction in reinforcement learning via clustering. In: International conference on machine learning.
Mezghani, L., Sukhbaatar, S., Bojanowski, P., Lazaric, A., Alahari, K. (2022). Learning goal-conditioned policies offline with self-supervised reward shaping. In: Conference on robot learning.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al., (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
Article Google Scholar
Nair, A.V., Pong, V., Dalal, M., Bahl, S., Lin, S., Levine, S. (2018). Visual reinforcement learning with imagined goals. In: Advances in neural information processing systems.
Nasiriany, S., Pong, V., Lin, S., Levine, S. (2019). Planning with goal-conditioned policies. In: Advances in neural information processing systems.
Oh, J., Guo, Y., Singh, S., Lee, H. (2018). Self-imitation learning. In: International conference on machine learning.
Peng, X.B., Kumar, A., Zhang, G., Levine, S. (2019). Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.
Plappert, M., Andrychowicz, M., Ray, A., McGrew, B., Baker, B., Powell, G., Schneider, J., Tobin, J., Chociej, M., Welinder, P., Kumar, V., Zaremba, W. (2018). Multi-goal reinforcement learning: Challenging robotics environments and request for research.
Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., & Dormann, N. (2021). Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268), 1–8.
Google Scholar
Schaul, T., Horgan, D., Gregor, K., Silver, D. (2015). Universal value function approximators. In: International conference on machine learning.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Google Scholar
Wang, Q., Xiong, J., Han, L., sun, p., Liu, H., Zhang, T. (2018). Exponentially weighted imitation learning for batched historical data. In: Advances in neural information processing systems.
Wei, H., Corder, K., Decker, K. (2018). Q-learning acceleration via state-space partitioning. In: International conference on machine learning and applications.
Yang, R., Fang, M., Han, L., Du, Y., Luo, F., Li, X. (2021). MHER: Model-based hindsight experience replay. In: Deep RL Workshop NeurIPS 2021.
Yang, R., Lu, Y., Li, W., Sun, H., Fang, M., Du, Y., Li, X., Han, L., Zhang, C. (2022). Rethinking goal-conditioned supervised learning and its connection to offline RL. In: International conference on learning representations.
Zhang, L., Yang, G., Stadie, B.C. (2021). World model as a graph: Learning latent landmarks for planning. In: International conference on machine learning.

Download references

Acknowledgements

Giovanni Montana acknowledges support from a UKRI AI Turing Acceleration Fellowship (EPSRC EP/V024868/1).

Funding

Giovanni Montana acknowledges support from a UKRI AI Turing Acceleration Fellowship (EPSRC EP/V024868/1).

Author information

Authors and Affiliations

University of Warwick, Coventry, UK
Mianchu Wang, Yue Jin & Giovanni Montana
Alan Turing Institute, London, UK
Giovanni Montana

Authors

Mianchu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yue Jin
View author publications
You can also search for this author in PubMed Google Scholar
Giovanni Montana
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Authors’ contributions follow the authors’ order convention.

Corresponding author

Correspondence to Giovanni Montana.

Ethics declarations

Conflict of interest

No competing and financial interests to disclose.

Ethical approval

Not applicable.

Consent to participate

The authors give their consent to participate.

Consent for publication

The authors give their consent for publication.

Additional information

Editors: Fabio Vitale, Tania Cerquitelli, Marcello Restelli, Charalampos Tsourakakis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

In this section, we show an example to motivate our algorithm and provide more evidence to support the argument in the main paper.

1.1 A numerical example

In this section, we use numerical examples to highlight the advantages of our dual-advantage weighting scheme over the GEAW approach. At its core, our method introduces the concept of ’region-advantage’. This involves segmenting states into ordered regions based on ascending state values. By doing so, we can harness both short-term and traditional goal-conditioned advantages to more effectively weight different behavioral actions. A key strength of our approach is its ability to identify the most effective action towards a goal using the region-advantage value. This is something the GEAW approach might miss, especially when the optimal action doesn’t possess the highest goal-conditioned advantage value due to the complexities of action multi-modality.

In Fig. 12a, four behavioral trajectories are depicted, originating from the starting state (represented by an orange triangle) and aiming for the goal (indicated by a purple star). Due to constraints on time, only two of these trajectories successfully reach the goal. At the starting state, $s_1$, three potential actions are available: $a_1$, $a_2$, and $a_3$. The goal-conditioned advantage values for these actions can be determined through the subsequent calculations.

Given the reward function that only gives $r=1$ at the goal state otherwise $r=0$, and discount factor $\gamma =0.99$. The goal-conditioned state-action values for $s_2, g$ and $a_2, a_3$ are:

$$\begin{aligned} \begin{array}{ll} &{} Q^{\pi _b}(s_2, a_2, g) = 0.99^3 \\ &{} Q^{\pi _b}(s_2, a_3, g) = 0 \\ \end{array} \end{aligned}$$

(38)

The goal-conditioned state value of $s_2$:

$$\begin{aligned} V^{\pi _b}(s_2, g) = \frac{0.99^3 + 0}{2} \approx 0.485 \end{aligned}$$

(39)

The goal-conditioned state-action value for $s_1, g$ and $a_1, a_2$:

$$\begin{aligned} \begin{array}{ll} &{} Q^{\pi _b}(s_1, a_1, g) = 0.99^{10} \approx 0.904 \\ &{} Q^{\pi _b}(s_1, a_2, g) = 0.99^2 \cdot 0.485 \approx 0.475 \\ \end{array} \end{aligned}$$

(40)

According to $A^{\pi _b}(s, a, g) = Q^{\pi _b}(s, a, g) - V^{\pi _b}(s)$, we have:

$$\begin{aligned} \begin{array}{ll} A^{\pi _b}(s_1, a_1, g) > A^{\pi _b}(s_1, a_2, g) \end{array} \end{aligned}$$

(41)

According to the state values, the states can be roughly divided into three regions colored by blue, yellow, and gray. For $s_1$, the next region is the blue one (the states in this region have bigger state values). The target region $Q-$values of $s_1, G$, and $a_1, a_2$ are:

$$\begin{aligned} \begin{array}{ll} &{} {\tilde{Q}}^{\pi _b}(s_1, a_1, G) = 0.99^7 \approx 0.932 \\ &{} {\tilde{Q}}^{\pi _b}(s_1, a_2, G) = 0.99^3 \approx 0.970 \end{array} \end{aligned}$$

(42)

Thus, we have:

$$\begin{aligned} {\tilde{A}}^{\pi _b}(s_1, a_1, G) < {\tilde{A}}^{\pi _b}(s_1, a_2, G). \end{aligned}$$

(43)

From the example, it’s evident that $a_2$ is the optimal action towards the goal. However, due to the trajectory’s failure post $s_2$, its goal-conditioned advantage value, $A^{\pi _b}(s_1,a_2, g)$, is less than that of $A^{\pi _b}(s_1,a_1, g)$. In contrast, our region-advantage value focuses on the action’s advantage towards the subsequent region. This provides a nuanced, short-term assessment of an action’s quality, enabling a more accurate identification of optimal actions compared to relying solely on the goal-conditioned advantage value.

Figure 12b underscores the advantages of integrating both goal-conditioned and region-conditioned advantages. In this illustration, state $s_3$ presents two actions, both of which exhibit identical region-advantage values: ${\tilde{A}}^{\pi _b}(s_1, a_1, G) = {\tilde{A}}^{\pi _b}(s_1, a_3, G)$. Additionally, the goal-conditioned advantage values reveal that $A^{\pi _b}(s_3, a_1, g) > A^{\pi _b}(s_3, a_3, g)$. This suggests that blending goal-conditioned and region-conditioned advantages for re-weighting behavioral actions potentially offers enhanced outcomes compared to solely employing the region-conditioned advantage.

It is important to note that the numerical examples in this section serve primarily as illustrative tools. A more comprehensive and rigorous exploration, encompassing both theoretical and empirical analyses, is detailed in the main body of the paper.

1.2 Assessing GEAW with two value functions

In this experiment, we aim to ascertain if the performance of DAWOG is primarily due to the employment of two value networks with unique initializations. To test this, we substituted DAWOG’s target region advantage with a goal-conditioned advantage. Specifically, we initialized two goal-conditioned state value functions, represented as $\{V_{\theta _i}\}^2_{i=1}$, using distinct random seeds. These were then updated following the GEAW protocol. The policy was optimized using

$$\begin{aligned} J(\pi ) = {\mathbb {E}}_{(s_t, a_t, r_t, g) \sim {\mathcal {D}}_R} \left[ \exp _{clip} (\frac{\beta }{2} \sum ^2_{i=1} A^{\pi _b}_i(s_t, a_t, g)) \log \pi (a_t \mid s_t, g) \right] \end{aligned}$$

where the advantage is calculated as $A^{\pi _b}_i(s_t, a_t, g)= r_t + \gamma V_{\theta _i}(s_{t+1}, g) - V_{\theta _i}(s_t, g)$. We benchmarked this approach against both GEAW and DAWOG across four environments, as depicted in Fig. 13. Our findings did not indicate that DAWOG’s superior performance is solely due to the dual value networks with different initializations.

1.3 Assessing the effects of DAWOG’s target region

In this subsection, we provide further evidence highlighting the efficacy of the dual-advantage in promoting short-term success for the agent. Figure 14 depicts the success rate at which the agent reaches the subsequent target region within a span of ten time steps. When comparing GEAW with DAWOG, it becomes evident that by harnessing the target region advantage, the agent more frequently reaches the target region in fewer steps, thereby progressing closer to the intended goal.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, M., Jin, Y. & Montana, G. Goal-conditioned offline reinforcement learning through state space partitioning. Mach Learn 113, 2435–2465 (2024). https://doi.org/10.1007/s10994-023-06500-z

Download citation

Received: 15 February 2023
Revised: 15 November 2023
Accepted: 16 December 2023
Published: 05 February 2024
Issue Date: May 2024
DOI: https://doi.org/10.1007/s10994-023-06500-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Goal-conditioned offline reinforcement learning through state space partitioning

Abstract

Similar content being viewed by others

An offline-to-online reinforcement learning approach based on multi-action evaluation with policy extension

Goal exploration augmentation via pre-trained skills for sparse-reward long-horizon goal-conditioned reinforcement learning

Model-Based Multi-objective Reinforcement Learning with Unknown Weights

Explore related subjects

1 Introduction

2 Related work

3 Preliminaries

4 Methods

4.1 Target region advantage function

Definition 1

Definition 2

Definition 3

Definition 4

4.2 The DAWOG algorithm

4.3 Policy improvement guarantees

Proposition 1

Proof

Proposition 2

Proposition 3

Proof

5 Experimental results

5.1 Tasks and datasets

5.1.1 Grid World

5.1.2 AntMaze navigation

5.1.3 Gym robotics

5.2 Implementation details

5.3 Competing methods

5.4 Performance comparisons and analysis

5.5 Further studies

5.5.1 Learning curves

5.5.2 Region occupancy times

5.5.3 Over-estimation bias

5.5.4 Effects of different target regions

5.5.5 Policy learning with entropy regularization

5.6 Sensitivity to hyperparameters

6 Discussion and conclusions

Availability of data and materials

Availability of code

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendix

Appendix

1.1 A numerical example

1.2 Assessing GEAW with two value functions

1.3 Assessing the effects of DAWOG’s target region

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation