1 Introduction

Reinforcement learning (RL) is concerned with optimising sequential decision-making in dynamic environments (Tesauro, 1995; Sutton & Barto, 2018). Typically, RL is used to train autonomous agents to perform complex tasks that rely on long-term decision making, where the decisions themselves impact future decisions as well as the environment the agent learns in. The agent identifies the optimal sequence of decisions, or actions, through trial-and-error learning, constantly interacting with the environment and adjusting its behaviour based on the rewards received. The end goal is to discover a policy that maximizes environmental rewards. By combining RL with the powerful predictive capabilities of neural networks, deep reinforcement learning has produced notable success in areas such as gaming (Mnih et al., 2013; Hessel et al., 2018), robotics (Kalashnikov et al., 2018; Mahmood et al., 2018) and autonomous driving (Kiran et al., 2022), advancing each year as it garners increasing interest and attention.

Despite the remarkable achievements of RL, its reliance on continuous interaction with the environment restricts its application in areas where data collection is expensive, time-consuming, or hazardous. While simulators can partially alleviate this issue in fields such as robotics and autonomous driving (Todorov et al., 2012), there are numerous situations where these are unavailable, and the trial-and-error nature of RL is clearly unsuitable or even unethical (e.g. in healthcare). Furthermore, these settings often already possess a wealth of data amassed through routine data collection or experimentation, offering a rich information source before an agent even engages in any environmental interaction (Komorowski et al., 2018; Liu et al., 2020; Yu et al., 2021).

The ambition to extend RL into such domains has given rise to offline reinforcement learning (offline-RL) (Lange et al., 2012), a paradigm where agents are restricted from interacting with the environment and must learn exclusively from pre-existing interactions. Conventional RL algorithms typically falter in this offline setting, as the primary method for rectifying errors in action value estimates (i.e. online interaction) is no longer available. This often leads to a complete collapse of the learning process as these errors propagate and compound during training (Fujimoto et al., 2019). Essentially, it is difficult for an agent to accurately assess the value of actions never encountered before, undermining the process of learning a policy based on value estimation.

The most common approach for overcoming this problem is to perform some kind of regularisation during training, encouraging updates during policy evaluation and/or policy improvement to stay close to actions in the underlying data (Levine et al., 2020). To date, numerous approaches have been proposed, ranging from methods that directly target the policy and/or value estimates (Kumar et al., 2019; Wu et al., 2019; Kumar et al., 2020; Nair et al., 2020; Kostrikov et al., 2021; Brandfonbrener et al., 2021) through to those which incorporate models of the environment (Kidambi et al., 2020; Yu et al., 2021; Argenson & Dulac-Arnold, 2020; Janner et al., 2022), each with their own strengths and weaknesses in terms of performance, computational efficiency, reproducibility, hyperparameter optimisation and ease of implementation.

One such approach centres around uncertainty quantification with respect to the estimated value of actions (Abdar et al., 2021). For actions absent in data, commonly referred to as out-of-distrubtion (OOD) actions, values estimates are subject to higher uncertainty than those present in data. In online settings, this is often used to improve exploration by being optimistic in the face of uncertainty (Ciosek et al., 2019; Chen et al., 2017). Offline, this is used to stay closer to actions in the data by, conversely, being pessimistic in the face of uncertainty (Buckman et al., 2020). Specifically, action-value estimates are penalised based on their level of uncertainty, in effect guiding the agent towards actions that are high-value/low-variance.

Although there are several techniques available for uncertainty quantification, ensemble-based methods in particular have found favour in offline-RL. SAC-N (An et al., 2021), for example, utilises an ensemble of value functions to approximate a value distribution, using the minimum value across the ensemble to penalise estimates pessimistically, attaining strong performance on offline benchmarks. However, the ensemble size needed to realise this minimum can be excessively large, resulting in substantial computational overhead and scalability issues. While alternative approaches attempt to alleviate this by promoting greater diversification across the ensemble (An et al., 2021) or incorporating elements of conservative value estimation (Ghasemipour et al., 2022), they still remain relatively computationally demanding.

Recognising the potential of ensemble-based approaches to offline-RL, in this work we aim to address this practical obstacle through the use of policy constraints. In offline-RL, policy constraints have been extensively employed as a method for ensuring OOD policy actions stay closer to data actions. Here, we investigate its role as a simple method for controlling the effective sample size of OOD actions, thus directly regulating the degree of epistemic uncertainty of value functions assessed for these actions.

Our findings indicate that when using unconstrained policies, the level of uncertainty in value estimates for OOD actions is relatively low, necessitating the use of large ensemble sizes to accurately estimate the tails of value distributions, and thus achieve the minimal values required for sufficient penalisation. Using a constrained policy on the other hand, results in increased epistemic uncertainty, proportional to the strength of constraint and distance from data actions. Due to the heightened uncertainty, the tails of the value distribution become elongated, allowing for the acquisition of similar minimal values with a considerably reduced ensemble size. We find this to be the case when using two alternative methods for training the ensemble of value functions, namely shared and independent target values.

We leverage these findings as part of two distinct implementations based on existing offline-RL algorithms: TD3-BC-N (an extension of the TD3-BC (Fujimoto & Gu, 2021)) and SAC-BC-N (an extension of SAC-N). In both cases, the policy constraint takes the form of behavioural cloning (BC), avoiding the need to explicitly model the behaviour of data actions, with inherent benefits in terms of simplicity and efficiency. Moreover, we use BC to extend these approaches to online fine-tuning, gradually diminishing its influence as the agent interacts with the environment.

Through an extensive empirical evaluation using the D4RL benchmarking suite (Fu et al., 2020), we show both implementations are able to produce state-of-the-art policies in a computationally efficient manner, which can then be fine-tuned during deployment while largely mitigating severe performance drops during the offline-to-online transition. In addition, we find this can be achieved without having to adjust hyperparameters based on data quality, an arguably necessary feature for real-world application where the performance properties of the data may be undetermined. We hope our work highlights the potential of such an approach and provides a useful benchmark for future advancements to be evaluated against. For the purpose of transparency and reproducibility, the code base for this work is made freely available.Footnote 1

The remainder of this manuscript is structured as follows. In Sect. 2 we outline related work on behavioural cloning, uncertainty quantification and online fine-tuning before providing background material in Sect. 3. We present our offline learning and online fine-tuning procedures in Sect. 4 and evaluate them in Sect. 5. We end with a discussion and concluding comments in Sect. 6.

2 Related work

In this Section, we provide an overview of related literature on offline-RL and online fine-tuning. With respect to offline-RL, we focus on methods that utilise behavioural cloning and uncertainty estimation as strategies to counteract overestimation bias for out-of-distribution actions. For online fine-tuning, we review methodologies that prioritize both stability and performance.

2.1 Methods based on behavioural cloning

In its most vanilla form, behavioural cloning (BC) is a form of imitation learning designed to mimic the actions of a demonstrator, most commonly an expert (Bain & Sammut, 1995). Its use in offline-RL is primarily to act as a policy constraint, preventing agents from choosing actions that stray too far from the source data.

One way of incorporating BC into offline-RL is through modelling the distribution of actions in the data, commonly referred to as the behaviour policy. In BCQ (Fujimoto et al., 2019), this is achieved using a Variational AutoEncoder (VAE) (Sohn et al., 2015), whose generated actions form the basis of a policy which is then optimally perturbed by a separate network in the DDPG (Lillicrap et al., 2015) framework. This approach is modified by PLAS (Zhou et al., 2020) to train policies within the latent space of VAE, naturally constraining policies as they are decoded from latent to action space. VAEs are also utilised by BRAC (Wu et al., 2019) and BEAR (Kumar et al., 2019), which instead seek to minimise divergence metrics (Kullback–Leibler, Wasserstein, Maximum Mean Discrepancy) between the behaviour and the learned policy. To account for multimodality, Fisher-BRC (Kostrikov et al., 2021) clones a behaviour policy using Gaussian mixtures and uses this for critic regularisation via the Fisher divergence metric. Implicit Q-learning (IQL) (Kostrikov et al., 2021) combines expectile regression and advantaged weighted BC to train agents without having to evaluate actions outside the data. TD3-BC (Fujimoto & Gu, 2021) favours a minimalist approach, directly incorporating BC into policy updates via a mean squared error between data and policy actions.

Despite their diversity, each of these methodologies effectively addresses overestimation bias, facilitating the learning of a policy that either matches or surpasses the original behaviour. Additionally, they achieve this in a computationally efficient manner, requiring only a limited number of networks and relatively few gradient updates. However, these approaches tend to be overly restrictive, hindering agents’ abilities to discern optimal behaviour from suboptimal data. Consequently, their performance is often inferior to alternative methods (An et al., 2021; Ghasemipour et al., 2022). Nonetheless, as we suggest, these techniques can still be employed in a complementary capacity alongside ensemble-based approaches, improving computational efficiency via fostering uncertainty for OOD value estimates.

2.2 Methods based on uncertainty quantification

As is customary in machine learning, we distinguish between two distinct sources of uncertainty: aleatoric and epistemic (Hullermeier & Waegeman, 2021). The former stems from inherent stochasticity while the later arises due to incomplete information. In deep learning, various techniques for quantifying both sources of uncertainty have been proposed [for extensive reviews see e.g. (Abdar et al., 2021; Zhou et al., 2022)] and several studies have endeavoured to provide insights in the context of RL [for instance (Eriksson et al., 2022; Charpentier et al., 2022; Lee et al., 2021)]. These preliminary attempts have sought to address various challenges, including mitigating Q-learning instability, achieving equilibrium between exploration and exploitation, and facilitating risk-sensitive sequential decision-making.

In model-free RL, ensemble methods have garnered considerable interest for estimating epistemic uncertainty for action-value estimates. In online-RL, ensembles are frequently employed to enhance exploration by encouraging agents to seek out actions whose estimated values vary the most. This is achieved by constructing a distribution of action-value estimates using the ensemble and acting optimistically with respect to the upper bound, as demonstrated by Chen et al. (2017). In offline-RL these distributions direct agents towards actions within the dataset by, conversely, acting pessimistically with respect to the lower bound, prioritizing actions characterized by high value and low variance.

SAC-N (An et al., 2021), for example, adapts SAC (Haarnoja et al., 2018a, b) to the offline setting by increasing the number of critics from 2 to N, choosing the minimum across the ensemble to penalise action-value estimates that vary the most. While very effective in term of performance, in some cases the size of the ensemble needed to estimate this minimum is excessively large (up to 500) as is the number of gradient steps required to reach peak performance (up to 3 M). Even with parallelisation, this results in considerable computational overhead, both in terms of training time and memory requirements, affecting the capacity to scale up to more complex, real-world problems.

EDAC (An et al., 2021) attempts to reduce ensemble size by increasing uncertainty through diversification. The authors note that, without intervention, the gradients of the critic ensemble tend to align, requiring larger and larger ensembles to achieve sufficient penalisation. To counteract this, EDAC diversifies these gradients by minimising the pair-wise cosine similarity within the ensemble, reducing its size by as much as a factor of ten without compromising performance. However, this diversity regulariser can still be relatively expensive for medium-sized ensembles and the large number of gradient updates remain. Our proposed solution is instead based on increasing uncertainty through the use of policy constraints.

The approach most similar to our own is MSG (Ghasemipour et al., 2022), which also uses an ensemble of critics for uncertainty estimation, but uses conservative Q-learning (CQL) (Kumar et al., 2020) to steer agents towards actions in the data instead of BC. In effect, CQL “pushes down” on value estimates for out-of-distribution actions and “pushes up” for actions in the data. MSG replaces the shared target of SAC-N/EDAC with independent targets to enforce pessimism, and when combined with CQL performs well on challenging benchmarks. However, this performance is still dependent on relatively large ensembles and many gradient steps, with attempts to mitigate this using more efficient means such as multi-head (Lee et al., 2015) and multi-input/multi-outputs (Havasi et al., 2020) leading to detrimental impacts on performance. In contrast, our proposed solution emphasises mitigation through the application of BC.

In order to specifically characterise the uncertainty in value estimates for OOD-actions, PBRL (Bai et al., 2022) makes use of bootstrapping, sampling actions from the learned policy and penalising value estimates based on their deviation from the mean. Critic updates with respect to these estimates augment those based on non-bootstrapped uncertainty for in distribution actions. This idea is extended by RORL (Yang et al., 2022), which separately characterises uncertainty for three sets of state-action pairs (those in the data, perturbed states with data actions and perturbed states with policy actions at those states) in order to smooth Q-value estimates in regions outside the data, with the goal of learning policies that are robust to adversarial attacks. While these approaches are able to capture uncertainty effectively using a much reduced ensemble size (equal to our own), the techniques used to achieve this, most notably bootstrapping, are far less computationally efficient than the BC approach we propose, and less straightforward to implement.

In an attempt to remove the requirement for ensembles entirely, SAC-RND (Nikulin et al., 2023) estimates uncertainty using random network distillation (RND). The authors demonstrate that with an appropriate choice of prior and predictor, RND is able to discriminate between in-distribution and out-of-distribution actions sufficiently well enough so that anti-exploration bonuses can be used to regulate Q-values estimates, and thus agents are able to learn competitive policies. However, despite the fact this approaches uses only \(N=2\) critic networks, it is still less computationally efficient than our proposed approached, primarily stemming from the training associated with the RND component and comparatively large number of gradient updates.

2.3 Methods for online fine-tuning

Depending on the quality of the dataset, offline trained agents may exhibit limited performance upon deployment, necessitating further online fine-tuning through interaction with the environment. It can be argued that the domains which necessitate offline learning to begin with also necessitate a smooth transition from offline to online learning, that improvements in performance should not be preceded by periods of policy degradation. In practice, this presents a formidable challenge due to the sudden distribution shift from offline to online data, which can introduce bootstrapping errors that distort the pre-trained policy (Lee et al., 2020). While continued regularisation can potentially mitigate this issue, it can also hinder the agent’s ability to learn from newly acquired samples. As such, approaches that promote stability as well as performance are desirable.

An initial theoretical study of policy fine-tuning in episodic Markov Decision Processes in Xie et al. (2021), examines the potential benefits of granting online agents access to a reference policy that is, in a certain sense, already close to an optimal one. The policy expansion scheme proposed in Zhang et al. (2023) attempts to achieve stable learning by using offline-trained policies as potential candidates within a policy set, while REDQ + AdaptiveBC (Zhao et al., 2021) seeks stability through adaptively adjusting the BC component of TD3-BC based on online returns. We make use of a similar approach proposed by Beeson & Montana (2022), which adjust the influence of BC based on exponential decay, avoiding the need for prior domain knowledge as required by REDQ + AdaptiveBC.

Other related studies have investigated different setups or aspects, such as action-free offline datasets (i.e., datasets without logged actions) (Zhu et al., 2023) or “learning on the job” (Nair et al., 2022) to improve policy generalisation. The feasibility of employing existing off-policy methods to capitalize on offline data through minimal algorithmic adjustments has be examined in Ball et al. (2023). Their findings underscore the significance of sampling mechanisms for offline data, the crucial role of normalizing the critic update, and the advantages of large ensembles for improving sample efficiency.

3 Preliminaries

In this section, we present the common RL setup and outline the challenges encountered when adapting algorithms to the offline setting. We then provide details of ensemble-based uncertainty methods we adopt as part of our approach.

3.1 Offline reinforcement learning

We follow standard convention and define a Markov decision process (MDP) with state space S, action space A, transition dynamics \(T(s'\mid s, a)\), reward function R(sa) and discount factor \(0 < \gamma \le 1\) (Sutton & Barto, 2018). An agent interacts with this MDP by following a policy \(\pi (a \mid s)\), which can be deterministic or stochastic. The goal of reinforcement learning is to discover an optimal policy \(\pi ^*(a \mid s)\) that maximises the expected discounted sum of rewards,

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}_\pi \sum _{t=0}^\infty \gamma ^t r(s_t, a_t), \end{aligned}$$

also know as the return. In actor-critic methods, this is achieved by alternating between policy evaluation and policy improvement using Q-functions \(Q^\pi (s,a)\), which estimate the value of taking action a in state s following policy \(\pi\) thereafter. Policy evaluation consists of updating the Q-function (the critic) based on the Bellman expectation equation

$$\begin{aligned} Q^\pi (s, a)=r(s, a) + \gamma {{\,\mathrm{\mathbb {E}}\,}}_{s' \sim T, a' \sim \pi } (Q^\pi (s', a')), \end{aligned}$$

where \(s'\) and \(a'\) are used to denote the next state and next action, respectively. Policy improvement comes in the form of updating the policy (the actor) so as to maximise Q(sa).

In terms of objective functions, policy evaluation and policy improvement are defined as, respectively,

$$\begin{aligned} Q^\pi = \underset{Q}{{{\,\mathrm{arg\,min}\,}}}\ {{\,\mathrm{\mathbb {E}}\,}}_{(s, a, s') \sim D} \Big ( Q(s, a) - r(s, a) -\gamma Q^\pi (s', \pi (s')) \Big )^2, \end{aligned}$$
(1)

and

$$\begin{aligned} \pi = \underset{\pi }{{{\,\mathrm{arg\,max}\,}}}\ {{\,\mathrm{\mathbb {E}}\,}}_{s \sim D}\Big [Q(s, \pi (s)) \Big ], \end{aligned}$$
(2)

where \(r(s, a) + \gamma Q^\pi (s', \pi (s'))\) is commonly referred to as the target value.

In practice, both actor and critic are parameterised functions, employing non-linear approximation methods such as neural networks. Parameters are updated according to sample based estimates, with the samples themselves coming from the agent’s own interactions with the environment. To improve data efficiency, these interactions are stored in a replay buffer D which is constantly added to and sampled from during training. To encourage sufficient exploration of the environment, a level of randomness is induced into online action selection, such as by adding noise if policies are deterministic or sampling if policies are stochastic.

In offline reinforcement learning, also known as batch reinforcement learning (Lange et al., 2012), the agent no longer has access to the environment and instead must learn solely from pre-existing interactions \(D=(s_i, a_i, r_i, s'_i)\). While it is possible to adapt existing algorithms to this setting by simply removing online interaction, in practice this often leads to highly sub-optimal policies or a complete collapse of the learning process. The primary cause of this is the propagation and compounding of overestimation bias for state-action pairs absent in D (Levine et al., 2020). Such overestimation bias results from the bootstrapped nature of Q-network updates and the maximisation carried out as part of policy improvement.

This can be seen more clearly by examining the general objectives of policy evaluation and improvement. In policy evaluation (1), Q-value estimates for Q(sa) and \(Q(s', a')\) use actions sampled from different policies, namely the behaviour policy \(\pi _\beta (s)\) (i.e. the policy/policies that collected previous interactions), and the learned policy \(\pi (s)\). Errors that appear during policy evaluation propagate to policy improvement (2), biasing actions that maximise spurious Q-values estimates. This then feeds back into policy evaluation, compounding existing errors which then propagate to policy improvement, and so on. In the online setting such bias can be mitigated by trialing policy actions in the environment, observing rewards and correcting Q-value estimates accordingly. In the offline setting this is no longer permitted and hence additional measures must be implemented in order to stabilise training.

3.2 Regularisation through uncertainty estimation

A sensible approach to combating overestimation bias is to target its root cause, namely the Q-values estimates themselves. One tool for achieving this is uncertainty estimation, using the premise that Q-value estimates for out-of-distribution (OOD) actions are inherently more uncertain than for actions in the data. This uncertainty can be used in training to favour Q-values with low-variance in policy evaluation and high-value/low-variance in policy improvement, in effect guiding the agent towards actions in the vicinity of the data.

This idea forms the basis of approaches such as SAC-N and EDAC. Both use an ensemble of N Q-functions to approximate Q-value distributions, updating network parameters using the minimum across the ensemble for policy actions \(\pi (s)\). In terms of the general objectives for policy evaluation and improvement, these become, respectively:

$$\begin{aligned} Q^\pi _i = \underset{Q}{{{\,\mathrm{arg\,min}\,}}}\ {{\,\mathrm{\mathbb {E}}\,}}_{(s, a, s') \sim D} \left( Q_i(s, a) - r(s, a) -\gamma \min _{i=1,\ldots , N}Q^\pi _i(s', \pi (s')) \right) ^2, \end{aligned}$$
(3)

and

$$\begin{aligned} \pi = \underset{\pi }{{{\,\mathrm{arg\,max}\,}}}\ {{\,\mathrm{\mathbb {E}}\,}}_{s \sim D}\left[ \min _{i=1,\ldots , N}Q_i(s, \pi (s)) \right] . \end{aligned}$$

Alternatively, as is done in MSG, each Q-function can be updated towards its own (rather than a shared) target value, giving a modified policy evaluation objective of:

$$\begin{aligned} Q^\pi _i = \underset{Q}{{{\,\mathrm{arg\,min}\,}}}\ {{\,\mathrm{\mathbb {E}}\,}}_{(s, a, s') \sim D} \left( Q_i(s, a) - r(s, a) -\gamma Q^\pi _i(s', \pi (s')) \right) ^2. \end{aligned}$$
(4)

Using uncertainty estimation in this way constitutes a pessimistic approach to offline-RL. By using the minimum across the ensemble, Q-value estimates for OOD actions are penalised according to their level of uncertainty. By increasing the size of the ensemble, the minimum is realised more accurately, and hence with large enough N the level of penalisation is sufficient to prevent overestimation bias. In practice, such approaches attain strong performance, but the size of the ensemble required to accurately estimate this minimum is often very large, necessitating the use of considerable computational resource to implement.

4 Policy constrained critic ensembles

The key issue we seek to address in this work is the high computational cost of ensemble-based approaches to offline reinforcement learning, approaches that are otherwise very effective due to their strong performance and straightforward implementation. These costs primarily stem from the need to use large ensembles to obtain accurate estimates of lower bounds, which form the basis of penalties applied to Q-value estimates for OOD actions.

As demonstrated by An et al. (2021), the strength of these penalties depend on both the size of the ensemble and the magnitude of the standard deviation. Using the same example for illustrative purposes [itself based on Royston (1982)], if Q(sa) follows a Gaussian distribution with mean \(\mu (s, a)\) and standard deviation \(\sigma (s, a)\), the approximate expected minimum of a set of N realisations is given by:

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}\left[ \underset{j=1,\ldots , N}{\min }\ Q_j(s, a) \right] \approx \mu (s, a) - \Phi ^{-1} \left( \frac{N-\frac{\pi }{8}}{N - \frac{\pi }{4} + 1} \right) \sigma (s, a), \end{aligned}$$
(5)

where \(\Phi\) is the cumulative distribution function of the standard Gaussian.

In general the distribution of Q(sa) is unknown, but the same basic principles apply. In SAC-N, the size of the ensemble needed to sufficiently penalise Q-value estimates is high, as the standard deviation across the ensemble (i.e. level of uncertainty) is relatively small. In order to achieve similar levels of penalisation with a reduced ensemble size, the level of uncertainty across the ensemble must be increased. In EDAC this is achieved by diversifying the ensemble and in MSG by using conservative Q-learning.

Our proposed method for increasing this uncertainty is based on policy constraints. We note that, although policy constraints are primarily used to steer agents towards actions in the data, this also has an effect on the level of uncertainty of Q-values estimates of OOD actions. By constraining the policy, the Q-ensemble is trained on actions closer to the data, in effect reducing the effective sample size of OOD actions, which in turn increases epistemic uncertainty with respect to their Q-value estimates. The higher the level of constraint, the greater the level of uncertainty as the tails of the value distribution expand. Thus, policy constraints provide an additional mechanism for controlling uncertainty in Q-value estimates, which can be used to achieve sufficient levels of penalisation with a much reduced ensemble size.

With this in mind, we modify existing ensemble-based approaches to directly incorporate behavioural cloning into policy updates, in a similar vein to TD3-BC (Fujimoto & Gu, 2021). While many other approaches for constraining policies exist (see Sect. 2), we favour this one in particular as it requires no explicit modelling of the behaviour policy \(\pi _\beta\) and is straightforward to implement, computationally cheap, flexible enough to accommodate deterministic and stochastic policies and requires no changes to policy evaluation using either shared (3) or independent (4) targets.

Let \(\rho (a)\) be a function representing a divergence metric between policy and data actions a. The general policy improvement objective becomes:

$$\begin{aligned} \pi = \underset{\pi }{{{\,\mathrm{arg\,max}\,}}}\ ~ {{\,\mathrm{\mathbb {E}}\,}}_{(s, a) \sim D} \left[ \min _{i=1,\ldots , N}Q_i(s, \pi (s)) - ~ \beta \rho (a) \right] . \end{aligned}$$
(6)

The hyperparameter \(\beta\) controls the balance between RL and BC, and by extension the level of uncertainty in Q-value estimate for OOD actions. Lower values favour RL but also lead to lower levels of uncertainty. Higher values increase uncertainty, but tip the balance towards BC, making it more difficult for the agent to discover high-value actions that lie beyond the data. Thus, the aim is to find a value of \(\beta\) that induces enough uncertainty without being too restrictive, allowing sufficient penalisation of Q-value estimates using a smaller ensemble.

Regardless of the form of \(\rho (a)\), the balance in (6) is highly sensitive to Q-value estimates, which scale with rewards and vary across tasks. Therefore, to keep this balance in check, following the example of TD3-BC we normalise estimates by dividing by the mean of the absolute values, such that:

$$\begin{aligned} Q_{norm}(s, \pi (s)) = \frac{Q(s, \pi (s))}{{{\,\mathrm{\mathbb {E}}\,}}_{s \sim D} \mid Q(s, \pi (s)) \mid }. \end{aligned}$$

So far we have presented our approach within the general actor-critic framework, outlining the changes to policy evaluation and policy improvement from incorporating ensemble methods and behavioural cloning. In Sects. 4.1 and 4.2 we present two specific versions based on TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018b), respectively, which are then evaluated in Sect. 5 alongside our fine-tuning approach detailed in Sect. 4.3.

4.1 TD3-BC-N

Twin Delayed Deep Deterministic Policy Gradient (TD3) is an approach to reinforcement learning that proposes a number of techniques for addressing function approximation error in actor-critic methods, most notably DDPG. Based on a deterministic policy, TD3 makes use of a dual critic network for policy evaluation and updates Q-functions and policies at a ratio of 2:1. As is common with Q-learning approaches, target networks are used to stabilise training during policy evaluation. Exploration comes in the form of noise sampled from a Gaussian distribution.

We modify the baseline TD3 algorithm by increasing the number of critics from 2 to N and adding a BC term to policy updates in the form of a mean squared error (similar to TD3-BC). Corresponding parameter updates and notation are as follows. Let \(\theta _{i}\) and \(\theta _{i}^{'}\) represent the parameters of the ith Q-network and target Q-network, respectively, and \(\phi\) and \(\phi '\) represent the parameters for a policy network and target policy network, respectively. Let \(\beta\) represent the BC coefficient, N the ensemble size, \(\tau\) the target network update rate, \(\epsilon\) policy noise and B a sample of transitions from dataset D.

Each Q-network update is performed through gradient descent. For shared target values, we use:

$$\begin{aligned} \nabla _{\theta _i}\frac{1}{\vert B \vert } \sum _{(s, a, r, s') \sim B} \left( Q_{\theta _i}(s, a) - r - \gamma \min _{i=1,\ldots , N}Q_{\theta _i^{'}}(s', a')\right) ^2, \end{aligned}$$
(7)

and for individual target values:

$$\begin{aligned} \nabla _{\theta _i}\frac{1}{\vert B \vert } \sum _{(s, a, r, s') \sim B} \left( Q_{\theta _i}(s, a) - r - \gamma Q_{\theta _i^{'}}(s', a')\right) ^2. \end{aligned}$$
(8)

In either case \(a'=(\pi _{\phi '}(s') + \text {noise})\) with noise sampled from an \(N(0, \epsilon )\) distribution. The policy network update is performed through gradient ascent using:

$$\begin{aligned} \nabla _{\phi }\frac{1}{\vert B \vert } \sum _{(s, a) \sim B} \min _{i=1,\ldots , N}Q_{\theta _i}\left( s, \pi _\phi (s)\right) - \beta \left( \pi _\phi (s) - a \right) ^2. \end{aligned}$$
(9)

Target networks are updated using Polyak averaging:

$$\begin{aligned} \begin{aligned} \theta ^{'}_i\leftarrow & {} \tau \theta _i + (1 - \tau ) \theta ^{'}_i \\ \phi ^{'}\leftarrow & {} \tau \phi + (1 - \tau ) \phi ^{'}. \end{aligned} \end{aligned}$$
(10)

The final procedure is presented in Algorithm 2.

figure a

4.2 SAC-BC-N

Soft Actor-Critic (SAC) is a maximum entropy approach to reinforcement learning. Based on a stochastic policy, SAC augments the standard policy evaluation and improvement objectives of actor-critic methods with an entropy regulariser, in effect encouraging agents to maximise returns while acting as randomly as possible. This helps boost exploration, which comes in the form of sampling actions from the policy. Like TD3, SAC uses a dual critic with target networks to promote stability but uses a critic to actor update ratio of 1:1.

We modify the baseline SAC algorithm by increasing the number of critics from 2 to N and by adding a BC term to policy updates. Since the policy is stochastic, this BC term can take the form of either a mean-squared error or log-likelihood. Corresponding parameter updates and notation are as follows. Let \(\theta _{i}\) and \(\theta _{i}^{'}\) represent the parameters of the ith Q-network and target Q-network, respectively, and \(\phi\) represent the parameters for a policy network. Let \(\alpha\) represent the entropy coefficient, \({\mathcal {H}}\) the minimum entropy, \(\beta\) the BC coefficient, N the ensemble size, \(\tau\) the target network update rate and B a sample of transitions from dataset D.

Each Q-network update is performed through gradient descent. For shared target values we use:

$$\begin{aligned} \nabla _{\theta _i} \frac{1}{\vert B \vert } \sum _{\begin{array}{c} (s, a, r, s') \sim B \\ a' \sim \pi _\phi (s') \end{array}} \left( Q_{\theta _i}(s, a) - r - \gamma \min _{i=1,\ldots , N} Q_{\theta _i^{'}}(s', a') + \gamma \alpha \log \pi _\phi (a' \mid s') \right) ^2, \end{aligned}$$
(11)

and for individual target values:

$$\begin{aligned} \nabla _{\theta _i} \frac{1}{\vert B \vert } \sum _{\begin{array}{c} (s, a, r, s') \sim B \\ a' \sim \pi _\phi (s') \end{array}} \left( Q_{\theta _i}(s, a) - r - \gamma Q_{\theta _i^{'}}(s', a') + \gamma \alpha \log \pi _\phi (a' \mid s') \right) ^2. \end{aligned}$$
(12)

The policy network update is performed through gradient ascent. For mean-squared error we use:

$$\begin{aligned} \nabla _{\phi }\frac{1}{\vert B \vert } \sum _{\begin{array}{c} (s, a) \sim B \\ a_p \sim \pi _\phi (s) \end{array}} \min _{i=1,\ldots , N}Q_{\theta _i}\left( s, a_p \right) - \alpha \log \pi _\phi (a_p \mid s) - \beta \left( \pi _\phi (s) - a \right) ^2. \end{aligned}$$
(13)

and for log-likelihood:

$$\begin{aligned} \nabla _{\phi }\frac{1}{\vert B \vert } \sum _{\begin{array}{c} (s, a) \sim B \\ a_p \sim \pi _\phi (s) \end{array}} \min _{i=1,\ldots , N}Q_{\theta _i}\left( s, a_p \right) - \alpha \log \pi _\phi (a_p \mid s) + \beta \log \pi _\phi (a \mid s). \end{aligned}$$
(14)

The entropy coefficient update is performed through gradient ascent using:

$$\begin{aligned} \nabla _{\alpha } \frac{1}{\vert B \vert } \sum _{\begin{array}{c} s \sim B \\ a_p \sim \pi _\phi (s) \end{array}} \alpha \left( \log \pi _\phi (a_p \mid s) + {\mathcal {H}} \right) . \end{aligned}$$
(15)

Target networks are updated using Polyak averaging:

$$\begin{aligned} \theta ^{'}_i \leftarrow \tau \theta _i + (1 - \tau ) \theta ^{'}_i. \end{aligned}$$
(16)

The final procedure is presented in Algorithm 4.

figure b

4.3 Stable online fine-tuning

The main goal in offline-RL is to discover optimal behavioural from existing data sets, allowing agents to learn effective policies before being deployment in the environment. Following deployment however, agents can collect more information about the environment, presenting opportunities for continued improvement via online fine-tuning. As agents can now correct for value estimates through online interaction, it may seem natural to remove constraints imposed during offline learning, but in practice this can often result in an initial phase of policy degradation due to the abrupt transition from constrained to unconstrained learning (see Sect. 2). In many situations, such degradation is deemed undesirable, emphasising the need for approaches that prioritize stability alongside performance.

During the transition from offline to online learning, an agent’s policy should exhibit consistent improvement, surpassing its offline performance without experiencing periods of substantial deterioration. Our approach is well-suited to accomplishing these objectives. First, by making minimal modifications to existing algorithms, we largely preserve the core characteristics that contribute to their success online. Second, our utilisation of BC offers a convenient mechanism for stabilizing the transition by gradually reducing its influence over time. Numerous methods can achieve this, but for simplicitly we adopt an approach based on exponential decay as in (Beeson & Montana, 2022). Let \(\beta _{start}\) and \(\beta _{end}\) be the initial and final values of the BC component \(\beta\), respectively, and S the number of decay steps. The exponential decay rate \(\kappa _\beta\) is given by:

$$\begin{aligned} \kappa _{\beta }=\exp \left[ \frac{1}{S}\log \left( \frac{\beta _{end}}{\beta _{start}}\right) \right] . \end{aligned}$$
(17)

Determining the appropriate use of existing data is also an important aspect of online fine-tuning. One option is to supplement the existing data with new transitions, enabling a seamless transition as the agent gradually acquires new information online. However, if the original data is sub-optimal, the online fine-tuning process may be slow, as the agent’s offline-trained policy is not fully utilised. Alternatively, discarding the data allows the agent to improve its policy without being hampered by data it has already improved upon. However, this could compromise stability in the initial stages due to limited experience and a paucity of data. We propose an approach that strikes a balance, adding new transitions to a portion of the original data before training. We outline this fine-tuning procedure using TD3-BC-N in Algorithm 3. The corresponding procedure for SAC-BC-N is provided in the Appendix.

figure c

5 Experimental results

In this section, we present a comprehensive evaluation of our offline learning and online fine-tuning procedures using the open-source D4RL benchmarking suite. Section 5.1 provides an overview of this benchmark and the domains we consider, with Sect. 5.2 outlining implementation details. In Sect. 5.3 we investigate our claims regarding the impact of policy constraints on uncertainty estimation, and examine the trade-off between ensemble size and level of constraint. This is followed by a comparison of performance and computational efficiency in Sect. 5.4, as well as a number of supplementary experiments to highlight the importance of individual components and implementation choices. We end in Sect. 5.5 with an assessment of our fine-tuning strategy.

5.1 Benchmark datasets

D4RL is a popular resource for benchmarking offline reinforcement learning algorithms. The suite contains a wide range of tasks and data sets designed to test an agent’s ability to learn effective policies in various settings. We outline the domains considered in this work, and refer the reader to the original paper for further details (Fu et al., 2020).

  • MuJoCo. This setting makes use of the hopper, halfcheetah and walker2d environments of the MuJoCo physics simulator (Todorov et al., 2012), assessing how well agents learn from sub-optimal and/or narrow data distributions. Each environment has four associated data sets: “expert” which contains transitions collected from an agent trained to expert level using SAC; “medium” which contains transitions collected from an agent trained to 1/3 expert level using SAC; “medium-replay” which contains the transitions used to train the medium-level agent; “medium-expert” which contains the combined transitions from “medium” and “expert”. In general this setting is considered one of the easier among the benchmark, with environments having well defined rewards structures and data sets comprising a decent proportion of near-optimal trajectories.

  • Maze2D. This settings involves moving a force actuated ball to a fixed target location. Data is collected via a controller which starts and ends at random goal locations. The purpose of this setting is to test an agent’s ability to stitch together previous trajectories to reach the evaluation goal. There are three increasingly difficult mazes: “umaze”, “medium” and “large”. We focus on the more challenging sparse reward setting, in which the agent receives a reward of 1 when within a 0.5 unit radius of the target goal and 0 otherwise.

  • AntMaze. This setting replaces the ball from Maze2d with an more complex Ant robot, with episodes terminating once the Ant reaches the goal location. Data is collected via a controller using two different methods: “play” in which the controller moves from hand-picked starting locations to hand-picked goals; “diverse’ in which the controller moves from random starting locations to random goals. This setting is considered one of the more challenging as agents must learn to both control the Ant and stitch trajectories together using only sparse rewards.

  • Adroit. This setting makes use of the Adroit environment, controlling a high-dimensional robotic hand to perform specifics tasks. The aim is to assess whether agents can learn from narrow data distributions (“cloned”) and human demonstrations (“human”) with sparse rewards. We focus on the “pen” task as, similar to other approaches, this is the only task in which notable performance is achieved (see Appendix).

5.2 Implementation details

Following the protocol of D4RL, we train agents using offline data sets and evaluate their performance in the simulated environment. Performance is measured in terms of normalised score, with 0 and 100 representing random and expert policies, respectively. Each experiment is repeated across five random seeds with reported results the mean normalised score ± one standard error across 50 evaluations for MuJoCo and 500 evaluations for Maze2d/AntMaze/Adroit (10 and 100 evaluations per seed, respectively).

For both TD3-BC-N and SAC-BC-N each Q-network comprises a 3-layer MLP with ReLU activation functions and 256 nodes, taking as input a state-action pair and outputting a Q-value. For TD3-BC-N the policy network comprises a 3-layer MLP with ReLU activation functions and 256 nodes, taking as input a state and outputting an action bound to [\(-1, 1\)] via tanh transformation. For SAC-BC-N the policy network comprises the same architecture but instead outputs the mean and standard deviation of a Gaussian distribution which is also bound to [\(-1, 1\)] via tanh transformation. Each approach retains the hyperparameters values of their online counterpart (full details are provided in the Appendix).

Across all data sets, we train agents for 1 M gradient steps using an ensemble size of \(N=10\). To help stabilise training for narrow data distributions, we inflate the value of the BC coefficient \(\beta\) by a factor of 10 for the first 50 k gradient steps (alternatively the policy can be updated using only BC). We use shared targets for MuJoCo and Maze2d tasks and independent targets for AntMaze and Adroit. We investigate the impact of each of these designs decisions as part of our ablations studies.

For the BC component, we find the characteristics of each environment necessitate varying intensities, and for SAC-BC-N dictate its form (mean-squared error or log-likelihood). We therefore adjust its intensity and/or form based on task type, but to better reflect real-world scenarios where the quality of the data is often unknown, we prohibit adjustments within the same task. Values for each task and data set are provided in the Appendix.

5.3 The impact of policy constraints on uncertainty

Before we consider the full range of tasks and data sets, we first investigate the claims made in previous sections relating to the impact of policy constraints on uncertainty levels in Q-value estimates for OOD actions. To do this, we train a number of agents using TD3-BC-N with dependent target values across a range of N and \(\beta\) on the “hopper-medium-expert” dataset, and examine the performance of resulting policies and uncertainty of Q-values estimates from resulting ensembles.

Beginning with performance, we summarise this via a heatmap in Fig. 1, using shade to represent mean normalised score. For the lowest values of \(\beta\) we see that larger ensembles are required to prevent overestimation bias through sufficient penalisation of OOD actions. As the value of \(\beta\) increases, the size of the ensemble required to achieve this level of penalty decreases, allowing the same level of performance to be attained as for larger ensembles. We also see that the larger the value of N, the smaller the value of \(\beta\) before performance starts to degrade. In these cases, the level of uncertainty resulting from both a large ensemble and high level of policy constraint leads to over-penalisation of Q-values estimates, in effect driving the agent towards actions in the data at an increased rate.

Fig. 1
figure 1

Performance as a function of N and \(\beta\). Lower values of \(\beta\) require larger values of N and smaller values of N require higher values of \(\beta\). If both N and \(\beta\) are large, the uncertainty in Q-value estimates for OOD actions is too high, and thus the penalty applied too severe, leading the agent to prefer actions similar to those of the data

In terms of uncertainty of Q-value estimates, we consider both the standard deviation across the ensemble and the clip penalty \(Q_{clip}(s, a)\), which measures the size of the difference between the mean and minimum:

$$\begin{aligned} Q_{clip}(s, a) = \frac{1}{N} \sum _{j=1}^{N} Q(s, a) - \min _{j=1,\ldots , N}Q(s, a). \end{aligned}$$

In particular, we examine how each of these measures of uncertainty varies according to how far actions are from the data and the values of N and \(\beta\).

To this effect, we sample 50,000 states from the data and 50,000 actions from a random policy and calculate (a) the Euclidean distance between random and data actions and (b) the standard deviation/clip penalty. We then group distances into equally sized bins and within each bin calculate the average standard deviation/clip penalty. We summarise results for \(N=10\) and \(N=50\) in Fig. 2 via heatmaps, using shade to represent the size of the corresponding uncertainty metric. Similar plots for \(N=[2, 5, 20]\) can be found in the Appendix. In general, we see that as the distance between random and data actions increases, so too does the level of uncertainty (standard deviation and penalty gap), and this becomes more pronounced as the value of \(\beta\) increases. This supports our hypothesis that policy constraints can be used to control uncertainty in Q-value estimates. We also see that the highest levels of uncertainty occur when both N and \(\beta\) are large, supporting our explanation of declining performance as observed in Fig. 1.

Fig. 2
figure 2

Uncertainty as a function of distance, N and \(\beta\). Top row standard deviation, bottom row clip penalty. As the distance between random and data actions increases so too does the level of uncertainty, becoming more pronounced as \(\beta\) and N get larger. White space is used to represent erroneous values due to unreliable Q-values estimates resulting from divergent critic loss during training

Finally, we also examine the distribution of the minimum across the ensemble, \(Q_{min}\), as this value is the one used in updates during policy evaluation and policy improvement. Using the same format as for uncertainty, we summarise results for \(N=10\) and \(N=50\) in Fig. 3, using shade to represent the value of \(Q_{min}\). In general, we see that \(Q_{min}\) decreases as the distance between random and data actions increases, being more pronounced as either N or \(\beta\) increase. This culminates in the lowest \(Q_{min}\) values when the size of the ensemble and level of constraint are at their highest, mirroring the findings based on uncertainty measures.

Fig. 3
figure 3

\(Q_{min}\) as a function of distance, N and \(\beta\). As the distance between random and data actions increases, \(Q_{min}\) decreases, with this decrease more pronounced as \(\beta\) and N get larger. White space is used to represent erroneous values due to unreliable Q-values estimates resulting from divergent critic loss during training

For completeness, we reproduce these plots for agents trained using independent target values in the Appendix, finding in general the same features. We also provide additional plots examining (a) the distribution of \(Q_{min}\) for policy actions and (b) the shape of the distribution of Q-value estimates for individual actions, providing more insights into the impact of \(\beta\) on uncertainty.

5.4 Performance and efficiency comparisons

As one of our objectives is to attain the same-level of performance as ensemble-based methods, we compare to published results from SAC-N, EDAC and MSG. As the leading BC based approaches we also compare to published results from IQL and TD3-BC. Finally, since MSG makes use of CQL we also compare to updated CQL results as published in the IQL paper.Footnote 2 For completeness, we provide additional comparisons to approaches outlined in Sect. 2 in the Appendix.

We present results for all tasks and data sets in Table 1. Where figures are not published for a given task we denote the entry as “–”.Footnote 3 To help better visualise performance levels, we compare our results to the best performing method in Fig. 4, which with a few exceptions is SAC-N/EDAC for MuJoCo and Adroit, and MSG for maze tasks. For the MuJoCo and Adroit environments, we see that in general both TD3-BC-N and SAC-BC-N can match the performance of SAC-N and EDAC, and for Maze2d and AntMaze they can match the performance of MSG. Note that this is achieved without adjusting hyperparameters within the same task, in contrast to SAC-N, EDAC and MSG. In the Appendix we investigate the effect of removing this restriction using the MuJoCo environments, finding performance can be slightly enhanced.

For the MuJoCo domain in particular we note there is very little variation in performance across seeds/evaluations, demonstrating our approach is able to learn robust as well as performant policies. This is further evidenced in Fig. 5 where we plot the percentage difference between the mean and worst score across the 50 evaluations, which in most cases is negligible. Since real-world application will typically only involve single policy deployment, such a property is highly desirable.

Table 1 Performance comparison across D4RL benchmark
Fig. 4
figure 4

Comparing the performance of TD3-BC-N (green) and SAC-BC-N (red) against the best method from Table 1 (blue). Performance is competitive across all tasks (Color figure online)

Fig. 5
figure 5

Evaluating robustness of learn policies for MuJoCo tasks. Each plot shows the percentage difference between the mean and worst performing episode across 50 evaluations (10 evaluations per 5 seeds). With the exception of one data set, both TD3-BC-N and SAC-BC-N are able to produce robust policies regardless of data quality

After demonstrating our approach can match state-of-the-art alternatives in terms of performance, we turn our attention to computational efficiency. To ensure a fair comparison, we implement our own versions of baselines based on author published source code and the CORL repository (Tarasov et al., 2022), and run them on the same hardware/software configuration. We use exactly the same network architecture across ensemble-based approaches, training each member of the ensemble in parallel. For CQL, IQL and TD3-BC we use the network architecture as described in their respective papers. Full details are provided in the Appendix.

In Fig. 6 we plot the training time in hours of each approach, considering several variations of SAC-N, EDAC and MSG based on ensemble size, which varies according to task type. We see that TD3-BC-N and SAC-BC-N are easily the most efficient among the ensemble-based approaches, a direct consequence of a smaller ensemble size and need for fewer gradient updates to reach peak performance. In particular, the computation time for TD3-BC-N is comparable to the minimalist approach of TD3-BC.

Fig. 6
figure 6

Computational efficiency. A smaller ensemble size coupled with fewer gradient updates allows TD3-BC-N and SAC-BC-N to significantly reduce computation time to levels similar to that of more minimalist approaches such as TD3-BC

To get a clearer sense of how performance and efficiency compare across algorithms, in Fig. 7 we plot the average training time and normalised score for MuJoCoFootnote 4 and AntMaze tasks. We see that ensemble-based approaches (SAC-N, EDAC, MSG) are the most performant, but also the most computationally expensive. Conversely, BC based approaches (TD3-BC, IQL) are the most computationally efficient, but least performant. TD3-BC-N and SAC-BC-N on other hand are able to retain the advantages of both approaches while diminishing their individual deficiencies.

Fig. 7
figure 7

Performance and efficiency. Average training time and normalised score across MuJoCo and AntMaze tasks. TD3-BC-N and SAC-BC-N can match the performance of ensemble-based approaches while retaining the computational efficiency of those based on behavioural cloning

5.4.1 Ablation studies

In addition to our main results, we also conduct a number of ablations studies to verify the importance of individual components of our approach, as well implementation decisions. In Ablations 1–3, we use the MuJoCo environments to assess the impact of removing the BC component, ensemble of critics and inflated period of BC, respectively. In Ablation 4, we use the AntMaze and Adroit environments to show the impact of using dependent targets instead of independent targets during policy evaluation. We conduct these ablations using TD3-BC-N, making no other changes than those outlined above.

We summarise results in Fig. 8, plotting the percentage difference between each ablation score and the main results of Table 1. For Ablations 1–2 we see that removing either BC or the ensemble has a detrimental impact on performance overall. While the performance for some tasks is unaffected by removing the BC component, there are others that suffer catastrophic failure and hence its inclusion is essential. For Ablation 3 we see removing the inflated period of BC has minimal impact on most data sources, but the severe impact on “walker2d-expert” warrants its inclusion. Finally, in Ablation 4 we see the use of independent targets is crucial for the more challenging “medium” and “large” AntMaze environments and is beneficial for Adroit environments.

Fig. 8
figure 8

Ablations studies. Each plot shows the percentage difference in mean normalised score between each ablation and the main results from Table 1. Ablations 1 and 2 show that both behavioural cloning and an ensemble of critics are necessary to achieve strong performance. Ablations 3 and 4 show the importance of our implementation choices, namely the use of an initial period of inflated BC and independent targets for AntMaze/Adroit environments

5.5 Online fine-tuning

Starting with our offline trained agents, we perform online fine-tuning according to the procedures outlined in Algorithms 3 and 4. We populate the replay buffer R with the last 2500 transitions from D and train agents for an additional 250 k environment interactions, with gradient updates commencing after the first 2500 interactions (i.e. \(K=2500\)). The offline value of \(\beta\) is used for \(\beta _{start}\) and the number of decay steps S is set as 50 k. The value of \(\beta _{end}\) is set according to environment and procedure, but as with our offline experiments, its value doesn’t change according to initial data quality. Values for each data set and procedure are provided in the Appendix. All other parameters remain the same.

For comparison purposes, we also fine-tune agents using a similar procedure to REQD+AdaptiveBC, in which the BC component is adjusted based on online returns. Specifically \(\beta\) is adjusted as:

$$\begin{aligned} \Delta \beta = K_P (R_{avg} - R_{target}) + K_D \max (0, R_{avg} - R_{current}), \end{aligned}$$

where, as per the original paper, \(K_P=3e^{-5}\), \(K_D=1e^{-4}\), \(R_{target}=1.05\), \(R_{current}\) is the latest normalised return and \(R_{average}\) a running average of normalised returns. Note, in order to calculate normalised scores prior knowledge of random and expert performance is required. Apart from how \(\beta\) is adjusted during online training, all other conditions remain the same. We denote these curves as TD3-BC-N-Adapt and SAC-BC-N-Adapt.

For each task, we plot the corresponding learning curves in Fig. 9, evaluating policies every 5000 environment interactions (10 evaluations for MuJoCo, 100 evaluations for Adroit/AntMaze/Maze2d). The solid line represents the mean (non-normalised) score across each of the five seeds, shaded area the standard error and dashed line performance prior to fine-tuning. For the MuJoCo environments, in the majority of cases agents are able to improve their policies while avoiding severe performance drops during the offline to online transition. For TD3-BC-N, the performance for “hopper/halfcheetah-expert” declines slightly over the course of training and for SAC-BC-N there is sharp decline for “walker2d-expert” within the \(\beta\) decay period. For Adroit, TD3-BC-N manages a reasonable transition and subsequent improvement, but SAC-BC-N is less successful. With the exception of “antmaze-umaze”, in AntMaze both TD3-BC-N and SAC-BC-N obtain improved policies in a reasonable stable manner. Finally, for Maze2d we see continued improvement for both methods, with some minor initial deterioration in TD3-BC-N for “maze2d-umaze” and fairly large initial slump in SAC-BC-N for “maze2d-umaze”.

Comparing to the -Adapt versions, in general we see both performance and stability are as good or better, despite the fact our approach requires no prior domain knowledge. We note severe stability issues for SAC-BC-N-Adapt for “hopper-medium”, “hopper-medium-replay” and “walker2d-medium-replay”. This may be a result of the values of \(K_p\) and \(K_d\), which were originally set based on mean-squared error, not transferring to log-likelihood.

Fig. 9
figure 9

Online fine-tuning for D4RL tasks. The solid line represents the mean non-normalised score across each of the five agents, shaded area the standard error and dashed line performance prior to fine-tuning. In general, agents are able to improve their policies in a stable manner, with only a few tasks/data sources causing stability issues. Note that learning curves for SAC-BC-N and -Adapt are identical for “halfcheetah” tasks as \(\beta =0\), hence we omit -Adapt versions for clarity

6 Discussion and conclusion

In this work we have investigated the role of policy constraints as a mechanism for improving the computational efficiency of ensemble-based approaches to offline reinforcement learning. Through empirical evaluation, we have shown how constraints in the form of behavioural cloning can be used to control the level of uncertainty in the estimated value of out-of-distribution actions, allowing these estimates to be sufficiently penalised to prevent overestimation bias. Through this feature, we have been able to match state-of-the-art performance across a number of challenging benchmarks while significantly reducing computational burden, cutting the size of the ensemble to a fraction of that needed when policies are unconstrained. We have also shown how behavioural cloning can be repurposed to promote stable and performant online fine-tuning, by gradually reducing its influence during the offline-to-online transition. These achievements have required only minimal changes to existing approaches, allowing for easy implementation and interpretation.

Our work highlights a number of interesting avenues for future research. Primary among these is the development of methods for selecting the size of the ensemble N and level of behavioural cloning \(\beta\) offline. While we have demonstrated our approach can achieve strong performance using consistent hyperparameters, we have also shown how performance can be further improved by allowing them to vary. Related to this is the development of approaches for automatically tuning \(\beta\) during training, possibly making use of uncertainty metrics described in Sect. 5.3. A theoretical analysis of the impact of \(\beta\) on uncertainty could also prove beneficial in this regard.

While in this work we have used ensembles for uncertainty estimation, other techniques such a multi-head, multi-input/outputs and Monte Carlo dropout can just as easily be used and integrated with BC. Similarly, other forms of policy constraints and/or other divergence metrics can be incorporated into ensemble-based approaches in a relatively straightforward manner. As such, there a number of permutations which could lead to improved performance and/or computational efficiency.

Finally, our fine-tuning procedure may benefit from incorporating elements from methods outlined in Sect. 2, allowing for greater stability during the entire duration of online learning. In addition, our approach may also prove useful in promoting greater data efficiency in online-RL.