Model-based trajectory stitching for improved behavioural cloning and its applications

Behavioural cloning (BC) is a commonly used imitation learning method to infer a sequential decision-making policy from expert demonstrations. However, when the quality of the data is not optimal, the resulting behavioural policy also performs sub-optimally once deployed. Recently, there has been a surge in offline reinforcement learning methods that hold the promise to extract high-quality policies from sub-optimal historical data. A common approach is to perform regularisation during training, encouraging updates during policy evaluation and/or policy improvement to stay close to the underlying data. In this work, we investigate whether an offline approach to improving the quality of the existing data can lead to improved behavioural policies without any changes in the BC algorithm. The proposed data improvement approach - Trajectory Stitching (TS) - generates new trajectories (sequences of states and actions) by `stitching' pairs of states that were disconnected in the original data and generating their connecting new action. By construction, these new transitions are guaranteed to be highly plausible according to probabilistic models of the environment, and to improve a state-value function. We demonstrate that the iterative process of replacing old trajectories with new ones incrementally improves the underlying behavioural policy. Extensive experimental results show that significant performance gains can be achieved using TS over BC policies extracted from the original data. Furthermore, using the D4RL benchmarking suite, we demonstrate that state-of-the-art results are obtained by combining TS with two existing offline learning methodologies reliant on BC, model-based offline planning (MBOP) and policy constraint (TD3+BC).

Behavioural cloning (BC) is a commonly used imitation learning method to infer a sequential decision-making policy from expert demonstrations.However, when the quality of the data is not optimal, the resulting behavioural policy also performs sub-optimally once deployed.Recently, there has been a surge in offline reinforcement learning methods that hold the promise to extract high-quality policies from sub-optimal historical data.A common approach is to perform regularisation during training, encouraging updates during policy evaluation and/or policy improvement to stay close to the underlying data.In this work, we investigate whether an offline approach to improving the quality of the existing data can lead to improved behavioural policies without any changes in the BC algorithm.The proposed data improvement approach -Trajectory Stitching (TS) -generates new trajectories (sequences of states and actions) by 'stitching' pairs of states that were disconnected in the original data and generating their connecting new action.By construction, these new transitions are guaranteed to be highly plausible according to probabilistic models of the environment, and to improve a statevalue function.We demonstrate that the iterative process of replacing old trajectories with new ones incrementally improves the underlying behavioural policy.Extensive experimental results show that significant performance gains can be achieved using TS over BC policies extracted from the original data.Furthermore, using the D4RL benchmarking

Introduction
Behavioural cloning (BC) [1,2] is one of the simplest imitation learning methods to obtain a decision-making policy from expert demonstrations.BC frames the imitation learning problem as a supervised learning one.Given expert trajectories -the expert's paths through the state space -a policy network is trained to reproduce the expert behaviour: for a given observation, the action taken by the policy must closely approximate the one taken by the expert.Although a simple method, BC has shown to be very effective across many application domains [1,[3][4][5], and has been particularly successful in cases where the dataset is large and has wide coverage [6].An appealing aspect of BC is that it is applied in an offline setting, using only the historical data.Unlike reinforcement learning (RL) methods, BC does not require further interactions with the environment.Offline policy learning can be advantageous in many circumstances, especially when collecting new data through interactions is expensive, time-consuming or dangerous; or in cases where deploying a partially trained, sub-optimal policy in the real-world may be unethical, e.g. in autonomous driving and medical applications.
BC extracts the behaviour policy which created the dataset.Consequently, when applied to sub-optimal data (i.e. when some or all trajectories have been generated by non-expert demonstrators), the resulting behavioural policy is also expected to be sub-optimal.This is due to the fact that BC has no mechanism to infer the importance of each state-action pair.Other drawbacks of BC are its tendency to overfit when giving a small number of demonstrations and the state distributional shift between training and test distributions [6,7].In the area of imitation learning, significant efforts have been made to overcome such limitations, however the available methodologies generally rely on interacting with the environment [7][8][9][10].So, a question arises: can we help BC infer a superior policy only from available sub-optimal data without the need to collect additional expert demonstrations?
Our investigation is related to the emerging body of work on offline RL, which is motivated by the aim of inferring expert policies with only a fixed set of sub-optimal data [11,12].A major obstacle towards this aim is posed by the notion of action distributional shift [12][13][14].This is introduced when the policy being optimised deviates from the behaviour policy, and is caused by the action-value function overestimating out-of-distribution (OOD) actions.A number of existing methods address the issue by constraining the actions that can be taken.In some cases, this is achieved by constraining the policy to

State
Value Fig. 1 Simplified illustration of Trajectory Stitching.Each original trajectory (a sequence of states and actions) in the dataset D is indicated as T i with i = 1, . . ., 3. A first stitching event is seen in trajectory T 1 whereby a transition to a state originally visited in T 2 takes place.A second stitching event involves a jump to a state originally visited in T 3 .At each event, jumping to a new state increases the current trajectory's future expected returns.The resulting trajectory (in bold) consists of a sequence of states, all originally visited in D, but connected by imagined actions; it replaces T 1 in the new dataset.
In contrast to existing offline learning approaches, we turn the problem on its head: rather than trying to regularise or constrain the policy somehow, we investigate whether the data quality itself can be improved using only the available demonstrations.To explore this avenue, we propose a model-based data improvement method called Trajectory Stitching (TS).Our ultimate aim is to develop a procedure that identifies sub-optimal trajectories and replaces them with better ones.New trajectories are obtained by stitching existing ones together, without the need to generate unseen states.The proposed strategy consists of replaying each existing trajectory in the dataset: for each stateaction pair leading to a particular next state along a trajectory, we ask whether a different action could have been taken instead, which would have landed at a different seen state from a different trajectory.An actual jump to the new state only occurs when generating such an action is plausible and it is expected to improve the quality of the original trajectory -in which case we have a stitching event.
An illustrative representation of this procedure can be seen in Fig. 1, where we assume to have at our disposal only three historical trajectories.In this example, a trajectory has been improved through two stitching events.To determine the stitching points, TS uses a probabilistic view of statereachability that depends on learned dynamics models of the environment.These models are evaluated only on in-distribution states enabling accurate prediction.In order to assess the expected future improvement introduced by a potential stitching event, we utilise a state-value function and a reward model.Thus, TS can be thought of as a data-driven, automated procedure yielding highly plausible and higher-quality demonstrations to facilitate supervised learning; at the same time, sub-optimal demonstrations are removed altogether whilst keeping the diverse set of seen states.
Our experimental results show that TS produces higher-quality data, with BC-derived policies always superior than those inferred on the original data.Remarkably, we demonstrate that TS-augmented data allow BC to compete with state-of-the-art offline RL algorithms on highly complex continuous control openAI gym tasks implemented in MuJoCo using the D4RL offline benchmarking suite [26].Furthermore, we show that integrating TS with existing offline learning methods that explicitly use BC such as model-based planning [24] and TD3+BC [18] can significantly boost their performance.
2 Related work

Imitation learning
Imitation learning aims to emulate a policy from expert demonstrations [27].BC is the simplest of such category of methods and uses supervised learning to clone the actions in the dataset.BC is a powerful method and has been used successfully in many applications such as learning a quadroter to fly [28], self-driving cars [29,30] and games [5].These application are highly complex and shows accurate policy estimation from high-quality offline data.
One drawback from using BC is the state distributional shift between training and test distributions.Improved imitation learning methods have been introduced to reduce this distributional shift, however they usually require online exploration.For instance, DAgger [7] is an online learning approach that iteratively updates a deterministic policy; it addresses the state distributional shift problem of BC through an on-policy method for data collection; similarly to TS, the original dataset is augmented, but this involves online interactions.Another algorithm, GAIL [9], iteratively updates a generative adversarial network [31] to determine whether a state-action pair can be deemed as expert; a policy is then inferred using a trust region policy optimisation step [32].TS also uses generative modelling, but this is to create data points likely to have come from the data that connect high-value regions.Whereas expert demonstrations are essential for imitation learning, TS creates higher quality datasets from existing, possibly sub-optimal data, to improve offline policy learning.

Offline reinforcement learning
Offline RL aims to learn an optimal policy from sub-optimal datasets without further interactions with the environment [11,12].Similarly to BC, offline RL suffers from distributional shift.However this shift comes from the policy selecting OOD actions leading to overestimation of the value function [13,14].In the online setting, this overestimation encourages the agent to explore, but offline this leads to a compounding of errors where the agent believes OOD actions lead to high returns.Many offline RL algorithms bias the learned policy towards the behaviour-cloned one [18,24,25] to ensure the policy does not deviate too far from the behaviour policy.Many of these offline methods are therefore expected to directly benefit from enhanced datasets yielding higherachieving behavioural policies.

Model-free methods
Many model-free offline RL methods typically deal with distributional shift either by regularising the policy to stay close to actions given in the dataset [13][14][15][16][17][18] or by pessimistically evaluating the Q-value to penalise OOD actions [19][20][21].Both options involve explicitly or implicitly capturing information about the unknown underlying behaviour policy.This behaviour policy can be fully captured using BC.For instance, batch-constrained Q-learning (BCQ) [13] is a policy constraint method which uses a variational autoencoder to generate likely actions in order to constrain the policy.The TD3+BC algorithm [18] offers a simplified policy constraint approach; it adds a behavioural cloning regularisation term to the policy update biasing actions towards those in the dataset.Alternatively, conservative Q-learning (CQL) [19] adjusts the value of the state-action pairs to "push down" on OOD actions and "push up" on indistribution actions.CQL manipulates the value function so that OOD actions are discouraged and in-distribution actions are encouraged.Implicit Q-learning (IQL) [33] avoids querying OOD actions altogether by manipulating the Qvalue to have a state-value function in the SARSA-style update.All the above methods try to directly deal with OOD actions, either by avoiding them or safely handling them in either the policy improvement or evaluation step.In contrast, our method rethinks the problem of learning from sub-optimal data.Rather than using RL to learn a policy, instead we use RL-based approaches to enrich the data enabling BC to extract an improved policy.Our method generates unseen actions between in-distribution states; by doing so, we avoid distributional shift by evaluating a state-value function only on seen states.

Model-based methods
Model-based algorithms rely on an approximation of the environment's dynamics [34,35], that is probability distributions where the next state and reward are predicted from a current state and action.In the online setting, model-based methods tend to improve sample efficiency [35][36][37][38][39].In an offline learning context, the learned dynamics have been exploited in various ways.One approach consists of using the models to improve the policy learning.For instance, Model-based offline RL (MOReL) [40] is an algorithm which constructs a pessimistic Markov Decision Model (P-MDP), based off a learned forward dynamics model and a state-action detector.The P-MDP is given an additional absorbing state, which gives large negative reward for unknown state-actions.Model-based Offline policy Optimization (MOPO) [41] augments the dataset by performing rollouts using a learned, uncertainty-penalised, MDP.Unlike MOPO, TS does not introduce imagined states, but only actions between reachable unconnected states.
Another opportunity to exploit learnt models of the environment is in decision-time planning.Model-based offline planning (MBOP) [24] uses the learnt environment dynamics and a BC policy to roll-out a trajectory from the current state, one transition at a time.The best trajectory from the current state is found where the trajectory horizon has been extended using a value function and the first action is selected.This process is repeated for each new state.Model-based offline planning with trajectory pruning (MOPP) [25] extends the MBOP idea, but prunes the trajectory roll-outs based on an uncertainty measure, safely handling the problem of distributional shift.Diffuser [42] uses a diffusion probabilistic model to predict a whole trajectory in one step.Rather than using a model to predict a single next state at decision-time, diffuser can generate unseen trajectories that have high likelihood under the data and maximise the cumulative rewards of a trajectory ensuring long-horizon accuracy.However, diffuser's individual plans are very slow which limits its use case for real-world applications.Our TS method can be used in direct conjunction with planning, especially with MBOP and MOPP, which both use a BC policy to guide the trajectory sampling.

State similarity metrics
A central aspect of the proposed TS approach consists of a stitching event, which uses a notion of state similarity to determine whether two states are "close" together.Relying on only geometric distances would often be inappropriate; e.g. two states may be close in Euclidean distance, yet reaching one from another may be impossible (e.g. in navigation task environments where walls or other obstacles preclude reaching a nearby state).Bisimulation metrics [43] capture state similarity based on the dynamics of the environment.These have been used in RL mainly for system state aggregation [44][45][46]; they are expensive to compute [47] and usually require full-state enumeration [48][49][50].A scalable approach for state-similarity has recently been introduced by using a pseudometric [51] which facilitates the calculation of state-similarity in offline RL.PLOFF [50] is an offline RL algorithm that uses a state-action pseudometric to bias the policy evaluation and improvement steps.Whereas PLOFF uses a pseudometric to stay close to the dataset, we bypass this notion altogether by only using states in the dataset and generating unseen actions connecting them.Our stitching event is based from the decomposition of the trajectory distribution which allows us to pick unseen actions, but with high likelihood, determined by the future state.

Data re-sampling and augmentation approaches
In offline RL, data re-sampling strategies aim to only learn from highperforming transitions.For instance, best-action imitation learning (BAIL) [52] imitates state-action pairs based from the upper-envelope of the dataset.Monotonic Advantage Re-Weighted Imitation Learning (MARWIL) [53] weights state-action pairs from an exponentially-weighted advantage function during policy learning by BC.Return-based data re-balance (ReD) [54] resamples the data based from the trajectory returns and then applies offline reinforcement learning methods.The proposed TS differs from BAIL, MAR-WIL and ReD as we increase the dataset by adding impactful stitching transitions as well as removing the low-quality transitions.TS has the effect of re-sampling high-value transitions in the trajectory as well supplementing the dataset with stitched transitions, connecting high-value regions.
Best action trajectory stitching (BATS) [55] is a related trajectory stitching method: it augments the dataset by adding transitions through model-based planning.TS differs from BATS in a number of fundamental ways.First, BATS takes a geometric approach to defining state similarity; state-actions are rolled-out using the dynamics model until a state is found that is within a short distance of a state in the dataset.Relying exclusively on geometric distances may result in poor results; as such, our stitching events are based on the dynamics of the environment and are only assessed between two indistribution states.Second, BATS generates new states that are not in the dataset.Due to compounding model error, resulting in unlikely rollouts, the rewards are penalised for the generated transitions which favours state-action pairs within the dataset.In contrast, we only allow one-step stitching between in-distribution states and use the value function to extend the horizon rather than a learned model.Finally, BATS adds all stitched actions to the original dataset, then create a new dataset by running value iteration, which is eventually used to learn a policy through BC.In contrast, our TS method has been designed to be more directly suited to policy learning through BC: since the lower-value experiences have been removed through stitching events, the resulting dataset contains only high-quality trajectories to learn from.

Problem setup
We consider the offline RL problem setting, which consists of finding an optimal decision-making policy from a fixed dataset.The policy is a mapping from states to actions, π : S → A, whereby S and A are the state and action spaces, respectively.The dataset is made up of transitions D = {(s t , a t , r t , s t+1 )} that include the current state, s t , the action performed in that state, a t , the next state after the action has been taken, s t+1 , and the reward resulting for transitioning, r t .The actions are assumed to follow an unknown behavioural policy, π β , acting in a Markov decision process (MDP).The MDP is defined as M = (S, A, P, R, γ), where P : S × A × S → [0, 1] is the transition probability function which defines the dynamics of the environment, R : S × A × S → R is the reward function and γ ∈ (0, 1] is a scalar discount factor [56]. In offline RL, the agent must learn a policy, π(a t | s t ), that maximises the returns defined as the expected sum of discounted rewards, E π [ without ever having access to π β .Here we are interested in performing imitation learning through BC, which mimics π β by performing supervised learning on the state-action pairs in D [1,2].More specifically, BC finds a deterministic policy, This solution is known to minimise the KL-divergence between π β and the trajectory distributions of the learned policy [57].Our objective is to enhance the dataset, such that it has the effect of being collected by an improved behaviour policy.Thus, training a policy by BC on the improved dataset will lead to higher returns than π β .

Model-based Trajectory Stitching
Under our modelling assumptions, the probability distribution of any given trajectory T = (s 0 , a 0 , s 1 , a 1 , s 2 , . . ., s H ) in D can be expressed as where p(a t | s t ) is the policy and p(s t+1 | s t , a t ) is the environment's dynamics.First, we note that, in the offline case, Eq. ( 1) can be re-written in an alternative, but equivalent form as which now depends on two different conditional distributions: p(s t+1 | s t ), the environment's forward dynamics, and p(a t | s t , s t+1 ), its inverse dynamics.Both distributions can be approximated using the available data, D (see Section 3.3).We also pre-train a state-value function V π β to estimate the future expected sum of rewards for being in a state s following the behaviour policy π β as well as a reward function (see Section 3.4), which will be used to predict r(s t , ât , s t+1 ) for any action ât not in D.
Eq. ( 2) informs our data-improvement strategy, as follows.For a given transition, (s t , a t , s t+1 ) ∈ D, our aim is to replace s t+1 with ŝt+1 ∈ D using a synthetic connecting action ât .A necessary condition for such a state swap to occur is that ŝt+1 should be plausible, conditional on s t , according to the learnt forward dynamic model, p(s t+1 | s t ).Furthermore, such a state swap should only happen when landing on ŝt+1 leads to higher expected returns.Accordingly, two criteria need to be satisfied in order to allow swapping states: ).The first criterion ensures that the new next state must be at least as likely to have been observed as the candidate state under the learnt dynamic model.Furthermore, to be beneficial, the candidate next state must not only be likely to be reached from s t under the environment dynamics, but must also lead to higher expected returns compared to the current s t+1 .This requirement is captured by the second criterion using the pre-trained value function.In practice, finding a suitable candidate ŝt+1 involves a search for candidate next states amongst all the states that has been visited by any trajectory in D (see Section 3.3).Where the two criteria above are satisfied, a plausible action connecting s t and the newly found ŝt+1 is obtained by generating an action that maximises the learnt inverse dynamics model.In summary, we have: For every trajectory in the dataset, starting from the initial state, we sequentially identify candidate stitching events.For instance, in Fig. 1, two such events have been identified along the T 1 trajectory and eventually they yield a new trajectory, T1 .When the cumulative sum of rewards along the newly formed trajectory are higher than those observed in the original trajectory, the old trajectory is replaced by the new one in D. This is captured by the following definition.
Definition 2 A trajectory replacement event is such that, if a new trajectory T started at the initial state s 0 of T has been compiled after a sequence of candidate stitching events, then T replaces T in D when the following condition is satisfied: In this definition, p is a small positive constant and the (1 + p) terms ensures that the cumulative sum of returns in the new trajectory improves upon the old one by a given margin.This conservative approach takes into account potential prediction errors incurred by using the learnt reward model when assessing the rewards for T .
The procedure above is repeated for all the trajectories in the current dataset.When any of the original trajectories are replaced by new ones, a new and improved dataset is formed.The new dataset can then be thought of as being collected by a different, and improved, behaviour policy.Using the new data, the value function is trained again, and a search for trajectory replacement events is started again.This iterative procedure is summarised below.
Definition 3 Trajectory Stitching is an iterative process whereby every trajectory in a dataset D may be entirely replaced by a new one formed through trajectory replacement events.When such replacements take place, resulting in a new dataset, an updated value function is inferred and the process is repeated again.
The trajectory stitching method enforces a greedy next state selection policy (Definition1) and guarantees that the trajectories produced by this policy have higher returns than under the previous policy (Definition 2).Therefore, we obtain a new dataset (Definition 3) collected under a new behaviour policy for which a new value function can be learned and the trajectory stitching process can be repeated.This iterative data improvement process is terminated when no more trajectory replacements are possible, or earlier (see Section 4).
The TS approach is sufficiently flexible and can be implemented in various ways.In the remainder of this section we describe how we have chosen to model the two probability distributions featuring in Eq. ( 2), and how we estimate the state-value function and predict the environment's rewards.

Candidate next state search
The search for a candidate next state requires a learned forward dynamics model, i.e. p(s t+1 | s t ).Model-based RL approaches typically use such dynamics' models conditioned on the action as well as the state to make predictions [24,35,40,41].Here, we use the model differently, only to guide the search process and identify a suitable next state to transition to.Specifically, conditional on s t , the dynamics model is used to assess the relative likelihood of observing any other s t+1 in the dataset compared to the observed one.The environment dynamics is assumed to follow a Gaussian distribution whose mean vector and covariance matrix are approximated by a neural network, i.e.
In our implementation, we take an ensemble of N Gaussian models, E; each component of E is characterised by its own parameter set, (µ . This approach has been shown to take into account epistemic uncertainty, i.e. the uncertainty in the model parameters [22,24,38,39].Each individual model's parameter vector is estimated via maximum likelihood by optimising where | • | is the determinant of a matrix, and each model's parameter set is initialised differently prior to estimation.Upon fitting the models, a state s t+1 is replaced by ŝt+1 only when Here we are taking a conservative approach as we trust the likelihood prediction of seen state-next state pairs, pξ i (s t+1 | s t ), more than unseen state-next state pairs, pξ i (ŝ t+1 | s t ).

Value and reward function estimation
Value functions are widely used in reinforcement learning to determine the quality of an agent's current position [56].
In our context, V θ is only used to observe the value of in-distribution states, thus avoiding the OOD issue when evaluating value functions which occurs in offline RL.The value function will only be queried once to determine whether a candidate stitching event has been found (Definition 1).
Value functions require rewards for training, therefore a reward must be estimated for unseen tuples (s t , ât , ŝt+1 ).There are many different modelling choices available; e.g., under a Gaussian model, the mean and variance of the reward can be estimated allowing uncertainty quantification.Other alternatives include a Wasserstein-GAN, a VAE, and a standard multilayer neural network.In practice, the impact of the specific reward model and its effects when used for TS appears negligible (e.g.see Section 4.4.1).In the remainder of this section, we provide further details for one such model, based on Wasserstein-GAN [31,58], which we have extensively used in all our experiments (Section 4) and in our early investigations [59].
Wasserstein-GANs consist of a generator, G φ and a discriminator D ψ , with parameters of the neural networks φ and ψ respectively.The discriminator takes in the state, action, reward, next state and determines whether this transition is from the dataset.The generator loss function is: Here z ∼ p(z) is a noise vector sampled independently from N (0, 1), the standard normal.The discriminator loss function is: Once trained, a reward will be predicted for the stitching event when a new action has been generated between two previously disconnected states.

Action generation
Sampling a suitable action that leads from s t to the newly found state ŝt+1 requires an inverse dynamics model.Specifically, we require that a synthetic action must maximise the estimated conditional density, p(a t | s t , ŝt+1 ).Given our requirement of sampling synthetic actions, a conditional variational autoencoder (CVAE) [60,61] provides a suitable approximation for the inverse dynamics model.The CVAE consists of an encoder q ω1 and a decoder p ω2 where ω 1 and ω 2 are the respective parameters of the neural networks.
The encoder maps the input data onto a lower-dimensional latent representation z whereas the decoder generates data from the latent space.We train a CVAE to maximise the conditional marginal log-likelihood, log p(a t | s t , ŝt+1 ).While intractable in nature, the CVAE objective enables us to maximize the variational lower bound instead, max ω1,ω2 log p(a t | s t , ŝt+1 , z) ≥ max ω1,ω2 where z ∼ N (0, 1) is the prior for the latent variable z, and D KL represents the KL-divergence [62,63].To generate an action between two unconnected states, s t and ŝt+1 , we use the decoder p ω to sample from p(a t | s t , ŝt+1 ).This process ensures that the most plausible action is generated conditional on s t and ŝt+1 .

Experimental results
In this section we first investigate whether TS can improve the quality of existing datasets for the purpose of inferring decision-making policies through BC in an offline fashion, without collecting any more data from the environment.Furthermore, we show that TS can help existing methods that explicitly use a BC term for offline learning to achieve higher performance.Specifically, we explore the use of TS in combination with two algorithms: model-based offline planning (MBOP) [24], which uses an explicit BC policy to select new actions, and TD3+BC [18], which has an explicit BC policy constraint.Our experiments rely on the D4RL datasets, a collection of commonly used benchmarking tasks, and include comparisons with selected offline RL methods.These comparisons provide an insight into the potential gains that can be achieved when

Algorithm 1 Model-based Trajectory Stitching
Initialise: An action generator p ω1 , a reward generator G φ , an ensemble of dynamics models {p ξ i (s | s)} N i=1 , an acceptance threshold p, and a dataset D 0 made up of T trajectories (T 1 , . . .T T ) 1: for k = 0, . . ., K do

2:
Train state-value function, V on D k by minimising Eq. (3).while not done do 7: Create set of candidate next states from dataset, Evaluate dynamics models for new set of states and take minimum, min i pξ i (ŝ | s) Generate a new action and reward, ã ∼ p ω1 (z, s, ŝ j ), r ∼ G φ (z, s, ã, ŝ j ) 11: Add (s, ã, r, ŝ j ) to new trajectory Tt

12:
Set s = ŝ j 13: else 14: Add original transition, (s, a, r, s ) to the new trajectory Tt end for

24:
Collect trajectories into dataset, D k+1 = ( T1 , . . .TT ) 25: end for TS is combined with BC-based algorithms, which often reach or even improve upon current state-of-the-art performance levels in offline RL.In Section 4.2, we show empirically that even with a small amount of expert data, the TS+BC policies become closer to the expert policy, in KL divergence.In all experiments, we run TS for five iterations; these have been found to be sufficient to increase the quality of the data without being overly computationally expensive (Section 4.3).Finally we provide ablation studies into the choice of reward model, as well as alternative extraction policies to BC.

Performance assessment on D4RL data
We compare our TS method on the D4RL [26] benchmarking datasets of the openAI gym MuJoCo tasks.Three complex continuous environments are tested -Hopper, Halfcheetah and Walker2d -each with different levels of difficulty.The "medium" datasets were gathered by the original authors using a single policy produced from the early-stopping of an agent trained by soft actor-critic (SAC) [64,65].The "medium-replay" datasets are the replay buffers from the training of the "medium" policies.The "expert" datasets were obtained from a policy trained to an expert level, and the "medium-expert" datasets are the combination of both the "medium" and "expert" datasets.A BC-cloned policy that used a TS dataset is denoted by TS+BC.All results and comparisons are summarised in Table 1 and detailed explanations of our methods are in order.We run TS for 3 different seeds, giving 3 datasets, we then train BC over 5 seeds for each new dataset giving 15 TS+BC policies.

Behaviour cloning: TS+BC
The first method we investigate using TS with on the D4RL datasets is BC.Enriching the dataset with more high-value transitions and removing low quality ones leaves the dataset with closer-to-expert trajectories making BC the most suitable policy extraction algorithm.From Table 1 we can see that TS+BC improves over BC in all cases, showing that TS creates a higher quality dataset as claimed.

Model-based offline planning: TS+MBOP
Given previously presented evidence that TS improves over BC, a natural next step is to investigate whether TS can also improve on other methods that are reliant on BC.Model-based offline planning (MBOP) [24] is an offline model-based planning method that uses a BC policy to rollout multiple trajectories picking the action that leads to the trajectory with highest returns.   1 Average normalised scores of state-of-the-art offline RL methods achieved on three locomotion tasks (Hopper, Halfcheetah and Walker2d) using the D4RL v2 data sets.The results for competing methods have been gathered from the original publications.Bold scores represent the highest scores per task.TS+BC, TD3+TS+BC, TS+MBOP: In brackets we report the percentage improvement achieved by TS relative to their respective baselines.
For this study, we alter MBOP slightly to obtain TS+MBOP: in this version, actions are selected using our TS extracted policy and we use our trained value function.
As can be observed in Table 1, TS+MBOP improves over the MBOP baseline in all cases.We also compare TS+MBOP to state-of-the-art model-based algorithms such as a MOPO [41], MOReL [40] and Diffuser [42]; in these comparisons, TS+MBOP achieves higher performance in 5 out of the 9 comparable tasks.Only in the hopper medium and medium-replay tasks does another model-based method outperform TS+MBOP.

Model-free offline RL: TD3+TS+BC
We also investigate the benefits of using TS in conjunction with a modelfree offline RL algorithm.TD3+BC [18] explicitly using BC in the policy improvement step to regularise the policy to take actions close to the dataset.As TS removes low-quality data, the learned Q-values will be inaccurate when trained solely on the new TS data.To counter this, we warm start TD3+BC on the original dataset, then use the new TS data to fine-tune both the critic and actor after the Q-values have been sufficiently trained.To keep this a fair comparison, we train the policy over the same number of iterations as reported in [18].We make one small amendment to the Walker2d mediumreplay dataset where we train the critic only using the original data, and use the TS data only to fine-tune the policy.We run TD3+TS+BC on the same 5 seeds as reported in the original dataset.
As reported in Table 1, we find that, in all cases, TD3+TS+BC outperforms the baseline method thus solidifying the positive effect of TS in offline RL.For this comparison, we also consider two additional state-of-the-art model-free offline RL algorithms: IQL [33] and CQL [19].In 6 out of the 9 comparable tasks, TD3+TS+BC significantly improves over the model-free baselines.In the hopper medium-replay task, we find that TD3+TS+BC under-performs compared to other model-free methods (IQL and CQL).

Expected performance on sub-optimal data
It is well known that BC minimises the KL-divergence of trajectory distributions between the learned policy and π β [57].As TS has the effect of improving π β , this suggests that the KL-divergence between the trajectory distributions of the learned policy and the expert policy would be smaller post TS.To investigate this hypothesis, we use two complex locomotion tasks, Hopper and Walker2D, in OpenAI's gym [66].Independently for each task, we first train an expert policy, π * , with TD3 [67], and use this policy to generate a baseline noisy dataset by sampling the expert policy in the environment and adding white noise to the actions, i.e. a = π * (s) + .A range of different, sub-optimal datasets are created by adding a certain amount of expert trajectories to the noisy dataset so that they make up x% of the total trajectories.Using this procedure, we create eight different datasets by controlling x, which takes values 0 0.1 2.5 5 10 20 30 40 Percentage of Expert Data in the set {0, 0.1, 2.5, 5, 10, 20, 30, 40}.BC is run on each dataset for 5 random seeds.We run TS (for five iterations) on each dataset over three different random seeds and then create BC policies over the 5 random seeds, giving 15 TS+BC policies.Random seeds cause different TS trajectories as they affect the latent variables sampled for the reward function and inverse dynamics model.Also, the initialisation of weights is randomised for the value function and BC policies hence the robustness of the methods is tested over multiple seeds.The KL divergences are calculated following [57] as Fig. 2 shows the scores as average returns from 10 trajectory evaluations of the learned policies.TS+BC consistently improves on BC across all levels of expertise for both the Hopper and Walker2d environments.As the percentage of expert data increases, TS is available to leverage more high-value transitions, consistently improving over the BC baseline.Fig. 3 (left) shows the average difference in KL-divergences of the BC and TS+BC policies against the expert policy.Precisely, the y-axis represents D KL (p π * (T ), p π BC (T )) − D KL (p π * (T ), p π TS+BC (T )), where p π (T ) is the trajectory distribution for policy π, Eq. (1).A positive value represents the TS+BC policy being closer to the expert, and a negative value represents the BC policy being closer to the expert, with the absolute value representing the degree to which this is the case.We also scale the average KL-divergence between 0 and 1, where 0 is the smallest KL-divergence and 1 is the largest KL-divergence, per task.This makes the scale comparable between Hopper and Walker2d.The figure shows that BC can extract a behaviour policy closer to the expert after performing TS on the dataset, except in the 0% case for Walker2D, however the difference is not significant.TS seems to work particularly well with a minimum of 2.5% expert data for Hopper and 0.1% for Walker2d.Furthermore, Fig. 3 (middle and right) shows the mean square error (MSE) between actions from the expert policy and the learned policy for the Hopper (middle) and Walker2d (right) tasks.Actions are selected by collecting 10 trajectory evaluations of an expert policy.As we expect, the TS+BC policies produce actions closer to the experts on most levels of dataset expertise.A surprising result is that for 0% expert data on the Walker2d environment the BC policy produces actions closer to the expert than the TS+BC policy.This is likely due to TS not having any expert data to leverage.However, even in this case, TS still produces a higher-quality dataset than previous as shown by the increased performance on the average returns.Overall, these results offer empirical confirmation that TS does have the effect of improving the underlying behaviour policy of the dataset.

On the number of TS iterations
We investigate empirically how the quality of the dataset improves after each iteration; see Definition 3. We repeat TS on each D4RL dataset, each time using a newly estimated value function to take into account the newly generated transitions.In all our experiments, we choose 5 iterations.Figure 4 shows the scores of the D4RL environments on the different iterations, with the standard deviation across seeds shown as the error bar.With iteration 0 we indicate the BC score as obtained on the original D4RL datasets.For all datasets, we observe that the average scores of BC increase initially over a few iterations, then remain stable with only some minor random fluctuations.We see less improvement in the expert datasets as there are fewer trajectory improvements to be made.Conversely, for the medium expert datasets more iterations are required to reach an improved performance.For Hopper and Walker2d medium-replay, there is a higher degree of standard deviation across the seeds, which gives a less stable average as the number of iterations increases.

Ablation studies
In this Section we perform ablation studies to assess the impact of the reward model on TS performance and the effect of value-weighted BC.

Choice of reward model
Model-based TS requires a predictive model for rewards associated to the stitched transitions enabling a value function to be learned on the new dataset.Unlike some online methods [68,69] we do not have access to the true reward function during training time and so a model must be trained to predict rewards.There are many choices of models.For example, MBPO [35], MOPO [41] and MBOP [24] use a neural network that outputs the parameters of a Gaussian distribution, to predict the next state and reward.These models are coupled with the next state as well as reward.We solely want to predict the reward and consider the following options: a Gaussian distribution whose parameters are modelled by a neural network, a Wasserstein-GAN, a VAE and multilayer neural network that minimizes the mean square error between true and predicted reward.
We evaluate the reward models on the D4RL hopper-medium dataset and perform a 95 : 5 training and test split.To make it a fair test all models are trained on the same training data and each model has two hidden layers with dimension size 512.In TS we want to predict a reward for an unseen transition, where s and s are in the dataset but have never been connected by an observed action.Therefore, we evaluate the trained reward models on unseen data to test their OOD performance.Table 2 shows the MSE between predicted and true rewards of the models on the rest of the D4RL hopper datasets: random, expert and medium replay.The GAN, VAE and MLP perform very similarly achieving accurate predictions on all three datasets.The VAE and MLP outperform the GAN in predicting rewards of the expert dataset.The Gaussian model performed very poorly on these datasets.
Finally we compare TS(WGAN)+BC with TS(MLP)+BC on the D4RL datasets; here, either a WGAN or MLP is used to predict the reward.Table 3 shows that the decision between using a WGAN or MLP is insignificant as they are both accurate enough at predicting rewards.

Value-weighted BC
This weighted-BC method gives larger weight to the high-value states and lower weight to the low-value states during training.On the Hopper medium and medium-expert datasets, training this weighted-BC method only gives a slight improvement over the original BCcloned policy.For Hopper-medium, weighted-BC achieves an average score of 59.21 (with standard deviation 3.4); this is an improvement over BC (55.3), but lower than TS+BC (64.3).Weighted-BC on hopper-medexp achieves an average score of 66.02 (with standard deviation 6.9); again, this is a slight improvement over BC (62.3), but significantly lower than TS+BC (94.8).The experiments indicate that using a value function to weight the relative importance of seen states when optimising the BC objective function is not sufficient to achieve the performance gains introduced by TS.

Conclusion
In this paper, we have proposed an iterative data improvement strategy, Trajectory Stitching, which can be applied to historical datasets containing demonstrations of sequential decisions taken to solve a complex task.At each iteration, TS performs one-step stitching between reachable states within the dataset that lead to higher future expected returns.We have demonstrated that, without further interactions with the environment, TS improves the quality of the historical demonstrations, which in turn has the effect of boosting the performance of BC-extracted policies significantly.Extensive experimental results using the D4RL benchmarking data have demonstrated that TS always improves the underlying behaviour policy.We have also demonstrated that TS is beneficial beyond BC, when combined with existing offline reinforcement learning methods.In particular, TS can be used to extract an improved explicit BC-based regulariser for TD3+BC, as well as an improved BC prior for offline model-based planning (MBOP).TS-based methods achieve state-of-the-art results in 10 out of the 12 D4RL datasets considered.
We believe that this work opens up a number of directions for future investigation.For example, TS could be extended to multi-agent offline policy learning by reformulating Eq. 2 to actions taken by multiple agents.Besides the realm of offline RL, TS may also be useful for learning with sub-optimal demonstrations, e.g. by inferring a reward function through inverse RL.Historical demonstrations can also be used to guide RL and improve the data efficiency of online RL [70].In these cases, BC can be used to initialise or regularise the training policy [71,72]. in a tuple consisting of state, action and next state; it encodes it into a mean µ q and standard deviation σ q of a Gaussian distribution N (µ q , σ q ).The latent variable z is then sampled from this distribution and used as input for the decoder along with the state, s, and next state, s .The decoder outputs an action that is likely to connect s and s .The CVAE is trained for 400, 000 gradient steps with hyperparameters given in Table A1.

Reward function
The reward function is used to predict reward signals associated with new transitions, s, a, s .For this model, we use a conditional-WGAN with two hidden layers of size 512.The generator, G φ , takes in a state s, action a, next state s and latent variable z; it outputs a reward r for that that transition.The decoder takes a full transition of (s, a, r, s ) as input to determine whether this transition is likely to have come from the dataset or not.In the reward ablation study all models use the same number of hidden layers and dimension size and are trained for 500k iterations.

Value function
Similarly to previous methods [13], our value function V θ takes the minimum of two value functions, {V θ1 , V θ2 }.Each value function is a neural network with two hidden layers of size 256 and a ReLU activation.The value function takes in a state s and determines the sum of future rewards of being in that state and following the policy (of the dataset) thereon.

KL-divergence experiment
As the KL-divergence requires a continuous policy, the BC policy network is a 2-layer MLP of size 256 with ReLU activation, but with the final layer outputting the parameters of a Gaussian, µ s and σ s .We carry out maximum likelihood estimation using a batch size of 256.For the Walker2d experiments, TS was slightly adapted to only accept new trajectories if they made less than ten changes.For each level of difficulty, TS is run 3 times and the scores are the average of the mean returns over 10 evaluation trajectories of 5 random seeds of BC.To compute the KL-divergence, a continuous expert policy is also required, but TD3 gives a deterministic one.To overcome this, a continuous expert policy is created by assuming a state-dependent normal distribution centred around π * (s) with a standard deviation of 0.01.

Search procedure for candidate next states
Calculating p(s | s) for all s ∈ D may be computationally inefficient.To speed this up in the MuJoCo environments, we initially select a smaller set of candidate next states by thresholding the Euclidean distance.Although on its own a geometric distance would not be sufficient to identify stitching events, we found that in our environments it can help reduce the set of candidate next states thus alleviating the computational workload.To pre-select a smaller set of candidate next states, we use two criteria.Firstly, from a transition (s, a, r, s ) ∈ D, a neighbourhood of states around s is taken and the following state in the trajectory is collected.Secondly, all the states in a neighbourhood around s are collected.This process ensures all candidate next states are geometrically-similar to s or are preceded by geometrically-similar states.The neighbourhood of a state is an −ball around the state.When is large enough, we can retain all feasible candidate next states for evaluation with the forward dynamic model.Fig. A1 illustrates this procedure.

D4RL experiments
For the D4RL experiments, we run TS 3 times for each dataset and average the mean returns over 10 evaluation trajectories of 5 random seeds of BC, to attain the results for TS+BC.For the BC results, we average the mean returns over 10 evaluation trajectories of 5 random seeds.The BC policy network is a 2-layer MLP of size 256 with ReLU activation, the final layer has tanh activation multiplied by the action dimension.We use the Adam optimiser with a learning rate of 1e − 3 and a batch size of 256.
The hyperparameters we use for MBOP are given in Table A2.TD3+BC is trained for 1000k iterations we train TD3+TS+BC also for 1000k iterations with the actor and critic dimensions the same as the original implementation.For TD3+TS+BC we warm start the algorithm on the original data and train for 800k iterations and then carry on training for the remaining 200k iterations on the new TS data.As the TS dataset contains many duplicate transitions we remove all duplicates from the dataset when training with TD3+BC.For the hopper datasets (except medium-expert) the policy is improved if the data is swapped to the TS dataset at 600k iterations.Also the critic is fixed and training on the TS dataset starts at 900k iterations for the walker2d medium-replay dataset.

Fig. 2
Fig.2Comparative performance of BC and TS+BC as the fraction of expert trajectories increases up to 40%.For two environments, Hopper (left) and Walked2D (right), we report the average return of 10 trajectory evaluations of the best checkpoint during BC training.BC has been trained over 5 random seeds and TS has produced 3 datasets over different random seeds.

Fig. 3
Fig.3Estimated KL-divergence and MSE of the BC and TS+BC policies on the Hopper and Walker2d environments as the fraction of expert trajectories increases.(Left) Relative difference between the KL-divergence of the BC policy and the expert and the KL-divergence of the TS+BC policy and the expert.Larger values represent the TS+BC policy being closer to the expert than the BC policy.MSE between actions evaluated from the expert policy and the learned policy on states from the Hopper (Middle) and Walker2d (Right) environments.The y-axes (Middle and Right) are on a log-scale.All policies were collected by training BC over 5 random seeds, with TS being evaluated over 3 different random seeds.All KLdivergences were scaled between 0 and 1, depending on the minimum and maximum values per task, before the difference was taken.

Fig. 4
Fig.4Returns of BC extracted policies as the number of iterations of TS is increased.Iteration 0 are the BC scores on the original D4RL datasets.The errors bars represent the standard deviation of the average returns of 10 trajectory evaluations over 5 random seeds of BC and 3 random seeds of TS.

Fig. 5
shows the mean-square error (MSE) between predicted and true rewards during training on the test and train set.From this clearly the VAE model and MLP model perform the best by attaining the smallest error, getting training and test error to 10 −5 .The average reward for a transition in the hopper-medium dataset is 3.11, so in fact the GAN also performs very well by attaining a training and test error of order 10 −4 .

Fig. A1
Fig. A1 Visualisation of our two definitions of a neighbourhood.For a transition (st, at, s t+1 ) ∈ D, the neighbourhoods are used to reduce the size of the set of candidate next states.(Left) All states within an -ball of the current state, st, are taken and the next state in their respective trajectories (joined by an action shown as an arrow) are added to the set of candidate next states.(Right) All states within an -ball of the next state, s t+1 are added to the set of candidate next states.The full set of candidate next states are highlighted in yellow.
[56]ur context, we use a state-value function to assess whether a candidate next state offers a potential improvement over the original next state.To accurately estimate the future returns given the current state, we calculate a state-value function dependent on the behaviour policy of the dataset.The function V θ (s) is approximated by a MLP neural network parameterised by θ.The parameters are learned by minimising the squared Bellman error[56],

Table 2
Assessment of different types of models to predict reward on the hopper-medium D4RL dataset.The MSE between predicted and true rewards are assessed during training on a test set and training set of the same size.MSE between true and predicted rewards from the reward functions evaluated on the other D4RL hopper datasets.This table shows the performance of the reward models when evaluated on unseen data.The standard deviation is over the whole dataset.

Table 3
TS uses a value function to estimate the future returns from any given state.Therefore TS+BC has a natural advantage over just BC which uses only the Comparison of BC, TS(WGAN)+BC and TS(MLP)+BC on the D4RL locomotion tasks.For the TS methods, the mean performance is provided over 3 datasets of TS and 5 seeds of BC and the standard deviation is given over the total of 15 policies.states and actions.To ensure that using a value function is only sufficient to improve the performance of BC, we investigate a weighted version of the BC loss function whereby the weights are given by the estimated value function, i.e.