Imitation Learning by State-Only Distribution Matching

Imitation Learning from observation describes policy learning in a similar way to human learning. An agent's policy is trained by observing an expert performing a task. While many state-only imitation learning approaches are based on adversarial imitation learning, one main drawback is that adversarial training is often unstable and lacks a reliable convergence estimator. If the true environment reward is unknown and cannot be used to select the best-performing model, this can result in bad real-world policy performance. We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric. Our training objective minimizes the Kulback-Leibler divergence (KLD) between the policy and expert state transition trajectories which can be optimized in a non-adversarial fashion. Such methods demonstrate improved robustness when learned density models guide the optimization. We further improve the sample efficiency by rewriting the KLD minimization as the Soft Actor Critic objective based on a modified reward using additional density models that estimate the environment's forward and backward dynamics. Finally, we evaluate the effectiveness of our approach on well-known continuous control environments and show state-of-the-art performance while having a reliable performance estimator compared to several recent learning-from-observation methods.


INTRODUCTION
Imitation learning (IL) describes methods that learn optimal behavior that is represented by a collection of expert demonstrations.While in standard reinforcement learning (RL), the agent is trained on environment feedback using a reward signal, IL can alleviate the problem of designing effective reward functions.This is particularly useful for tasks where demonstrations are more accessible than designing a reward function.One popular example is to train traffic agents in * Both Authors contributed equally a simulation to mimic real-world road users [Kuefler et al., 2017].
Learning-from-demonstrations (LfD) describes IL approaches that require state-action pairs from expert demonstrations [Ho and Ermon, 2016].While actions can guide policy learning, it might be very costly or even impossible to collect actions alongside state demonstrations in many realworld setups.For example, when expert demonstrations are available as video recordings without additional sensor signals.One example of such a setup is training traffic agents in a simulation, where the expert data contains recordings of traffic in bird's eye view [Kuefler et al., 2017].No direct information on the vehicle physics, throttle, and steering angle is available.Another example is teaching a robot to pick, move, and place objects based on human demonstrations [Osa et al., 2018].In such scenarios, actions have to be estimated based on sometimes incomplete information to train an agent to imitate the observed behavior.
Alternatively, learning-from-observations (LfO) performs state-only IL and trains an agent without actions being available in the expert dataset [Torabi et al., 2019].While LfO is a more challenging task than LfD, it can be more practical in case of incomplete data sources.Learning an additional environment model may help to infer actions based on expert observations and the learned environment dynamics [Torabi et al., 2018a].Like LfD, distribution matching based on an adversarial setup is commonly used in LfO [Torabi et al., 2018b].In adversarial imitation learning (AIL), a policy is trained using an adversarial discriminator, which is used to estimate a reward that guides policy training.While AIL methods obtain better performing agents than supervised methods like behavioral cloning (BC) using less data, adversarial training often has stability issues [Miyato et al., 2018] and under some conditions is not guaranteed to converge [Jin et al., 2020].Additionally, estimating the performance of a trained policy without access to the environment reward can be very challenging.While the duality gap [Grnarova et al., 2019;Sidheekh et al., 2021] is a convergence metric suited for GAN based methods, it is difficult to use in the AIL setup since it relies on the gradient of the generator for an optimization process.In the AIL setup, the generator consists of the policy and the environment and therefore the gradient is difficult to estimate with black box environments.As an alternative for AIL setups the predicted reward (discriminator arXiv:2202.04332v1[cs.LG] 9 Feb 2022 output) or the policy loss can be used to estimate the performance.
To address the limitations of AIL, we propose a state-only distribution matching method that learns a policy in a nonadversarial way.We optimize the Kulback-Leibler divergence (KLD) between the actionless policy and expert trajectories by minimizing the KLD of the conditional state transition distribution of the policy and the expert for all time steps.We estimate the expert state transition distribution using normalizing flows, which can be trained offline using the expert dataset.Thus, stability issues arising from the min-max adversarial optimization in AIL methods can be avoided.This objective is similar to FORM [Jaegle et al., 2021], which was shown to be more stable in the presence of task-irrelevant features.
While maximum entropy RL methods [Ziebart, 2010] can improve policy training by increasing the exploration of the agent, they also add a bias if being used to minimize the proposed KLD.To match the transition distributions of the policy and the expert exactly, the state-next-state distribution of the policy is expanded into policy entropy, forward dynamics and inverse action model of the environment.It has been shown that such dynamic models can improve the convergence [Zhu et al., 2020] and are well suited to infer actions not available in the dataset [Torabi et al., 2018a].We model these distributions using normalizing flow models which have been demonstrated to perform very well on learning complex probability distributions [Papamakarios et al., 2019].Combining all estimates results in an interpretable reward that can be used together with standard maximum entropy RL methods [Haarnoja et al., 2018a].The optimization based on the KLD provides a reliable convergence metric of the training and a good estimator for policy performance.
As contributions we derive SOIL-TDM (State Only Imitation Learning by Trajectory Distribution Matching), a nonadversarial LfO method which minimizes the KLD between the conditional state transition distributions of the policy and the expert using maximum entropy RL.We show the convergence of the proposed method using off-policy samples from a replay buffer.We develop a practical algorithm based on the SOIL-TDM objective and demonstrate its effectiveness to measure its convergence compared to several other stateof-the-art methods.Empirically we compared our method to the recent state-of-the-art IL approaches OPOLO [Zhu et al., 2020], f-IRL [Ni et al., 2020], andFORM [Jaegle et al., 2021] in complex continuous control environments.We demonstrate that our method is superior especially if the selection of the best policy cannot be based on the true environment reward signal.This is a setting which more closely resembles real-world applications in autonomous driving or robotics where it is difficult to define a reward function [Osa et al., 2018].

Background
In this work, we want to train a stochastic policy function π θ (a t |s t ) in continuous action spaces with parameters θ in a sequential decision making task considering finite-horizon environments1 .The problem is modeled as a Markov Decision Process (MDP), which is described by the tuple (S, A, p, r) with the continuous state spaces S and action spaces A. The transition probability is described by p(s t+1 |s t , a t ) and the bounded reward function by r(s t , a t ).At every time step t the agent interacts with its environment by observing a state s t and taking an action a t .This results in a new state s t+1 and a reward signal r t+1 based on the transition probability and reward function.We will use µ π θ (s t , a t ) to denote the state-action marginals at time step t of the trajectory distribution induced by the policy π θ (a t |s t ).

Maximum Entropy Reinforcement Learning and Soft Actor Critic
The standard objective in RL is the expected sum of undiscounted rewards The goal of the agent is to learn a policy π θ (a t |s t ) which maximises this objective.The maximum entropy objective introduces a modified goal for the RL agent, where the agent has to maximise the sum of the reward signal and its output entropy H(π θ (•|s)) [Ziebart, 2010].The parameter α controls the stochasticity of the optimal policy by determining the relative importance of the entropy term versus the reward.
Soft Actor-Critic (SAC) [Haarnoja et al., 2018a;Haarnoja et al., 2018b] combines off-policy Q-Learning with a stable stochastic actor-critic formulation.The soft Q-function parameters Ψ can be trained with: (2) The soft state value function is defined by: Lastly, the policy is optimized by minimizing the following objective:

Imitation Learning
In the IL setup, the agent does not have access to the true environment reward function r(s t , a t ) and instead has to imitate expert trajectories performed by an expert policy π E collected in a dataset D E .
In the typical learning-from-demonstration setup the expert demonstrations D Lf D E := {s k t , a k t , s k t+1 } N k=1 are given by action-state-next-state transitions.Distribution matching has been a popular choice among different LfD approaches.The policy π θ is learned by minimizing the discrepancy between the stationary state-action distribution induced by the expert µ E (s, a) and the policy µ π θ (s, a).An overview and comparison of different LfD objectives resulting from this discrepancy minimization was done by Ghasemipour et al. [2019].
Often the backward KLD is used to measure this discrepancy [Fu et al., 2017]: (5) Learning-from-observation (LfO) considers a more challenging task where expert actions are not available.Hence, the demonstrations D Lf O E := {s k t , s k t+1 } N k=1 consist of statenext-state transitions.The policy learns which actions to take based on interactions with the environment and the expert state transitions.Distribution matching based on statetransition distributions is a popular choice for state-only IL [Torabi et al., 2018b;Zhu et al., 2020]:

Method
In a finite horizon MDP setting, the joint state-only trajectory distributions are defined by the start state distribution p(s 0 ) and the product of the conditional state transition distributions p(s i+1 |s i ).For the policy distribution µ π θ and the expert distribution µ E this becomes: Our goal is to match the state-only trajectory distribution µ π θ induced by the policy with the state-only expert trajectory distribution µ E by minimizing the Kulback-Leibler divergence (KLD) between them: (7) The conditional expert state transition distribution µ E (s i+1 |s i ) can be learned offline from the demonstrations for example by training a conditional normalizing flow on the given state/next-state pairs.The policy induced conditional state transition distribution can be rewritten with the Bayes theorem using the environment model p(s i+1 |a i , s i ) and the inverse action distribution density π θ (a i |s i+1 , s i ): It holds for any a i where π > 0 .Thus, one can extend the expectation over (s i+1 , s i ) by the action a i and the KLD minimization min D KL (µ π θ ||µ E ) can be rewritten as (11) In practice the reward function r(a i , s i ) can be computed using monte carlo integration with a single sample from p(s i+1 |a i , s i ) using the replay buffer.
This max-entropy RL task can be optimized with standard max-entropy RL algorithms.In this work, we applied the SAC algorithm [Haarnoja et al., 2018a] as it is outlined in Section 2.1.
The extension to infinite horizon tasks can be done by introducing a discount factor γ as in the work by Haarnoja et al. [2018b].In combination with our reward definition one obtains the following infinite horizon maximum entropy objective:

Algorithm
To evaluate the reward function, the environment model p(s i+1 |a i , s i ) and the inverse action distribution function π θ (a i |s i+1 , s i ) have to be estimated.We model both distributions using conditional normalizing flows and train with maximum likelihood based on expert demonstrations and rollout data from a replay buffer.The environment model The whole training process according to Algorithm 1 is described in the following 2 .The expert state transition model µ E (s t+1 |s t ) is trained offline using the expert dataset D E which contains K expert state trajectories.We assume the expert distribution is correctly represented by the dataset.Density modeling of the expert state transitions can still result in overfitting when only few expert demonstrations are available (in the few sample limit).We improved the expert training process by adding Gaussian noise to the state values.The standard deviation of the noise is reduced during training so that the model has a correct estimate of the density without overfitting to the explicit demonstrations.With this approach we are able to successfully train expert state transition models on as few as one expert trajectory.The influence of this improved routine is studied in Appendix A.6.After this initial step the following process is repeated until convergence in each episode.The policy interacts with the environment for T steps to collect state-action-next-state information, which is saved in the replay buffer D RB .The conditional normalizing flows for the environment model µ φ (s i+1 |a i , s i ) (policy independent) and the inverse action distribution model µ η (a i |s i+1 , s i ) (policy dependent) are optimized using samples from the replay buffer D RB for N steps.In Appendix A.3 we show that this reduces the KLD (Equation 9) in each step.Afterwards we use the learned models together with the samples from the replay buffer to compute a onesample monte carlo approximation of the reward to train the Q-function.The policy π θ (a t |s t ) is updated using SAC.The SAC-based Q-function training and policy optimization also minimize Equation 9(see Equation 11) in each step.Together with the inverse action policy learning they lead to a converging algorithm since all steps reduce the KLD, which is bounded from below by 0. It is worth noting that the overall algorithm is non-adversarial, the inverse action policy optimization and the policy optimization using SAC both reduce the overall objective -the KLD.Contrary to AIL algorithms like OPOLO, it is not based on an adversarial nested min-max optimization.Additionally, we can estimate the similarity of state transitions from our policy to the expert during each optimization step, since we model all densities in the rewritten KLD from Equation 9.As a result we have a reliable performance estimate enabling us to select the best performing policy based on the lowest KLD between policy and expert state transition trajectories.

Relation to learning from observations
The LfO objective of previous approaches like OPOLO minimizes the divergence between the joint policy state transition distribution and the joint expert state transition distribution: which can be rewritten as (see A.1) Thus, this LfO objective minimizes the sum of the KLD between the joint distributions and the KLDs of the marginal distributions.The SOIL-TDM objective in comparison minimizes purely the KLD of the joint distributions.In case of a perfect distribution matching -a zero KLD between the joint distributions -the KLDs of the marginals also vanish so both objectives have the same optimum.

Related Work
Many recent IL approaches are based on inverse RL (IRL) [Ng and Russell, 2000].In IRL, the goal is to learn a reward signal for which the expert policy is optimal.AIL algorithms end for 22: end procedure are popular methods to perform IL in a RL setup [Ho and Ermon, 2016;Fu et al., 2017;Kostrikov et al., 2020].In AIL, a discriminator gets trained to distinguish between expert states and states coming from policy rollouts.The goal of the policy is to fool the discriminator.The policy gets optimized to match the state action distribution of the expert, using this two-player game.Based on this idea more general approaches have been derived based on f-divergences.Ni et al.[2020] derive an analytic gradient of any f-divergence between the agent and expert state distribution w.r.t.reward parameters.Based on this gradient they present the algorithm f-IRL that recovers a stationary reward function from the expert density by gradient descent.Ghasemipour et al.[2019] identify that IRL's state-marginal matching objective contributes most to its superior performance and apply this understanding to teach agents a diverse range of behaviours using simply hand-specified state distributions.
A key problem with AIL for LfD and LfO is optimization instability [Miyato et al., 2018].Wang et al.[2019] avoid the instabilities resulting from adversarial optimization by estimating the support of the expert policy to compute a fixed reward function.Similarly, Brantley et al.[2020] use a fixed reward function by estimating the variance of an ensemble of policies.Both methods rely on additional behavioral cloning steps to reach expert-level performance.Liu et al.[2020] propose Energy-Based Imitation Learning (EBIL) which recovers fixed and interpretative reward signals by directly estimating the expert's energy.Neural Density Imitation (NDI) [Kim et al., 2021] uses density models to perform distribution matching.Deterministic and Discriminative Imitation (D2-Imitation) [Sun et al., 2021] requires no adver-sarial training by partitioning samples into two replay buffers and then learning a deterministic policy via off-policy reinforcement learning.Inverse soft-Q learning (IQ-Learn) [Garg et al., 2021] avoids adversarial training by learning a single Q-function to implicitly representing both reward and policy.The implicitly learned rewards from IQ-Learn show a high positive correlation with the ground-truth rewards.
LfO can be divided into model-free and model-based approaches.GAILfO [Torabi et al., 2018b] is a model-free approach which uses the GAIL principle with the discriminator input being state-only.Yang et al. [2019] analyzed the gap between the LfD and LfO objectives and proved that it lies in the disagreement of inverse dynamics models between the imitator and expert.Their proposed method IDDM is based on an upper bound of this gap in a model-free way.OPOLO [Zhu et al., 2020] is a sample-efficient LfO approach also based on AIL, which enables off-policy optimization.The policy update is also regulated with an inverse action model that assists distribution matching in a mode-covering perspective.
Other model-based approaches either apply forward dynamics models or inverse action models.Sun et al. [2019] proposed a solution based on forward dynamics models to learn time dependant policies.While being provably efficient, it is not suited for infinite horizon tasks.Alternatively, behavior cloning from observations (BCO) [Torabi et al., 2018a] learns an inverse action model based on simulator interactions to infer actions based on the expert state demonstrations.GPRIL [Schroecker et al., 2019] uses normalizing flows as generative models to learn backward dynamics models to estimate predecessor transitions and augmenting the expert data set with further trajectories, which lead to expert states.Jiang et al. [2020] investigated IL using few expert demonstrations and a simulator with misspecified dynamics.A detailed overview of LfO was done by Torabi et al. [2019].

Method Discussion and Relation to FORM
While our proposed method SOIL-TDM was independently developed it is most similar to the state-only approach FORM [Jaegle et al., 2021].In FORM the policy training is guided by a conditional density estimation of the expert's observed state transitions.In addition a state transition model µ π θ Φ (s i+1 |s i ) of the current policy is learned.The policy reward is estimated by: The approach matches conditional state transition probabilities of expert and policy in comparison to the joint state-action (like GAIL) or joint state-next-state (like OPOLO or GAILfO) densities.The authors argue that this has consequences in the robustness of the different approaches.Namely, methods based on conditional state probabilities are less sensitive to erroneously penalizing features that may not be in the demonstrator data but lead to correct transitions.Hence, such methods may be less prone to overfit to irrelevant differences.Jaegle et al. [2021] demonstrate the benefit of such a conditional density matching approach.
In contrast to FORM we show in Equation 8 that the policies next-state conditional density µ π θ (s i+1 |s i ) can be separated into the policies action density and the forward-and the backward-dynamics densities.Using this decomposition we show that the KLD minimization is equivalent to a maximum entropy RL objective (see Equation 9) with a special reward (see Equation 10).Here the entropy of the policy stemming from the decomposition of the conditional statenext-state density leads to the maximum entropy RL objective.Jaegle et al. [2021] mention that the second term in their reward objective can be viewed as an entropy-like expression.Hence, if this reward is optimized using a RL algorithm which includes some form of policy entropy regularization this entropy is basically weighted twice.In the experiments we show that this double accounting of the policy entropy negatively affects the sample efficiency of the algorithm in comparison to our method.

Experiments
We evaluate our proposed method described in Section 3 in a variety of different IL tasks and compare it against the baseline methods OPOLO, F-IRL and FORM.We evaluate and compare all methods in complex and high dimensional continuous control environments using the Pybullet physics simulation [Coumans andBai, 2016 2019].To evaluate the performance of all methods, the episode rewards of the trained policies are compared to reward from the expert policy.The expert data generation as well as the used baseline implementations are described in Appendix A.5.
Since we assume no environment reward is available as an early stopping criterion, we use other convergence estimates available during training to select the best policy for each method.In adversarial training the duality gap [Grnarova et al., 2019;Sidheekh et al., 2021] is an established method to estimate the convergence of the training process.In the IL setup the duality gap can be very difficult to estimate since it requires the gradient of the policy and the environment (i.e. the gradient of the generator) for the optimization process it relies on.We therefore use two alternatives for model selection for OPOLO.The first approach selects the model with the lowest policy loss and the second approach selects the model based on the highest estimated reward over ten consecutive epochs.For F-IRL we selected the model with the lowest estimated Jensen-Shannon divergence over ten epochs.To estimate the convergence of SOIL-TDM the policy loss based on the KLD from equation 9 is used.It can be estimated using the same models used for training the policy.Similarly, we used the effect models of FORM to estimate the convergence based on the reward.
The evaluation is done by running 3 training runs with ten test episodes (in total 30 rollouts) for each trained policy and calculating the respective mean and confidence interval for all runs.We plot the rewards normalized so that 1 corresponds to expert performance.Values above 1 mean that the agent has achieved a higher reward than the mean of the expert (episode rewards of the expert are reported in Appendix A.5). Implementation details of our method are described in Appendix A.4.
The evaluation results of the discussed methods on a suite of continuous control tasks with unkown true environment reward as a selection criterion are shown in Figure 1.The achieved rewards are plotted with respect to the number of expert trajectories provided for training the agent.The confidence intervals are plotted using lighter colors.
If the true environment reward is unkown the results show that SOIL-TDM achieves or surpasses the performance of the baseline methods on all tested environments (with the exception of two and four expert trajectories in the HumanoidBulletEnv-v0 environment and two trajectories in the Walker2DBulletEnv-v0 environment).In general the adversarial based methods OPOLO and F-IRL exhibit a high variance of the achieved rewards using the proposed selection criteria.While loss and reward are well suited for selecting the best model in usual setups, the results demonstrate that they might be less expressive for estimating the convergence in adversarial training due to the min-max game of the discriminator and the policy.The stability of the SOIL-TDM training method is evident from the small confidence band of the results which gets smaller for more expert demonstrations.While the selection of FORM is also more stable compared to the adversarial methods it generally achieves lower rewards in the sample efficient regime of one and two expert trajectories.
Figure 2 shows the benchmark results of OPOLO, F-IRL, FORM, and SOIL-TDM if the true environment reward is used as an early stopping criterion.In this setup, our method still achieves competitive performance or surpasses OPOLO, F-IRL, and FORM.Additional figures for a comparison of the training performance and efficiency can be found in the Appendix A.7.An ablation study for our method is in the Appendix A.6.

Conclusion
In this work we propose a non-adversarial state-only imitation learning approach based on the minimization of the Kulback-Leibler divergence between the policy and the expert state trajectory distribution.This objective leads to a maximum entropy reinforcement learning problem with a reward function depending on the expert state transition distribution and the forward and backward dynamics of the environment which can be modeled using conditional normalizing flows.The proposed approach is compared to several stateof-the-art learning from observations methods in a scenario with unkown environment rewards and achieves state-of-theart performance.

A.1 Relation to LfO
The learning from observations (LfO) objective minimizes the divergence between the joint policy state transition distribution and the joint expert state transition distribution: where s is a successor state of s given a stationary policy and stationary s , s marginals.This can be rewritten as

A.2 Bounded rewards
Since we use the SAC algorithm as a subroutine all rewards must be bounded.This is true if all subterms of our reward function are bounded which holds if for some and H which is a rather strong assumption which requires compact action and state spaces and a nonzero probability to reach every state s i+1 given any action a i from a predecessor state s i .Since this is in general not the case in practice we clip the logarithms of π θ (a i |s i+1 , s i ), p(s i+1 |a i , s i ), µ E (s i+1 |s i ) to [−15, 1e9].It should be noted that clipping the logarithms to a maximum negative value still provides a reward signal which guides the imitation learning to policies which achieve higher rewards.

A.3 Correctness of using replay buffer
Here we argue that Algorithm 1 leads to a local optimum of the KLD objective from Equation 9under the conditions: a) In the large sample limit per iteration b) appropriate density estimators and optimizers are used c) Equation 9 is minimized by optimizing the policy using a maximum entropy RL algorithm d) the inverse action policy π (a|s , s) is trained using maximum likelihood from a replay buffer in an alternating fashion with the policy optimization.
In Section 3 we show that Equation 9is a maximum entropy RL objective.Thus when optimizing the policy π θ(e) in episode e in Algorithm 1 using the maximum entropy RL algorithm SAC [Haarnoja et al., 2018a] keeping the parameters of the conditional normalizing flows µ E (s |s), µ φ (s |s, a), and µ η (a|s , s) (to stay consistent with our method section, we use the distribution definitions here instead of the model definition) fixed which define the reward implies i=0..T −1 E (si,ai,si+1)∼π θ(e) [log p(s i+1 |a i , s i ) (19) Now, in the next episode e + 1 the first part of Algorithm 1, i.e. optimizing the model of the inverse action policy π θ(e+1) with a maximum likelihood objective using the new replay buffer data (s i , a i , s i+1 ) ∼ π θ(e) obtained from rollouts in episode e + 1 with the new policy π θ(e) trained in episode e leads to i=0..T −1 (20) due to the maximum likelihood objective for the inverse action policy.
Using this inequality one obtains (21) Thus Algorithm 1 optimizes Equation 9 also in the "update dynamics models" part when training µ η (a|s , s) using maximum likelihood from a replay buffer.Thus, optimizing the policy π using SAC and training the model of the inverse action policy π using the replay buffer and maximum likelihood are non-competing and non adversarial objectives, they alternately minimize the same objective in each part of Algorithm 1 and decrease the Kulback-Leibler Divergence in each step, ending in a minimum at convergence since the KLD is bounded by 0 from below.
The inequality from Equation 20 is based on the maximum likelihood objective and a "clean" replay buffer that contains only samples from the current policy.But it can be shown that it also holds for a replay buffer which contains a mixture of samples from the current and old policies: the inequality holds for the mixture distribution which contains a fraction α of the replay buffer which stems from the new policy and a fraction 1 − α which stems from the old policies (α depends on the size of the replay buffer and the number of new samples obtained in the current rollout).I.e.p(RB(e + 1)) = απ θ(e+1) + (1 − α)p(RB(e)).Since the old inverse action policy π θ(e) is the argmax of the maximum likelihood objective of the 1 − α fraction of RB(e+1) which is RB(e) it is better or equal than any other inverse action policy with regard to that previous replay buffer RB(e).Thus (22) due to the maximum likelihood objective for π θ(e+1) on the data RB(e+1)) the following inequality follows: (23) Also, by using the mixture definition of RB(e+1) one can rewrite an Expectation over RB(e+1) as follows: (24) Using the expanded Expectation Inequality 23 can be rewritten as follows: (25) By using Inequality 22 it can be rewritten to (26) which implies (by subtracting the common 1 − α term from both sides) which is Inequality 20 multiplied by α and thus also implies (together with Inequality 19) Inequality 21.From this follows the convergence of Algorithm 1 to a minimum when using a mixed replay buffer.

A.4 Implementation Details
We use the same policy implementation for all our SOIL-TDM experiments.The stochastic policies π θ (a t |s t ) are modeled as a diagonal Gaussian to estimate continuous actions with two hidden layers (512, 512) with ReLU nonlinearities.
To train a policy using SAC as the RL algorithm, we also need to model a Q-function.Our implementation of SAC is based on the original implementation from Haarnoja et al. [2018b] and the used hyperpameter are described in Table 1.In this implementation, they use two identical Q-Functions with different initialization to stabilize the training process.These Q-Functions are also modeled with an MLP having two hidden layers (512, 512) and Leaky ReLU.We kept the entropy parameter α fixed to 1 and did not apply automatic entropy tuning as described by Haarnoja et al. [2018b].
We implement all our models for SOIL-TDM using the PyTorch framework version 1.9.03 .To estimate the imita- We train and test all algorithms on a computer with 8 CPU cores, 64 GB of working memory and an RTX2080 Ti Graphics card.The compute time for the SOIL-TDM method depends on the time to convergence and is from 4h to 14h.

A.5 Expert Data Generation and Baseline Methods
The expert data is generated by training an expert policy based on conditional normalizing flows and the SAC algorithm on the environment reward.A conditional normalizing flow policy has been chosen for the expert to make the 4 https://github.com/VLL-HD/FrEIA(Open Source MIT License) distribution matching problem for the baseline methods and SOIL-TDM -which employ a conditional Gaussian policymore challenging and more similar to real-world IL settings.
The idea is that real-world demonstrations might me more complex and experiments using the same policy setup for the expert and the imitator might not well translate to real-world tasks.
The stochastic flow policy π θ (a t |s t ) is based on RealNVPs [Dinh et al., 2017] which have the same setup as the normalizing flow implementations used for µ E (s |s), µ φ (s |s, a), and µ η (a|s , s) also using N=16 GLOWCouplingBlocks.The state is processed as a condition with one MLP having a hidden size of 128.Each flow block has an additional fully connected layer to further process the condition to a small feature size of 8. Every flow block has 128 hidden neurons.Finally, each action is passed through a tanh layer as it is described in the SAC implementation.The log probability calculation was adapted accordingly [Haarnoja et al., 2018b].The final episode reward of the trained expert policy is in Table 3.
The expert trajectories are generated using the trained policy and saved as done by [Ho and Ermon, 2016] 5 .For the OPOLO6 and F-IRL7 baseline, the original implementations with the official default parameters for each environment are used.Only the loading of the expert data was changed to use the demonstrations of the previously trained normalizing flow policy.Since, no official code for FORM was publicly available, the FORM baseline was implemented based on our method.We changed the reward to use the state prediction effect model µ π θ Φ (s |s) as proposed by Jaegle et al. [2021].Both effect models where implemented using the same implementation as for our expert transition model µ E (s |s).Training of the expert effect model µ E ω (s |s) was performed offline with the same hyperparameter setup used for our expert transition model µ E (s |s) (see Table 1 and Table 2).The policy optimization was done using the same SAC setup as for our method since FORM does not depend on a specific RL algorithm [Jaegle et al., 2021].We tested different setups for the entropy parameter α and found that automatic entropy tuning as described by Haarnoja et al. [2018b] worked best.
For our experiment with unkown true environment reward the following selection criteria are used.For "OPOLO est.reward" the estimated return based on the reward r(s, s ) = −log(1 − D(s, s )) with the state s, next state s and the discriminator output D(s, s ) is used.For "OPOLO pol.loss" the original OPOLO policy loss is used: With the Q-Function Q and the f -divergence function.For both estimates the original OPOLO implementation was used.For FORM the convergence was estimated with the estimated return based on the reward: r t = logµ E ω (s |s) − logµ π θ Φ (s |s) using the normalizing flow effect models.For F-IRL we use the implementation of Ni et al. [2020] for the estimate of the Jensen-Shannon divergence between state marginals of the expert and policy 1 2 p(x)log 2p(x) p(x) + q(x) + q(x)log 2q(x) p(x) + q(x) dx.

A.6 Ablation Study
In this section we want to investigate the influence of difference components for our proposed imitation learning setup.First we want to answer the question if learning additional backward and forward dynamics models to estimate the state transition KLD improves policy performance.We compare our proposed method SOIL-TDM to an approach where we train a policy only based on the log-likelihood of the expert defined by: r abl (s, s ) = log µ E (s |s) Using this reward the policy is optimized using SAC as described earlier.We call this simplified reward design approach "Ablation only Expert Model".By comparing the performance of this method to our approach, we can show that learning additional density models to estimate forward and backward dynamics leads to improved policy performance.The resulting rewards are plotted in Figure 3.The relative reward using this ablation method is much lower compared to SOIL-TDM.Only for a high amount of trajectories does this method reach expert-level performance.
We furthermore want to evaluate how the quality of the learned normalizing flows affects the overall algorithm performance.We therefore report the estimated test loglikelihood of the trained expert models µ E (s |s) for different amount of expert trajectories using a separate test dataset with 20 unseen expert trajectories in table 4. The influence of the  expert model quality on the overall algorithm performance can be seen by comparing the test log-likelihood to the reward of the trained policy (in Figure 3 "SOIL-TDM" for our method and "Ablation only Expert Model" for the simplified reward using the same expert model µ E (s |s)).SOIL-TDM is less sensitive to expert model performance compared to "Ablation only Expert Model".Lastly, we want to investigate the expert model training.In the improved expert model training Routine, we regularize the optimization by adding Gaussian noise to the expert state values and linearly decrease its standard deviation down to 0.005 during training.As an additional ablation, we tested 2 different training setups for the expert transition models µ E (s |s).For the first model (µ E abl1 (s |s)) we omit the noise decay and only use the final constant Gaussian noise during training.For the second model (µ E abl2 (s |s)) no noise is added during training of the expert model.We evaluated also the test log-likelihood of the trained models using the test dataset with 20 unseen expert trajectories.The resulting test loglikelihoods are in Tables 5 -6.The results show that, constant Gaussian noise already improves the performance of the expert model and our applied noise scheduling routine results in further performance improvements.We also used the trained expert models for policy training based on the ablation reward from Equation 29.These ablation methods are called "Ablation wo.Noise Sched.","Ablation wo.Noise" respectively.The final rewards of these methods are also plotted in Figure 3.The results indicate that adding noise to the states during the offline training of the expert transition model also improves final policy performance.

A.7 Additional Results
The following figures  show the policy loss and the estimated reward together with the environment reward during the training on different pybullet environments for OPOLO, F-IRL, FORM, and SOIL-TDM (our method).All plots have been generated from training runs with 4 expert trajectories and 10 test rollouts.It can be seen that the estimated reward and policy loss from SOIL-TDM correlates well with the true environment reward.It is possible that the policy loss of SOIL-TDM is lower than 0 since its based not on the true distributions.Instead its based on learned and inferred estimates of expert state conditional distribution, policy state conditional distribution, policy inverse action distribution and q-function with relatively large absolute values (∼ 50 − 100) each.These estimation errors accumulate in each time-step due to sum and subtraction and due to the Qfunction also over (on average) 500 timesteps which can lead to relatively large negative values.

Figure 1 :
Figure 1: Unkown true environment reward selection criteria: Relative reward for different amount of expert trajectories on continuous control environments.The best policies based on estimated convergence values were selected.The value 1 corresponds to expert policy performance.

Figure 2 :
Figure 2: Best true environment reward selection criterion: Relative reward for different amount of expert trajectories on continuous control environments.The value 1 corresponds to expert policy performance.

Figure 3 :
Figure 3: Best environment reward for ablation experiments.Relative reward for different amount of expert trajectories.The value 1 corresponds to expert policy performance.

Figure 4 :
Figure 4: The policy loss, estimated reward and the environment test loss during training in the pybullet Ant environment using our proposed SOIL-TDM and the OPOLO, F-IRL, and FORM implementations with 4 expert trajectories.

Figure 5 :
Figure 5: The policy loss, estimated reward and the environment test reward during training in the pybullet Hopper environment using our proposed SOIL-TDM and the OPOLO, F-IRL, and FORM implementations with 4 expert trajectories.

Figure 6 :
Figure 6: The policy loss, estimated reward and the environment test reward during training in the pybullet Walker2D environment using our proposed SOIL-TDM and the OPOLO, F-IRL, and FORM implementations with 4 expert trajectories.

Figure 7 :
Figure 7: The policy loss, estimated reward and the environment test reward during training in the pybullet HalfCheetah environment using using our proposed SOIL-TDM and the OPOLO, F-IRL, and FORM implementations with 4 expert trajectories.

Figure 8 :
Figure 8: The policy loss, estimated reward and the environment test reward during training in the pybullet Humanoid environment using using our proposed SOIL-TDM and the OPOLO, F-IRL, and FORM implementations with 4 expert trajectories.

Table 1 :
[Dinh et al., 2017]eter tion reward in SOIL-TDM a model for the expert transitions µ E (s |s) as well as a forward µ φ (s |s, a) and backward dynamics model µ η (a|s , s) has to be learned.All three density models are based on RealNVPs[Dinh et al., 2017]consisting of several flow blocks where MLPs preprocess the conditions to a smaller condition size.The RealNVP transformation parameters are also calculated using MLPs, which process a concatenation of the input and the condition features.

Table 3 :
Episode Reward of Expert Policy