1 Introduction

Imitation learning (IL) describes methods that learn optimal behavior that is represented by a collection of expert demonstrations. In standard reinforcement learning (RL), the agent is trained on environment feedback using a reward signal. IL can alleviate the problem of designing effective reward functions using demonstrations. This is particularly useful for tasks where demonstrations are more accessible than designing a reward function. One popular example is to train traffic agents in a simulation to mimic real-world road users [1].

Learning-from-demonstrations (LfD) describes IL app-roaches that require state-action pairs from expert demonstrations [2]. Although actions can guide policy learning, it might be very costly or even impossible to collect actions alongside state demonstrations in many real-world setups. For example, when expert demonstrations are available as video recordings without additional sensor signals. One example of such a setup is training traffic agents in a simulation, where the expert data contains traffic recordings in bird’s eye view [1]. No direct information on the vehicle physics, throttle, and steering angle is available. Another example is teaching a robot to pick, move, and place objects based on human demonstrations [3]. In such scenarios, actions have to be estimated based on sometimes incomplete information to train an agent to imitate the observed behavior.

Alternatively, learning-from-observations (LfO) performs state-only IL and trains an agent without actions being available in the expert dataset [4]. Although LfO is a more challenging task than LfD, it can be more practical in case of incomplete data sources. Like LfD, distribution matching based on an adversarial setup is commonly used in LfO [5]. In adversarial imitation learning (AIL), a policy is trained using an adversarial discriminator, which is used to estimate a reward that guides policy training. AIL methods obtain better-performing agents than supervised methods like behavioral cloning (BC) using less data. However, adversarial training often has stability issues [6] and, under some conditions, is not guaranteed to converge [7]. One possibility to improve the stability of generative adversarial networks is to use normalization layers like spectral normalization [6]. Jin et al. [7] give further insights into improving the stability of minimax optimization which is used in adversarial training. Additionally, estimating the performance of a trained policy without access to the environment reward can be very challenging. Although the duality gap [8, 9] is a convergence metric suited for GAN based methods, it is difficult to use in the AIL setup since it relies on the gradient of the generator for an optimization process. In the AIL setup, the generator consists of the policy and the environment and therefore the gradient is difficult to estimate with black box environments. As an alternative for AIL setups, the predicted reward (discriminator output) or the policy loss can be used to estimate the performance. We evaluate this approach empirically in Section 5.

To address the limitations of AIL, we propose a state-only distribution matching method that learns a policy in a non-adversarial way. We optimize the matching between the actionless policy and expert trajectories by minimizing the Kulback-Leibler divergence (KLD) of the conditional state transition distribution of the policy and the expert for all time steps. We estimate the expert state transition distribution using normalizing flows, which can be trained offline using the expert dataset. Thus, stability issues arising from the min-max adversarial optimization in AIL methods can be avoided. This objective is similar to FORM [10], which was shown to be more stable in the presence of task-irrelevant features.

Although maximum entropy RL methods [11] can improve policy training by increasing the exploration of the agent, they also add a bias if being used to minimize the proposed KLD. To match the transition distributions of the policy and the expert exactly, the state-next-state distribution of the policy is expanded into policy entropy, forward dynamics and the inverse action model of the environment. It has been shown that such dynamic models can improve the convergence [12] and are well suited to infer actions not available in the dataset [13]. We model these distributions using normalizing flow models which have been demonstrated to perform very well in learning complex probability distributions [14]. Combining all estimates results in an interpretable reward that can be used together with standard maximum entropy RL methods [15]. The optimization based on the KLD provides a reliable convergence metric of the training and a good estimator for policy performance.

As contributions, we derive SOIL-TDM (State Only Imitation Learning by Trajectory Distribution Matching), a non-adversarial LfO method which minimizes the KLD between the conditional state transition distributions of the policy and the expert using maximum entropy RL. We show the convergence of the proposed method using off-policy samples from a replay buffer. We develop a practical algorithm based on the SOIL-TDM objective and demonstrate its effectiveness in measuring its convergence compared to several other state-of-the-art methods. Empirically we compared our method to the recent state-of-the-art IL approaches OPOLO [12], F-IRL [16], and FORM [10] in complex continuous control environments. We demonstrate that our method is superior especially if the selection of the best policy cannot be based on the true environment reward signal. This is a setting that more closely resembles real-world applications in autonomous driving or robotics where it is difficult to define a reward function [3].

2 Background

In this work, we want to train a stochastic policy function \(\pi _{\theta }(a_i\vert s_i)\) in continuous action spaces with parameters \(\theta \) in a sequential decision making task considering finite-horizon environmentsFootnote 1.

The problem is modeled as a Markov Decision Process (MDP), which is described by the tuple (SApr) with the continuous state spaces S and action spaces A. The transition probability is described by \(p(s_{i+1}\vert s_i, a_i)\) and the bounded reward function by \(r(s_i, a_i)\). At every time step i the agent interacts with its environment by observing a state \(s_i\) and taking action \(a_i\). This results in a new state \(s_{i+1}\) and a reward signal \(r_{i+1}\) based on the transition probability and reward function. We will use \(\mu ^{\pi _{\theta }}(s,a)\) and \(\mu ^{\pi _{\theta }}(s',s)\) to denote the state-action and state-next-state marginals of the trajectory distribution induced by the policy \(\pi _{\theta }\). These marginal distributions describe the state-action and state-next-state frequency over all time steps of a given policy \(\pi _{\theta }\).

2.1 Maximum entropy reinforcement learning and soft actor critic

The standard objective in RL is the expected sum of undiscounted rewards \(\sum _{i=0}^{T}\mathbb {E}_{(s_i,a_i) \sim \mu ^{\pi _{\theta }}}[r(s_i, a_i)]\). The goal of the agent is to learn a policy \(\pi _{\theta }(a_i\vert s_i)\) which maximises this objective. The maximum entropy objective

$$\begin{aligned} J_{\pi }(\theta ) = \sum _{i=0}^{T}\mathbb {E}_{(s_i,a_i) \sim \mu ^{\pi _{\theta }}}[r(s_i, a_i)+\alpha \mathcal {H}(\pi _{\theta }(\cdot \vert s_i))] \end{aligned}$$
(1)

introduces a modified goal for the RL agent, where the agent has to maximise the sum of the reward signal and its output entropy \(\mathcal {H}(\pi _{\theta }(\cdot \vert s_i))=\mathbb {E}_{a\sim \pi _{\theta }}[-\log \pi _{\theta }(\cdot \vert s_i)]\) [11]. The parameter \(\alpha \) controls the stochasticity of the optimal policy by determining the relative importance of the entropy term versus the reward.

Soft Actor-Critic (SAC) [15, 17] is an off-policy actor-critic algorithm based on the maximum entropy RL framework. Since we apply SAC in our imitation learning setup the main objectives will be briefly explained. SAC combines off-policy Q-Learning with a stable stochastic actor-critic formulation.

The state value function and Q-function are used to estimate how good an agent is in a specific state \(s_i\) or to perform a specific action \(a_i\) in a given state \(s_i\), based on the expected return. The Q-function can be estimated using a function approximator \(Q_{\Psi }(s_i,a_i)\). The soft Q-function parameters \(\Psi \) can be trained to minimize the soft Bellman residual

$$\begin{aligned} J_Q(\Psi ) = \mathbb {E}_{(s_i,a_i) \sim D_{RB}}[\frac{1}{2}(Q_{\Psi }(s_i,a_i) - \hat{Q}_{\hat{\Psi }}(s_i,a_i))^2], \end{aligned}$$
(2)

where state-action pairs are sampled from a replay buffer \(D_{RB}\) which contains state-action pairs from repeated policy-environment interactions. The target Q-function \(\hat{Q}_{\hat{\Psi }}\) can be estimated by

$$\begin{aligned} \hat{Q}_{\hat{\Psi }}(s_i,a_i) = r(s_i,a_i) + \gamma \mathbb {E}_{s_{i+1}}[V_{\hat{\Psi }}(s_{i+1})], \end{aligned}$$
(3)

using the soft state value function \(V_{\hat{\Psi }}(s_i)\), the current reward \(r(s_i,a_i)\) and the discount factor \(\gamma \). There is no need to include a separate function approximator for the soft value function since the state value is related to the Q-function and the policy by

$$\begin{aligned} V_{\hat{\Psi }}(s_i):=\mathbb {E}_{a_i \sim \pi _{\theta }}[Q_{\hat{\Psi }}(s_i,a_i) - \alpha \log \pi _{\theta }(a_i\vert s_i)]. \end{aligned}$$
(4)

To stabilize the training, the update uses a target network with parameters \(\hat{\Psi }\) that are a moving average of the parameters \(\Psi \). Lastly, the policy is optimized by minimizing the following objective:

$$\begin{aligned} J_{\pi }(\theta ) = \mathbb {E}_{s_i \sim D_{RB}}[\mathbb {E}_{a_i \sim \pi _{\theta }}[\alpha \log \pi _{\theta }(a_i\vert s_i) - Q_{\Psi }(s_i,a_i)]], \end{aligned}$$
(5)

where the states are sampled from the replay buffer \(D_{RB}\) and the actions \(a_i\) are sampled using the current policy \(\pi _{\theta }\).

2.2 Imitation learning

In the IL setup, the agent does not have access to the true environment reward function \(r(s_i, a_i)\) and instead has to imitate expert trajectories performed by an expert policy \(\pi _{E}\) collected in a dataset \(\mathcal {D}_E\).

In the typical learning-from-demonstration setup the expert demonstrations \(\mathcal {D}^{LfD}_E:=\{s^k_i,a^k_i, s^k_{i+1}\}^N_{k=1}\) are given by state-action-next-state transitions. Distribution matching has been a popular choice among different LfD approaches. The policy \(\pi _{\theta }\) is learned by minimizing the discrepancy between the stationary state-action distribution induced by the expert \(\mu ^{E}(s,a)\) and the policy \(\mu ^{\pi _{\theta }}(s,a)\). An overview and comparison of different LfD objectives resulting from this discrepancy minimization were made by Ghasemipour et al. [18]. Often the backward KLD is used to measure this discrepancy [19]:

$$\begin{aligned} J_{LfD}(\pi _{\theta }) := \mathbb {D}_{KL}(\mu ^{\pi _{\theta }}(s,a)\vert \vert \mu ^{E}(s,a)). \end{aligned}$$
(6)

Learning-from-observation

(LfO) considers a more challenging task where expert actions are not available. Hence, the demonstrations \(\mathcal {D}^{LfO}_E:=\{s^k_i, s^k_{i+1}\}^N_{k=1}\) consist of state-next-state transitions. The policy learns which actions to take based on interactions with the environment and the expert state transitions. Distribution matching based on state-transition distributions is a popular choice for state-only IL [5, 12]:

$$\begin{aligned} J_{LfO}(\pi _{\theta }) := \mathbb {D}_{KL}(\mu ^{\pi _{\theta }}(s',s)\vert \vert \mu ^{E}(s',s)). \end{aligned}$$
(7)

3 Method

In a finite horizon MDP setting, the joint state-only trajectory distributions are defined by the start state distribution \(p(s_0)\) and the product of the conditional state transition distributions \(p(s_{i+1} \vert s_i)\). For the policy distribution \(\mu ^{\pi _{\theta }}\) and the expert distribution \(\mu ^{E}\), this becomes

$$\begin{aligned} \mu ^{\pi _{\theta }}(s_T,\ldots ,s_0)= & {} p(s_0) \prod \limits _{i = 0\ldots T-1} \mu ^{\pi _{\theta }}(s_{i+1}\vert s_i),\nonumber \\ \mu ^{E}(s_T,\ldots ,s_0)= & {} p(s_0) \prod \limits _{i = 0\ldots T-1} \mu ^{E}(s_{i+1}\vert s_i). \end{aligned}$$

Our goal is to match the state-only trajectory distribution \(\mu ^{\pi _{\theta }}\) induced by the policy with the state-only expert trajectory distribution \(\mu ^{E}\) by minimizing the Kulback-Leibler divergence (KLD) between them. This results in the SOIL-TDM objective

$$\begin{aligned} J_{SOIL-TDM}= & {} \mathbb {D}_{KL}(\mu ^{\pi _{\theta }}\vert \vert \mu ^{E}) \nonumber \\= & {} \mathbb {E}_{(s_T,\ldots ,s_0) \sim \mu ^{\pi _{\theta }}}[\log \mu ^{\pi _{\theta }} - \log \mu ^{E}]\nonumber \\= & {} \sum \limits _{i = 0\ldots T-1} \mathbb {E}_{(s_{i+1},s_i) \sim \mu ^{\pi _{\theta }}}[\log \mu ^{\pi _{\theta }}(s_{i+1}\vert s_i)\nonumber \\- & {} \log \mu ^{E}(s_{i+1}\vert s_i)]. \end{aligned}$$
(8)

To estimate the policy-induced conditional state transition distribution, we define the following equation based on the Bayes theorem

$$\begin{aligned} \pi '_{\theta }(a_{i}\vert s_{i+1},s_i) = \frac{p(s_{i+1}\vert a_i,s_i) \pi _{\theta }(a_i\vert s_i)}{\mu ^{\pi _{\theta }}(s_{i+1}\vert s_i)}. \end{aligned}$$
(9)

The posterior distribution is represented by the inverse action distribution density \(\pi '_{\theta }(a_{i}\vert s_{i+1},s_i)\), the likelihood distribution is represented by the environment model \(p(s_{i+1}\vert a_i,s_i)\), the prior is represented by the policy distribution \(\pi _{\theta }(a_i\vert s_i)\), and the marginal likelihood by the policy-induced conditional state transition distribution \(\mu ^{\pi _{\theta }}(s_{i+1}\vert s_i)\). By solving for the marginal likelihood, we can rewrite (9) to

$$\begin{aligned} \mu ^{\pi _{\theta }}(s_{i+1}\vert s_i) = \frac{p(s_{i+1}\vert a_i,s_i) \pi _{\theta }(a_i\vert s_i)}{\pi '_{\theta }(a_{i}\vert s_{i+1},s_i)}. \end{aligned}$$
(10)

It holds for any \(a_i\) where \(\pi ' > 0\). Thus, one can extend the expectation over \((s_{i+1},s_i)\) by the action \(a_i\) and the KLD minimization \(\min \mathbb {D}_{KL}(\mu ^{\pi _{\theta }}\vert \vert \mu ^{E})\) can be rewritten as

$$\begin{aligned} \min \sum _{i = 0\ldots T-1}{} & {} \mathbb {E}_{(s_i,a_i,s_{i+1}) \sim {\pi _{\theta }}}[\log p(s_{i+1}\vert a_i,s_i) + \log \pi _{\theta }(a_i\vert s_i) \nonumber \\{} & {} - \log \pi '_{\theta }(a_{i}\vert s_{i+1},s_i) - \log \mu ^{E}(s_{i+1}\vert s_i)]. \end{aligned}$$
(11)

Now, we define a reward function (also see Appendix B)

$$\begin{aligned} r(a_i, s_i) :={} & {} \mathbb {E}_{s_{i+1} \sim p(s_{i+1}\vert a_i,s_i)} [-\log p(s_{i+1}\vert a_i,s_i) \nonumber \\{} & {} + \log \pi '_{\theta }(a_{i}\vert s_{i+1},s_i) + \log \mu ^{E}(s_{i+1}\vert s_i)], \end{aligned}$$
(12)

which depends on the expert state transition likelihood \(\mu ^{E}(s_{i+1}\vert s_i)\), on the environment model \(p(s_{i+1}\vert a_i,s_i)\) and on the inverse action distribution density \(\pi '_{\theta }(a_{i}\vert s_{i+1},s_i)\). Using this reward function, the state-only trajectory distribution matching problem can be transformed into a max-entropy RL task

$$\begin{aligned} \min \mathbb {D}_{KL}(\mu ^{\pi _{\theta }}\vert \vert \mu ^{E})= & {} \max \sum _{i = 0\ldots -1} \mathbb {E}_{(a_i,s_i) \sim \pi _{\theta }}[ -\log \pi _{\theta }(a_i\vert s_i) + r(a_i, s_i)]\nonumber \\= & {} \max \sum _{i = 0\ldots T-1} \mathbb {E}_{(a_i,s_i) \sim \pi _{\theta }}[r(a_i, s_i) + \mathcal {H}(\pi _{\theta }(\cdot \vert s_i)]. \end{aligned}$$
(13)

In practice, the reward function \(r(a_i,s_i)\) can be computed using Monte Carlo integration with a single sample from \(p(s_{i+1}\vert a_i,s_i)\) using the replay buffer.

This max-entropy RL task can be optimized with standard max-entropy RL algorithms. In this work, we applied the SAC algorithm [15] as it is outlined in Section 2.1.

The extension to infinite horizon tasks can be done by introducing a discount factor \(\gamma \) as in the work by Haarnoja et al. [17]. In combination with our reward definition, one obtains the following infinite horizon maximum entropy objective

$$\begin{aligned} J_{ME-iH}= & {} \sum _{i = 0\ldots \inf } \mathbb {E}_{(a_i,s_i) \sim \pi _{\theta }}[ \sum _{j = i\ldots \inf } \gamma ^{j-i} \mathbb {E}_{(a_j,s_j) \sim \pi _{\theta }} [r(a_j, s_j)\nonumber \\{} & {} + \mathcal {H}(\pi _{\theta }(\cdot \vert s_j) \vert s_i, a_i] ]. \end{aligned}$$
(14)

3.1 Algorithm

To evaluate the reward function, the environment model \(p(s_{i+1}\vert a_i,s_i)\) and the inverse action distribution function \(\pi '_{\theta }(a_{i}\vert s_{i+1},s_i)\) have to be estimated. We model both distributions using conditional normalizing flows and train them with maximum likelihood based on expert demonstrations and rollout data from a replay buffer. The environment model \(p(s_{i+1}\vert a_i,s_i)\) is modeled by \(\mu _{\phi }(s_{i+1}\vert a_i,s_i)\) with parameter \(\phi \), while the inverse action distribution function \(\pi '_{\theta }(a_{i}\vert s_{i+1},s_i)\) is modeled by \(\mu _{\eta }(a_{i}\vert s_{i+1},s_i)\) with parameter \(\eta \).

The whole training process according to Algorithm 1 is described in the following.Footnote 2 The expert state transition model \(\mu ^E(s_{i+1}\vert s_i)\) is trained offline using the expert dataset \(D_E\) which contains K expert state trajectories. We assume the expert distribution is correctly represented by the dataset. If the expert demonstrations are incomplete and important information is missing, the resulting policy might perform sub-optimally. For optimal performance, additional guidance like an additional reward might be a possible solution under such circumstances. We show that our method can learn meaningful expert state transition distributions even for incomplete expert trajectories (Appendix G). Since we use a normalizing flow to estimate the expert state transition distribution, this transition distribution can be complex and multimodal. Therefore learning the behavior from different experts is possible. We performed experiments using experts trained differently in the same environment by using different Reinforcement Learning algorithms. The results show that our method is able to train well-performing agents also in the case of multimodal demonstrations which come from different expert trajectories. Density modeling of the expert state transitions can still result in overfitting when only a few expert demonstrations are available (due to limit of the sample amount). We improve the expert training process by adding Gaussian noise to the state values. The standard deviation of the noise is reduced during training so that the model has a correct estimate of the density without overfitting to the explicit demonstrations. With this approach, we are able to successfully train expert state transition models on at least one expert trajectory. The influence of this improved routine is studied in Appendix F.

After this initial step, the following process is repeated until convergence in each episode. The policy \(\pi _{\theta }(\hat{a}_i\vert s_i)\) interacts with the environment for T steps by generating an action \(\hat{a}_i\) based on the current state \(s_i\). The environment generates a next-state \(s_{i+1}\) based on its current state and the received action at each time step. The collected state-action-next-state information is saved in the replay buffer \(D_{RB}\). The conditional normalizing flows for the environment model \(\mu _{\phi }(s_{i+1}\vert a_i,s_i)\) (policy independent) and the inverse action distribution model \(\mu _{\eta }(a_{i}\vert s_{i+1},s_i)\) (policy dependent) are optimized using samples from the replay buffer \(D_{RB}\) for N steps. In Appendix C we show that this reduces the KLD (11) in each step.

To train the Q-function, we compute a one-sample Monte-Carlo approximation of the reward using the learned models together with the samples from the reply buffer. The policy \(\pi _{\theta }(a_t\vert s_t)\) is updated based on \(J_{\pi }(\theta )\) using SAC as described in Section 2.1. The entropy weight \(\alpha \) in (1) is set to constant 1 during the optimization without automatic entropy weight tuning (proposed in [17]).

The SAC-based Q-function training and policy optimization also minimize the KLD in (11) (see (13)) in each step. During the SAC optimization, the reward function is fixed. Alternately, the subcomponents of it (\(\mu _{\phi }\) and \(\mu _{\eta }\)) are trained separately to the policy optimization. However, since after each SAC policy learning step we use data from new rollouts to adapt the inverse action distribution approximation (Algorithm 1), \(\pi '_{\theta }\) and \(\mu _{\eta }\) do not diverge. Theoretically, without the limit of sample amount, and under the assumption of well-behaved function approximators, the inverse action distribution (\(\pi '_{\theta }\)) can be approximated without bias. The SAC based policy optimization together with the inverse action policy learning leads to a converging algorithm since all steps reduce the KLD, which is bounded from below by 0 (see Appendix C).

It is worth noting that the overall algorithm is non-adversarial, the inverse action policy optimization and the policy optimization using SAC both reduce the overall objective - the KLD. On the contrary, the AIL algorithms (like OPOLO) are based on an adversarial nested min-max optimization. Additionally, we can estimate the similarity of state transitions from our policy to the expert during each optimization step, since we model all densities in the rewritten KLD from (11). As a result, we have a reliable performance estimate enabling us to select the best-performing policy based on the lowest KLD between policy and expert state transition trajectories.

3.2 Relation to learning from observations

The LfO objective of previous approaches like OPOLO minimizes the divergence between the joint policy state transition distribution and the joint expert state transition distribution

$$\begin{aligned} J_{LfO}(\pi _{\theta }) = \mathbb {D}_{KL}(\mu ^{\pi _{\theta }}(s',s)\vert \vert \mu ^E(s',s)), \end{aligned}$$
(15)

which can be rewritten as (see Appendix A)

$$\begin{aligned} J_{LfO}(\pi _{\theta }) ={} & {} \mathbb {D}_{KL}(\mu ^{\pi _{\theta }}(s_T,\ldots ,s_0)\vert \vert \mu ^E(s_T,\ldots ,s_0)) \nonumber \\{} & {} + \sum _{i = 1\ldots T-1} \mathbb {D}_{KL}(\mu ^{\pi _{\theta }}(s_i)\vert \vert \mu ^E(s_i)). \end{aligned}$$
(16)

Thus, the LfO objective minimizes both KLD of the joint distributions and the KLDs of the marginal distributions of all time steps. The SOIL-TDM objective in comparison minimizes purely the KLD of the joint distributions. In the case of perfect distribution matching - a zero KLD between the joint distributions - the KLDs of the marginals also vanish so both objectives have the same optimum. Minimizing purely the KLD of the joint distributions can contribute to the robustness of the learning algorithm, as it was demonstrated by Jaegle et al. [10]. Methods based on conditional state probabilities are less sensitive to erroneously penalizing features that may not be in the demonstrator data but lead to correct transitions. Hence, such methods may be less prone to overfit to irrelevant differences between the learner and the expert data. This, as well as the relation to the work by Jaegle et al. [10], is discussed further in Section 4.1.

Algorithm 1
figure e

State-Only Imitation Learning by Trajectory Distribution Matching (SOIL-TDM).

4 Related work

Many recent IL approaches are based on inverse RL (IRL) [20]. In IRL, the goal is to learn a reward signal for which the expert policy is optimal. AIL algorithms are popular methods to perform IL in a RL setup [2, 19, 21]. In AIL, a discriminator gets trained to distinguish between expert states and states coming from policy rollouts. The goal of the policy is to fool the discriminator. The policy gets optimized to match the state action distribution of the expert using this two-player game. Based on this idea more general approaches have been derived based on f-divergences. Ni et al. [16] derived an analytic gradient of any f-divergence between the agent and expert state distribution w.r.t. reward parameters. Based on this gradient they presented the algorithm F-IRL that recovers a stationary reward function from the expert density by gradient descent. Ghasemipour et al. [18] identified that IRL’s state-marginal matching objective contributes most to its superior performance and applied this understanding to teach agents a diverse range of behaviors using simply hand-specified state distributions.

A key problem with AIL for LfD and LfO is optimization instability [6]. Wang et al. [22] avoided the instabilities resulting from adversarial optimization by estimating the support of the expert policy to compute a fixed reward function. Similarly, Brantley et al. [23] used a fixed reward function by estimating the variance of an ensemble of policies. Both methods rely on additional behavioral cloning steps to reach expert-level performance. Liu et al. [24] proposed Energy-Based Imitation Learning (EBIL) which recovers fixed and interpretative reward signals by directly estimating the expert’s energy. Neural Density Imitation (NDI) [25] uses density models to perform distribution matching. Deterministic and Discriminative Imitation (D2-Imitation) [26] requires no adversarial training by partitioning samples into two replay buffers and then learning a deterministic policy via off-policy reinforcement learning. Inverse soft-Q learning (IQ-Learn) [27] avoids adversarial training by learning a single Q-function to represent both reward and policy implicitly. The implicitly learned rewards from IQ-Learn show a high positive correlation with the ground-truth rewards.

LfO can be divided into model-free and model-based approaches. GAILfO [5] is a model-free approach that uses the GAIL principle with the discriminator input being state-only. Yang et al. [28] analyzed the gap between the LfD and LfO objectives and proved that it lies in the disagreement of inverse dynamics models between the imitator and expert. Their proposed method Inverse-Dynamics-Disagreement-Minimization (IDDM) is based on an upper bound of this gap in a model-free way. OPOLO [12] is a sample-efficient LfO approach also based on AIL, which enables off-policy optimization. The policy update is also regulated with an inverse action model that assists distribution matching in a mode-covering perspective.

Other model-based approaches either apply forward dynamics models or inverse action models. Sun et al. [29] proposed a solution based on forward dynamics models to learn time dependent policies. Although being provably efficient, it is not suited for infinite horizon tasks. Alternatively, behavior cloning from observations (BCO) [13] learns an inverse action model based on simulator interactions to infer actions based on the expert state demonstrations. GPRIL [30] uses normalizing flows as generative models to learn backward dynamics models to estimate predecessor transitions and augment the expert data set with further trajectories, which lead to expert states. Jiang et al. [31] investigated IL using few expert demonstrations and a simulator with misspecified dynamics. A detailed overview of LfO was done by Torabi et al. [4].

4.1 Method discussion and relation to FORM

While our proposed method SOIL-TDM was independently developed, it is most similar to the state-only approach FORM [10]. In FORM the policy training is guided by a conditional density estimation of the expert’s observed state transitions. In addition a state transition model \(\mu ^{\pi _{\theta }}_{\Phi }(s_{i+1}\vert s_{i})\) of the current policy is learned. The policy reward is estimated by: \( r_i = log \mu ^E(s_{i+1}\vert s_{i}) - log \mu ^{\pi _{\theta }}_{\Phi }(s_{i+1}\vert s_{i})\). The approach matches conditional state transition probabilities of expert and policy in comparison to the joint state-action (like GAIL) or joint state-next-state (like OPOLO or GAILfO) densities. The authors of FORM argue that this conditional state matching contributes to the robustness of their approach. Namely, methods based on conditional state probabilities are less sensitive to erroneously penalizing features that may not be in the demonstrator data but lead to correct transitions. Hence, such methods may be less prone to overfit to irrelevant differences. Jaegle et al. [10] demonstrate the benefit of such a conditional density matching approach.

In contrast to FORM we show in (10) that the policies next-state conditional density \(\mu ^{\pi _{\theta }}(s_{i+1}\vert s_{i})\) can be separated into the policies action density and the forward- and the backward-dynamics densities. Using this decomposition, we show that the KLD minimization is equivalent to a maximum entropy RL objective (see (13)) with a special reward (see (12)). Here the entropy of the policy stemming from the decomposition of the conditional state-next-state density leads to the maximum entropy RL objective. Jaegle et al. [10] mention that the second term \(log \mu ^{\pi _{\theta }}_{\Phi }(s_{i+1}\vert s_{i})\) in their reward objective can be viewed as an entropy-like expression. Hence, if this reward is optimized using a RL algorithm that includes some form of policy entropy regularization like SAC this entropy is basically weighted twice. In the experiments, we show that this double accounting of the policy entropy negatively affects the sample efficiency of the algorithm in comparison to our method.

5 Experiments

We evaluate our proposed method described in Section 3 in a variety of different IL tasks and compare it against the baseline methods OPOLO, F-IRL, and FORM. For all methods, we use the complex and high-dimensional continuous control environments AntBulletEnv-v0, HalfCheetahBulletEnv-v0, HopperBulletEnv-v0, Walker2DBulletEnv-v0, HumanoidBulletEnv-v0 of the Pybullet physics simulation [32]. To evaluate the performance of all methods, the cumulative rewards of the trained policies are compared to cumulative rewards from the expert policy. The expert data generation as well as the used baseline implementations are described in Appendix E.

Since we assume no environment reward is available as an early stopping criterion, we use other convergence estimates available during training to select the best policy for each method. In adversarial training, the duality gap [8, 9] is an established method to estimate the convergence of the training process. In the IL setup, the duality gap can be very difficult to estimate since it requires the gradient of the policy and the environment (i.e. the gradient of the generator) for the optimization process it relies on. We, therefore, use two alternatives for model selection for OPOLO. The first approach selects the model with the lowest policy loss and the second approach selects the model based on the highest estimated cumulative reward over ten consecutive epochs. For F-IRL we selected the model with the lowest estimated Jensen-Shannon divergence over ten epochs. To estimate the convergence of SOIL-TDM the policy loss based on the KLD from (11) is used. It can be estimated using the same models used for training the policy. Similarly, we used the effect models of FORM to estimate the convergence based on the reward.

Many IL approaches show asymptotic performance. Although it is a reasonable comparison, we also argue that early stopping based on estimated performance gives valuable insights into how well the policy performance can be estimated without relying on external signals like an environment reward. It especially shows that the performance of adversarial methods can be estimated less reliably. Estimating the convergence asymptotically for unknown hyperparameter setups without any external signal is therefore less reliable. However, for complete transparency, we also included results based on the best environment reward and also added additional results in Appendix H which show the time to convergence as well as the asymptotic performance of all methods.

The evaluation is done by running 3 training runs with ten test episodes (in total 30 rollouts) for each trained policy and calculating the respective mean and confidence interval for all runs. The methods are compared based on different amounts of expert trajectories. This limited amount of expert demonstrations can cause a suboptimal learned representation of expert behavior which can lead to a deviation to expert performance. We plot the cumulative reward normalized so that 1 corresponds to expert performance. Values above 1 mean that the agent has achieved a higher cumulative reward than the mean of the expert. The expert in the proposed experiments is a policy trained on the true environment reward using SAC. The expert is not necessarily optimal w.r.t the environment reward (episode rewards of the expert are reported in Appendix E). Hence, achieving the most similar behavior is more desirable than surpassing the reward of the expert. Additionally, smaller confidence intervals are also desired since they indicate higher training stability and more reliable results. Implementation details of our method are described in Appendix D.

Fig. 1
figure 1

Unkown true environment reward selection criteria: Relative cumulative reward for a different amount of expert trajectories on continuous control environments. The best policies based on estimated convergence values were selected. The value 1 corresponds to expert policy performance. The confidence intervals are plotted using lighter colors

The evaluation results of the discussed methods on a suite of continuous control tasks with unknown true environment reward and the previously described selection criteria are shown in Fig. 1.

Fig. 2
figure 2

Best true environment reward selection criterion: Relative cumulative reward for a different amount of expert trajectories on continuous control environments. The best policies based on the cumulative reward were selected. The value 1 corresponds to expert policy performance. The confidence intervals are plotted using lighter colors

Although the true environment reward is unknown the results show that SOIL-TDM achieves or surpasses the performance of the baseline methods on all tested environments (except for two and four expert trajectories in the HumanoidBulletEnv-v0 environment and two trajectories in the Walker2DBulletEnv-v0 environment). In general the adversarial-based methods OPOLO and F-IRL exhibit a high variance of the achieved rewards using the proposed selection criteria. Although loss and reward are well suited for selecting the best model in usual setups, the results demonstrate that they might be less expressive for estimating the convergence in adversarial training due to the min-max game of the discriminator and the policy. The stability of the SOIL-TDM training method is evident from the small confidence band of the results which gets smaller for more expert demonstrations. Although the selection of FORM is more stable than the adversarial methods it generally achieves lower rewards in the sample efficient regime of one and two expert trajectories.

Figure 2 shows the benchmark results of OPOLO, F-IRL, FORM, and SOIL-TDM if the true environment reward is used as an early stopping criterion. In this setup, our method still achieves competitive performance or surpasses OPOLO, F-IRL, and FORM. Compared to the results from Fig. 1 the baseline methods often achieve better results using the true environment reward as a selection criterion. In contrast, our proposed method has more similar results for both selection criteria, which underlines the reliability of our proposed performance estimation.

We argue that the reduced performance of the baseline methods OPOLO and F-IRL are due to missing reliable and tractable convergence estimators for adversarial-based approaches if the best policy is selected based on estimated performance. The results based on the best true environment reward selection criterion are more similar in performance among the different IL approaches. The optimum can only be reached if enough expert demonstrations are available and for enough trajectories, all methods result in expert-level performance. Although the statement that SOIL-TDM and OPOLO have the same optimum under the condition of a zero KLD between the joint distributions (which might be violated with conditional Gaussian policies and limited data) is not contradicted by the experiments, we can show that SOIL-TDM is efficient regarding expert demonstrations. Additional figures for comparing the training performance and efficiency can be found in Appendix H. An ablation study for our method is in Appendix F.

6 Conclusion

In this work, we propose a non-adversarial state-only imitation learning approach based on minimising the Kulback-Leibler divergence between the policy and the expert state trajectory distribution. This objective leads to a maximum entropy reinforcement learning problem with a reward function depending on the expert state transition distribution and the forward and backward dynamics of the environment which can be modelled using conditional normalizing flows. The proposed approach is compared to several state-of-the-art learning from observation methods in a scenario with unknown environment rewards and achieves state-of-the-art performance.