Lipschitzness is all you need to tame off-policy generative adversarial imitation learning

Despite the recent success of reinforcement learning in various domains, these approaches remain, for the most part, deterringly sensitive to hyper-parameters and are often riddled with essential engineering feats allowing their success. We consider the case of off-policy generative adversarial imitation learning, and perform an in-depth review, qualitative and quantitative, of the method. We show that forcing the learned reward function to be local Lipschitz-continuous is a sine qua non condition for the method to perform well. We then study the effects of this necessary condition and provide several theoretical results involving the local Lipschitzness of the state-value function. We complement these guarantees with empirical evidence attesting to the strong positive effect that the consistent satisfaction of the Lipschitzness constraint on the reward has on imitation performance. Finally, we tackle a generic pessimistic reward preconditioning add-on spawning a large class of reward shaping methods, which makes the base method it is plugged into provably more robust, as shown in several additional theoretical guarantees. We then discuss these through a fine-grained lens and share our insights. Crucially, the guarantees derived and reported in this work are valid for any reward satisfying the Lipschitzness condition, nothing is specific to imitation. As such, these may be of independent interest.


Introduction
Imitation learning (IL) (Bagnell 2015) sets out to design artificial agents able to adopt a behavior demonstrated via a set of expert-generated trajectories. Also referred to as "teaching by showing" (Schaal 1997), IL can replace tedious tasks such as manual hard-coded agent programming, or hand-crafted reward design "reward shaping" (Ng et al. 1999) for 1 3 the agent to be trained via reinforcement learning (RL) (Sutton and Barto 1998). Besides, in contrast with the latter, imitation learning does not necessarily involve agent-environment interactions. This feature is particularly appealing in real-world domains such as robotics (Atkeson and Schaal 1997;Schaal 1997;Ratliff et al. 2007;Billard et al. 2008), where the artificial agent is physically implemented with expensive hardware, and the environment contains enough external entities (e.g. humans, other artificial agents, other costly devices) to raise safety concerns (Ha et al. 2020;Kahn 2016;Ray et al. 2019;Held et al. 2017). When controls are provided in the demonstrations [or recovered via inverse dynamics from the available kinematics (Hanna and Stone 2017)], we can treat said controls as regression targets, and learn a mimicking policy with a simple, supervised approach. This interaction-free approach (simulated or physical, real-world interactions), called behavioral cloning (BC), has enabled the success of various endeavors in robotic manipulation and locomotion (Ratliff et al. 2007;Wang et al. 2017), in autonomous driving-with the first self-driving vehicle (Pomerleau 1989(Pomerleau , 1990 thirty years ago and more recently with (Gu et al. 2020) using Waymo's open dataset (Sun et al. 2019)-and also in grand challenges like AlphAGo  and AlphAStAr (Vinyals et al. 2019). Due to its conceptual simplicity, we expect BC to still be a part of the pipeline for the most ambitious enterprises going forward, especially as open datasets get slowly released.
Despite its practical advantages, BC is extremely data-hungry w.r.t. the amount of expert demonstrations it needs to yield robust, high-fidelity policies. Besides, unless corrective behavior is present in the dataset (e.g. in autonomous driving, how to drive back onto the road), the policy learned via BC will not be able to internalize this behavior. Once in a situation from which it can not recover, there will be a permanent covariate shift between its current observations and the demonstrated ones. The controls learned in a supervised manner on the expert dataset are therefore useless, due to the distributional shift. As a result, the agent's errors will compound, a phenomenon coined by Ross and Bagnell (2010) as compounding errors. In Sect. 6.2.3, we stress how the latter echoes the compounding variations phenomenon, exhibited as part of the theoretical contributions of this work. To address the shortcomings of BC, Abbeel and Ng (2004) proposes to harness the innate credit assignment (Sutton and Barto 1998) capabilities of RL, by first trying to learn the cost function underlying the demonstrated behavior [inverse RL (Ng et al. 2000)], before using this cost to optimize a policy via RL. The succession of inverse RL and RL is called apprenticeship learning (AL) (Abbeel and Ng 2004), and can, by design, yield policies that can recover from out-of-distribution situations thanks to RL's built-in temporal abstraction mechanisms. Cost learning however is incredibly tedious, and successful approaches end up requiring coarse relaxations to avoid being deterringly computationally-expensive (Abbeel and Ng 2004;. Ultimately, as noted by Ziebart et al. (2008), setting out to recovering the cost signal under which the expert demonstrations are optimal (base assumption of inverse RL) is an ill-posed objective-echoing the reward shaping considerations from Ng et al. (1999). In line with this statement, generative adversarial imitation learning (GAIL)  departs from the typical AL pipeline, and replaces learning the optimal cost ("optimal" in the inverse RL sense) by learning a surrogate cost function. GAIL does so by leveraging generative adversarial networks (Goodfellow et al. 2014), as the name hints. The method is described in greater detail in Sect. 3. Due to the RL step it involves (like any AL method), GAIL suffers from poor sample-efficiency w.r.t. the amount of interactions it needs to perform with the environment. This caveat has since been addressed, notably by transposition to the off-policy setting, concurrently in SAM (Blondé and Kalousis 2019) and DAC (Kostrikov et al. 2019) (cf. Sect. 4). Both adversarial IL methods leverage 1 3 actor-critic architectures, consequently suffering from a greater exposure to instabilities. These weaknesses are mitigated with various complementary techniques, and cautious hyper-parameter tuning.
In this work, we set out to first conduct a thorough theoretical and empirical investigation into off-policy generative adversarial imitation learning, to pinpoint which are the techniques that are instrumental in performing well, and shed light over which are ones that can be discarded or disregarded without decrease in performance. Ultimately, we would like to exhibit the techniques that are sufficient for the method to achieve peak performance. Virtually every algorithmic design choice made in this work is supported by an ablation study reported in the Appendix. We start by describing the base off-policy adversarial imitation learning method at the core of this work in Sect. 4. We then undertake diagnoses of the various issues that arise from the combination of bilevel optimization problems at the core of the investigated model in Sect. 5. A key contribution of our work consists in showing that enforcing a Lipschitzness constraint on the learned surrogate reward is a necessary condition for the method to even learn anything-in our consumer-grade, computationally affordable hardware setting. We study it closely, providing empirical evidence of the importance of this constraint through detailed ablation results in Sect. 5.5. We follow up on this empirical evidence with theoretical results in Sect. 6.1, characterizing the Lipschitzness of the state-action value function under said reward Lipschitzness condition, and discuss the obtained variation bounds subsequently. Crucially, we show that without variation bounds on the reward, a phenomenon we call compounding variations can cause the variations of the state-action value to explode. As such, the theoretical results reported in Sect. 6.1-and discussed in Sect. 6.2-corroborate the empirical evidence exhibited in Sect. 5.5. Note, the theoretical results reported in this work are valid for any reward satisfying the condition, they readily transfer to the general RL setting and are not specific to imitation. The theoretically-grounded Lipschitzness condition, implemented as a gradient penalty, is in practice a local Lipschitzness condition. We therefore investigate where (i.e. on which samples, on which input distribution) the local Lipschitzness regularization should be enforced. We propose a new interpretation of the regularization scheme through an RL perspective, make an intuitively grounded claim on where to enforce the constraint to get the best results, and corroborate our claim empirically (cf. Sect. 6.3). Crucially, we show that the consistent satisfaction of the Lipschitzness constraint on the reward is a strong predictor of how well the mimicking agent performs empirically (cf. Sect. 6.4). Finally, we introduce a generic pessimistic reward preconditioner which makes the base method it is plugged into provably more robust, as attested by its companion guarantees (cf. Sect. 6.5). Again, these guarantees are not not specific to imitation and can be of independent interest for the RL community. Among the reported insights, we give an illustrative example of how the simple technique can further increase the robustness of the method it is plugged into.

Related work
Off-policy generative adversarial imitation learning, which is the object of this work, involves learning a parametric surrogate reward function, from expert demonstrations. By design Blondé and Kalousis 2019;Kostrikov et al. 2019), this signal is learned at the same time as the policy, and is therefore subject to non-stationarities (cf. Sect. 5.2). This reward regime is reminiscent of the reward corruption phenomenon (Everitt et al. 2017;Romoff et al. 2018), which posits that the real-world rewards are imperfect 1 3 (e.g. uncontrolled task specification change, sensor defects, reward hacking) and must therefore be treated as such, i.e. non-stationary at the very least. Despite being learned and therefore liable to non-stationary behavior, our reward is internal-as opposed to outside the agent's and practitioner's scope-and is therefore fully observable, as well as controllable via the practitioner-specified algorithmic design. The reward corruption can consequently be acted upon, and more easily mitigated than if it originated from a black box reward originating from the unknown environment.
The demonstrations on the other hand are available from the very beginning, and do not change as the policy learns. In that respect, our approach differs from observational learning (Borsa et al. 2017), where the policy learns to imitate another by observing it itself learn in the environment-and therefore does not strictly qualify as an expert at the task. Observational learning draws clear parallels with the teacher-student scheme in policy distillation . While our reward is changing since the policy changes and due to the inherent learning dynamics of function approximators, in observational learning, the reward would be changing also due to the expert still learning, causing a distributional drift.
Multi-armed bandits (Robbins 1952) have received a lot of attention in recent years to formalize and model problems of sequential decision making under uncertainty. In the context of this work, the most appropriate variants of bandits are stateful contextual multiarmed bandits. As the name hints, such models formalize decision making specific to given situations (i.e. contexts, states), in which the situations are i.i.d.-sampled. We consider the case of reinforcement learning, where the situations are entangled, along with the decisions themselves, in a Markov decision process (cf. Sect. 3). In particular, non-stationary reward channels in Markov decision processes have been studied extensively (cf. Sect. 5.2). Among these, adversarial bandits (Auer et al. 1995) can be seen as the archetype or worstcase reward corruption scenario, in which an adversary-possibly driven by malevolent intents-decides on the reward given to the agent. In these models, the common way to deal with non-stationary reward processes is to assume the reward variations in time are upper-bounded, either per-decision or over longer time periods. We give a comprehensive account of sequential decision making under uncertainty in non-stationary Markov decision processes in Appendix 2. By contrast, our theoretical guarantees are built on the premise that the reward function's variations are bounded over the input space by assuming that the reward function is locally Lipschitz-continuous over it. We make the same assumption on the dynamics of the multi-stage decision process, as well as on the control policy. While our theoretical results ultimately characterize the value function's robustness in terms of Lipschitz-continuity, (Fonteneau et al. 2010(Fonteneau et al. , 2013 start from the same assumptions, propose an estimator of the expected return, and derive bounds on its bias and variance. Derived in the offline RL setting, their bounds increase as the "dispersion" of the offline dataset increases. As such, our findings and dicussions carried out in Sect. 6.2 echo their work. Several works have recently attempted to address the overfitting problem GAIL suffers from. This is due to the discriminator being able to trivially distinguish agent-generated samples from expert-generated ones, which occurs when the learning dynamics of the adversarial game are not properly balanced. As such, the gist of said techniques is to either weaken the discriminator directly or make its classification task harder, which unsurprisingly exactly coincides with the typical techniques used to cope with overfitting in (binary) classification. These techniques are, in no particular order: reducing the discriminator's capacity-by plugging the classifier on top of an independent perception stack (e.g. random features, state-action value convolutional layers) (Reed et al. 2018), smoothing the positive labels with uniform random noise (Blondé and Kalousis 2019), adopting a positive-unlabeled classification objective (instead of the traditional positive-negative one) (Xu and Denil 2019), using a gradient penalty [originally from (Gulrajani et al. 2017)] regularizer (Blondé and Kalousis 2019;Kostrikov et al. 2019), leveraging an adaptive information bottleneck in the discriminator network (Peng et al. 2018), enriching the expert dataset via task-specific data augmentation (Zolna et al. 2019). In this work, we do not propose a new regularization technique. Instead, we perform an in-depth analysis of the simplest techniques-in terms of conceptual simplicity, implementation time, number of parameters, and computational cost (Hernandez and Brown 2020)-and ultimately find that the gradient penalty regularizer achieves the best trade-off.
A large-scale empirical study of adversarial imitation learning (Orsini et al. 2021), released very recently, considers a wide range of hyper-parameter settings, reporting results for more than 500k trained agents. The authors conclude that their study adds nuances to ours (this work). In particular, they argue that while the regularization techniques that urge the reward to be Lipschitz-continuous indeed do improve the performance (hence corroborating what we show in the first investigation of our work; cf. Sect. 5.5), more traditional regularizers (e.g. weight decay, dropout) can often perform similarly. In this work, we align the notion of smoothness with the Lipschitz-continuity of a function approximator, and are therefore focusing, from Sect. 5.5 onward, on gradient penalization because it explicitly enforces the reward to be smooth. More importantly, reward Lipschitzness is among the premises of our theoretical guarantees. In the results reported in (Orsini et al. 2021), the discriminator regularization schemes that can perform on par with schemes enforcing Lipschitz-continuity explicitly [gradient penalization (Gulrajani et al. 2017), and spectral normalization (Miyato et al. 2018)], which are always the top performers, are: dropout (Srivastava et al. 2014), weight decay (Loshchilov and Hutter 2017), and mixup ) (performing data augmentation). Regularization schemes such as dropout, weight decay, and data augmentation are less often seen through the lens of smoothness regularization than through the lens of generalization, despite generalization being among the beneficial effects of smoothness (Rosca et al. 2020). Used in the last layer, weight decay (Loshchilov and Hutter 2017) punishes spikes in elements of the weight matrix by limiting its norm, hence not allowing the output of the network to change too much. Dropout (Srivastava et al. 2014) applies masks over hidden activations, making the network return similar outputs when inputs only differ slighly. When using data augmentation [e.g. in mixup ], the network is forced to be close-to-invariant to purposely crafted variations of the input. These regularizers do not enforce Lipschitzness over the input space as explicitly as gradient penalties and spectral normalization do; nevertheless, they do encourage Lipschitzness implicitly, making the predictor more robust as a result. Specifically, as noted in Gouk et al. (2021), when a neural function approximator is trained with dropout, the Lipschitz constant of each layer is multiplied by 1 − r , where r is the dropout rate. It is also noted in Cisse et al. (2017) that using weight decay regularization at the last layer controls the Lipschitz constant of the network. All in all, the methods reported by Orsini et al. (2021) as performing the best are the ones enforcing Lipschitz-continuity over the input space explicitly, and these can be matched by regularization schemes that encourage Lipschitzness over the input space implicitly. As such, these results are complementary to the ones we report in our first investigation in Sect. 5.5, where we found that direct, explicit gradient penalization exceeds the performance of other evaluated regularizers. As we report, not constraining the Lipschitzness of the discriminator yields the worst results among the evaluated alternatives. Keeping the Lipschitz constant of the discriminator in check seems essential. Perhaps more importantly, the empirical investigation we conduct 1 3 in Sect. 5.5, and that is complemented by Orsini et al. (2021), motivates the derivation of our novel theoretical guarantees. Through these, we provide insights as to why keeping the Lipschitz constant of the reward in check seems to play such an important role in the stability of the value in off-policy adversarial IL. The considerable computational budget spent in Orsini et al. (2021) attests to how challenging the tackled problem is. Hafner et al. (2011) advocate for the use of a smooth reward signal in RL. Lange et al. (2012) presents it as one key method to make learning values in offline RL less tedious. Sharp changes in reward value are hard to represent and internalize by the action-value neural function approximator. Using a smooth reward surrogate derived from the original "jumpy" reward signal such that the trends are preserved but the crispness is attenuated proved instrumental empirically. Our observation about reward Lipschitz-continuity being a crucial component of our off-policy imitation learning pipeline is in line with the suggestion of Hafner et al. (2011). On top of providing empirical evidence of its benefits, we also provide a number of theoretical results characterizing what the reward smoothness does on the value function smoothness.
Finally, we point out that local Lipschitz-continuity conditions are also found in the adversarial robustness literature. Notably, Finlay et al. (2018) encourages Lipschitzness via gradient regularization, as is done in our work. Similarly, Hardt et al. (2015) derives bounds under a Lipschitz-continuity assumption on the loss.

Background
Setting In this work, we address the problem of an agent whose goal is, in the absence of extrinsic reinforcement signal (Singh et al. 2009), to imitate the behavior demonstrated by an expert (Bagnell 2015), expressed to the agent via a pool of trajectories. The agent is never told how well she performs or what the optimal actions are, and is not allowed to query the expert for feedback.
Preliminaries The intrinsic behavior of the decision maker is represented by the policy , modeled by a neural network with parameter , mapping states to probability distributions over actions. Formally, the conditional probability density over actions that the agent concentrates at action a t in state s t is denoted by (a t |s t ) , for all discrete timestep t ≥ 0 . We model the environment the agent interacts with as an infinite-horizon, memoryless, and stationary Markov Decision Process (MDP) (Puterman 1994) formalized as the tuple ∶= (S, A, p, 0 , u, ) . S ⊆ ℝ n and A ⊆ ℝ m are respectively the state space and action space. p and 0 define the dynamics of the world, where p(s t+1 |s t , a t ) denotes the stationary conditional probability density concentrated at the next state s t+1 when stochastically transitioning from state s t upon executing action a t , and 0 denotes the initial state probability density. u denotes a stationary reward process that assigns, to any state-actions pairs, a real-valued reward r t distributed as r t ∼ u(⋅|s t , a t ) . Finally, ∈ [0, 1) is the discount factor. We make the MDP episodic by positing the existence of an absorbing state in every trace of interaction and enforcing = 0 to formally trigger episode termination once the absorbing state is reached. Since our agent does not receive rewards from the environment, she is in effect interacting with an MDP lacking a reward process r. Our method however encompasses learning a surrogate reward parameterized by a deterministic function approximator such as a neural network with parameter , denoted by r , and whose learning procedure will be reported subsequently. Consequently, our agent effectively interacts with the augmentation of the previous MDP defined as * ∶= (S, A, p, 0 , r , ) . A trajectory is a trace of in * , succession of consecutive transitions (s t , a t , r t , s t+1 ) , where r t ∶=r (s t , a t ) . A demonstration is the set of state-actions pairs (s t , a t ) extracted from a trajectory collected by the expert policy e in . The demonstration dataset D is a set of demonstrations.
Objective Building on the reward hypothesis at the core of reinforcement learning (any task can be defined as the maximization of a reward), to act optimally, our agents must be able to deal with delayed signals and maximize the long-term cumulative reward. To address credit assignment, we use the concept of return, the discounted sum of rewards from timestep t onwards, defined as R t ∶= ∑ +∞ k=0 k r t+k ∶= ∑ +∞ k=0 k r (s t+k , a t+k ) in the infinite-horizon regime. By taking the expectation of the return with respect to all the future states and actions in * , after selecting a t in s t and following thereafter, we obtain the state-action value (Q-value) of the policy at (s t , a t ) : . At state s t , a policy that picks a t verifying: therefore acts optimally looking onwards from s t . Ultimately, an agent acting optimally at all times maximizes V (s 0 )∶= a 0 ∼ (⋅|s 0 ) [Q (s 0 , a 0 )] for any given start state s 0 ∼ 0 . In fine, we can now define the utility function [also called performance objective (Silver et al. 2014)] to which our agent's policy must be solution of: and Π is the search space of parametric function approximators, i.e. deep neural networks. Generative adversarial imitation learning GAIL (Ho and Ermon 2016) trains a binary classifier D , called discriminator, where samples from e are positive-labeled, and those from are negative-labeled. It borrows its name from Generative Adversarial Networks (Goodfellow et al. 2014): the policy plays the role of generator and is optimized to fool the discriminator D into classifying its generated samples (negatives), as positives. As such, the prediction value indicates to what extent D believes 's generations are coming from the expert, and therefore constitutes a good measure of mimicking success. GAIL does not try to recover the reward function that underlies the expert's behavior. Rather, it learns a similarity measure between e and , and uses it as a surrogate reward function. We say that and D are "trained adversarially" to denote the two-player game they are intricately tied in: D is trained to assert with confidence whether a sample has been generated by , while receives increasingly greater rewards as D 's confidence in said assertion lowers. In fine, the surrogate reward measures the confusion of D . In this work, the neural network function approximator modeling D uses a sigmoid as output layer activation, i.e. D ∈ [0, 1] . The exact zero case is bypassed numerically for log •D to always exist, by adding an infinitesimal value > 0 to D inside the logarithm. The same numerical stability trick is used for log •(1 − D ) to avoid the exact one case (cf. reward formulations in Sect. 4).

Comprehensive refresher on the sample-efficient adversarial mimic
Building on TRPO (Schulman et al. 2015), GAIL  inherits its policy evaluation subroutine, consisting in learning a parametric estimate of the state-value function V ≈ V via Monte-Carlo estimation over samples collected by . While it uses function approximation to estimate V , hoping it generalizes better than a straight-forward non-parametric Monte-Carlo estimate (discounted sum), we will reserve the term actorcritic for architectures in which the state-value V (⋅) or Q-value Q (⋅, ⋅) is learned via Temporal-Difference (TD) (Sutton 1988). This terminology choice is adopted from Sutton and Barto (1998) (cf. Chapter 13.5). A critic is used for bootstrapping, as in the TD update rule (whatever the bootstrapping degree is). As such, TRPO is not an actor-critic, while algorithms learning their value via TD, such as DDPG (Silver et al. 2014;Lillicrap et al. 2016), are actor-critic architectures. Albeit hindered from various weaknesses (cf. Sect. 5.1), and forgetting for a moment that it is combined with function approximation (Sutton et al. 1999;Silver et al. 2014), the TD update is able to propagate information quicker as the backups are shorter and therefore do not need to reach episode termination to learn, in contrast with Monte-Carlo estimation. That is without even involving fictitious, memory, or experience replay mechanisms (Lin 1992). By design, TD learning is less data-hungry (w.r.t. interactions in the environment), and involving replay mechanisms (Lin 1992;Lillicrap et al. 2016;Wang et al. 2016) significantly adds on to its inherent sample-efficiency. Based on this line of reasoning, SAM (Blondé and Kalousis 2019) and DAC (Kostrikov et al. 2019) addressed the deterring sample-complexity of GAIL by, among other improvements [cf. (Blondé and Kalousis 2019;Kostrikov et al. 2019)], using an actor-critic architecture to replace TRPO for policy evaluation and improvement. SAM (Blondé and Kalousis 2019) uses DDPG , whereas DAC (Kostrikov et al. 2019) uses TD3 (Fujimoto et al. 2018). Both were released concurrently, and both report significant improvements in sample-efficiency (up to two orders of magnitude). Standing as the stripped-down model that brought sample-efficiency to GAIL, we take SAM as base. Albeit described momentarily in the body of this work, we urge the reader eager to understand every single aspect of the laid out algorithm to also refer to the section in which we describe the experimental setting, cf. Sect. 5.5.
We now lay out the constituents of SAM (Blondé and Kalousis 2019), and how their learning procedures are orchestrated. The agent's behavior is dictated by a deterministic policy , the critic Q assigns Q-values to actions picked by the agent, and the reward r assesses to what degree the agent behaves like the expert. As usual, , , and denote the respective parameters of these neural function approximatiors. To explore when carrying out rollouts in the environment, is perturbed both in parameter space by adaptive noise injection in (Plappert et al. 2018;Fortunato et al. 2017), and action space by adding the temporally-correlated response of an Ornstein-Uhlenbeck noise process (Uhlenbeck and Ornstein 1930;Lillicrap et al. 2016) to the action returned by . Formally, in state s t , action a t is sampled from (⋅|s t )∶= + (s t ) + t , where ∼ N(0, 2 a ) ( a adapts conservatively such that | + (s t ) − (s t )| remains below a certain threshold), and where t is the response of the Ornstein-Uhlenbeck process (Uhlenbeck and Ornstein 1930) OU at timestep t in the episode, such that t ∶= OU (t, b ) . Note, OU is reset upon episode termination. As a first minor contribution, we carried out an ablation study on exploration strategies, and report the results in Appendix 9. While the utility of temporally-correlated noise is somewhat limited to dynamical systems, both parameter noise and input noise injections have proved beneficial in generative modeling with GANs [ (Zhao et al. 2017) and , respectively]. As in GAIL (Ho and Ermon 2016) (described earlier in Sect. 3), the discriminator D is trained via an adversarial training procedure (Goodfellow et al. 2014) against the policy . The surrogate reward r used to augment MDP into * is derived from D to reflect the incentive that the agent needs to complete the task at hand. In the tasks we consider in this work (simulated robotics environments (Brockman et al. 2016), based on the MuJoCo (Todorov et al. 2012) physics engine, and described in Table 1) an episode terminates either (a) when the agent fails to complete the task according to an task-specific criterion hard-coded in the environment, or (b) when the agent has performed a number of steps in the environments that exceeds a predefined hard-coded timeout, which we left to its default value-with the exception of HalfCheetah, in which (a) does not apply. Due to (a), the agent can decide to truncate its return by triggering its own failure, and decide to "cut its losses" when it is penalized too heavily for not succeeding according to the task criterion. Always-negative rewards [e.g. per-step " −1 " reward to urge to agent to complete the task quickly (Kaelbling 1993)] can therefore make the agent give up and trigger termination the earliest possible, as this would maximize its return. On the other hand, always-positive rewards can make the agent content with its sub-optimal actions which would prevent it from pursuing higher rewards, as long as it remains alive. This phenomenon has been dubbed survival bias in (Kostrikov et al. 2019). Notably, this discussion highlights the tedious challenge that reward shaping (Ng et al. 1999) usually represents to practitioners when designing a new task. Stemming from their generator loss counterparts in the GAN literature, the minimax (saturating) reward variant is r ∶= − log(1 − D ) , and the non-saturating reward variant is log(D ) . The minimax reward is always positive, the non-saturating reward is always negative, and the sum of the two can take positive and negative values. We found empirically that using the minimax reward, despite being always positive, yielded by far the best results compared to the sum of the two variants. The performance gap is reduced in the HalfCheetah task which was expected since it is the only task in which the agent can not trigger an early termination. We report these comparative results in Appendix 6. Crucially, these results show that the base method considered in this work can already successfully mitigate survival bias, without requiring additional reward shaping. In summary, we use the formulation r ∶= − log(1 − D ) , unless stated otherwise explicitly.
We also adopt the mechanism introduced in Kostrikov et al. (2019) that wraps the absorbing transitions (agent-generated and expert-generated) to enable the discriminator to distinguish between terminations caused by failure and terminations triggered by the Table 1 State and action dimensions, n and m, of the studied environments from the MuJoCo (Todorov et al. 2012) simulated robotics benchmark from OpenAI Gym (Brockman et al. 2016) abbrv. IDP for InvertedDoublePendulum, the continuous control counterpart of Acrobot. In the last column, we report both the mean and standard deviation (formatted as ( ) in the 1 3 artificially hard-coded timeout. The method enables the discriminator to penalize the agent for terminating by failure when the expert would, with the same action and in the same state, terminate by reaching the episode timeout without failing. In such a scenario, without wrapping the absorbing transitions, the agent perfectly imitates the expert in the eyes of the discriminator, which is not the case. We use the wrapping mechanism in every experiment. Nonetheless, we omit it from the equations and algorithms for legibility. Giving the agent the ability to differentiate between terminations that are due to time limits and those caused by the environment had proved crucial for the decision maker to continue beyond the time limit. The significant role played by the explicit inclusion of the notion of time in RL has been established by Harada Harada (1997), yet without much follow-up, until being revived in Pardo et al. (2018) where the authors demonstrate that a careful inclusion of the notion of time in RL can meaningfully impact performance. By assuming the roles of opponents in a GAN, and are tied in a bilevel optimization problem (as highlighted in Pfau and Vinyals (2016)). Similarly, by defining an actor-critic architecture, and are also tied in a bilevel optimization problem. We notice the dual role of , which is intricately tied in both bilevel problems. As such, what SAM (Blondé and Kalousis 2019) sets out to solve can be dubbed a -coupled twin bilevel optimization problem. Note, Q uses the parametric reward r as a scalar detached from the computational graph of the ( , ) bilevel problem, as having gradients flow back from Q to would prevent D from being learned as intended, i.e. adversarially in the ( , ) bilevel problem. The information and gradient flows occurring between the components are illustrated in Fig. 1. As we show via numerous ablation studies in this work, training this -coupled twin bilevel system to completion is severely prone to instabilities and highly sensitive to hyper-parameters. Ultimately, we show that r 's Lipschitzness is a sine qua non condition for the method to perform well, and study the effects of this necessary condition in several theoretical results in Sect. 6.1. Sample-efficiency is achieved through the use of a replay mechanism (Lin 1992): every component (every neural network, , , and ) is trained using samples from the replay buffer R , a "first in, first out" queue of fixed retention window, to which new rollout samples (transitions) are sequentially added, and from which old rollout samples are sequentially removed. Note however that when a transition is sampled from R , its reward component is re-computed using the most recent r update. Blondé and Kalousis (2019) and Kostrikov et al. (2019) were the first to train D with experience replay, in a non-i.i.d. context (Markovian), for increased learning stability. Borrowing the common terminology, the reward is therefore effectively "learned off-policy". Let be the off-policy distribution that corresponds to uniform sampling over R . is therefore effectively a mixture of past policy updates [ i−Δ+1 , … , i−1 , i ] , where the mixing depends on R 's retention window, and the number of collected samples per iteration.
We introduce * , which denotes the discounted state visitation frequency of an arbitrary policy in * . Formally, is the probability of reaching state s at timestep t when interacting with the MDP * by acting according to . Since ∑ s∈S (s) = 1∕(1 − ) , can be seen as a probability distribution over states up to a constant factor. Due to the presence of the discount factor , * (s) has higher value if s is visited earlier than later in the infinite-horizon trajectory. In practice, we relax the definition to its non-discounted counterpart and to the episodic regime case, as is usually done. Plus, since every interaction is done in MDP * , we use the shorthand . From this point forward, when states s t are sampled uniformly from the replay buffer Rin effect, following policy -the expectation over said samples will be denoted as s t ∼ [⋅].
We now go over how each module ( , , and ) is optimized in this work. We optimize with the binary cross-entropy loss, where positive-labeled samples are from e , and negative-labeled samples are from : In this work, unless stated otherwise, is regularized with gradient penalization ℜ (k) , subsuming the original formulation proposed in Gulrajani et al. (2017), which was used in SAM (Blondé and Kalousis 2019) and DAC (Kostrikov et al. 2019): The regularizer will be the object of several downstream analyses and discussions (cf. Sects. 5.4 and 6.3). The meaning of , k and will be given in Sect. 5.4.
The critic's parameters are updated by gradient decent on the TD loss (Sutton 1988), using the multi-step version (Peng et al. 1996) ("n-step") of the Bellman target (R.H.S. of the expected Bellman equation), which has proven beneficial for policy evaluation (Hessel et al. 2017;Fernando Hernandez-Garcia and Sutton 2019). The loss optimized by the critic is: (1) where the target Q targ uses softly-updated ) target networks , ′ and ′ , and is defined as: Finally, since is deterministic, its utility value at timestep t is where the approximation is due to the actor-critic design involving the use of function approximators. To maximize its utility at t, must take a gradient step in the ascending direction, derived according to the deterministic policy gradient theorem (Silver et al. 2014): This last step [Eq. (8)] emerges from the natural assumption that ∀s ∇ s = 0 , since the analytical form of 's dynamics, p, is unknown. To overcome the inherent overestimation bias (Thrun and Schwartz 1993) hindering Q-Learning and actor-critic methods based on greedy action selection [e.g. DDPG ], and therefore suffered by our critic Q , we apply the actor-critic counterpart of double-Q learning (van Hasselt 2010)analogously, Double-DQN (van Hasselt et al. 2015) for DQN-proposed in Twin-Delayed DDPG (abbrv. TD3) (Fujimoto et al. 2018). This add-on method, simply called clipped double-Q learning (abbrv. CD), consists in learning an additional (or "twin") critic, and using the smaller of the two associated Q-values in the Bellman target, used in the temporal-difference error of both critics. For its reported benefits at minimal cost, we also use the other main add-on proposed in TD3 (Fujimoto et al. 2018) called target policy smoothing. The latter adds noise to the target action in order for the deterministic policy not to pick actions with erroneously high Q-values, as such input noise injection effectively smooths out the Q landscape along changes in action. Target policy smoothing (or target smoothing, abbrv. TS) draws strong inspiration from the SARSA (Sutton and Barto 1998) learning update since it uses a perturbation of the greedy next-action in the learning update rule, which makes the method more robust against noisy inputs and therefore potentially safer in a safety-critical scenario. Note, while value overfitting primarily impedes policies that are deterministic by design, stochastic policies that prematurely collapse to their mode (Schulman et al. 2015) are deterministic in effect and as such are impeded too. In particular, fitting the value estimate against an expectation of similar bootstrapped target value estimates forces similar actions to have similar values, which corresponds-by definition-to making the Q-function locally Lipschitz-continuous. As such, the induced smoothness over Q is to be understood in terms of local Lipschitz-continuity (or equivalently, local Lipschitzness), which we define in Definition 1. More generally, the concept of smoothness that is at the core of the analyses laid out in this work is the concept of Lipschitz-continuity. Interestingly, we show later in Sect. 6.2.4, formally and from first principles, that target policy smoothing is equivalent to applying a regularizer on Q that induces Lipschitz-continuity w.r.t. the action input. In addition, we align the notion of robustness of a function approximator with the value of its Lipschitz constant (cf. Definition 1): a k 1 -Lipschitz-continuous function approximator will be characterized as more robust than another k 2 -Lipschitz-continuous function approximator if and only if k 1 ≤ k 2 . As such, in this work, the notions of smoothness and robustness are both aligned with the notion of Lipschitz-continuity.
x ↦ f (x) , and C 0 (continuous) over X . We denote the euclidean norms of X and Y by ‖ ⋅ ‖ X and ‖ ⋅ ‖ Y respectively, and the Frobenius norm of the ℝ m×n matrix space by ‖ ⋅ ‖ F . Lastly, let k be a non-negative real, k ≥ 0 .
In either case, if the inequality is verified, k is called the Lipschitz constant of f. The symbol ∇ , historically reserved to denote the gradient operator, is here used to denote the Jacobian operator of the vector function f, to maintain symmetry with the notations and appellations used in previous works.
(c) Let X be a subspace of X , X ⊆ X . f is said locally k-Lipschitz-continuous over X ⊆ X iff, for all x ∈ X , there exists a neighborhood U x of x such that f is k-Lipschitz-continuous over U x .
Based on Definition 1(b) the gradient penalty in Eq.
(2), effectively enforces local Lipschitz-continuity over the support of the distribution (described later in cf. Sect. 5.4), a subspace of the state-action joint space.
Unless specified otherwise, we use both the clipped double-Q learning and target policy smoothing add-on techniques in all the experiments reported in this work. We ran an ablation study on both techniques to illustrate their respective benefits, and support our algorithmic design choice to use them. We report said ablations in Appendix 4.
We describe the inner workings of SAM in Algorithm 1. 1 Since our agent learns a parametric reward-differentiable by design-along with a deterministic policy, we could, in principle, use the gradient s t ∼ [∇ (s t )∇ a r (s t , a)| a= (s t ) ] [constructed by analogy with Eq. (8)] to update the policy. (Blondé and Kalousis 2019) raised the question of whether one should use this gradient and answered in the negative: while the gradient in Eq. (8) guides the policy towards behaviors that maximize the long-term return of the agent, effectively trying to address the credit assignment problem, the gradient involving r in place of Q is myopic, and does not encourage the policy to think more than one step ahead. It is obvious that back-propagating through Q , literally designed to enable the policy to reason across longer time ranges, will be more helpful to the policy towards solving the task. The authors therefore discard the gradient involving r . Nonetheless, we set out to investigate whether the latter can favorably assist the gradient in Eq. (8) in solving the task, when both gradients are used in conjunction. Drawing a parallel with the line of work using unsupervised auxiliary tasks to improve representation learning in visual tasks (Jaderberg et al. 2016;Shelhamer et al. 2016;Mirowski et al. 2016;Doersch et al. 2015), we define the gradient s t ∼ [∇ (s t )∇ a Q (s t , a)| a= (s t ) ] as the main gradient, and s t ∼ [∇ (s t )∇ a r (s t , a)| a= (s t ) ] as the auxiliary gradient, which we denote by g m and g a respectively. Based on our previous argumentation, allowing the myopic g a to take the upper hand over g m could have a disastrous impact on solving the task: combining the g m and g a must be done conservatively. As such, we use the auxiliary gradient only if it amplifies the main gradient. We measure the complementarity of the main and auxiliary tasks by the cosine similarity between their respective gradients, (g m , g a )) , as done in Du et al. (2018), and assemble the new composite gradient g c ∶=g m + max(0, (g m , g a )) g a . By design, g a is added to g m only if the cosine similarity between them, (g m , g a )) , is positive, and will, in that case, be scaled by said cosine similarity. If the gradients are collinear, they are summed: g c = g m + g a . If they are orthogonal or if the similarity is negative, g a is discarded: g c = g m . Our experiments comparing the usage of g c and g m (cf. Fig. 12 in Appendix 3) show that using the composite gradient g c does not yield any improvement over using only g m . By monitoring the values taken by (g m , g a )) , we noticed that the cosine similarity was almost always negative, yet close to 0, hence g c = g m , which trivially explains why the results are almost identical.

Lipschitzness is all you need
This section aims to put the emphasis on what makes off-policy generative adversarial imitation learning challenging. When applicable, we propose solutions to these challenges, supported by intuitive and empirical evidence. In fine, as the section name hints, we found that-in our experimental and computational setting, described at the beginning of Sect. 5.5-forcing the local Lipschitzness of the reward is a sine qua non condition for good performance, while also being sufficient to achieve peak performance.

A deadlier triad
In recent years, several works (Fujimoto et al. 2018;Fu et al. 2019;Achiam et al. 2019) have carried out in-depth diagnoses of the inherent problems of Q-learning (Watkins 1989;Watkins and Dayan 1992)-and bootstrapping-based actor-critic architectures by extension-in the function approximation regime. Note, while the following issues directly apply to DQN , which even introduces additional difficulties (e.g. target networks, replay buffer), we limit the scope of this section to Q-learning, to eventually make our point. Q-learning under function approximation possesses properties that, when used in conjunction, make the algorithm brittle, prone to unstable behavior, as well as tedious to bring to convergence. Without caution, the algorithm is bound to diverge. These properties constitute the deadly triad (Sutton and Barto 1998;van Hasselt et al. 2018): function approximation, bootstrapping, and off-policy learning.
Since the method we consider in this work per se follows an actor-critic architecture, it possesses all three properties, and is therefore inclined to diverge and suffer from instabilities. Additionally, since the learned reward r is: (a) defined from binary classifier predictionsdiscriminator's predicted probabilities of being expert-generated-estimated via function approximation, (b) learned at the same time as the policy, and (c) learned off-policy-with the negative samples coming from the replay distribution , the method we study consequently introduces an extra layer of complication in the deadly triad. We now go over the three points and explain to what extent they each exacerbate the divergence-inducing properties that form the deadly triad.
To tackle point (a), we introduce explicit residuals to represent the various sources of error involved in temporal-difference learning, and illustrate how these residuals accumulate over the course of an episode. We will use the shorthand [⋅] for expectations for the sake of legibility. We take inspiration from Eq. (12) in Fujimoto et al. (2018), where a bias term is introduced in the TD error due to the function approximation of the Q-value, as the Bellman equation is never exactly satisfied in this regime. Borrowing the terminology from the statistical risk minimization literature, while the original bias suffered by the TD error was due to the estimation error caused by bootstrapping, function approximation is responsible for an extra approximation error contribution. The sum of these two errors is represented with the residual . Let us now consider D (s, a) , the estimated probability that a sample (s, a) is coming from expert demonstrations. Formally, D (s, a) = ℙ [EXPERT(s, a)] , where the event is defined as EXPERT(s, a)∶="s ∼ e ∧ a ∼ e " , and where ℙ denotes the probability estimated with the approximator . In the same vein, we distinguish the error contributions: the approximation error is caused by the choice of function approximatior class (e.g. two-layer neural networks with hyperbolic tangent activations), and the estimation error is due to the gap between the estimations of our classifier and the predictions of the Bayes classifier-the classifier with the lowest misclassification rate in the chosen class. This gap can be written as |D (s t , a t ) − BAYES(s t , a t )| , where BAYES(s, a) = ℙ BAYES [EXPERT(s, a)] , by analogy with the previous notations. In fine, we introduce the residual that represents the contribution of both errors in the learned reward r , hence: As observed in Fujimoto et al. (2018) when estimating the accumulation of error due to function approximation in the standard RL setting, the variance of the state-action value is proportional to the variance of both the return and the Bellman residual . Crucially, in our setting involving the learned imitation reward r , it is also proportional to the variance of the residual , containing contributions of both the approximation error and estimation error of r . As a result, the variance of the estimate also suffers from a critically stronger dependence on (cf. ablation study in Appendix 7). Intuitively, as we propagate rewards further (higher k value), their induced residual error triggers a greater increase in the variance of the Q-value estimate. In addition to its effect on the variance, the additional residual also clearly impacts the overestimation bias (Thrun and Schwartz 1993) it is afflicted by, which further advocates the use of dedicated techniques such as Double Q-learning (Fujimoto et al. 2018;van Hasselt 2010), as we do in this work (cf. Sect. 4). All in all, by introducing an extra source of approximation and estimation error, we further burden TD-learning.
Moving on to points (b)-the reward is learned at the same time as the policy-and (c)-the reward is learned off-policy using samples from the replay policy -we see that each statement allow us to qualify the reward r as a non-stationary process. Conceptually, by considering a additive decomposition of the reward r into a stationary r STAT and a nonstationary contribution r NON-STAT , we see that following an accumulation analysis similar to the previous one shows that the variance of the state-action value is proportional to the variances of each contribution. While the variance of r STAT can be important and therefore can have a considerable impact on the variance of the Q-value estimate, it can usually be somewhat tamed with online normalization techniques and mitigated with techniques enabling the agent to cope with rewards of vastly different scales [e.g. pop-Art (van Hasselt et al. 2016)]. We show later that such methods do not help when the underlying reward is non-stationary (cf. Sect. 5.2 for empirical results). The variance of the non-stationary contribution r NON-STAT , indeed is, due to its continually-changing nature, untameable with these regular techniques relying on the usual stationarity assumption-unless additional dedicated mechanisms are integrated (e.g. change point detection techniques). Naturally, the non-stationary contribution also has an effect on the bias of the estimation, and a fortiori on its overestimation bias [as with (a)]. We note that the argument made in the context of Q-learning by Fu et al. (2019) naturally transfers to the TD-learning objective optimized in this work: the objective is non-stationary, due to (i) the moving target problem-caused by using bootstrapping to learn an estimate that is updated every iteration and (ii) the distribution shift problem-caused by learning the Q-value estimate off-policy using , effectively being a mixture of past policies, which changes every iteration. Point (i) is a source of non-stationarity since the target of the supervised objective is moving with the prediction as iterations go by, due to using bootstrapping. Fitting the current estimate against the target defined from this very estimate is an ordeal, and (b) makes the task even harder by having the reward move too, given it is also learned, at the same time. The target of the TD objective therefore now has two moving pieces, one from bootstrapping (i), one from reward learning (b). The distribution shift problem (ii), stemming from the Q-value being learned off-policy, is naturally worsened by the reward being estimated off-policy (c). Note, although both the reward and Q-value are learned with samples from , the actual mini-batches used to perform the gradient update of each estimate might be different in practice. As such, the TD error would be optimized using samples from a mixture of past policies that is different from the mixture under which the reward is learned, and then use this reward trained under a different effective distribution in the Bellman target. All in all, by introducing a extra sources of non-stationarity (b) and (c), we further burden the nonstationarity of TD-learning (i) and (ii). In this work, we focus on the MDP * whose transition distribution p is stationary i.e. not changing over time. As discussed in Sect. 5.1, the reward process defined by r is however non-stationary. In particular, r is drifting, i.e. gradually changes at an unknown rate, due to the reward being learned at the same time as the policy, but also due to it being estimated off-policy. While the former reason is true in the on-policy setting as well, the latter is specific to the off-policy setting, on which we focus in this work. Indeed, in on-policy generative adversarial imitation learning, the parameter sets and are involved in a bilevel optimization problem (cf. Sect. 3) and consequently are intricately tied. is trained via an adversarial procedure opposing it to in a zero-sum two-player game. At the same time, is trained by policy gradients to optimize 's episodic accumulation of rewards generated by r . The synthetically generated rewards perceived by the agent are, in effect, sampled from a stochastic process that incrementally changes over the course of the policy updates, effectively qualifying r as a drifting non-stationary reward process.

Continually changing rewards
By moving to the off-policy setting-for reasons laid out earlier in Sect. 4-the zerosum two-player game is not opposing r and , but r and , where is the off-policy distribution stemming from experience replay. As the parameter set go through gradient updates, the new policies are added to the mixture of past policies . Crucially, to perform its parameter update at a given iteration, the policy uses transitions augmented with rewards generated by r , whose latest update was trying to distinguish between samples from e and (as opposed to e and in the on-policy setting). Since is drifting, is also drifting based on how experience replay operates. Nevertheless, by being a mixture of previous policy updates, potentially drifts less that , since, in effect, two consecutive distributions are mixing over a wide overlap of the same past policies. In reality however, corresponds to uniformly sampling a mini-batch from the replay buffer. Consecutive can therefore be uncontrollably distant from each other in practice, making the distributional drift of the reward more tedious to deal with than in the on-policy setting. Using large mini-batches and distributed multi-core architectures somewhat levels the playing field though.
The adversarial bilevel optimization problem guiding the adaptive tuning of r for every update is reminiscent of the stream of research pioneered by Auer et al. (1995) in which the reward is generated by an omniscient adversary, either arbitrarily or adaptively with potentially malevolent drive (Yu and Mannor 2009a, b;Lim et al. 2013;Gajane et al. 2018;Yu and Sra 2019). Non-stationary environments are almost exclusively tackled from a theoretical perspective in the literature (cf. previous references). Specifically, in the drifting case, the non-stationarities are traditionally dealt with via the use of sliding windows. The accompanying (dynamic) regret analyses all rely on strict assumptions. In the switching case, one needs to know the number of occurring switches beforehand, while in the drifting case, the change variation need be upper-bounded. Specifically, Cheung et al. 2019a) assume the total change to be upper-bounded by some preset variation budget, while (Cheung et al. 2019b) assumes the variations are uniformly bounded in time. Ortner et al. (2019) assumes that the incremental variation [as opposed to total in Cheung et al. 2019a)] is upper-bounded by a per-change threshold. Finally, in the same vein, ) posits regular evolution, by making the assumption that both the transition and reward functions are Lipschitz-continuous w.r.t. time. By contrast, our approach relies on imposing local Lipschitz-continuity of the reward over the input space, which will be described later in Sect. 5.4.
Online return normalization methods-using statistics computed over the entire return history (reminiscent of sliding window methods) to whiten the current return estimateare the usual go-to solution to deal with rewards (and a fortiori returns) whose scale can vary a lot, albeit still under stationarity assumption. We investigate whether online return normalization methods and pop-Art (van Hasselt et al. 2016) can have a positive impact on learning performance, when the process underlying the reward is learned at the same time as the policy, via experience replay. Given that the reward distribution can drift at an unknown rate (although influenced by the learning rate used to train ), it is fair to assume that we might benefit from such methods, especially considering how unstable a twin bilevel optimization problem can be. On the other hand, as learning progresses, older rewards are -especially in early training-stale, which can potentially pollute the running statistics accumulated by these normalization techniques. The results obtained in this ablation study are reported in Appendix 8.
We observe that neither return normalization nor pop-Art provide an improvement over the baseline. On the contrary, in Hopper and Walker2d, we see that they even yield significantly poorer performance within the allowed runtime, compared to the base method using neither return normalization nor pop-Art (cf. Fig. 20). We propose an explanation of this phenomenon based on the stability-plasticity dilemma (Carpenter and Grossberg 1987). In early training, the policy changes at a fast rate and with a high amplitude when going through gradient updates, due to being a randomly initialized neural function approximator. The reward r is in a symmetric situation, but is also influenced by the rate of change of , being trained in an adversarial game. In order to keep up with this fast pace of change in early training, the critic Q -using the reward r in its own learning objective-needs to be sufficiently flexible to accommodate and adapt quickly to these frequent changes. In other words, the critic's plasticity must be high. Since reward estimates from r become stale after a few updates, we also want our critic to avoid using stale reward to prevent the degradation of . This property is referred to as stability in Carpenter and Grossberg (1987). In fine, the critic must be plastic and stable. Note, using the current reward update to augment the sample transitions with their reward, as done in this work, provides the critic with such stability. However, return normalization and pop-Art use stale running statistics estimates to whiten the state-action values returned by the critic, which 1 3 prevents both plasticity (values need to change fast with the reward, normalization slows down this process) and harms stability due to the staleness of the obsolete reward that are "baked in" the running statistics. The obtained results corroborate the previous analysis (cf. Appendix 8).
We conclude this section by discussing the reward learning dynamics. While in the transient regime, the reward process is effectively non-stationary, it gradually becomes stationary as it reaches a steady-state regime. Nonetheless, the presence of such stabilization does not guarantee that the desired equilibrium has been reached. Indeed, as we will discuss in the next section, adversarial imitation learning has proved to be prone to overfitting. We now address it.

Overfitting cascade
Being based on a binary classifier, the synthetic reward process r is inherently susceptible to overfitting, and it has been shown (cf. subsequent references) that it indeed does. As exhibited in Sect. 2, several endeavors have proposed techniques to prevent the learned reward from overfitting, individually building on traditional regularization methods aimed to address overfitting in classification. These techniques either make the discriminator model weaker (Reed et al. 2018;Blondé and Kalousis 2019;Kostrikov et al. 2019;Peng et al. 2018), or make the classification task harder (Blondé and Kalousis 2019; Xu and Denil 2019; Zolna et al. 2019), to deter the discriminator from relying on non-salient features to trivially distinguish between samples from e and ( e and in our off-policy setting, cf. Sect. 5.2).
On a more fundamental level, the ability of deep neural networks to generalize (and a fortiori to circumvent overfitting) had been attributed to the flatness of the loss landscape in the neighborhoods of minima of the loss function (Hochreiter and Schmidhuber 1997;Keskar et al. 2017)-provided the optimization method is a variant of stochastic gradient descent. While it has more recently been shown that sharp minima can generalize (Dinh et al. 2017), we argue and show both empirically and analytically that, in the off-policy setting tackled in this work, flatness of the reward function around the maxima-corresponding to the positive samples, i.e. the expert data-is paramount for good empirical performance. In other words, we argue that the presence of peaks in the reward function caused by the discriminator overfitting on the expert data (non-salient features in the worst case) is the major source of optimization issues occuring in off-policy GAIL. As such, we focus on methods that address overfitting by inducing flatness in the learned reward function around expert samples, subject to being peaked on the reward landscape. An obvious candidate to enforce this desired flatness property is gradient penalty regularization, inducing Lipschitzcontinuity on the reward function r , over its input space S × A , which has been described earlier in Sect. 4, and will be the object of Sects. 5.4 and 6.3.
Simply put, reward overfitting translates to the presence of peaks on the reward landscape. Even in the case where these peaks exactly coincide with the expert data (perfect classification, the discriminator coincides with the Bayes classifier of the function class), peaked reward landscapes (i.e. sparse reward setting) can be tedious to optimize over. Crucially, peaks in r can potentially cause peaks in the state-action value landscape Q . When policy evaluation is done via Monte-Carlo estimation, the length of the rollouts likely attenuates the contribution of individual peaked rewards aggregated during the rollout into a discounted sum. If the peaks were not predominant in the rollout, the associated empirical estimate of the value will not be peaked (relative to its neighboring values). By contrast, the TD's bootstrapping-based objective does not attenuate peaks in r , which consequently causes peaks in Q . Note, using multi-steps returns (Peng et al. 1996) can help mitigate the phenomenon and benefit from the attenuation effect witnessed in the Monte-Carlo estimation described above, hence our usage of multi-step returns in this work (cf. Sect. 4).
Narrow peaks in the state-action value estimate Q can cause the deterministic policy to itself overfit to these peaks on the Q landscape. As such overfitting cascades from rewards to the policy, and hampers policy optimization [cf. Eq. (8)]. Furthermore, peaks in Q-values can severely hinder temporal-difference optimization since, by design, these outlying values can appear in either the predicted Q-value or the target Q-value. As such, echoing the observations and analyses made in Sects. 5.1 and 5.2, bootstrapping makes the optimization more tedious, when bringing sampled-efficiency to GAIL. These irregularities naturally transfer to the loss landscape, exacerbating the innate irregularity of loss landscapes when using neural networks as function approximators (Li et al. 2018), making it harder to optimize over Eq. (3). In fine, peaks on the reward landscape can cascade and impede both policy improvement and evaluation.
In the next section (Sect. 5.4), we discuss how to enforce Lipschitz-continuity in usual neural architectures, before going over empirical results corroborating our previous analyses (Sect. 5.5). Ultimately, we show that not forcing Lipschitz-continuity on the learned surrogate reward yields poor results, making it a sine qua non condition for success.

Enforcing Lipschitz-continuity in deep neural networks
Designed to address the shortcomings of the original GAN (Goodfellow et al. 2014), whose training effectively minimizes a Jensen-Shannon divergence between generated and real distributions, the Wasserstein GAN (WGAN) ) leverages the Wasserstein metric. Specifically, the authors of  use the dual representation of the Wasserstein-1 metric under a 1-Lipschitz-continuity (cf. Definition 1) assumption over the discriminator, which allow them to employ the Kantorovich-Rubinstein duality theorem, to eventually arrive at a tractable loss one can optimize over.
In the Wasserstein GAN , the weights of the discriminatorcalled critic to emphasize that it is no longer a classifier-are clipped. While not equivalent to enforcing the 1-Lipschitz constraint their model is theoretically built on, clipping the weights does loosely enforce Lipschitz-continuity, with a Lipschitz constant depending on the clipping boundaries. This simple technique however disrupts, by its design, the optimization dynamics. As emphasized in Gulrajani et al. (2017), clipping the weights of the Wasserstein critic can result in a pathological optimization landscape, echoing the analysis carried out in Sect. 5.3.
In an attempt to address this issue, the authors of Gulrajani et al. (2017) propose to impose the underlying 1-Lipschitz constraint via another method, fully integrated into the bilevel optimization problem as a gradient penalty regularization. When augmented with this gradient penalization technique, WGAN-dubbed WGAN-GP-is shown to yield consistently better results, enjoys more stable learning dynamics, and displays a smoother loss landscape (Gulrajani et al. 2017). Interestingly, the regularization technique has proved to yield better results even in the original GAN (Lucic et al. 2017), despite it not being grounded on the Lipschitzness footing like WGAN . In addition, following in the footsteps of the comprehensive study proposed in Lucic et al. (2017) and Kurach et al. (2018) shows empirically that the WGAN loss does not outperform the original GAN consistently across various hyper-parameter settings, and advocates for the 1 3 use of the original GAN loss, along with the use of spectral normalization (Miyato et al. 2018), and gradient penalty regularization (Gulrajani et al. 2017) to achieve the best results (albeit at an increased cost in computation in visual domains). In line with these works (Lucic et al. 2017;Kurach et al. 2018), we therefore commit to the archetype GAN loss formulation (Goodfellow et al. 2014), as has been laid out earlier in Sect. 4 when describing the discriminator objective in Eq. (1). We now remind the objective optimized by the discriminator [cf. Eq. (2)], where the generalized form of the gradient penalty, ℜ (k) , subsumes the original penalty (Gulrajani et al. 2017) as well as variants that will be studied later in Sect. 6.3: In Eq. (14), corresponds to the weight attributed to the regularizer in the objective (cf. ablation in Sect. 6.3), and ‖ ⋅ ‖ depicts the euclidean norm in the appropriate vector space.
is the distribution defining where in the input space S × A the Lipschitzness constraint should be enforced. is defined from e and . In the original gradient penalty formulation (Gulrajani et al. 2017), corresponds to sampling points uniformly in segments 2 joining points from the generated data and real data, grounded on the derived theoretical results (cf. Proposition 1 in Gulrajani et al. (2017)) that the optimal discriminator is 1-Lipschitz along these segments. While it does not mean that enforcing such constraint will make the discriminator optimal, it yields good results in practice. We discuss several formulations of in Sect. 6.3, evaluate them empirically and propose intuitive arguments explaining the obtained results. In particular, we adopt an RL viewpoint and propose an alternate ground as to why the regularizer has enabled successes in control and search tasks, as reported in Blondé and Kalousis (2019); Kostrikov et al. (2019). In particular, in Gulrajani et al. (2017), the 1-Lipschitz-continuity is encouraged by using ℜ (1) as regularizer.
Additionally, in line with the observations done in Gulrajani et al. (2017), we investigated with (a) replacing ℜ (k) with a one-sided alternative defined as , and (b) ablating online batch normalization of the state input from the discriminator. The alternative regularizer of (a) encourages the norm to be lower than k (formally, ‖∇ s t ,a t D (s t , a t )‖ ≤ k ) in contrast to the original regularizer that enforces it to be close to k. While the one-sided version describes the notion of k-Lipschitzness more accurately (cf. Definition 1), it yields similar results overall, as shown in Appendix 5.1. Crucially, we conclude from these experiments that it is sufficient to have the norm remain upper-bounded by k, or equivalently, to have D be Lipschitz-continuous. In other words, we do not need to impose a stronger constraint than k-Lipschitz-continuity on the discriminator to achieve peak performance, in the context of this ablation study. As for (b), online batch normalization of the state input is mostly hurting performance. as reported in Appendix 5.2. We therefore arrive at the same conclusions as Gulrajani et al. (2017): (a) we use the two-sided formulation of ℜ (k) described in Eq. (14) since using the once-sided variant yields no improvement, and (b) we omit the online batch normalization of the state input in the discriminator since it hurts performance, while still using this normalization scheme in the policy and critic (more details about the technique will be given when we describe our experimental setting in the next section, Sect. 5.5).

Diagnosing the importance of Lipschitzness empirically in off-policy adversarial imitation learning
Before going over the empirical results reported in this section, we describe our experimental setting. Unless explicitly stated otherwise, every experiment-reported in both this section and Sect. 6.5-is run in the same base setting. In addition, the used hyperparameters are made available in Appendix 1.

Environments
In this work, we consider the simulated robotics, continuous control environments built with the MuJoCo (Todorov et al. 2012) physics engine, and provided to the community through the OpenAI Gym API (Brockman et al. 2016). We use the following versions of the environments: v3 for Hopper, Walker2d, HalfCheetah, Ant, Humanoid, and v2 for InvertedDoublePendulum. For each of these, the dimension n of a given state s ∈ S ⊆ ℝ n and the dimension m of a given action a ∈ A ⊆ ℝ m scale as the degrees of freedom (DoFs) associated with the environment's underlying MuJoCo model. As a rule of thumb, the more complex the articulated physics-bound model is (i.e. more limbs, joints with greater DoFs), the larger both n and m are. The intrinsic difficulty of the simulated robotics task scales super-linearly with n and m, albeit considerably faster with m (policy's output) than with n (policy's input).
Omitting their respective versions, Table 1 reports the state and action dimensions (n and m respectively) for all the environments tackled in this work, and are ordered, from left to right, by increasing state and action dimensions, Humanoid-v3 being the most challenging. Since we consider, in our experiments, expert datasets composed of at most 10 demonstrations (10 is the default number; when we use 5, we specify it in the caption), we report return statistics (mean and standard deviation , formatted as ( ) in Table 1) aggregated over the set of 10 deterministically-selected demonstrations (the 10 first in our fixed pool) that every method requesting for 10 demonstrations will receive. To reiterate: in this work, every single method and variant will receive exactly the same demonstrations, due to an explicit seeding mechanism in every experiment. The reported statistics therefore identically apply to every method or variant using 10 demonstrations. By design, this reproducibility asset naturally extends to settings requesting fewer.

Demonstrations
As in , we subsampled every demonstration with a 1/u ratio-an operation called temporal dropout in Duan et al. (2017). For a given demonstration, we sample an index i 0 from the discrete uniform distribution unif{0, u − 1} to determine the first subsampled transition. We then take one transition every u transition from the initial index i 0 . In fine, the subsampled demonstration is extracted from the original one of length l by only preserving the transitions of indices {i 0 + ku � 0 ≤ k < ⌊l∕u⌋} . Since the experts achieve very high performance in the MuJoCo benchmark (cf. last column of Table 1) they never fail their task and live until the "timeout" episode termination triggered by OpenAI Gym API, triggered once the horizon of 1000 timesteps is reached, in every environments considered in this work. As such, most demonstrations have a length l ≈ 1000 transitions (sometimes less but always above 950). Since we use the sub-sampling rate u = 20 , as in , the subsampled demonstrations have a length of We wrap the absorbing states in both the expert trajectories beforehand and agentgenerated trajectories at training time, as introduced in Kostrikov et al. (2019). Note, this assumes knowledge about the nature-organic (e.g. falling down) and triggered (e.g. timeout flag set at a fixed episode horizon)-of the episode terminations (if any) occurring in the expert trajectories. Considering the benchmark, it is trivial to individually determine their natures in our work, which makes said assumption of knowledge weak. We trained the experts from which the demonstrations were then extracted using the on-policy stateof-the-art PPO (Schulman et al. 2017) algorithm. We used early stopping to halt the expert training processes when a phenomenon of diminishing returns is observed in its empirical return, typically attained by the 20 million interactions mark. We used our own parallel PPO implementation, written in PyTorch (Paszke et al. 2019), and will share the code upon acceptance. The IL endeavors presented in this work have also been implemented with this framework.

Distributed training
The distributed training scheme employed to obtain every empirical imitation learning result exhibited in this work uses the MPI message-passing standard. Upon launch, an experiment spins n workers, each assigned with an identifying unique rank 0 ≤ r < n . They all have symmetric roles, except the rank 0 worker, which will be referred to as the "zerorank" worker. The role of each worker is to follow the studied algorithm-SAM (cf. AlGo-rithM 1) in the experiments reported in this section, and the proposed extension PURPLE in the experiments reported later in Sect. 6.5. The zero-rank worker exactly follows the algorithm, while the n − 1 other workers omit the evaluation phase (denoted by the symbol " " appearing in front of the line number). The random seed of each worker is defined deterministically from its rank and the base random seed given as a hyper-parameter by the practitioner, and is used to (a) determine the behavior of every stochastic entity involved in the worker's training process, and (b) determine the stochasticity of the environment it interacts with.
Before every gradient-based parameter update step-denoted in Algorithm 1 by the symbol " " appearing in front of the line number-the zero-rank worker gathers the gradients across the n − 1 other workers, and aggregates them via an averaging operation, and sends the aggregate to every worker. Upon receipt, every worker of the pool then uses the aggregated gradient in its own learning update. Since the parameters are synced across workers before the learning process kicks off, this synchronous gradient-averaging scheme ensures that the workers all have the same parameters throughout the entire learning process (same initial parameters, then same updates). This distributed training scheme leverages learners seeded differently in their own environments, also seeded differently, to accelerate exploration, and above all provide the model with greater robustness.
Every imitation learning experiment whose results are reported in this work has been run for a fixed wall-clock duration-12 or 48 h, as indicated in their respective captionsdue to hardware and computational infrastructure constraints. While the effective running time appears in the caption of every plot, the latter still depict the temporal progression of the methods in terms of timesteps, the number of interactions carried out with the environment. The reported performance corresponds to the undiscounted empirical return, computed using the reward returned by the environment (available at evaluation time), gathered by the non-perturbed policy (deterministic) of the zero-rank worker. Every experiment uses 16 workers, and can therefore be executed on most desktop consumergrade computers. Lastly, we monitored every experiment with the Weights & Biases (Biewald 2020) tracking and visualization tool.
Additionally, we run each experiment with 5 different base random seeds (0-4), raising the effective seed count per experiment to 80. Each presented plot depicts the mean across them with a solid line, and the standard deviation envelope (half a standard deviation on either side of the mean) with a shaded area.
Finally, we use an online observation normalization scheme, instrumental in performing well in continuous control tasks. The running mean and standard deviation used to standardize the observations are computed using an online method to represent the statistics of the entire history of observation. These statistics are updated with the mean and standard deviation computed over the concatenation of latest rollouts collected by each parallel worker, making is effectively an online distributed batch normalization (Ioffe and Szegedy 2015) variant.

Empirical results
We now go over our first set of empirical results, whose goal is to show to what extent gradient penalty regularization is needed. The compared methods all use SAM (cf. Sect. 4) as base.
First, Fig. 2 compares several modular configurations, which are described using the following handles in the legend. GP means that gradient penalization (GP) (cf. Sect. 5.4) is used. NoGP means that GP is not used (using instead of GP ). Note, NoGP is the only negative handle that we use, since it it central to our analyses. When any other technique is not in use, it is simply absent from the handle in the legend. SN means that spectral normalization (SN) (Miyato et al. 2018) is used. SN normalizes the discriminator's weights to have a norm close to 1, drawing a direct parallel with GP. In line with what the large-scale ablation studies on GAN add-ons advocate (Lucic et al. 2017;Kurach et al. 2018), SN is used in most modern GAN architectures for its simplicity. We here investigate if SN is enough to keep the gradient in check, or if GP is necessary. LS denotes one-sided uniform label smoothing, consisting in replacing the positive labels only (hence one-sided), which are normally equal to 1 (expert, real), by a soft label u, distributed as u ∼ unif(0.7, 1.2) . We do not consider Variational Discriminator Bottleneck (VDB) (Peng et al. 2018) in our comparisons since (a) we prefer to focus on stripped-down canonical methods, and (b) the information bottleneck forced on the discriminator's hidden representation boils down to smoothing the labels anyway, as shown recently in Müller et al. (2019).
In Fig. 2, we see that not using GP (NoGP) prevents the agent from learning anything valuable: the agent barely collects any reward at all. While using SN can improve performance slightly (NoGP-SN), the addition of LS (NoGP-SN-LS) considerably improves performance over the two previous candidates. Nonetheless, despite the sizable runtime, all three perform poorly and are a far cry from achieving the same empirical return as the expert (cf. Table 1). In contrast with Figs. 2, 3 and 4 show to what extent introducing GP in the off-policy imitation learning algorithm considered in this work impacts performance positively. The performance gap is substantial-in every environment except the easiest one considered, InvertedDoublePendulum-v2, as described in Table 1. As soon as GP is in use, the agent achieves near-expert performance (cf. Table 1 that without GP, neither SN nor LS are enough to enable the agent to mimic the expert with high fidelity, while Fig. 3 and Fig. 4 show that with GP, extra methods such as LS barely improve performance. These results support our claim: gradient penalty is, (empirically) necessary and sufficient to ensure near-expert performance in off-policy generative adversarial imitation learning, in our computational setting. Ablation study on GP in on-policy GAIL. We see that the agent is still able to learn policies achieving peak performance even without GP, in contrast to the off-policy version of the algorithm. In the most difficult environment of the MuJoCo suite (cf . Table 1), Humanoid, GP achieves best performance. Runtime is 12 hours 1 3 We also conducted an ablation of GP in the on-policy setting, reported in Fig. 5. We see that across the range of environments, GP does not assume the same decisive role as in the off-policy setting. In fact, the agent reaches peak performance earlier without GP in two challenging environments, Ant and HalfCheetah, out of the five considered. Nevertheless, it still allows the agent to attain peak empirical return faster in Hopper, Walker2d, and perhaps most strikingly, in the extremely complex Humanoid environment. All in all, while GP can help in the on-policy setting, in is not necessary as in the off-policy setting studied in this work. In line with the analyses led in Sects. 5.1-5.3, the results of Fig. 5 somewhat corroborate our claim that the presence of bootstrapping in the policy evaluation objective creates a bottleneck, that can be addressed by enforcing a Lipschitz-continuity constraint-GP-on the reward learned for imitation. Figure 6 compares SAM, with and without GP, against several alternate versions of the objective used to train the surrogate reward for imitation. We introduce the following new handles to denote these methods. "RED" means that the random expert distillation (RED) ) method is used to learn the imitation reward, replacing the adversarial one in SAM. RED is based on random network distillation (RND) (Burda et al. 2018), an exploration method using the prediction error of a learned network against a random fixed target as a measure of novelty, and use it to craft a reward bonus. Instead of updating the network while training to keep the novelty estimate tuned to the current exploration level of the agent, RED trains the RND predictor network to predict the random fixed target on the expert dataset before training the policy. RED then uses the prediction error to assemble a reward signal for the imitation agent, who is rewarded more if the actions it picks are deemed not novel, as that means the agent's occupancy measure matches the occupancy of what has been seen before, i.e. the expert dataset. As such, RED is a technique that rewards the agent for matching the distribution support of the expert policy e . Note, as opposed to adversarial imitation, the RED reward is not updated during training, which technically protects it from overfitting. "PU" means that we learn the reward via adversarial imitation, but using the discriminator objective recently proposed in positive-unlabeled (PU) GAIL (Xu and Denil 2019). Briefly, the method considers that while the expert-generated samples are positive-labels, the agentgenerated ones are unlabeled (as opposed to negative-labeled). Intuitively, it should prevent the discriminator overfitting on irrelevant features when it becomes difficult for the discriminator to tell agent and expert apart.
The wrapping mechanism-consisting in wrapping the absorbing transitions, which we described in Sect. 4-is used in every experiment reported in Fig. 6, including RED. In addition, note, we only use GP in the adversarial context we introduced it in. We do not use GP with RED. Each technique is re-implemented based on the associated paper, with the same hyper-parameters, with the exception of RED: instead of using the per-environment scale for the prediction loss on which the RED reward is built, we keep a running estimate of the standard deviation of this prediction loss and rescale said prediction loss with its running standard deviation. This modification is consistent with the rescaling done in the paper RED is based on RND. By contrast, the per-environment scales in RED's official implementation span several orders of magnitude (four). We here opt for environment-agnostic methods.
The results in Fig. 6 show that the wrapping techniques introduced in Kostrikov et al. (2019) and described in Sect. 4 increases performance overall. Like we have shown before in Figs. 2, 3, and 4, not using GP causes a considerable drop in performance. PU prevents the agent to learn an expert-like policy, in every environment. Note, while the comparison is fair, PU was introduced in visual tasks. In particular, we see that, in Hopper, PU's empirical return hits a plateau at about 1000 reward units (abbrv. r.u.). We observe the exact same phenomenon with RED, for which it occurs in every environment. This is caused by the agent being stuck performing the same sub-optimal actions, accumulating sub-optimal outcomes until episode termination artificially triggered by timeout. The agent exploits the fact that it has a lifetime upper-bounded by said timeout and is therefore biased by its survival (survival bias, cf. Sect. 4). The RED agents are in effect staying alive until termination, and therefore avoid falling down (organic trigger) until the timeout (artificial trigger) is reached. While the reward used in RED is not negative, the agent quickly reaches a performance level at which all the rewards are almost identical-since the RED reward is trained beforehand, with no chance of adaptive tuning like training the reward at the same time allows in this work, and since RED's score is based on how the agent and expert distribution match. Once the agent is similar enough to the expert, it always gets the same rewards and has therefore no incentive to resemble the expert with higher fidelity. Instead, it is content and just tries to live through the episode. This propensity to survival bias explains why such care was taken to hand-tune its scale. Finally, even though wrapping absorbing transitions generally improves performance, Fig. 6 shows that survival bias is avoided even without it (occurrence in Hopper has been overcome).
The results in Fig. 3 provide empirical evidence that enforcing Lipschitz-continuity on D over the input space via the gradient regularization [cf. Eq. (14)] is necessary and sufficient for the agent to achieve expert performance in the considered off-policy setting. We therefore ask the question: is the positive impact that GP has on training imitation policies via bootstrapping explained (a) by its direct effect on the reward smoothness, or (b) by its indirect effect on the state-action value smoothness? We argue that both contribute to the stability and performance of the studied method. While point (a) is intuitive from the analyses laid out in Sects. 5.1-5.3, we believe that point (b) deserves further analysis and discussion. As such, we derive theoretical results to qualify, both qualitatively and quantitatively, the Lipschitz-continuity that is potentially implicitly enforced on the state-action value when assuming the Lipschitz-continuity of the reward. These results are reported in Sect. 6.1, and will hopefully help us answer the previous question. A discussion of the indirect effect and how it compares to the direct effect implemented by target smoothing is carried out in Sect. 6.2.4.
6 Pushing the analysis further: robustness guarantees and provably more robust extension

Robustness guarantees: state-action value Lipschitzness
In this section, we ultimately show that enforcing a Lipschitzness constraint on the reward r has the effect of enforcing a Lipschitzness constraint on the associated stateaction value Q . Note, Q is the real Q-value derived from r , while Q is a function approximation of it. We discuss this point in more detail in Sect. 6.2. We characterize and discuss the conditions under which such result is satisfied, as well as how the exhibited Lipschitz constant for Q relates to the one enforced on r . We work in the episodic setting, i.e. with a finite-horizon T, which is achieved by assuming that = 0 once an absorbing state is reached. Note, since we optimize over mini-batches in practice, nothing guarantees that the Lipschitz constraint is satisfied by the learned function approximation globally across the whole joint space S × A , at every training iteration.
In such setting, we are therefore reduced to local Lipschitzness, defined as Lipschitzness in neighborhoods around samples at which the constraint is applied. The provenance of these samples is not the focus of this theoretical section and assume they are agentgenerated. We study the effect of enforcing Lipschitzness constraints on other data distributions in Sect. 6.3.
Notations Given a function f ∶ ℝ n × ℝ m → ℝ d , taking the pair of vectors (x, y) as inputs, we denote by ∇ x,y f the pair of Jacobians associated with x and y, ∇ x f and ∇ y f respectively, which are rectangular matrices in ℝ d×n and ℝ d×n respectively. Now that the stable concepts and notations have been laid out, we introduce the variables x i and y i , indexed by i ∈ I ⊆ ℕ . Note, indices i's do not depict different occurrences of the x variable: the x i 's and y i 's are distinct variables. These families of variables will enable us to formalize the Jacobian of f with respect to ( To lighten the notations, we overload the symbol ∇ and introduce the shorthands . In this work, the difference between the index of derivation i and the index of evaluation i ′ , i − i � ≤ 0 will be referred to as gap. We use ‖ ⋅ ‖ F to denote the Frobenius norm, which a) is naturally defined over rectangular matrices in ℝ m×n and b) is sub-multiplicative: ‖UV‖ F ≤ ‖U‖ F ‖V‖ F , for U and V rectangular with compatible sizes (provable via Cauchy-Schwarz inequality). In proofs, we use " ⊗ " for matrix multiplication, to avoid collisions with the scalar product.
Lemma 1 (Recursive inequality-induction step) Let the MDP with which the agent interacts be deterministic, with the dynamics of the environment determined by the function f ∶ S × A → S. The agent follows a deterministic policy ∶ S → A to map states to actions, and receives rewards from r ∶ S × A → ℝ upon interaction. The functions f, and r need be C 0 and differentiable over their respective input spaces. This property is satisfied by the usual neural network function approximators. The "almost-everywhere" case can be derived from this lemma without major changes (relevant when at least one activation function is only differentiable almost-everywhere, ReLU). (a) Under the previous assumptions, for k ∈ [0, T − t − 1] ∩ ℕ the following recursive inequality is verified: where C∶=A 2 max(1, B 2 ) is the time-independent counterpart of C t .

Proof of Lemma 1(a)
First, we take the derivative with respect to each variable separately: By assembling the norm with respect to both input variables, we get: Let A t , B t and C t be time-dependent quantities defined as: Finally, by substitution, we obtain: which concludes the proof of Lemma 1(a). ◻

Proof of Lemma 1(b) By introducing time-independent upper bounds A and B such that
Lemma 1 tells us how the norm of the Jacobian associated with a gap between derivation and evaluation indices equal to t + 1 relate to the norm of the Jacobian associated with a gap equal to t. We will use this recursive property to prove our first theorem, Theorem 1. Additionally, from this point forward, we will use the time-independent upper-bounds exclusively, i.e. Lemma 1(b).
Theorem 1 (Gap-dependent reward Lipschitzness) In addition to the assumptions laid out in Lemma 1, we assume that the function r is -Lipschitz over S × A. Since r is C 0 and differentiable over S × A, this assumption can be written as where k ∈ [0, T] ∩ ℕ and C is defined as in Lemma 1(b).

Proof of Theorem 1(a)
We will prove Theorem 1(a) by induction.
Let us introduce the dummy variable v, along with the induction hypothesis for v: where v represents the gap between the derivation timestep and the evaluation timestep.
Step 2: induction Let us assume that Eq. (41) is verified for v fixed, and show that Eq. (41) is satisfied when the gap is equal to v + 1.
Equation (41) is therefore satisfied for v + 1 when assumed at v, which proves the induction step.
Step 3: conclusion Since Eq. (41) has been verified for both the initialization and induction steps, the hypothesis is valid ∀v ∈ [0, T] ∩ ℕ , which concludes the proof of Theorem 1(a). ◻

Proof of Theorem 1b
We will prove Theorem 1(b) by induction. Let us introduce the dummy variable v, along with the induction hypothesis for v: where v represents the gap between the derivation timestep and the evaluation timestep.
Step 2: induction Let us assume that Eq. (46) is verified for v fixed, and show that Eq. (46) is satisfied when the gap is equal to v + 1.
Equation (46) is therefore satisfied for v + 1 when assumed at v, which proves the induction step.
Step 3: conclusion Since Eq. (46) has been verified for both the initialization and induction steps, the hypothesis is valid ∀v ∈ [0, T] ∩ ℕ , which concludes the proof of Theorem 1(b). ◻ This result shows that when there is a gap k between the derivation and evaluation indices, the norm of the Jacobian of r is upper-bounded by a gap-dependent quantity equal to √ C k , over the entire input space. Crucially, this property applies if and only if the gap between the timestep of the derivation variable and the timestep of the evaluation variable is equal to 0, hence the use of the same letter u in the assumption formulation.

Proof of Theorem 2
With finite horizon T, we have Q (s t , a t )∶= ∑ T−t−1 k=0 k r (s t+k , a t+k ) , ∀t ∈ [0, T] ∩ ℕ , since f, , and r are all deterministic (no expectation). Additionally, since r is assumes to be C 0 and differentiable over S × A , Q is by construction also C 0 and differentiable over S × A . Consequently, ∇ u s,a [Q ] u exists, ∀u ∈ [0, T] ∩ ℕ . Since both r and Q are scalar-valued (their output space is ℝ ), their Jacobians are the same as their gradients. We can therefore use the linearity of the gradient operator: . On the other hand, when 2 C ≠ 1: By applying √ ⋅ (monotonically increasing) to the inequality, we obtain the claimed result. ◻ Finally, we derive a corollary from Theorem 2 corresponding to the infinite-horizon regime.
Corollary 1 (Infinite-horizon regime) Under the assumptions of Theorem 2, including that r is -Lipschitz over S × A, and assuming that 2 C < 1, we have, in the infinite-horizon regime: Proof of Corollary 1 We now have Q (s t , a t )∶= ∑ +∞ k=0 k r (s t+k , a t+k ) , ∀t ∈ [0, T] ∩ ℕ , since f, , and r are all deterministic and are now working working under the infinite-horizon regime. Considering the changes in Q 's definition, the first part of the proof can be done by analogy with the proof of Theorem 2, until Eq. (54), which is our starting point. In this regime, 2 C ≥ 1 yields an infinite sum in Eq. (54), which results in an uninformative (because infinite) upper-bound on ‖∇ t s,a [Q ] t ‖ F . On the other hand, when 2 C < 1 (note, we always have 2 C ≥ 0 by definition), the infinite sum in Eq. (54) is defined. Since we have shown that 2 C < 1 is the only setting in which the sum is defined, we continue from the infinite-horizon version of Eq. (54) with 2 C < 1 onwards. Hence,

Using
√ ⋅ (monotonically increasing) on both sides concludes the proof of Corollary 1. ◻ To conclude the section, we now give interpretations of the derived theoretical results, discuss the implications of our results, and also exhibit to what extent they transfer to the practical setting.

Function approximation bias
Theorem 2 exhibits the Lipschitz constant of Q when r is -Lipschitz. In practice however, the state-action value (or value function) is usually modeled by a neural network, and learned via gradient descent either by using a Monte-Carlo estimate of the collected return as regression target, or by bootstrapping using a subsequent model estimate (Sutton 1988). We therefore have access to a learned estimate Q , as opposed to the real stateaction value Q . As such, the results derived in Theorem 2 will transfer favorably into the function approximation setting as Q becomes a better parametric estimate of Q . Note, the reward is denoted by r for the reader to easily distinguish it from the black-box reward traditionally returned by the environment. Albeit arbitrary, the notation r allows for the reward to be modeled by a neural network parameterized by the weights , and learned via gradient descent, as is indeed the case in this work. Crucially, having control over r in practice allows for the enforcement of constraints, making the -Lipschitzness assumption in Theorems 1, 2 and Corollary 1 practically satisfiable via gradient penalization 5.4. It is crucial to note that, while function approximation creates a gap between theory and practice for the Q-value (worse when bootstrapping), there is a meaningfully lesser gap for the reward as the -Lipschitzness constraint is directly enforced on the parametric reward r .
▶infinite sum of geometric series

Value lipschitzness
In Corollary 1 we showed that ‖∇ t s,a [Q ] t ‖ F ≤ ∕ √ 1 − 2 C , in the infinite-horizon regime, when r is assumed -Lipschitz over S × A , and assuming 2 C < 1 . In other words, in this setting, enforcing r to be -Lipschitz causes Q to be Δ Starting from the assumption that 2 C < 1 , we arrive at √ 1 − 2 C < 1 , then 1∕ √ 1 − 2 C > 1 , and since ≥ 0 by definition (cf. Sect. 5.4), we finally get Δ ∞ > . Without loss of generality, consider the case in which r is not a contraction, i.e. r is -Lipschitz C 0 over S × A , with ≥ 1 . As a result, Δ ∞ > ≥ 1 , i.e. Δ ∞ > 1 , which means that, under the considered conditions, Q is not a contraction over S × A either. The latter naturally extends to any u ∈ ℝ + that lower-bounds : if > u , then Δ ∞ > u , ∀u ∈ ℝ + . Lipschitz functions and especially contractions are at the core of many fundamental results in dynamics programming, hence also in reinforcement learning. Crucially, the Bellman operator being a contraction causes a fixed point iterative process, such as value iteration (Sutton and Barto 1998), to converge to a unique fixed point whatever the starting iterate of Q. Since we learn Q with temporal-difference learning (Sutton 1988) via a bootstrapped objective, the convergence of our method is a direct consequence of the contractant nature of the Bellman operator. As such the Lipschitzness-centric analysis laid out in this section is complementary to the latter. It provides a characterization of Q 's Lipschitzness over the input space S × A as opposed to over iterates, i.e. time. As such, our analysis therefore does not give convergence guarantees of an iterative process, which are already carried over from temporal-difference learning at the core of our algorithm. Rather, we provide variation upper-bounds for Q when r has upper-bounded variations: if r is -Lipschitz, then Q is Δ ∞ -Lipschitz. In fine, this result has an immediate corollary, derived previously in this block: if the variations of r are lower-bounded by , then the variations of Q are lower-bounded by Δ ∞ > .

Compounding variations
The relative position of 2 C with respect to 1 is instrumental in the behavior of the exhibited variation bounds, in both the finite-and infinite-horizon settings. In the latter, we see that the upper-bound gets to infinity when 2 C (non-negative by definition, and lower than 1 as necessary condition for the infinite sum to exist) gets closer to 1 from below. In the former, we focus on the 2 C ≠ 1 case, as in the other case, the bound does not even depend on 2 C . As such, we study the value of ‖∇ t s,a [Q ] t ‖ F 's upper-bound in the finite-horizon setting when 2 C ≠ 1 , dubbed Δ t ∶= √ 1 − ( 2 C) T−t ∕1 − 2 C . Beforehand, we would remind the reader how the bounded quantity should behave throughout an episode. Since Q is defined as the expected sum of future rewards r , predicting such value should get increasingly tainted with uncertainty as it tries to predict across long time ranges. As such, predicting Q at time t = 0 is the most challenging, as it corresponds to the value of an entire trajectory, whereas predicting Q at time t = T is the easiest (equal to last reward r ). Higher horizons T consequently make the prediction task more difficult, as do discount factors closer to 1. We now discuss Δ t . As long as 2 C ≠ 1 , Δ t gets to 0 as t gets to T. This is consistent with the previous reminder: as t gets to T, the Q estimation task becomes easier, hence the variation bound ( Δ t ) due to prediction uncertainty should decrease to 0. As t gets to 0 however, the behavior of Δ t depends on the value of 2 C : if 2 C ≫ 1 , Δ t explodes to 1 3 infinity, whereas for reasonable values of 2 C , Δ t does not. Since C∶=A 2 max (1, B 2

Let us assume that A (B) not only upper-bounds every A t ( B t ) but is also the tightest time-independent bound:
Note, the "or" is inclusive. In other words, if the variations (in space) of the policy or the dynamics are large in the early stage of an episode ( 0 ≤ t ≪ T ), then Δ t (variation bound on Q ) explodes. The exhibited phenomenon is somewhat reminiscent of the compounding of errors isolated in Ross and Bagnell (2010).

Is value lipschitzness enough?
We showed that under mild conditions, and in finite-and infinite-horizon regimes, r Lipschitzness implies Q Lipschitzness, i.e. that if similar state-action are mapped to similar rewards by r , then Q also maps then to similar state-action values. This regularization desideratum is evocative of the target policy smoothing add-on introduced in (Fujimoto et al. 2018), already presented earlier in Sect. 4. In short, target policy smoothing perturbs the target action slightly. In effect, the temporal-difference optimization now fits the value estimate against an expectation of similar bootstrapped target value estimates. Forcing similar action to have similar values naturally smooths out the value estimate, which by definition emulates the enforcement of a Lipschitzness constraint on the value, and as such mitigates value overfitting which deterministic policies are prone to. While its smoothing effect on the value function is somewhat intuitive, we set out to investigate formally how target policy smoothing affects the optimization dynamics, and particularly to what extent it smooths out the state-action value landscape. Since the function approximator Q is optimized as a supervised learning problem using the traditional squared loss criterion, we first study how perturbing the inputs with additive random noise, denoted by , impacts the optimized criterion, and what kind of behavior it encourages in the predictive function. As such, to lighten the expressions, we consider the supervised criterion C(x)∶=(y − f (x)) 2 , where f(x) is the predicted vector at the input vector x, and y is the supervised target vector. We also consider, in line with (Fujimoto et al. 2018), that the noise is sampled from a spherical zero-centered Gaussian distribution, omitting here that the noise is truncated for legibility, hence ∼ N(0, 2 I) . The criterion injected with input noise is C (x)∶=C(x + ) = (y − f (x + )) 2 . Assuming the noise has small amplitude (further supporting the original truncation), we can write the second-order Taylor series expansion of the perturbed criterion near = 0 , as a polynomial of : where ‖ ⋅ ‖ denotes the Euclidean norm in the appropriate vector space. From this point forward, we assume the noise has a small enough norm to allow the third term, O(‖ ‖ 3 ) , to be neglected. By integrating over the noise distribution, we obtain: Since the noise is sampled from the zero-centered and spherical distribution N(0, 2 I) , we have respectively that ∫ i p( )d = 0 and , where ij is the Kronecker symbol. By injecting these expressions in Eq. (60), we get: where Tr(H x C) is the trace of the Hessian of the criterion C, w.r.t. the input variable x. We now want to express the exhibited regularizer Tr(H x C)) as a function of the derivatives of the prediction function f, and therefore calculate the consecutive derivative sums: hence, In fine, we can write, in a more condensed form: The previous derivations-derived somewhat similarly in Webb (1994) and Bishop (1995)-show that minimizing the criterion with noise injected in the input is equivalent to minimizing the criterion without any noise and a regularizer containing norms of both the Jacobian and Hessian of the prediction function f. As raised in Bishop (1995), the second term of the regularizer is unsuitable for the design of a practically viable learning algorithm, since (a) it involves prohibitively costly second-order derivatives, and (b) it is not positive definite, and consequently not lower-bounded, which overall makes the regularizer a bad candidate for an optimization problem loss. Nevertheless, Bishop (1995) further shows that this regularization is equivalent to the use of a standard Tikhonov-like positive-definite regularization scheme involving only first-order derivatives, provided the noise has small amplitude-ensured here with a small and noise clipping. As such, the regularizer induced by the input noise is equivalent to 2 � ‖∇ x f ‖ 2 � , and by direct analogy, we can say that target policy smoothing induces an implicit regularizer on the TD objective, of the form 2 � ‖∇ a Q � ‖ 2 � , Note, ′ are the target critic parameters, given that target policy smoothing adds noise to the target action, an input of target critic value Q ′ . By construction, the target parameters ′ slowly follow the online parameters (cf. Sect. 4). In addition, temporal-difference learning urges Q to move closer to Q ′ by design [cf. Eq. (3)]. Consequently, properties enforced on one set of parameters should eventually be transfered to the other, such that in fine both and ′ possess the given property only explicitly enforced on one (albeit delayed). Based on this line of reasoning, the temporaldifference learning dynamics and soft target updates should make the theoretically equivalent 2 � ‖∇ a Q � ‖ 2 � regularizer enforce smoothness on the online parameters too, even if it explicitly only constrains the target weights ′ . All in all, we have shown that target smoothing is equivalent to adding a regularizer to the temporal-difference error to minimize when learning Q , where said regularizer is reminiscent of the gradient penalty regularizer, presented earlier in Eq. (14). As such, target smoothing does implement a gradient penalty regularization, but on Q . Crucially, the gradient in the penalty is only taken w.r.t. the action dimension, but not w.r.t. the state dimension. In spite of the use of target policy smoothing in our method, it was not enough to yield stable learning behaviors, as shown in Sect. 5.5. Gradient penalization was an absolute necessity. Even though both methods encourage Q to be smoother (directly in Fujimoto et al. (2018), and indirectly via reward Lipschitzness in this work), on its own, learning a smooth Q estimate seems not to be sufficient for our method to work: learning a smooth r estimate to serve as basis for Q seems to be a necessary condition.

Indirect reward regularization
The theoretical guarantees we have derived (cf. Theorems 1, 2 and Corollary 1) all build on the premise that the reward r is -Lipschitz over the joint input space S × A , i.e. that ‖∇ t s,a [r ] t ‖ F ≤ . Crucially, we do not enforce this regularity property directly is practice, but instead urge the discriminator D to be k-Lipschitz by restricting the norm of the Jacobian of the latter via regularization [cf. Eq. (2)]. We here set out to figure out to what extent the k-Lipschitzness enforced onto D propagates and transfers to r ; in particular, whether it results in the indicrectly-urged -Lipschitzness of r , with ≠ k outside of edge cases. While k is fixed throughout the lifetime of the agent, need not be. As such, discussing the behavior of this evolving Lipschitz constant w.r.t. the learning dynamics is crucial to better understand when the guarantees we have just derived (whose main premise is ‖∇ t s,a [r ] t ‖ F ≤ ) apply in practice. As laid out ealier in Sect. 4, in this work, we consider two forms of reward, crafted purely from the scores returned by D : the minimax (saturating) one r MM ∶= − log(1 − D ) and the non-saturating one r NS ∶= log(D ) (names purposely chosen to echo their counterpart GAN generator loss). Although we opted for the minimax form (based on the ablation study we carried out on the matter, cf. Appendix 6), we here tackle and discuss both forms, as we suspect there could be more to it than just zero-order numerics. Analyzing first-order behavior is the crux of most GAN design breakthroughs, which is far from surprising, considering how intertwined the inner networks are (generator G, and discriminator D). Yet, in adversarial IL, the policy (playing the role of G) does not receive gradients flowing back from D like in GANs. Instead, it gets a reward signal crafted from D's returned scalar value, detached from the computational graph, and try to maximize it over time via policy-gradient optimization. The discussion in adversarial IL has thus always limited to the numerics of the reward signal and how to shape it in a way that faciliates the resolution of the task at hand (similarly to how we discuss the impact of its shape when reporting our last empirical findings of Sect. 5.5).
By constrast, we here are interested in the gradients of these rewards ( r ,MM and r ,NS ) in this studied adversarial IL context, with the end-goal of characterizing their Lipschitz-continuity (or absence thereof). Their respective Jacobians' norms, under the setting laid out earlier in Sect. 6.1, , with D (s t , a t ) ∈ (0, 1) ( D 's score is wrapped with a sigmoid). As laid out above, we here posit that D is k-Lipschitz-continuous as founding assumption-‖∇ t s,a [D ] t ‖ F ≤ k . We can now upper-bound the Jacobians' norms unpacked above with the Lipschitz constant of D : Since D (s t , a t ) ∈ (0, 1) , both denominators (for either reward form) are in (0, 1), which makes the Jacobian's norm of either reward form unbounded over its domain (due to D → 0 from above for r NS ; due to D → 1 from below for r MM ), despite the D 's k-Lipschitzness. Since treating the entire range of values that can be taken by D (s t , a t ) , (0, 1), lead us to a dead end, and leaving us unable to upperbound neither ‖∇ t s,a [r MM ] t ‖ F nor ‖∇ t s,a [r NS ] t ‖ F , we now adopt a more granular approach and procede by dichotomy. As such, ∃ ∈ (0, 1) verifying 0 < ≪ 1 such that 1 D (s t , a t ) and as a result also ‖∇ t s, is unbounded when D (s t , a t ) ∈ (0, ] and bounded when D (s t , a t ) ∈ ( , 1) . Similarly, ∃ L ∈ (0, 1) verifying 0 ≪ L < 1 such that 1 (1 − D (s t , a t )) and as a result also ‖∇ t s, is bounded when D (s t , a t ) ∈ (0, L] and unbounded when D (s t , a t ) ∈ (L, 1) . If we were to figure out the effective range covered by D 's values throughout the learning process, we would maybe be able to exploit the dichotomy.
In practice, the untrained agent initially performs poorly at the imitation task, and is therefore assigned low scores by D (near 0, as "0" is the label assigned to samples from the agent in the classification update D goes through every iteration). As learning progresses, the agent's scores gradually shift towards 1-the label used for expert samples in D 's update, and optimally converge to the central value of 0.5 in the (0, 1) range that D can describe. Indeed, the perfect discriminator consistently predicts scores equal to 0.5 for the agent's actions (Goodfellow 2017): the agent has managed to perfectly confuse D as to where the data it is fed comes from (both sources, expert and agent, are perceived as equiprobable). What matters for ‖∇ t s,a [r ] t ‖ F (either form) to be bounded in practice is for it to be bounded for values of D in (0, M], where 0.5 ≤ M < 1 (the values realistically taken by D throughout the learning process). Since M < L in effect (for L, cf. dichotomy above), we can conclude that ‖∇ t s,a [r MM ] t ‖ F is effectively bounded: ∃ , 0 ≤ < +∞ , such that ‖∇ t s,a [r MM ] t ‖ F ≤ . We however can not conclude as such for ‖∇ t s,a [r NS ] t ‖ F , however close to zero might be (for , cf. dichotomy above). It is not rare for D to take 0 as value early in training, which makes ‖∇ t s,a [r NS ] t ‖ F unbounded in the interval described by the values taken by D in practice: (0, M]. Interestingly, when D is near 0 early in training, The lowest upper-bound for ‖∇ t s,a [r MM ] t ‖ F is ≈ k , and can only happen early in the training process, when D correctly classifies the agent's actions as coming from the agent. In other words, the Lipschitz constant of r MM is at its lowest early in training. Besides, as the agent becomes more proficient at mimicking the expert and therefore collects higher scores from D , increases monotonically and grows aways from its initial value k. Compared to the alternative (highest Lipschitz constant early in training and then monotonically decreasing as the scores increase when the agent gets better at the task, nearing the lowest value of k when D → 1 ), which as it turns out is exactly the behavior adopted by r NS , the behavior of r MM is far more desirable.
Crucially, to sum up, r NS is not Lipschitz early in training when the agent would benefit most from regularity in the reward landscape. r MM however is Lipschitz-continuous early in training, with the lowest Lipschitz constant of its lifetime, which aligns with the Lipschitz constant enforced on D ( ≈ k ). As such, r MM is at its most regular when the agent needs it most (early, when it knows nothing), and then becomes less and less restrictive (the Lipschitz constant increases) as the agent collects higher similarity scores with the expert from D . One could therefore see r MM as having built-in "training wheels", which gradually phase out as the 1 3 agent becomes better, providing less safety as the agent becomes more proficient at the imitation task. To conclude this discussion point, with the minimax reward form r ∶=r MM , we have This means that the premise of our theoretical guarantees consisting in positing that the reward is -Lipschitz-continuous can be satisfied in practice by enforcing k-Lipschitz-continuity on D via gradient penalty regularization [cf. Eq. (14)]. This is not the case when r ∶=r NS . We propose this analytical observation as an explanation as to why using r NS yields such poor results in our reported ablation, cf. Appendix 6. Our discussion detaches itself from the one adopting a zero-order numerics scope, laid out in Sect. 5.5, by discussing first-order numerics instead, which blends into our Lipschitzness narrative.

Local smoothness
The local Lipschitzness assumption is reminiscent of many theoretical results in the study of robustness to adversarial examples. Notably, Yang et al. (2020) shows that local Lipschitzness is correlated with empirical robustness and accuracy in various benchmark datasets. As mentioned when we justified the local nature of the Lipschitz-continuity notion tackled in this work (cf. Definition 1), we optimize the different modules over mini-batches of samples. While forcing the constraint to be satisfied globally might be feasible in some low-dimensional supervised or unsupervised learning problems, the notion of fixed dataset does not exist a priori in reinforcement learning. Section 6.3 describes, compares and discusses the effect of where the local Lipschitzness constraint is enforced (e.g. expert demonstration manifold, fictitious replay experiences). Wherever the regularizer is applied, the constraint is local nonetheless. One can therefore not guarantee that the -Lipschitz-continuity of r , formalized as ‖∇ t s,a [r ] t ‖ F ≤ , and urged by enforcing ‖∇ t s,a [D ] t ‖ F ≤ k via gradient penalization (cf. our previous discussion on indirect reward regularization in Sect. 6.2.5), will be satisfied everywhere in S × A . Plus, considering that Theorem 2 and Corollary 1 rely on the satisfaction of the constraint on r along every trajectory, which is likely not to be verified in practice, we can say with high confidence that the constraint on Q , ‖∇ t s,a [Q ] t ‖ F ≤ Δ ∞ , will not be satisfied over the whole joint input space either. Still, we can hope to enhance the coverage of the subspace on which the constraint ‖∇ t s,a [r ] t ‖ F ≤ is satisfied, dubbed ℭ , by doing more r learning updates with the regularizer-technically, D learning updates encouraging D to satisfy ‖∇ t s,a [D ] t ‖ F ≤ k via gradient penalization, cf. Eq. (14). From this point onward, we will qualify a state-action pair (s t , a t )-equivalently, an action a t in a given state s t -as " ℭ-valid" if it belongs to ℭ ∋ (s t , a t ) , i.e. if r is -Lipschitz, verifying ‖∇ t s,a [r ] t ‖ F ≤ . Note, the notion of ℭ-validity is inherently local, since we have defined the notion for a single given input pair (s t , a t ) . As such, future statements about ℭ-validity will all be local ones by essence. In addition, despite having ‖∇ t s,a [D ] t ‖ F ≤ k ⟹ ‖∇ t s,a [r ] t ‖ F ≤ in practice for the minimax reward form (cf. our previous discussion on indirect reward regularization in Sect. 6.2.5), there is not an exact equivalence between r being -Lipschitz and D being k-Lipschitz in theory. Therefore, we will qualify a state-action pair (s t , a t )-equivalently, an action a t in a given state s t -as "approximately ℭ-valid" if D is k-Lipschitz, verifying ‖∇ t s,a [D ] t ‖ F ≤ k . As it has been made clear by now, D 's k-Lipschitzness is encouraged by plugging a gradient penalty regularizer ℜ (k) into D 's loss [cf. Eq. (14)]. Despite being encouraged, ‖∇ t s,a [D ] t ‖ F ≤ k can nonetheless not be guaranteed solely from the application of the regularizer at (s t , a t ) . As such, to cover all bases, we will qualify a state-action pair (s t , a t ) -equivalently, an action a t in a given state s t -as "probably approximately ℭ-valid" if (s t , a t ) is in the support of the distribution that determines where the gradient penalty regularizer ℜ (k) of GP is applied in S × A , i.e. if (supp ) ∋ (s t , a t ) . A probably approximately ℭ-valid point is supported by the distribution that describes where ‖∇ t s,a [D ] t ‖ F ≤ k is enforced, and as such, ℜ (k) may be applied at this point.
Importantly, the policy might, due to its exploratory motivations, pick an action a t in state s t that is not ℭ-valid. Depending on where the constraint will then be enforced, the sample might then be ℭ-valid after r 's update (technically, indirectly via D 's update; cf. Sect. 6.3). This observation motivates the investigation we carry out in Sect. 6.4, in which we define a soft ℭ -validity pseudo-indicator of ℭ [cf. Eq. (67)] that enables us to assess whether the agent consistently performs approximately ℭ-valid actions when it interacts with the MDP * following .

A new reinforcement learning perspective on gradient penalty
We begin by considering a few variants of the original gradient penalty regularizer (Gulrajani et al. 2017) introduced in Sect. 5.4. Each variant corresponds to a particular case of the generalized version of the regularizer, described in Eq. (14). Subsuming all versions, we remind Eq. (14) here for didactic purposes: where is the distribution that describes where the regularizer is applied-where the Lipschitz-continuity constraint is enforced in the input space S × A . In Gulrajani et al. (2017), corresponds to sampling point uniformly along segments joining samples generated by the agent following its policy and samples generated by the expert policy, i.e. samples from the expert demonstrations D . Formally, focusing on the action only for legibilitythe counterpart formalism for the state is derived easily by using the visitation distribution instead of the policy-a ∼ means a = u a � + (1 − u) a �� , where a � ∼ , a �� ∼ e , and u ∼ unif(0, 1) . The distribution we have just described corresponds to the transposition of the GAN formulation to the GAIL setting, which is an on-policy setting. Therefore, in this work, we amend the previously described, and replace it with its off-policy counterpart, where a � ∼ (cf. Sect. 4). As for the penalty target, Gulrajani et al. (2017) use k = 1 , in line with the theoretical result derived by the authors. By contrast, DRAGAN (Kodali et al. 2017) use a such that a ∼ means a = a �� + , where a �� ∼ e , and ∼ N(0, 10) . Like WGAN-GP (Gulrajani et al. 2017), DRAGAN uses the penalty target k = 1 . Finally, for the sake of symmetry, we introduce a reversed version of DRAGAN, dubbed NAGARD (name reversed). To the best of our knowledge, the method has not been explored in the literature. NAGARD also uses k = 1 as penalty target, but perturbs the policy-generated samples as opposed to the expert ones: a ∼ means a = a � + , where a � ∼ (off-policy setting), and ∼ N(0, 10) . We use = 10 in all the variants, in line with the original hyper-parameter settings in Gulrajani et al. (2017) and Kodali et al. (2017). Figure 7 depicts in green the subspace of the input space S × A where the k-Lipschitzcontinuity constraint, formalized as ‖∇ t s,a [D ] t ‖ F ≤ k , and enouraged in GP by ℜ (k) , is applied. In other words, Fig. 7 highlights the support of the distribution for each variant, which have just been described above. As such, the green areas in Fig. 7b, c, and a are schematic depictions of where the state-actions pairs are probably approximately ℭ-valid.
One conceptual difference between the DRAGAN penalty and the two others is that the support of the distribution does not change throughout the entire training process for the former, while is does for the latter. Borrowing the intuitive terminology used in Kodali et al. (2017), WGAN-GP proposes a coupled penalty, while DRAGAN (like NAGARD) propose a local penalty. In Kodali et al. (2017), the authors perform a comprehensive empirical study of mode collapse, and 1 3 diagnose that the generator collapsing to single modes is often coupled with the discriminator displaying sharp gradients around the samples from the real distribution. In model-free generative adversarial imitation learning, the generator does not have access to the gradient of the discriminator with respect to its actions in the backward pass, although it could be somewhat accessed using a model-based approach Baram et al. (2017). In spite of not being accessible per se, the sharpness of the discriminator's gradients near real samples observed in Kodali et al. (2017) translates, in the setting considered in this work, to sharp rewards, which we referred to as reward overfitting and was discussed thoroughly in Sect. 5.3. As such, mode collapse mitigation in the GAN setting translates to a problem of credit assignment in our setting, caused by the peaked reward landscape (cf. Appendix 7 to witness the sensitivity w.r.t. the discount factor , controlling how far ahead in the episode the agent looks). The stability issues the methods incur in either settings are on par. Both gradient penalty regularizers aim to address these stability weaknesses, and do so by enforcing a Lipschitz-continuity constraint, albeit on a different support supp (cf. Fig. 7). As mentioned earlier in Sect. 5.4, the distribution used in WGAN-GP (Gulrajani et al. 2017) is motivated by the fact that-as they show in their work-the optimal discriminator is 1-Lipschitz along lines joining real and fake samples. The authors of Kodali et al. (2017) deem the assumptions underlying this result to be unrealistic, which naturally weakens the ensuing method derived from this line of reasoning. They instead propose DRAGAN, whose justification is straightforward and unarguable: since they witness sharp discriminator gradients around real samples, they introduce a local penalty that aims to smooth out the gradients of the discriminator around the real data points. Formally, as described above when defining the distribution associated with the approach, it tries to ensure Lipschitz-continuity of the discriminator in the neighborhoods (additive Gaussian noise perturbations) of the real samples. The generator or policy is more likely to escape the narrow peaks of the optimization landscape-corresponding to the real data points-with this extra stochasticity. In fine, in our setting, DRAGAN can dial down the sharpness of the reward landscape at expert samples the discriminator overfits on. This technique should therefore fully address the shortcomings raised and discussed in Sect. 5.4. While the method seem to yield better results than WGAN-GP in generative modeling with generative adversarial nets, the empirical results we report in Fig. 8 show otherwise. All the considered penalties help close the significant performance gap reported in Fig. 3, in almost every environment, but the penalty from WGAN-GP generally pulls ahead. Additionally, not only does is display higher empirical return, it also crucially exhibits more stable and less jittery behavior.
Despite the apparent disadvantage of local penalties (DRAGAN (Kodali et al. 2017) and NAGARD) compared to WGAN-GP in terms of their schematically-depicted supp sizes (cf. Fig. 7), it is important to remember that the additive Gaussian perturbation is distributed as N(0, 10) . For these local methods, is therefore covering a large 3 area around the central sample, including with high probability samples that are, according to the discriminator, from both categories-fake samples (predicted as from ), and real samples (predicted as from e ). As such, the perceived diameter of the green disks in the schematic representations in Fig. 7b and c maybe smaller than it would be in reality. It is crucial to consider the coverage of the different distributions as they determine how strongly the Lipschitz-continuity property is potentially enforced at a given state-action pair, for a fixed number of discriminator updates. Consequently, for a given optimization step, while the local penalties are-somewhat ironically-applying the Lipschitz-continuity constraint on data points scattered around the agent-(NAGARD) or expert-generated (DRAGAN) samples, the supp for WGAN-GP is less diffuse. Local penalties ensure the Lipschitzness is somewhat satisfied all around the selected samples, which for DRA-GAN is motivated by the fact that there are narrow peaks on the reward landscape located at the expert samples, where it us prone to overfit (cf. Sect. 5.3). The distribution used in WGAN-GP also supports data points near expert samples, but these are not scattered all around for the sole purpose of making the whole area smooth and escape bad basins of attraction like in DRA-GAN. In other terms, the Lipschitz-continuity constraint is applied isotropically, from the original expert sample outwards. By contrast, WGAN-GP's only supports a few discrete directions from a given expert sample, the lines joining said sample to all the agent-generated samples (of the mini-batch). Intuitively, while DRAGAN smooths out the reward landscape starting from expert data points and going in every direction from there, WGAN-GP smooths out the reward landscape starting from expert data points and going only in the directions that point toward agent-generated data points. As such, one could qualify DRAGAN as isotropic regularizer, and WGAN-GP as directed regularizer.
We believe that WGAN-GP outperforms DRAGAN in the setting and environments considered in this work (cf. Fig. 8) due to the fact that the agent benefits from having smooth reward pathways in the reward landscape in-between agent samples and expert samples. Along these pathways, going from the agent sample end to the expert sample end, the reward progressively increases. For the agent trying to maximize its return, these series of gradually increasing rewards joining agent to the expert data points are akin to an automatic curriculum (Karpathy and Van De Panne 2012;OpenAI 2019) assisting the reward-driven agent and leading it towards the expert. Figure 8 shows that WGAN-GP indeed achieves consistently better results across every environment but the least challenging, as seen in the IDP environment (cf. Table 1). In the four considerably more challenging environments, the directed method allows the agent to attain overall significantly higher empirical return than its competitors. Besides, it displays greater stability when approaching the asymptotic regime, whereas the local regularizers clearly suffer from instabilities, especially DRAGAN in the results obtained in environments Walker2d and HalfCheetah, depicted in Fig. 8. While the proposed interpretation laid out previously corroborates the results obtained and reported in Fig. 8, it does not explain the instability issues hindering the local penalties. We believe the jittery behavior observed in the results obtained in environments Walker2d and HalfCheetah 1 3 (cf. Fig. 8)-once the peak performance is attained-is caused by supp (green areas in Fig. 7) not changing is size as the agent learns to imitate and gets closer to the expert in S × A. Indeed, in DRAGAN, is a stationary distribution: it applies the regularizer on perturbations of the expert samples, where the additive noise's underlying sufficient statistics are constant throughout the learning process, and where the expert data points are distributed according to the stationary policy e and its associated state visitation distribution. For NAGARD, the perturbations follow the same distribution, and remain constant across the updates. However, unlike DRAGAN, is defined by adding the stationary noise to samples (a) (b) Fig. 8 Evaluation of gradient penalty variants. Explanation in text. Runtime is 48 h from the current agent, every update, distributed as in our off-policy setting. Since is by construction non-stationary across the updates, as a mixture of past updates, is non-stationary in NAGARD. Despite 's having these different support and stationary traits, the results of either local penalties are surprisingly similar. This is due to the variance of the additive noise used in both methods being large relative to the distance between the expert and agent samples, at all times, in the considered environments. As such, their supp are virtually overlapping, which makes the two local penalties virtually equivalent, and explains the observed similarities in-between them.
Coming back to the main point-"why do local penalties suffer from instabilities at the end of training?"-even though the agent samples are close to the expert ones, the local methods both apply the same large perturbation before applying the Lipschitz-continuity penalty. The probability mass assigned by is therefore still spread similarly over the input space, and is therefore severely decreased in-between agent and expert samples since these are getting closer in the space. The local methods are therefore often applying the constraint on data points that the policy will never visit again (since it wants to move towards the expert) and equivalently, rarely enforces the constraint between the agent and the expert, which is where the agent should be encouraged to go. With this depiction, it is clearer why WGAN-GP pulls ahead. Compared to the fixed size of supp in the local penalties, adapts to the current needs of the agent (hence qualifying as non-stationary). As the agent gets closer to the expert, Lipschitz-continuity is always enforced on data points between them, which is where it potentially benefits the agent most. The support of is therefore decreasing in size as the iterations go by, focusing the probability mass of where enforcing a smooth reward landscape matters most: where the agent should go, i.e. in the direction of the expert data points.
Besides, considering the inherent sample selection bias (Heckman 1979) the control agent is subjected to, where the latter end up in S × A depends on its actions, in every interaction with the dynamical system represented by its environment. This aspect dramatically differs from the traditional non-Markovian GAN setting-in which these penalties were introduced-where the generator's input noise is i.i.d.-sampled. Indeed, suffering from said sample selection bias, an imitation agent straying from the expert demonstrations is likely to keep on doing so until the episode is reset (cf. discussion in Sect. 5.4). Distributions whose definition involve samples generated by the learning agent and adapt to the agent's current relative position w.r.t. the expert data points therefore provide valuable extra guidance in Markovian settings. Additionally, assuming the input also contained the phase-"how far the agent/expert is in the current episode", 0 ≤ t ≤ T-[like in Peng et al. (2018)] not only would the imitation task be easier, but the benefits of the WGAN-GP penalty would be further enhanced, as it would allow the models to exploit the temporal structure of to the considered Markovian setting.
Finally, in reaction to the recent interest towards "zero-centered" gradient penalties (Roth et al. 2017;Mescheder et al. 2018), due to the theoretical convergence guarantees they allow for, we have conducted a grid search on the values of the Lipschitz constant k and the regularizer importance coefficient , as described in Sect. 6.3. The results are reported in Appendix 5.3. In short, the method performs poorly when k = 0 , unless a very small value is used for . Enforcing 0-Lipschitzness is far too restraining for the agent to learning anything, unless this constraint is only loosely imposed. Conversely, a smaller value yields worse results when k = 1 , revealing the interaction between the gradient penalty hyper-parameters k and . In particular, we will momentarily provide comprehensive evidence along with a greater characterization of how the choice of scaling factor not only impacts the agent's performance (which is already depicted in Appendix 5.3), but how it correlates quantitatively with the approximate ℭ-validity displayed by the agent (cf. Sect. 6.4). Unless explicitly stated otherwise, we use the WGAN-GP penalty variant, with Lipschitz constant target k = 1 , and scaling coefficient = 10 throughout the empirical results exhibited in both the body and appendix.

Diagnosing ℭ-validity: Is the Lipschitzness premise of the theoretical guarantees satisfied in practice?
To put things in perspective, we first give a side-by-side rundown of how what we set out to tackle here compares to what we have just tackled in Sect. 6.3, thereby giving a glimpse of what we set out to investigate in what follows. In the previous section, we showed how (a) the choice of (where do we want to encourage approximately ℭ-valid behavior), and (b) the choice of (to what degree do we want to encourage approximately ℭ-valid behavior) both independently impact the agent's performance in terms of empirical episodic return. In this section on the other hand, we will show how (a) the choice of , and (b) the choice of both independently impact the agent's consistency at effectively selecting approximately ℭ-valid actions with its learned policy . If we were to find a strong positive correlation between the agent's asymptotic return and its effectively measured approximate ℭ-validity rate-high when high, low when low, for all tested 's and for all tested 's-then we would have further quantitative evidence to support our work's main claim: reward Lipschitzness is necessary to achieve high return, and higher Lipschitzness uptime correlates strongly with higher return. Perhaps most crucially, we would be able to correlate high empirical episodic return with high chance of satisfying the premise of our theoretical guarantees ( r 's Lipschitzness). As such, these would consequently apply in in practice too. This would attest to the practical relevance of Sect. 6.1. We have shown that enforcing a Lipschitz-continuity constraint on the learned reward r (albeit indirectly via D ) is instrumental in achieving expert-level performance in offpolicy generative adversarial imitation learning (cf. Sect. 5.5). We have also shown that directed regularization techniques yield better results, seemingly due to the better guidance they provide to the mimicking agent, in the form of an automatic curriculum of rewards towards the expert data points (cf. Sect. 6.3). Such curriculum only exists where the Lipschitz-continuity constraint is satisfied. Said differently, it could not exist if the constraint were not satisfied along 's pathways which would then involve non-smooth hurdles. It is therefore crucially important for said constraint to be satisfied in effect for the state-actions pairs in the the support of the policy the agent uses in its learning update, , i.e. supp ∋ (s t , a t ) . Still, the deterministic policy likely performs only approximately ℭ-valid actions as it is trained with the sole objective to maximize cumulative rewards that represent its similarity w.r.t. the expert e . The imitation rewards corresponding to a greater degree of similarity are, by design of the generative adversarial imitation learning framework, situated between the agent's current position and the expert's position on the current reward landscape. Since this is where we apply the Lipschitzness constraint (with WGAN-GP, our baseline, as said above)-equivalently, since these regions are approximately ℭ -valid-is likely to never select ℭ-invalid actions as it optimizes for its utility function (cf. Sect. 3). Conversely, in the considered setting, picking ℭ-invalid actions could in theory hinder the optimization process the policy is subject to, as would a priori venture in regions of the state-action space that do not increase its similarity with the expert policy e -or, at the very least, for which the non-satisfaction of the reward's Lipschitz-continuity premise ‖∇ t s,a [r ] t ‖ F ≤ might lead to instabilities due to ‖∇ t s,a [Q ] t ‖ F > Δ ∞ as a direct consequence of our theoretical guarantees (cf. Sect. 6.2). Since we do not have such a tight control over where and to what degree the Lipschitzness constraint over the reward r is satisfied (hence our introduction of the notions of approximately ℭ-valid samples and probably approximately ℭ-valid samples), we instead turn to the closest surrogate over which we do have a tighter control: where and to what degree D 's constraint is enforced. The "where" is controlled by the choice of (determined by the gradient penalty regularization method in use), and the 'to what degree' by the choice of scale. Still, even in the occurrence where D 's constraint is enforced by adding ℜ (k) as in GP [cf. Eq. (14)] at the point (s t , a t ) , the most we could say is that (s t , a t ) is probably approximately ℭ-valid, since (s t , a t ) ∈ supp -otherwise, the gradient penalty regularizer ℜ (k) could never have been applied at that point in the landscape S × A . In effect, enforcing the constraint at the point was enough to guarantee that ‖∇ t s,a [D ] t ‖ F ≤ k , and we therefore do not know whether (s t , a t ) is approximately ℭ-valid, or not. As a direct consequence, we can a fortiori not guarantee that ‖∇ t s,a [r ] t ‖ F ≤ ; we do not know whether (s t , a t ) is ℭ -valid, or not-cf. Sect. 6.2.5 for our discussion on indirect reward regularization, in which we establish that D 's k-Lipschitzness causes r to be -Lipschitz in practice. On the flip side, based on the latter result about indirect Lipschitz-continuity inducement, we can state that ensuring empirically that ‖∇ t s,a [D ] t ‖ F ≤ k is enough to ensure that ‖∇ t s,a [r ] t ‖ F ≤ is verified in practice. In other words, showing that (s t , a t ) is approximately ℭ-valid can be used as a proxy for showing that (s t , a t ) is ℭ-valid, empirically. As such, in order to assess whether the premise of the theoretical guarantees we derived in Sect. 6.1 is satisfied in practice ( r 's --Lipschitz-continuity), it is sufficient to assess whether the agent's actions a t = (s t ) are approximately ℭ-valid. In particular, we want to know the relative impacts the choices of and the in GP have on the propensity for an action from to be approximately ℭ-valid. So as to estimate how often the actions selected by the agent via are approximately ℭ-valid, we build an estimator that softy approximates 1 ℭ ∶ S × A → {0, 1} , the indicator of the ℭ-validity subspace over S × A , where 1 ℭ (s t , a t ) = 1 when (s t , a t ) ∈ ℭ , and 1 ℭ (s t , a t ) = 0 when (s t , a t ) ∉ ℭ . Accordingly, we call our estimator soft approximate ℭ -validity pseudo-indicator, implementing a soft, C 0 mapping 1 ℭ ∶ S × A → (0, 1] , and formally defined as, ∀t ∈ [0, T] ∩ ℕ, ∀(s t , a t ) ∈ S × A: Thus, for a given pair Figures 9 and 10 depict respectively the evolution of the values taken by the soft approximate ℭ-validity pseudo-indicator 1 ℭ [cf. Eq. (67)] for different choices of (different gradient penalty variants) and (sweep over ℜ (k) 's scaling factor). In Figs. 9 and 10, we also share the return accumulated by the agents throughout their respective training periods, (cf. Figs. 9a and 10a, respectively). In particular, what we report in Figs. 9a and 10a echoes what we have already reported in Figs. 8 and 16, but the settings in which the agents were trained differ (ever so) slightly. We indicate the specificities of the setting tackled in this section below, in this very paragraph. Still, since their settings do not match perfectly, we report their return along their soft approximate ℭ-validity pseudo-indicator 1 ℭ values. We monitor and record these values during the evaluation trials the agent periodically goes through, in which the agent uses to decide what to do in a given state. To best align with the definition of Lipschitz-continuity (cf. Definition 1), which is also how we ▶soft approximate ℭ-validity pseudo-indicator designed our soft approximate ℭ-validity pseudo-indicator 1 ℭ , we use one-sided gradient penalties ℜ (k) in the sweep-max(0, ‖∇ s t ,a t D (s t , a t )‖ − k) 2 , which purely encourages ‖∇ t s,a [D ] t ‖ F ≤ k to be satisfied (nothing more, nothing less)-although we have shown the variant presents very little empirical difference with the base two-sided one (cf. ablation in Appendix 5.1). It is worth noting that the experiments whose results are reported in Figs. 9 and 10 carry out less iterations during the fixed allowed runtime, due to the substantial cost entailed by computing soft approximate ℭ-validity pseudo-indicator 1 ℭ at every single evaluation step, in every evaluation trial. One could cut down that cost simply by evaluating 1 ℭ less frequently, but we decided otherwise, as we gave priority to having a finer tracking of 1 ℭ . Besides, despite this slight apparent hindrance, the values of the proposed pseudo-indicator reported in either figure seem to have reached maturity, nearing their asymptotic regime, in the allowed runtime. We now go over and interpret the results reported in both figures.
In Fig. 9, we observe that the monitored soft approximate ℭ-validity pseudo-indicator 1 ℭ [cf. Eq. (67)] consistently takes values close to 1 when using the distribution advocated in WGAN-GP to assemble the regularizer ℜ (k) . Conversely, not using any gradient penalty regularizer causes the approximate ℭ-validity rate to be in the vicinity of 0. Albeit a priori not surprising, it is still substantially valuable to notice that D 's k-Lipschitz-continuity (and therefore r 's -Lipschitz-continuity; cf. Sect. 6.2) never happens by accident (or rather, by chance). As for DRAGAN and NAGARD (both being non-directed gradient penalty schemes, unlike WGAN-GP; cf. Sect. 6.3), both perform similarly across the board in terms of collected 1 ℭ values. Their recorded soft pseudo-indicator values stay around a fixed value per environment, different for every one of them. These are within the [0.1, 0.7] range, and as such, are definitely encouraging ‖∇ t s,a [D ] t ‖ F ≤ k in practice, yet are falling short of achieving the same (a) effective approximate ℭ-validity value, and (b) effective approximate ℭ-validity consistency as WGAN-GP. These phenomenona occur consistently across the spectrum of tackled environments.
In Fig. 10, we observe the unsurprising fact that the higher 's value is-equivalently, the more we encourage the regularity property ‖∇ t s,a [D ] t ‖ F ≤ k to be satisfied-the more ‖∇ t s,a [D ] t ‖ F ≤ k is satisfied in effect. Besides confirming that gradient penalization indeed urges Lipschitzness (which we were not doubting), the figure helps us gauge to what degree the value of ℜ (k) 's scaling coefficient in GP [cf. Eq. (14)] affects quantitatively the satisfaction of ‖∇ t s,a [D ] t ‖ F ≤ k monitored via the soft proxy 1 ℭ . We considered powers of 10 for 's sweep, tackling the values i ∶=10 i , for i ∈ {−3, −2, −1, 0, 1} . The gap inbetween the 1 ℭ values associated with each of these i differ per environment, but their ranking remain the same (higher 1 ℭ 's for higher i's). At its lowest (i.e. for minimum i: i = −3 ) the soft pseudo-indicator values lie more often that not near 0. For i = 1 , 1 ℭ perfectly aligns on the 1 value, meaning that the value we used so far ( = 10 , which corresponds to i with i = 1 ) is enough for to achieve a 100% satisfaction rate of ‖∇ t s,a [D ] t ‖ F ≤ k . The case i = 0 is right on the edge: in some environments, the approximate ℭ-validity exactly equals 1, while for other environments, it nears it, yet does not quite reach it.
Since we use WGAN-GP's in the experiments reported in Fig. 10, we can first conclude that picking WGAN-GP's variant and = 10 not only yields the best Fig. 9 Evaluation of several GP methods differing by their distribution In line with how we defined it in Eq. (14), controls "where" the GP constraint is enforced. Also, we report what happens without any GP regularization (NoGP). Explanation in text. Runtime is 48h 1 3 empirical return (as reported and discussed in Sect. 6.3), but also guarantees that the constraint ‖∇ t s,a [D ] t ‖ F ≤ k (and therefore ‖∇ t s,a [r ] t ‖ F ≤ ; cf. Sect. 6.2) is satisfied for 100% of the actions performed by the agent's in practice. As such, we can conclude that, in practice, the main premise of the theoretical guarantees we have derived in Sect. 6.1-the reward -Lipschitz-continuity, ‖∇ t s,a [r ] t ‖ F ≤ -is satisfied, hence making our theoretical guarantees practically relevant and insightful. In addition, since we showed that the learning agent's policy (or rather, it's companion Q-value) is trained on a reward surrogate r that verifies ‖∇ t s,a [r ] t ‖ F ≤ almost 100% of the time, we have empirically proved that the agent effectively sees virtually uninterrupted sequences of smooth rewards. This new observation somewhat corroborates our RL-grounded interpretation of directed gradient penalization as as the automated and adaptive creation of reward curricula (cf. Sect. 6.3, and particularly our schematic depiction of WGAN-GP's supp in Fig. 7a).
Despite having answered the question we asked in the title of the section (in the block right above), interpreting the findings laid out both in this section and in the previous one side-by-side allows us to draw another critical conclusion, substantially more meaningful than if we were to interpret either in a vacuum. In Sect. 6.3, we studied the impact and both have on the agent's performance, in terms of the empirical return in the MDP . We refer here to the latter via the shorthand return. In this section, on the other hand, we have studied the impact and both have on the effective approximate ℭ-validity rate of the agent. We refer here to the latter via the shorthand VAlidity. What emerges from comparing these two sets of results is that, for every given pair ( , ) (where to apply the gradient penalty, and to what degree, respectively) in GP [cf. Eq. (14)]: low return co-occurs with low VAlidity; intermediate return co-occurs with intermediate VAlidity; high return co-occurs with high VAlidity. Said differently, return and VAlidity behave similarly under the various pairings ( , ) that we have considered. Through these observations, we therefore witness a strong correlation between return and VAlidity. Ultimately, by combining our two previous empirical analyses, we have shown that VAlidity is a good predictor or return, and vice versa.
In fine, compared to Sects. 5.5, 6.4 (this section) gives a far more fine-grained diagnostic of how reward Lipschitzness relates to empirical return, along with insights related to the practicality of our theorerical guarantees.

3
We call t a reward preconditioner since it functionally echoes the numerical transformation that conditions the tackled problem into a form that is more amenable to be solved via first-order optimization methods. Since our preconditioner is a scalar, we use the shorthand t to constrast with the usual preconditioning matricies, denoted with capitalization. We have the following ranking of values, depending on the sign of the original learned synthetic reward r : ∀t ∈ [0, T] ∩ ℕ and ∀(s t , a t ) ∈ S × A , we have r (s t , a t ) ≤ r (s t , a t ) whenever r (s t , a t ) > 0 , and conversely, we have r (s t , a t ) > r (s t , a t ) whenever r (s t , a t ) < 0.
We posit that t does not depend on (i.e., is constant w.r.t.) the current state s t and action a t : and similarly Applying such a preconditioner to r therefore squashes the absolute value of r and in effect shrinks r 's Lipschitz constant (assuming here that r is -Lipschitz, with ‖∇ t s,a [r ] t ‖ F ≤ < +∞ ) without regard to the sign of the signal. Formally, since t is posited constant in s t and a t , we have, ∀t ∈ [0, T] ∩ ℕ and ∀(s t , a t ) ∈ S × A: That is, if r is -Lipschitz-continuous at t, then r is t -Lipschitz-continuous at t. Importantly, Eq. (70) will be instrumental in proving the first stages of our next theoretical guarantees, in which we deal with the counterpart action-value of r , denoted by Q .
Because of its "reward-squashing" effect, we name the method corresponding to the subtitution of r with the preconditioned reward r "Pessimistic" Reward Preconditioning Enforcing Lipschitzness. We dub the plug-in technique "PURPLE" (it is an acronym, with minor vowel filling and letter shuffle for legibility and easy of pronunciation). From this point onward, we study the effect of plugging PURPLE into SAM. The pseudo-code of the resulting algorithm can be obtained by replacing the learned reward r in SAM's pseudo-code laid out in AlGorithM 1 with the preconditioned reward r .
We now study how the injection of PURPLE in SAM impacts the theoretical guarantees we have previously derived in Sect. 6.1. Concretely, we derive the PURPLE counterparts of Lemma 1, Theorems 1, 2, and Corollary 1. In order for us to characterize the Lipschitzness of Q , we also posit that the introduced preconditioner does not depend on (i.e., is constant w.r.t.) the previously visited (past) states and actions. Formally: All in all, to develop the counterpart guarantees that will follow, the preconditioner t must possess the following properties: Note, the last two properties, Eqs. (69) and (71), can be condensed into, Property that t must have In plain English, to get our guarantees, we need the preconditioner to not depend on neither current nor past states visited and actions taken by the agent. Note, the property t ≤ 1 is only ever used in Sect. 6.6.1, and will not be leveraged anywhere else. The developed theory will still hold if ∃t ∈ [0, T] ∩ ℕ such that t > 1.
PURPLE in the broader algorithmic landscape Setting aside the fact that t depends on a schedule indexed by the timestep t, PURPLE has the effect of reducing the (policy) gradients received by the GAIL or SAM policy, since it squashed the reward received by the agent. This scales down the gradients traditionally designed for the policy. The most direct adaptation of PURPLE to the GAN world would consist in scaling down the output of the discriminator (from which the reward is directly crafted in GAIL and SAM). The generator in a GAN is updated with gradients of the output of the discriminator w.r.t. its own parameters, similarly to how the actor is updated with gradients of the critic in an actor-critic. Consequently, squashing the output of the discriminator squashes the gradients used by the generator, which is equivalent to reducing the learning rate for the optimization of the generator (assuming no exotic optimizer or regularizer are in use).
Lemma 2 Let the MDP with which the agent interacts be deterministic, with the dynamics of the environment determined by the function f ∶ S × A → S. The agent follows a deterministic policy ∶ S → A to map states to actions, and receives rewards from r ∶ S × A → ℝ upon interaction. The functions f, and r need be C 0 and differentiable over their respective input spaces. This property is satisfied by the usual neural network function approximators. The "almost-everywhere" case can be derived from this lemma without major changes (relevant when at least one activation function is only differentiable almost-everywhere, ReLU). (a) Under the previous assumptions, for k ∈ [0, T − t − 1] ∩ ℕ the following (non-recursive) inequality is verified: ▶used in the proof of Lemma 2, itself then used to prove (step 2)Theorem 3(a) + (b) where C∶=A 2 max(1, B 2 ) is the time-independent counterpart of C t .
Proof of Lemma 2(a) (a) First, we take the derivative with respect to each variable separately: By assembling the norm with respect to both input variables, we get: As in Lemma 1, let A t , B t and C t be time-dependent quantities defined as: Finally, by injecting Eq. (35), we directly obtain: which concludes the proof of Lemma 2(a). ◻

This shows that Eq. (94) is verified when
Conclusion We have shown that Eq. (94) is valid ∀v ∈ [0, T] ∩ ℕ , which concludes the proof of Theorem 3(a). ◻

This shows that Eq. (99) is verified when
Conclusion. We have shown that Eq. (99) is valid ∀v ∈ [0, T] ∩ ℕ , which concludes the proof of Theorem 3(b). ◻  ,r ,and r [cf. Eq. (68)] are all deterministic (no expectation). Additionally, since r is assumes to be C 0 and differentiable over S × A , Q is by construction also C 0 and differentiable over S × A . Consequently, ∇ u s,a [Q ] u exists, ∀u ∈ [0, T] ∩ ℕ . Since both r and Q are scalar-valued (their output space is ℝ ), their Jacobians are the same as their gradients. We can therefore use the linearity of the gradient operator:

Proof of Theorem 4 With finite horizon
. On the other hand, when 2 C ≠ 1: By applying √ ⋅ (monotonically increasing) to the inequality, we obtain the claimed result. ◻ Finally, we derive a corollary from Theorem 4 corresponding to the infinite-horizon regime.
Corollary 2 (Infinite-horizon regime) Under the assumptions of Theorem 4, including that r is -Lipschitz and that r is defined as in Eq. (68) over S × A, and assuming that 2 C < 1 , we have, in the infinite-horizon regime: Proof of Corollary 2 By following the proof of Corollary 1, using Theorem 3 instead of Theorem 1, we arrive directly at the claimed result. ◻ Remark 1 Say we were to write a proof analogous to the one laid out right above for Theorem 4, but using the time-dependent version of Theorem 3 instead of the time-independent version that we used in Eq. (106) (version 3(a) instead of 3(b)). Despite not being identifiable as a finite or infinite sum of geometric series, the expression we would get instead of Eq. (106) not only is a tighter bound by construction, but it also has an interesting form: Going through the first operands of the sum, and looking solely at the " " and "C" factors, we have the following: This observation tells us that, in the derived Lipschitz constant of Q , the reward preconditioner t at time t can compensate for all the past values {C v | v < t} . Intuitively, the more we wait to reduce t , the more the next t 's will need to compensate for the "negligence" of their predecessors. Note, the product of {C v | v < t} compounds quickly.

Provably more robust
Given that, in this work, we aligned the notion of robustness of a function approximator with the value of its Lipschitz constant (more robust means lower Lipschitz constant, cf. Sect. 4), and given that t 's upper bound verifies ≤ 1 (cf. Lemma 2), we can write, from the result of Corollary 2: where Δ ∞ ∶= ∕ √ 1 − 2 C is the upper bound of Q 's Lipschitz constant that we derived in Corollary 1. Note, all of what is written in this remark concerns the infinite-horizon regime, but one can derive the finite-horizon counterpart trivially-using Theorem 2 instead of Corollary 1, and Theorem 4 instead of Corollary 2-to arrive at the same conclusion: Q has a lower derived Lipschitz constant upper bound than Q by a factor of ≤ 1 and is therefore provably more robust than Q . In other words, employing the simple PURPLE reward preconditioning to SAM has the effect of making the learned Q-value provably more robust.

Detached guide
Consider the following particular form for t , ∀t ∈ [0, T] ∩ ℕ, ∀(s t , a t ) ∈ S × A: where is an inverse temperature hyper-parameter involved in the definition of the kernel of the Boltzmann or Gibbs probability distribution t ∶= exp(− t ) , (hence 0 < t ≤ 1 ), and where t ≥ 0 for now depicts an arbitrary non-negative energy function. t is nonnormalized, and as such, it is not a probability per se. Nonetheless, it still echoes the propensity or tendency of the state-action pair (s t , a t ) to possess the property described by the non-negative energy t , which we define momentarily. Low values of t ≥ 0 will push the preconditioner towards the upper limit t → 1 , while high energy values will make it tend towards the lower limit t → 0 with t > 0 . Equivalently, the preconditioned reward r will verify the approximate identity r (s t , a t ) ≈ r (s t , a t ) whenever t approaches zero (from above), and r (s t , a t ) ≈ 0 whenever the energy t grows towards higher levels. Under this orchestration, we need d t+k ds t = 0 and d t+k da t = 0 to be satisfied ∀t ∈ [0, T] ∩ ℕ, ∀k ∈ [0, T − t] ∩ ℕ, ∀(s t , a t ) ∈ S × A for the derived robustness guarantees to be readily applicable (we laid out the properties t must possess in Sect. 6.5, right before exposing Lemma 2). In particular, the soft approximate ℭ-validity pseudo-indicator [cf. Eq. (67)] is an instantiation of the t form laid out in Eq. (114), where = 1 for the inverse temperature, and t = max(0, ‖∇ s t ,a t D (s t , a t )‖ − k) 2 for the energy. In such an instance, r (s t , a t ) ≈ r (s t , a t ) whenever the pair (s t , a t ) is approximately ℭ-valid, formally, ‖∇ t s,a [D ] t ‖ F ≤ k . Conversely, in the extreme scenario where ‖∇ t s,a [D ] t ‖ F ≫ k , t grows large, t is approximately equal to 0, and r (s t , a t ) ≈ 0 . As such, in effect, the agent's policy is punished for selecting actions that do not satisfy the approximate ℭ-validity condition above. Besides, it is punished in accordance to how far outside the allowed range, [0, k], the norm of the Jacobian of D gets. Nonetheless, in this particular instance, the empirical observations we have made in Sect. 6.4 attest to the fact that, provided the right choice of scaling factor and distribution (both characterizing the gradient penalization), the approximate ℭ-validity constraint ‖∇ t s,a [D ] t ‖ F ≤ k can easily be satisfied 100% of the time by only regularizing D . For D 's k-Lipschitzness to be ensured, there is therefore no need to further alter the rewards provided to the agent's policy through PURPLE's pessimistic reward preconditioning. Note, however, that under such a t formulation, we see that we clearly have does not mean that the studied entities are not robust, it prevents us from applying our derived results to guarantee such robustness. Generally speaking, we will probably make the same observation whenever t is defined from a constraint we want to enforce on a learned function approximation, for regularization purposes. Indeed, verifying said desideratum on the function approximator directly via the application of a regularizer seems to always be the easiest (since most direct) solution to encourage the satisfaction of a constraint on a differentiable function (e.g. D , ). Constraints involving the Jacobian of a (a fortioni differentiable) function of the learned system (e.g. ‖∇ t s,a [D ] t ‖ F ≤ k ) is a particular case of the general class of constraints for which direct regularization is a priori prefereable to an analogous reward shaping as dictated by Eq. (114). On the flip side, due to the fact that the reward-albeit learned as a parametric function-is treated as an input in our computational graph, it is not differentiated through and can consequently be augmented with non-differentiable nodes through the design of t . In other words, even if it is preferable to apply regularization directly the objective of the regularized function approximator for it to satisfy some constraint, it might not always be possible to do so directly. In that case, guiding the policy towards areas of the state-action landscape that satisfy said constraint could be a surrogate solution, albeit far less preferable than acting on the targeted approximator directly.
As such, by aligning t with said constraint, Eq. (114) offers a way for the policy to act in view of the satisfaction of said constraint while enjoying the considerable advantage of being able to treat t as a black box. We will leverage this universality in the next discussion point.

Partial compensation of compounding variations
In reaction to the theoretical robustness guarantees derived in Theorem 2 and Corollary 1, we have discussed earlier in Sect. 6.2.3 that, if the variations in space of the policy or the dynamics are large in the early stage of an episode (i.e. when 0 ≤ t ≪ T ), then Δ t (the variation bound on Q ) might explode. As results, ‖∇ t s,a [Q ] t ‖ F would then be unbounded, leaving us unable to guarantee the robustness of the learned Q-value Q . The earlier large variations in either or both the policy and dynamics manifest, the more likely these variations are to compound to unreasonably high levels. Concretely, the degree of such compounding variations in space is entirely determined by the operand 2 C that appears in the variation bounds derived in both Theorem 2 and Corollary 1. The exact same line of reasoning holds for the variation bounds laid out later in Sect. 6.5, in both Theorem 4 and Corollary 2 respectively. These guarantees unanimously agree on the critical role that C plays in the robustness bounds, which we here called variation bounds indifferently. Loosely, high values of C prevent Q from enjoying the Lipschitzness guarantees laid out in Sects. 6.1 and 6.5. As such, it is paramount to devise a way to keep C in check by somewhat controling its magnitude, thereby preventing it from voiding our theoretical guarantees and from adopting a brittle behavior. We defined C in Lemma 1(b) as C∶=A 2 max (1, B 2 ) As such, to devise a way to limit the magnitude of C, we seek ways to limit the respective magnitudes of the A and B majorants. Similarly to the learned surrogate reward core D , the policy followed by the agent (of which is a placeholder) is learned as a parametric function approximator, enabling us to tame B by applying a gradient penalty regularizer directly on the policy (exactly like we already do to ensure that D remains k-Lipschitz-continuous).
By contrast, we can not tame A the same way (via direct regularization applied onto f), due to the transition function f of the world (whether real or simulated) being a black box that we can not even query at will. Not only is f non-differentiable (the real world never is; non-trivial simulated worlds virtually never are), but we also can not evaluate it at any state-action pair whenever we want. Our desideratum then ultimately boils down to finding a way to keep A in check, since the usual candidate to enforce Lipschitzness (applying a regularizer on the Jacobian directly)-which is the preferable option by far for D and -is out of the question for f, as we have established. Despite the fact that, by nature, we can not change f in the MDP , we can change the transition function f ′ that effectively takes the place of f in practice and underlies the effectively observed MDP ′ by urging the agent's policy to avoid areas of the state-action landscape S × A that display high ‖∇ t s,a [f � ] t ‖ ∞ values. In fact, f ′ changes continually ( f ′ is non-stationary) throughout the learning process as the preferences of the agent evolve across learning episodes. It is therefore fair to posit that we can devise a way to skew the policy towards areas of ≤ (where f is -Lipschitz-continuous, thereby also satisfying the premise of the guarantees) by defining the energy function t in the model-based preconditioner t as a one-sided gradient penalty, as follows: ∀t ∈ [0, T] ∩ ℕ, ∀(s t , a t ) ∈ S × A , where ON denotes an online, running estimate of the standard deviation of max(0, ‖∇ s t ,a t f (s t , a t )‖ F − ) 2 . For completeness, we remind here that we used the same online normalization technique in our RED experiments (cf. Sect. 5.5), inspired by the discussion laid out in in Burda et al. (2018) on the importance of (115) such normalization technique when the reward is grounded on a prediction loss. Considering the edge cases, and omitting here the clipping to min , when t is close to zero, t is approximately equal to 1, i.e. r (s t , a t ) ≈ r (s t , a t ) [cf. Eqs. (115), (116)]. Conversely, in the extreme scenario where t is very large (i.e. ‖∇ t s,a [f ] t ‖ F ≫ ), t is approximately equal to 0, and r (s t , a t ) ≈ 0.
Looking at the model-based instantiation of PURPLE laid out in Eq. (115), and specifically of the form exhibited in Eq. (113), we see that the energy t depends on the current state s t and action a t . Indeed, from the definitions of t and t , we immedi- As such, the crafted preconditioner does not satisfy the eligibily conditions for the derived theoretical guarantees to be applicable, which were represented in condensed form in Sect. 6.5, right before exposing Lemma 2. If we had used the supremum Frobenius norm ‖∇ t s,a [f ] t ‖ ∞ to formulate t instead of relaxing it to ‖∇ t s,a [f ] t ‖ F , its non-supremum counterpart, t would not depend on s t and a t (or any visited state or picked action), and our robustness guarantees would be readily applicable. Still, such a supremum Frobenius norm is intractable in practice. In order for us to be able to evaluate the developed prototype empirically, we resorted to the obvious tractable relaxation consisting in simply dropping the supremum altogether for this diagnostics-oriented case.
. SAM-PURPLE-7 and SAM-PURPLE-6 are two instantiations of SAM (cf. Algorithm 1), augmented with the model-based instantiation of PURPLE whose template is laid out in Eqs. (115) and (116), with = 7 and = 6 respectively. We indicate how to read the plots (whether lower or higher is better) in the caption of each column. Despite displaying overlapping return curves, note how tighter the standard deviation envelope is for PURPLE runs. Runtime is 96 h Now that we have laid out how the pessimistic model-based preconditioner t impacts the reward received by the agent artificially upon interaction, we consider how this preconditioning affects the Lipschitz constant of Q in the infinite-horizon setting, denoted by Δ ∞ [cf. Eq. (113) 1, B 2 ) , which in turn push the denominator of the Lipschitz constant Δ ∞ ∶= t ∕ √ 1 − 2 C towards 0 from above, exposing Δ ∞ to diverge to +∞ . Without preconditioning ( t = 1 ), the task of compensating for such a lowvalued denominator would be left to alone, and picking ≈ 0 would be the only way to maintain the robustness bound from diverging. With preconditioning however, we can also try to prevent it from diverging with the preconditioner t , whose value can be set far more finely (per timestep). Specifically, with the t formulation laid out in Eqs. (115) and (116) (116)]-can be tuned extensively in practice to achieve the desired level of compensation. We used min = 0.7 , = 1 , and ∈ {6, 7} in the experiments we conducted to showcase how the proposed model-based reward preconditioning laid out above can help us achieve our robustness desideratum.
Since we aim to showcase its potential benefits, as opposed to convince the reader to plug this preconditioning method in every future architecture, we conducted illustrative experiments only in the Hopper environment (neither the easiest, nor the hardest among the ones considered, cf. Table 1). Note, when it comes to D 's gradient penalty regularization, we use the default and (cf. Sect. 6.3): the directed distribution of WGAN-GP, with = 10 as scaling factor. Since the evaluated policy is penalized for navigating areas of S × A where ‖∇ t s,a [f ] t ‖ F > , we monitor G∶=‖∇ t s,a [f ] t ‖ F . We expect to observe lower values of G when using the studied preconditioning. In order to grasp the extent to which variations can compound in the system, and therefore highlight the need for mechanims allowing the main method to contain such compounding of variations (like the proposed one), we also monitor an approximation of 2 C , relaxed as . We expect to see the same ranking of methods in the plots depicting G and H respectively. These are all reported in Fig. 11.
Note, the steep surge in overall computational cost caused by the evaluation of the monitored metrics (G and H) and expecially t lowered the number of iterations our agent could do in the allowed runtime. As such, we increased said runtime from the usual 0.5day or 2-day duration to a 4-day duration (or 96 h) Such runs are more costly to orchestrate, hence the sparser array of experiments to offset the steeper cost in compute. In Fig. 11, we observe that, at evaluation time, the model-based PURPLE instantiation in Eqs. (115) and (116) indeed enables the agent to achieve lower values of G and H, with the same episodic return. Said differently, it seems that the agent-with preconditioning, compared to the one without-achieves the same proficiency, with the same convergence speed, while making decisions that are safer in terms of incurred variations of the approximate dynamics f . So, even if the preconditioner is not needed to reach a higher return (or reach it faster) per se, we have showcased that the studied model-based reward preconditioning can increase the robustness of the main method by augmenting it with the means to tame a priori untamable entities in the system (here, the dynamics). Still, the studied model-based instantiation of PURPLE is set back by several drawbacks. (a) We need to maintain a forward model f that approximates the effective transition function f ′ . (b) To be estimated, t requires explicit calls to an automatic differentiation library, making its frequent computation (every time a mini-batch is sampled from the replay buffer) extremely expensive overall. (c) The threshold (to be enforced as Lipschitz constant for f ) must be set such that not every decision made by the agent is penalized, while making sure it is still strict enough in that respect. Besides, we observed in practice that the range of values taken by ‖∇ t s,a [f ] t ‖ F varies greatly across environments. As such, must be tuned carefully per environment, making the overall process tedious and computationally expensive. In effect, this brings us back to the original issues of reward shaping (Ng et al. 1999), that adversarial IL (Ho and Ermon 2016) circumvented.

Total compensation of compounding variations
Inspired by the insight laid out in Remark 1, we derive theoretical guarantees that characterize the robustness of Q when using a preconditioner defined as follows: ∀t ∈ [0, T] ∩ ℕ , and ∀k ∈ [0, T − t] ∩ ℕ . Since the norms involved in C v are supremum ones, the preconditioner t verifies d t+k ds t = 0 and d t+k da t = 0 , ∀t ∈ [0, T] ∩ ℕ, ∀k ∈ [0, T − t] ∩ ℕ, ∀(s t , a t ) ∈ S × A . The reward preconditioner therefore verifies the properties one must satisfy for the derived robustness guarantees to be applicable (cf. Sect. 6.5). Again, note, the property t ≤ 1 is only ever used in Sect. 6.6.1, and has not been leveraged anywhere else. Given that the developed theory still holds if ∃t ∈ [0, T] ∩ ℕ such that t > 1 , the fact that the preconditioner defined in Eq. (117) does not necessarily lie in the (0, 1] interval is not an issue a priori. Still, in practice, it will virtually always be below 1.
We now derive the associated counterparts of Theorem 4 and Corollary 2. ∀t ∈ [0, T] ∩ ℕ . Note, the bound now only depends on , , and T − t, the "remaining time in the episode".

Proof of Theorem 5
The reward preconditioner used to assemble r from r is defined according to Eq. (117). As carried out in Remark 1, we start the proof of Theorem 5 analogously to the one laid out for Theorem 4, but using the time-dependent version of Theorem 3 instead of the time-independent version that we used in Eq. (106) (version 3 (a) instead of Theorem 3 (b)). Our starting point then aligns with the crux of Remark 1. As such: Since we defined to be within the interval [0, 1) in Sect. 3, we trivially have 2 < 1 , hence 2 ≠ 1 and: By applying √ ⋅ (monotonically increasing) to the inequality, we obtain the claimed result. ◻ Finally, we derive a corollary from Theorem 5 corresponding to the infinite-horizon regime.
Corollary 3 (Infinite-horizon regime) Under the assumptions of Theorem 5, including that r is -Lipschitz and that r is defined as in Eq. (68) over S × A, we have, in the infinitehorizon regime: which translates into Q being √ 1− 2 -Lipschitz over S × A.

Proof of Corollary 3
As we adapt the proof of Theorem 5 to the infinite-horizon regime, Eq. (121) becomes since we defined to be within the interval [0, 1) in Sect. 3, i.e. 2 < 1 . We then apply √ ⋅ to the inequality. ◻ ( 2 ) k = 2 1 − 2 ▶infinite sum of geometric series In these theoretical guarantees, we have shown that by carefully crafting PURPLE's reward preconditioner according to Eq. (117), we obtain upper-bounds Δ ∞ on the Lipschitz constant of the resulting action-value Q that are independent of C v , ∀v ∈ [0, Eq. (117)]. In other words, we have shown that such preconditioner design allows us to totally compensate for the compounding variations (a) first tackled in the discussion led in Sect. 6.2.3, and (b) then addressed only partially by the model-based reward preconditioning discussed profusely in Sect. 6.6.3 (of which we showcase the applicability in practice). Echoing what motivated the emergence of Remark 1 in the first place, the form adopted by the reward preconditioning [cf. Eq. (117)] that allowed us to derive the robustness guarantees of Theorem 5 and Corollary 3 enjoys an insightful and intuitive interpretation. Going through the elements of the series described by the preconditioner of Eq. (117), ( t+k ) k , ∀t ∈ [0, T] ∩ ℕ , and ∀k ∈ [0, T − t] ∩ ℕ , we have the following sequence of consecutive preconditioning values: We observe that, when purposely defined as such, the reward preconditioner t+k at a given stage t + k compensates for the C v 's of all the previous timesteps-backwards from t + k − 1 to t, where Q 's Lipschitz constant is characterized. In order to prevent the upperbound on ‖∇ t s,a [Q ] t ‖ F to be burdened by incipient, potentially prone to compound, vari- , the preconditioner can actively anticipate said incipient compounding variations to compound further within the time remaining in the episode by preemptively squashing the current surrogate reward at t + k based on how much C v 's variations have accumulated since t until t + k − 1 . The proposed interpretation of the studied preconditioner aligns with our intuitive desideratum: "if you want to fend off from compounding of variations that threaten the stability of your action-value, make the latter more robust as soon as you see, from past metrics-here, monitored C v values-that said variations might actually compound soon".
Despite appealing in principle thanks to its salient interpretation, and justified by theoretical guarantees, we did not experiment with the proposed preconditioner in practice. Indeed, considering how we have shown in Sect. 6.6.3 that the values in effect taken by do not seem to affect the agent's return in practice, we do not expect the interpretable preconditioner tackled in this discussion to bring anything practically in the considered environments. Using a gradient penalty constraint to induce local Lipschitz-continuity of the function at the core of the reward function is, in a sense, all you need to achieve peak expert performance in the considered off-policy generative adversarial imitation learning setting. Still, we believe the design and study of methods able to actively tune their level of robustness-aligned in this work with the concept of spatial, local Lipschitz-continuity-depending on the choices (or more pessimistically, on the mistakes) made by the agent to be an interesting avenue of future work. Besides, by augmenting the reward-less MDP (from which we first stripped the environmental reward) with our adversarially learned reward, preconditioned in line with Eq. (117), the resulting MDP has a memory, since the reward r depends on entities ( C v 's) from previous timesteps in the episode. In effect, due to such a reward preconditioning formulation, the Markov property is not satisfied anymore as, given the present, the future now does depend on the past. We believe the observations made and results derived in this work could pave the way to further investigations aiming to decipher known methods and ultimately pinpoint the most minimal setup for it to still do well.

Conclusion
In this work, we conducted an in-depth study of the stability problems incurred by off-policy generative adversarial imitation learning. Our contributions closely follow the line of reasoning, and are as follows.
(1) We characterized the various inherent hindrances the approach suffers from, in particular how learned parametric rewards affect the learned parametric state-action value.
(2) We showed that enforcing a local Lipschitz-continuity constraint on the discriminator network used to formulate the imitation surrogate reward is a sine qua non condition for the approach to empirically achieve expert performance in challenging continuous control problems, within a number of timesteps that still enable us to call the method sample-efficient. (3) In line with the first and second steps, we derived theoretical guarantees that characterize the Lipschitzness of the Q-function when the reward is assumed -Lipschitzcontinuous. Note, the reported theoretical results are valid for any reward satisfying the condition, nothing is specific to imitation. (4) We propose a new RL-grounded interpretation of the usual GAN gradient penalty regularizers-differing by where they induce Lipschitzness-along with an explanation as to (a) why they all have such a positive impact on stability, but also (b) how to make sense of the empirical gap between them. (5) We show that, in effect, the consistent satisfaction of the Lipschitzness constraint on the reward is a strong predictor of how well the mimicking agent performs empirically. (6) Finally, we introduce a pessimistic reward preconditioning technique which (a) makes the base method it is plugged into provably more robust, and (b) is accordingly backed by several theoretical guarantees. As in (3), these guarantees are not not specific to imitation and have a wide range of applicability. We give an illustrative example of how the technique can help further increasing the robustness of the method it is plugged into empirically. 2013,2015). As for the activations functions used in the neural networks, we used ReLU non-linearities in both the actor and critic, and used Leaky-ReLU (Maas et al. 2013) nonlinearities with a leak of 0.1 in the discriminator. We used an online version of batch normalization (described earlier in Sect. 5.5) to standardize the actor and critic observations before they are fed to them. We do not use any learning rate scheduler, for any module.

Appendix 2: Sequential decision making under uncertainty in non-stationary Markov decision processes
In Sect. 3, we have defined as a stationary MDP, in line with a vast majority of works in RL. Note, a stochastic process or a distribution is commonly said stationary if it remains unchanged when shifted in time. While the stationarity assumption allows for the derivation of various theoretical guarantees and is overall easier to deal with analytically, it fails to explain the inner workings of complex realistic simulations, and a fortiori the real world. One critical challenge incurred when modeling the world as a non-stationarity MDP is the Table 2 Hyper-parameters used in this work. Unless explicitly stated otherwise, every method uses these The "effective" batch size corresponds to the size of the mini-batch aggregated across parallel workers of the distributed architecture. In our case, every worker-of the grand total of n = 16 workerssamples a mini-batch of size 64 from its (individual) replay buffer, resulting in an effective batch size of 64 × 16 = 1024  (Fujimoto et al. 2018) 0.2 Target smoothing -noise clip (Fujimoto et al. 2018) 0.5 Actor update delay (Fujimoto et al. 2018) 2 Reward training steps per iteration 1 Agent training steps per iteration 1 Discriminator learning rate 5.0 × 10 −4 Entropy regularization scale 0.001 Positive label-smoothing Real labels ∼ unif(0.7, 1.2) Positive-Unlabeled (Xu and Denil 2019)-coeff. 0.25 1 3 (b) (a) Fig. 12 Comparison of the gradient used to update the policy in this work, involving the gradient of the state-action value, against an adaptive hybrid method involving also the gradient of the discriminator, and combining both gradients based on their cosine similarity. Runtime is 12 h 1 3

Online batch normalization in discriminator
See Appendix Fig. 15. Evaluation of the considered method under several exploration strategies. "Action" corresponds to defining by directly applying additive Gaussian noise to the action returned by . As such, (⋅, s t ) = (s t ) + , where ∼ N(0, ) , with = 0.2 . "Param" denotes the application of additive noise in the network parameters directly, and "Param + OU" corresponds to the additional application of temporally correlated noise, generated sequentially by a Ornstein-Uhlenbeck process, on the action (cf. Sect. 4 for a description of these two last approaches, and Table 2 for the associated hyper-parameters). Despite the absence of a clear winner, we use the combination of parameter noise and temporally correlated action noise in every experiment reported in this work, as it seems to yield the best results. Runtime is 12 h Funding Open access funding provided by University of Geneva. This work was supported by the Swiss National Science Foundation grant number CSSII5_177179 "Modeling pathological gait resulting from motor impairment".

Availability of data and materials
The simulated robotics, continuous control environments considered in this work are built with the MuJoCo (Todorov et al. 2012) physics engine, and provided to the community through the OpenAI Gym API (Brockman et al. 2016). Note, to use these environments, one needs a MuJoCo license, which can be obtained from https:// www. roboti. us/ licen se. html.

Conflict of interest
The authors declare that they have no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.