1 Introduction

Manually designing control policies for every possible situation a robot could encounter is impractical. Reinforcement learning (RL) provides a promising alternative to hand-coding skills. Recent applications of RL to high dimensional control tasks have seen impressive successes within simulation (Schulman et al., 2015b; Lillicrap et al., 2015). Unfortunately, a large gap exists between what is possible in simulation and the reality of learning on a physical system. State-of-the-art learning methods require thousands of episodes of experience which is impractical for a physical robot. Aside from the time it would take, collecting the required training data may lead to substantial wear on the robot. Furthermore, as the robot explores different policies it may execute unsafe actions which could damage the robot.

An alternative to learning directly on the robot is learning in simulation (Cutler & How, 2015; Koos et al., 2010). Simulation is a valuable tool for robotics research as execution of a robotic skill in simulation is comparatively easier than real world execution. Robots in simulation can be run unsupervised without fear of them breaking or wearing down. Simulation can often be ran faster than real time or parallelized to increase the speed at which data for RL can be collected. However, the value of simulation learning is limited by the inherent inaccuracy of simulators in modeling the dynamics of the physical world (Kober et al., 2013). As a result, learning that takes place in a simulator is unlikely to improve real world performance.

Grounded Simulation Learning (gsl) is a framework for learning with a simulator in which the simulator is modified with data from the physical robot, learning takes place in simulation, the new policy is evaluated on the robot, and data from the new policy is used to further modify the simulator (Farchy et al., 2013). The work introducing gsl demonstrates the effectiveness of the method in a single, limited experiment, by increasing the forward walking velocity of a slow, stable bipedal walk by 26.7%. This article introduces a new algorithm—Grounded Action Transformation (gat)—for simulator grounding within the gsl framework. gat grounds the simulator by modifying the robot’s actions as they are passed to the simulator to, in effect, create a simulator with different dynamics. The grounding function is learned with a small amount of real world and simulated data, allowing the simulator to be modified with less reliance on manual system identification. Additionally, by modifying the simulated robot’s actions we can treat the simulator as a black-box and do not require access to change internal parameters of the simulator.

As a first step, in order to facilitate extensive evaluations, we fully implement and evaluate gat on two tasks using a high-fidelity simulator as a surrogate for the real world. The results of this controlled study contribute to a deeper understanding of transfer from simulation methods and the effectiveness of gat. We then present two examples of using gat for sim-to-real transfer of bipedal locomotion policies learned in simulation to a real humanoid robot. In contrast to prior work (Farchy et al., 2013), one task in our real-world evaluation starts from a state-of-the-art walking controller as the initial policy, and nonetheless is able to improve the walk velocity by over 43%, leading to what may be the fastest known stable walk on the SoftBank nao robot.

Furthermore, to better understand situations where gat may be successful we consider real world environments that have a high degree of stochasticity. We show in simulated environments that gat may fail to find high performing policies when environment state transitions are noisy. To address this limitation we generalize gat to the stochastic gat (sgat) algorithm and show in simulated, stochastic environments that sgat finds higher performing policies than gat. We implement sgat on the nao robot and show that we can learn a fast and stable walking policy over a rough surface while gat fails to find a stable policy.

2 Preliminaries

In this section we formalize the reinforcement learning setting and the problem of sim-to-real learning.

2.1 Notation

We assume the environment is an episodic Markov decision process with state set \(\mathcal {S}\), action set \(\mathcal {A}\), transition function, \(P: \mathcal {S} \times \mathcal {A} \times \mathcal {S} \rightarrow [0,1]\), reward function \(r: \mathcal {S} \times \mathcal {A} \rightarrow \mathbb {R}\), discount factor \(\gamma\), and initial state distribution \(d_0\) (Puterman, 2014). We assume that \(\mathcal {S} = \mathbb {R}^k\) and \(\mathcal {A} = \mathbb {R}^m\) for some \(k,m \in \mathbb {N}_+\). We assume that the transition function, P, is unknown and the reward function, r, is known. We use \(P(s^\prime | s,a) :=P(s, a, s^\prime )\) to denote the conditional probability of state \(s^\prime\) given state s and action a. P is also sometimes called the environment’s dynamics. A policy, \(\pi : \mathcal {S} \rightarrow \mathcal {A}\), is a function mapping states to actions.

The agent interacts with the environment mdp as follows: The agent begins in initial state \(S_0 \sim d_0\). At discrete time-step t the agents takes action \(A_t = \pi (S_t)\). The environment responds with \(R_t :=r(S_t,A_t)\) and \(S_{t+1} \sim P(\cdot | S_t, A_t)\) according to the reward function and transition function. After interacting with the environment for at most \(l\) steps the agent returns to a new initial state and the process repeats. For notational convenience, we will write that all interactions last \(l\) steps, though in fact they may end earlier. In the MDP definition, we also include a terminal state, \({s_\infty }\), that allows the possibility of episodes ending before time-step \(l\). If at any time-step, t, \(S_t = {s_\infty }\), then for all \(t^\prime \ge t\), \(S_{t^\prime } = {s_\infty }\) and \(R_{t^\prime } = 0\).

Let \(h :=(s_0,a_0,r_0,s_1, \dotsc , s_{l- 1},a_{l- 1},r_{l- 1})\) be a trajectory. Any policy, \(\pi\), and MDP, \({\mathcal {M}}\), induce a distribution over trajectories, \(\Pr (H=h | \pi , {\mathcal {M}})\), where H is a random variable representing a trajectory. Let \(R(h) :=\sum _{t=0}^{l- 1} \gamma ^t r_t\) be the discounted return of h. We define the value of a policy, \(v(\pi , {\mathcal {M}}) :=\mathbf {E}[R(H) | H \sim (\pi , {\mathcal {M}})]\), as the expected discounted return when sampling a trajectory with policy \(\pi\) in MDP \({\mathcal {M}}\). We are interested in learning a policy, \(\pi\), for an mdp, \({\mathcal {M}}\), such that \(v(\pi , {\mathcal {M}})\) is maximized. We wish to minimize the number of actions that must be taken in \({\mathcal {M}}\) before a good policy is learned, i.e., we desire low sample complexity for learning.

2.2 Learning in simulation

In this article we study reinforcement learning in a simulated environment with the objective that learned policies will perform well in the real world. We formalize this setting as learning a policy, \(\pi\), in one MDP, \({\mathcal {M}_\mathtt {sim}}\), with the objective of maximizing \(v(\pi , {\mathcal {M}})\). The MDP \({\mathcal {M}_\mathtt {sim}}\) is the simulator and \({\mathcal {M}}\) is the real world. Formally, \({\mathcal {M}}\) and \({\mathcal {M}_\mathtt {sim}}\) are identical MDPs except for the transition function P.Footnote 1 We use P to denote the transition function of the real world and \(P_\mathtt {sim}\) to denote the transition function of the simulator. We make the assumption that the reward function, r, is user-defined and thus is identical for \({\mathcal {M}}\) and \({\mathcal {M}_\mathtt {sim}}\). However, the different dynamics distribution means that for any policy, \(\pi\), \(v(\pi , {\mathcal {M}}) \ne v(\pi , {\mathcal {M}_\mathtt {sim}})\) since \(\pi\) induces a different trajectory distribution in \({\mathcal {M}}\) than in \({\mathcal {M}_\mathtt {sim}}\). Thus, for any \(\pi ^\prime\) with \(v(\pi ^\prime , {\mathcal {M}_\mathtt {sim}}) > v(\pi , {\mathcal {M}_\mathtt {sim}})\), it does not follow that \(v(\pi ^\prime , {\mathcal {M}}) > v(\pi , {\mathcal {M}})\)—in fact \(v(\pi ^\prime , {\mathcal {M}})\) could be much worse than \(v(\pi , {\mathcal {M}})\). In practice and in the literature, learning in simulation often fails to improve expected performance (Farchy et al., 2013; Christiano et al., 2016; Rusu et al., 2016b; Tobin et al., 2017).

3 Related work

The challenge of transferring learned policies from simulation to reality has received much research attention of late. This section surveys this recent work as well as older research in simulation-transfer methods. We note that our work also relates to model-based reinforcement learning (Sutton & Barto, 1998). However, much of model-based reinforcement learning focuses on learning a simulator for the task mdp (often from scratch) while we focus on settings where an inaccurate simulator is available a priori.

We divide the sim-to-real literature into four categories: simulator modification, simulator randomization or simulator ensembles, simulators as prior knowledge, and sim-to-real perception learning.

3.1 Simulator modification

We classify sim-to-real works that attempt to use real world experience to change the simulator as simulator modification approaches. This category of work is the category most similar to this work.

Abbeel et al. (2006) use real-world experience to modify an inaccurate model of a deterministic mdp. The real-world experience is used to modify \(P_\mathtt {sim}\) so that the policy gradient in simulation is the same as the policy gradient in the real world. Cutler et al. (2014) use lower fidelity simulators to narrow the action search space for faster learning in higher fidelity simulators or the real world. This work also uses experience in higher fidelity simulators to make lower fidelity simulators more realistic. Both these methods assume random access modification—the ability to arbitrarily and locally modify the simulated dynamics of any state-action pair. This assumption is restrictive in that it may be false for many simulators especially for real-valued states and actions.

Other work has used real world data to modify simulator parameters (e.g., coefficients of friction) (Zhu et al., 2018) or combined simulation with Gaussian processes to model where real world data has not been observed (Lee et al., 2017). Such approaches may extrapolate well to new parts of the state-space, however, they may fail if no setting of the physics parameters can capture the complexity of the real world. Golemo et al. (2018) train recurrent neural networks to predict differences between simulation and reality. Then, following actions in simulation, the resulting next state is corrected to be closer to what it would be in the real world. This approach requires the ability to directly set the state of the simulator which is a requirement we avoid in this work.

Manual parameter tuning is another form of simulator modification that can be done prior to applying reinforcement learning. Lowrey et al. (2018) manually identify simulation parameters before applying policy gradient reinforcement learning to learn to push an object to target positions. Tan et al. (2018) perform similar system identification (including disassembling the robot and making measurements of each part) and adding action latency modeling before using deep reinforcement learning to learn quadrapedal walking. In contrast to these approaches, the algorithms we introduce take a data-driven approach to modifying the simulator without the need for expert system identification.

Finally, while most approaches to simulator modification involve correcting the simulator dynamics, other approaches attempt to directly correct \(v(\pi , {\mathcal {M}_\mathtt {sim}})\). Assuming \(v(\pi , {\mathcal {M}}) = v(\pi , {\mathcal {M}_\mathtt {sim}}) + \epsilon (\pi )\), Iocchi et al. (2007) attempt to learn \(\epsilon (\pi )\) for any \(\pi\). Then policy search can be done directly on \(v(\pi , {\mathcal {M}_\mathtt {sim}}) + \epsilon (\pi )\) without needing to evaluate \(v(\pi , {\mathcal {M}})\). Rodriguez et al. (2019) introduce a similar approach except they take into account uncertainty in extrapolating the estimate of \(\epsilon (\pi )\) and use Bayesian optimization for policy learning. Like this work, both of these works apply their techniques to bipedal locomotion. Koos et al. (2010) use multi-objective optimization to find policies that trade off between optimizing \(v(\pi , {\mathcal {M}_\mathtt {sim}})\) and a measure of how likely \(\pi\) is to transfer to the real world.

3.2 Robustness through simulator variance

Another class of sim-to-real approaches is methods that attempt to cross the reality gap by learning robust policies that can work in different variants of the simulated environment. The key idea is that if a learned policy can work in different simulations then it is more likely to be able to perform well in the real world. The simplest instantiation of this idea is to inject noise into the robot’s actions or sensors (Jakobi et al., 1995; Miglino et al., 1996) or to randomize the simulator parameters (Peng et al., 2017; Molchanov et al., 2019; OpenAI et al., 2018). Unlike data driven approaches, such domain randomization approaches learn policies that are robust enough to cross the reality gap but may give up some ability to exploit the target real world environment. This problem may be more acute when learning with simple policy representations, as simpler policies may lack the capacity to perform well under a wide range of environment conditions (Mozifian et al., 2019).

A number of works have attempted to combine domain randomization and real world data to adapt the simulator. Chebotar et al. (2019) randomize simulation parameters and use real world data to update the distribution over simulation parameters while simulatenously learning robotic manipulation tasks. Ramos et al. (2019) take a similar approach. Muratore et al. (2018) attempt to use real world data to predict transferrability of policies learned in a randomized simulation. Mozifian et al. (2019) attempt to maintain a wide distribution over simulator parameters while ensuring the distribution is narrow enough to allow reinforcement learning to exploit instances that are most similar to the real world.

Domain randomization produces policies that are robust enough to transfer to the real world. An alternative approach that does not involve randomness is to learn policies that perform well under an ensemble of different simulators (Boeing & Bräunl, 2012; Rajeswaran et al., 2017; Lowrey et al., 2018). Pinto et al., (2017b) simultaneously learn an adversary that can perturb the learning agent’s actions while it learns in simulation. The learner must learn a policy that is robust to disturbances and then will perform better when transferred to the real world.

3.3 Simulator as prior knowledge

Another approach to sim-to-real learning is to use experience in simulation to reduce learning time on the physical robot. Cully et al. (2015) use a simulator to estimate fitness values for low-dimensional robot behaviors which gives the robot prior knowledge of how to adapt its behavior if it becomes damaged during real world operation. Cutler and How (2015) use experience in simulation to estimate a prior for a Gaussian process model to be used with the pilco (Deisenroth & Rasmussen, 2011) learning algorithm. Rusu et al. (2016a, b) introduce progressive neural network policies which are initially trained in simulation before a final period of learning in the true environment. Christiano et al. (2016) turn simulation policies into real world policies by transforming policy actions so that they produce the same effect that they did in simulation. Marco et al. (2017) use simulation to reduce the number of policy evaluations needed for Bayesian optimization of task performance. In principle, our work could be used with any of these approaches to correct the simulator dynamics which would lead to a more accurate prior.

3.4 Reality gap in the observation space

Finally, while we focus on the reality gap due to differences in simulated and real world dynamics, much recent work has focused on transfer from simulation to reality when the policy maps images to actions. In this setting, even if P and \(P_\mathtt {sim}\) are identical, policies may fail when transferred to the real world due to the differences between real and rendered images. Domain randomization is a popular technique for handling this problem. Unlike the dynamics randomization techniques discussed above, in this setting domain randomization means randomizing features of the simulator’s rendered images (Sadeghi & Levine, 2017; Tobin et al., 2017, 2018; Pinto et al., 2017a). This approach is useful in that it forces deep reinforcement learning algorithms to learn representations that focus on higher level properties of a task and not low-level details of image appearance. Computer vision domain adaptation methods can also be used to overcome the problem of differing observation spaces (Fang et al., 2018; Tzeng et al., 2016; Bousmalis et al., 2018; James et al., 2019). A final approach is to learn perception and control separately so that the real world perception system is only trained with real world images (Zhang et al., 2016; Devin et al., 2017). The problem of overcoming a reality gap in the agent’s observations of the world is orthogonal to the problem of differing dynamics that we study.

4 Grounded simulation learning

In this section we introduce the grounded simulation learning (gsl) framework as presented by Farchy et al. (2013). Our main contribution is a novel algorithm that instantiates this general framework. gsl allows reinforcement learning in simulation to succeed by using trajectories from \({\mathcal {M}}\) to first modify \({\mathcal {M}_\mathtt {sim}}\) such that the modified \({\mathcal {M}_\mathtt {sim}}\) is a higher fidelity model of \({\mathcal {M}}\). The process of making the simulator more like the real world is referred to as grounding.

The gsl framework assumes the following:

  1. 1.

    There is an imperfect simulator mdp, \({\mathcal {M}_\mathtt {sim}}\), that models the mdp environment of interest, \({\mathcal {M}}\). Furthermore, \({\mathcal {M}_\mathtt {sim}}\) must be modifiable. In this article, we formalize modifiable as meaning that the simulator has parameterized transition probabilities \(P_{\varvec{\phi }}(\cdot | s,a) :=P_\mathtt {sim}(\cdot | s,a; {\varvec{\phi }})\) where the vector \({\varvec{\phi }}\) can be changed to produce, in effect, a different simulator.

  2. 2.

    There is a policy improvement algorithm, \(\mathtt {optimize}\), that searches for \(\pi\) which increase \(v(\pi , {\mathcal {M}_\mathtt {sim}})\). The \(\mathtt {optimize}\) routine returns a set of candidate policies, \(\varPi\) to evaluate in \({\mathcal {M}}\).

We formalize the notion of grounding as minimizing a similarity metric between the real world trajectories and the trajectory distribution of the simulation. Let \(d(h, \Pr _\mathtt {sim}(\cdot |\pi ; {\varvec{\phi }}))\) be a score for the likelihood of a given trajectory in the simulator parameterized by \({\varvec{\phi }}\). Given a dataset of trajectories, \({\mathcal {D}_\mathtt {real}}:=\{h_i\}_{i=1}^m\), collected by running a policy, \(\pi\), in \({\mathcal {M}}\), simulator grounding of \({\mathcal {M}_\mathtt {sim}}\) amounts to finding \({\varvec{\phi }}^\star\) such that:

$${\phi ^ \star } = \mathop {\arg \max }\limits_\phi \mathop \sum \limits_{h \in {{\cal D}_\mathtt {real}}} d\left( {h,{{\Pr }_\mathtt {sim}}( \cdot |\pi ;\phi )} \right).$$
(1)

For instance, if \(d(h, \Pr _\mathtt {sim}(\cdot |\pi ; {\varvec{\phi }})) :=\log \Pr _\mathtt {sim}(h |\pi ; {\varvec{\phi }})\) then \({\varvec{\phi }}^\star\) maximizes the negative log-likelihood or equivalently the empirical Kullback-Leibler divergence between \(\Pr (\cdot | \pi , {\mathcal {M}})\) and \(\Pr _\mathtt {sim}(\cdot | \pi , {\varvec{\phi }}^\star )\).

Intuitively, Eq. (1) is solved by making the real world trajectories under \(\pi\) more likely when running \(\pi\) in the simulator. Though exactly solving Eq. (1) may be intractable, if we can make real world trajectories more likely in the simulator then the simulator will be better for policy optimization. Assuming a mechanism for optimizing (1), the gsl framework is as follows:

  1. 1.

    Execute an initial policy, \(\pi _{0}\), in the real world to collect a data set of trajectories, \({\mathcal {D}_\mathtt {real}}= \{h_j\}_{j=1}^m\).

  2. 2.

    Optimize (1) to find \({\varvec{\phi }}^\star\) that makes \(\Pr (H=h | \pi _{0}, {\mathcal {M}_\mathtt {sim}})\) closer to \(\Pr (H=h | \pi _{0}, {\mathcal {M}})\) for all \(h \in {\mathcal {D}_\mathtt {real}}\).

  3. 3.

    Use \(\mathtt {optimize}\) to find a set of candidate policies \(\varPi\) that improve \(v(\cdot , {\mathcal {M}_\mathtt {sim}})\) in the modified simulation.

  4. 4.

    Evaluate each proposed \(\pi _c \in \varPi\) in \({\mathcal {M}}\) and return the policy:

    $${\pi _1}: = \mathop {\arg \max }\limits_{{\pi _c} \in \Pi } v({\pi _c},{\cal M}).$$

gsl can be applied iteratively with \(\pi _1\) being used to collect more trajectories to ground the simulator again before learning \(\pi _2\). The re-grounding step is necessary since changes to \(\pi\) result in changes to the distribution of trajectories that the agent observes. When the distribution changes, a simulator that has been modified with data from the trajectory distribution of \(\pi _0\) may be a poor model under the trajectory distribution of \(\pi _1\). The entire gsl framework is illustrated in Fig. 1.

Fig. 1
figure 1

Diagram of the grounded simulation learning framework

5 The grounded action transformation algorithm

We now introduce the main contribution of this article—a novel gsl algorithm called the grounded action transformation (gat) algorithm. gat instantiates the gsl framework by introducing a specific implementation of the grounding step (Step 2) of the gsl framework. The main idea behind gat is to augment the simulator with a differentiable action transformation function, g, which transforms the agent’s simulated action into an action which—when taken in simulation—produces the same transition that would have occurred in the physical system. The function, g, is represented with a parameterized function approximator whose parameters serve as \({\varvec{\phi }}\) for the augmented simulator in the gsl framework. We leave open the gat instantiation of the other gsl steps (data collection, policy optimization, and final policy evaluation). The main contribution of gat is a novel method to ground the simulator.

The gat algorithm learns two functions: f which predicts the effects of actions in \({\mathcal {M}}\) and \(f_\mathtt {sim}^{-1}\), which predicts the action needed in simulation to reproduce the desired effects. Let \({\mathbf {x}}\) be a subset of the components of state \({\mathbf {s}}\) and let \(\mathcal {X}\) be the set of all possible values for \({\mathbf {x}}\). We refer to the components of \({\mathbf {x}}\) as the state variables of interest. We define gat as grounding a subset of the state components to allow users to inject domain knowledge into the grounding process if they know what components are most important to model correctly; a user can always opt to include all components of the state as state variables of interest if they lack such domain knowledge. Formally, the function \(f: {\mathcal {S}}\times {\mathcal {A}}\rightarrow \mathcal {X}\) is a forward model that predicts the effect on the state variables of interest given an action chosen in a particular state in \({\mathcal {M}}\). The function \(f_\mathtt {sim}^{-1}: {\mathcal {S}}\times \mathcal {X} \rightarrow {\mathcal {A}}\) is an inverse model that predicts the action that causes a particular effect on the state variables of interest given the current state in simulation. The overall action transformation function \(g: {\mathcal {S}}\times {\mathcal {A}}\rightarrow {\mathcal {A}}\) is specified as \(g({\mathbf {s}}, {\mathbf {a}}):=f^{-1}_\mathtt {sim}({\mathbf {s}}, f({\mathbf {s}}, {\mathbf {a}}))\). When the agent is in state \({\mathbf {s}}_t\) in the simulator and takes action \({\mathbf {a}}_t\), the augmented simulator replaces \({\mathbf {a}}_t\) with \(g({\mathbf {s}}_t, {\mathbf {a}}_t)\) and the simulator returns \({\mathbf {s}}_{t+1}\) where the \({\mathbf {x}}_{t+1}\) components of \({\mathbf {s}}_{t+1}\) are closer in value to what would be observed in \({\mathcal {M}}\) had \({\mathbf {a}}_t\) been taken there. Figure 2 illustrates the augmented simulator.

gat learns the functions f and \(f^{-1}_\mathtt {sim}\) with supervised learning. The function f is learned by collecting a small number of real world trajectories and then constructing a supervised learning dataset \(\{({\mathbf {s}}_i, {\mathbf {a}}_i)\} \rightarrow \{{\mathbf {x}}_i^\prime \}\). Similarly, the function \(f^{-1}_\mathtt {sim}\) is learned by collecting simulated trajectories and then constructing a supervised learning dataset \(\{({\mathbf {s}}_i, {\mathbf {x}}_i^\prime )\} \rightarrow \{{\mathbf {a}}_i\}\). This pair of supervised learning problems can be solved by a variety of techniques. In our experiments we use either neural networks or linear models trained with gradient descent on a squared error loss. Pseudocode for the full gat algorithm is given in Algorithm 1.

Fig. 2
figure 2

The augmented simulator which can be grounded to the real world with supervised learning. The policy computes an action that is then passed to the action grounding module. This module first predicts the values for the state variables of interest if the action had been taken in the real world. The module then uses an inverse dynamics model, \(f^{-1}_\mathtt {sim}\), to compute the action that produces the same effect in simulation. Finally, the policy’s action is replaced with the predicted action and this modified action is passed to the simulator

figure a

Because we take a data-driven approach to simulator modification, the result is not necessarily a globally more accurate simulator for the real world. Our only goal is that the simulator is more realistic for trajectories sampled with the grounding policy. If we can achieve this goal, then we can locally improve the policy without any additional real world data. A simulator that is more accurate globally may provide a better starting point for gat, however, by focusing on simulator modification local to the grounding policy we can still obtain policy improvement in low fidelity simulators.

We also note that gat minimizes the error between the immediate state transitions of \({\mathcal {M}_\mathtt {sim}}\) and those of \({\mathcal {M}}\). Another possible objective would be to observe the difference between trajectories in \({\mathcal {M}}\) and \({\mathcal {M}_\mathtt {sim}}\) and ground the simulator to minimize the total error over a trajectory. Such an objective could lead to an action modification function g that accepts short-term error if it reduces the error over the entire trajectory, however, it would require the simulator dynamics to be differentiable. As it is unclear how to select the modified actions that minimize multi-step error, we accept minimizing the one-step error as a good proxy for minimizing our ultimate objective which is that the current policy \(\pi\) produces similar trajectories in both \({\mathcal {M}}\) and \({\mathcal {M}_\mathtt {sim}}\). The specific choice of g used by GAT allows GAT to learn the actions that minimize the one-step error in simulated and real world transitions.

5.1 Modifying actions vs. modifying parameters

Before presenting an empirical evaluation of gat, we discuss the motivation for modifying actions instead of internal simulator parameters. Our main motivation for modifying the agent’s simulated action is that we can then treat the simulator as a black box. While physics-based simulators typically have a large number of parameters determining the physics of the simulated environment (e.g., friction coefficients, gravitational values) these parameters are not necessarily amenable to numerical optimization of Eq. (1). First, just because a simulator has such parameters does not mean that they’re exposed to the user or can be modified without additional software engineering. On the other hand, when applying RL, it is reasonable to assume that a user has access to the actions output by the policy and could thus include an action transformation to ground the simulator. Second, even if changing physics parameters is straightforward, it may be computationally or manually intensive to determine how to change a parameter to make the simulator produce trajectories closer to the ones we observe in the real world. In contrast, action modification with gat allows us to transform simulator modification into a supervised learning problem.

In this article we focus on the blackbox setting where we are unable to change the simulator’s internal parameters. However, if these parameters are exposed to the user then there may be settings where correctly identifying the real world parameters may provide more reliable transfer than action modification. A characterization of the settings where one approach is preferable to the other is an interesting direction for future research.

6 GAT empirical study

We now present an empirical study of applying the gat algorithm for reinforcement learning with simulated data. Our experiments are designed to answer the following questions:

  1. 1.

    Does grounding a simulation with gat allow skills learned in simulation to transfer to the real world?

  2. 2.

    Does gat make the simulated robot’s actions have similar effects to those they would have in the real world.

To answer these questions we apply gat on three tasks with the simulated and physical NAO robot. Though our focus is on sim-to-real transfer, we include two experiments in a sim-to-sim setting where we use one simulator as a surrogate for the real world. These experiments allow us to run a larger number of experimental trials than would be practical in the tasks using a physical robot. We first give a general description of the empirical set-up. We then proceed to describe each task and the empirical results observed.

6.1 General NAO task description

All empirical tasks use either a simulated or physical Softbank nao robot.Footnote 2 The nao is a humanoid robot with 25 degrees of freedom (see Fig. 3a). Though the nao has 25 degrees of freedom, we restrict ourselves to observing and controlling 15 of them (we ignore joints that are less important for our experimental tasks—joints in the head, hands, and elbows). We will refer to the degrees of freedom as the joints of the robot. Figure 4 shows a diagram of the nao and its different joints.

We define the state variables of interest to be the angular position of each of the robot’s joints. In addition to angular position, the robot’s state consists of joint angular velocities and other task-dependent variables. The robot’s actions are desired joint angular positions which are implemented at a lower software level using pid control. There is a one-to-one correspondence between components of the robot’s action and the state variables of interest.

Fig. 3
figure 3

The three robotic environments used here. The Softbank nao is our target physical robot. The nao is simulated in the Gazebo and SimSpark simulators. Gazebo is a higher fidelity simulator which we also use as a surrogate for the real world in an empirical comparison of grounded action transformation (gat) to baseline methods

Fig. 4
figure 4

Diagram of the Softbank nao robot with joints (degrees of freedom) labeled. Each joint has a sensor that reads the current angular position of the joint and can be controlled by providing a desired angular position for the joint. In this work, we ignore the HeadYaw, HeadPitch, left and right ElbowRoll, left and right ElbowYaw, left and right WristYaw, and left and right Hand joints. There is also no need to control the right HipYawPitch joint as, in reality, this degree of freedom is controlled by the movement of the left HipYawPitch Joint. This image was downloaded from: http://doc.aldebaran.com/2-8/family/nao_technical/lola/actuator_sensor_names.html

In all tasks our implementation of gat uses a history of the joint positions and desired joint positions as an estimate of the nao’s state to input into the forward and inverse models. Instead of directly predicting \({\mathbf {x}}_{t+1}\), the forward model, f, is trained to predict the change in \({\mathbf {x}}_t\) after taking \({\mathbf {a}}_t\). The inverse model \(f^{-1}_\mathtt {sim}\) takes the current \({\mathbf {x}}_t\) and a desired change at \({\mathbf {x}}_{t+1}\) and outputs the action needed to cause this change. Since both the state variables of interest and actions have angular units, we train both f and \(f^{-1}_\mathtt {sim}\) to output the sine and cosine of each output angle. From these values we can recover the predicted output with the \(\arctan\) function. Since \(f^{-1}_\mathtt {sim}\) and f are trained with supervised learning, they may make small errors when used to change the agent’s actions (Ross et al., 2011). Since small errors may make the output of g not smooth from timestep to timestep, we sometimes find it useful to use a smoothing parameter, \(\alpha\), to ensure stable motions. The action transformation function (Algorithm 1, line 7) is then defined as:

$$\begin{aligned} g({\mathbf {s}}, {\mathbf {a}}) :=\alpha f^{-1}_\mathtt {sim}({\mathbf {s}},f(s,\mathbf {a})) + (1 - \alpha ) {\mathbf {a}}. \end{aligned}$$

In our experiments involving bipedal walking, we set \(\alpha\) as high as possible subject to the robot remaining stable in simulation when executing \(\pi _0\). In all other experiments, we use \(\alpha =1.0\).

We consider two simulators in this work: the SimsparkFootnote 3 Soccer Simulator used in the annual RoboCup 3D Simulated Soccer competition and the Gazebo simulator from the Open Source Robotics Foundation.Footnote 4 SimSpark enables fast simulation but is a lower fidelity model of the real world. Gazebo enables relatively high fidelity simulation with an additional computational cost. The nao model in both of these simulations is shown in Fig. 3a.

Across all tasks we use the covariance matrix adaptation evolutionary strategies (cma-es) algorithm (Hansen et al., 2003) for the policy optimization routine. cma-es is a stochastic search algorithm that updates a population of candidate policies over a set number of generations. At each generation, cma-es samples a population of policy parameter values from a Gaussian distribution. It then uses the evaluation of each candidate policy in simulation to update the sampling distribution for the population at the next generation. cma-es has been found to be very effective at optimizing robot skills in simulation (Urieli et al., 2011). In all experiments we use a population size of 150 candidate policies at each generation as we were able to submit up to 150 parallel policy evaluations at a time on the University of Texas Computer Science distributed computing cluster.

With the exception of the final experiment in this section, we run a single iteration of gat per experimental setting. A single iteration allows us to keep the initial policy fixed so that we have a more controlled measure of the efficacy of simulator grounding. In all cases we select the architectures of the forward and inverse dynamics models via optimizing a least-squares loss on a held-out set of transitions. These models are trained with stochastic gradient descent using the Adam optimizer (Kingma & Ba, 2014).

6.2 Learning arm control

Our first task requires the nao to learn to raise its arms from its sides to a goal position, \(\mathbf {p}^\star\) which is defined to be halfway to horizontal (lift 45 degrees). We call this task the “Arm Control” task. In this task, the robot’s policy only controls the two shoulder joints responsible for raising and lowering the arms. The angular position of these joints are the state variables of interest, \({\mathbf {x}}\). The policy is a linear mapping from \({\mathbf {x}}_t\) and \({\mathbf {x}}_{t-1}\) to the action \({\mathbf {a}}_t\):

$$\begin{aligned} \pi ({\mathbf {x}}_t, {\mathbf {x}}_{t-1}) = \mathbf {w} \cdot ({\mathbf {x}}_t, {\mathbf {x}}_{t-1}) + \mathbf {b} \end{aligned}$$

where \(\mathbf {w}\) and \(\mathbf {b}\) are learnable parameters. At time t, the agent receives reward:

$$\begin{aligned} r({\mathbf {x}}_t) = \frac{1}{|{\mathbf {x}}_t - \mathbf {p}^\star |_2^2} \end{aligned}$$

and the episode terminates after 200 steps or when either of the robot’s arms raise higher than 45 degrees. The optimal policy is to move as close as possible to 45 degrees without lifting higher.

We apply gat for sim-to-sim transfer from Simspark (\({\mathcal {M}_\mathtt {sim}}\)) to Gazebo (\({\mathcal {M}}\) – effectively treating Gazebo as the real world). We represent f and \(f_\mathtt {sim}^{-1}\) with linear functions. To train f, we collect 50 trajectories in \({\mathcal {M}}\) and train \(f_\mathtt {sim}^{-1}\) with 50 trajectories from \({\mathcal {M}_\mathtt {sim}}\).

On this task our baseline is learning without simulator modification. For each method (gat and “No Modification”), we run 10 experimental trials where each trial consists of running 50 generations of cma-es and taking the best performing candidate policy from each generation and evaluating it in \({\mathcal {M}}\). Our main point of comparison is which method finds a policy that allows the robot to move its arms closer to the target position (higher \(v(\pi , {\mathcal {M}})\)).

Figure 5 shows the mean distance from the target position for the final policy learned in simulation either with gat or with “No Modification.” Results show that gat is able to overcome the reality gap and results in policies that reduce error in final arm position.

Fig. 5
figure 5

Mean performance of best policies found on the Arm Control task. We run 10 experimental trials using gat and 10 experimental trials directly transferring from \({\mathcal {M}_\mathtt {sim}}\) to \({\mathcal {M}}\) (“No Modification”). The vertical axis gives the average distance to the target position during a trajectory (lower is better). Error bars are for a 95% confidence interval

We also visualize the effect of the action modification function, g, in the simulator. Figure 6 shows how the robot’s LeftShoulderPitch joint moves in \({\mathcal {M}}\), \({\mathcal {M}_\mathtt {sim}}\), and the grounded \({\mathcal {M}_\mathtt {sim}}\) when a constant action of \(-15\) degrees is applied. In \({\mathcal {M}_\mathtt {sim}}\) the position of the LeftShoulderPitch responds immediately to the command while in \({\mathcal {M}}\) the position changes much more slowly. In Simspark, the shoulder joints are more responsive to commands and thus the robot needs to learn it must take weaker actions to prevent overshooting the target. In Gazebo, the joints are less responsive to the actions and the same policy fails to get the arms close to the target. After applying gat, the position changes much slower in simulation as the action modification function reduces the magnitude of the desired change. This visualization helps answer our second empirical question as to whether or not action modification makes the simulator behave more like reality.

Fig. 6
figure 6

Visualization of the robot’s LeftShoulderPitch joint position in \({\mathcal {M}}\), \({\mathcal {M}_\mathtt {sim}}\), and \({\mathcal {M}_\mathtt {sim}}\) after applying gat. The horizontal axis is time in frames (50 frames per second). The vertical axis has units of angles which is the unit for both the plotted actions and states. Trajectories were generated in each environment with a policy that sets a constant desired position of \(-15\) degrees (“Action”). “Real State” shows the LeftShoulderPitch position in \({\mathcal {M}}\), “No Grounding State” shows position in \({\mathcal {M}_\mathtt {sim}}\), and “Grounded State” shows position in the grounded \({\mathcal {M}_\mathtt {sim}}\). “Grounded Action” shows the action that the gat action modification function takes in place of “Action”

6.3 Linear walk policy optimization

Our second task is walking forward with a linear control policy on the physical robot. The state variables of interest are 10 joints in the robot’s legs (ignoring the left HipYawPitch joint) and the 4 joints controlling its shoulders. The actions are desired angular positions for all 15 of these joints.

The policy inputs are the gyroscope that measures forward-backward angular velocity, y, and the gyroscope that measures side-to-side angular velocity, x. We also provide as input an open-loop sine wave. The sine wave encodes prior knowledge that a successful walking policy will repeat actions periodically. The final form of the policy is:

$$\begin{aligned} \pi (\langle x, y, \sin (c \cdot t) \rangle ) = \mathbf {w} \cdot \langle x, y, \sin (c \cdot t) \rangle + \mathbf {b} \end{aligned}$$

where c is a learnable scalar that controls the walking step frequency. The policy outputs only commands for the left side of the robot’s body and the commands for the right side are obtained by reflecting these commands around a learned value. That is, for each joint, j, on the left side of the robot’s body we learn a parameter \(\psi _j\) and obtain the action for the right side of the robot’s body by reflecting the policy’s output for j across \(\psi _j\). This representation is equivalent to expressing the policy for the right side of the robot’s body as:

$$\begin{aligned} \pi _r(\langle x, y, \sin (c \cdot t) \rangle ) = \mathbf {\psi } - (\mathbf {w} \cdot \langle x, y, \sin (c \cdot t) \rangle + \mathbf {b} - \mathbf {\psi }). \end{aligned}$$

In our experiments, instead of optimizing a separate \({{\varvec{\psi }}}\) vector, we clamp \({\varvec{\psi }}\) to be equal to the bias, \(\mathbf {b}\).

We define the reward as a function of the distance the robot has travelled at the final time-step. Let \(\varDelta (s_t, s_0)\) be the robot’s forward change in position between state \(s_t\) and state \(s_0\) and let \(\mathbb {I}(s_t)\) take value 1 if the robot has fallen over in state \(s_t\) and 0 otherwise. In simulation:

$$\begin{aligned} r(s_t,a_t) :={\left\{ \begin{array}{ll} 0 &{} t < l - 1 \\ \varDelta (s_t, s_0) - 25 \cdot \mathbb {I}(s_t) &{} t = l \end{array}\right. }. \end{aligned}$$

where the penalty of \(-25\) discourages cma-es from proposing policies that obtain high forward displacement through potentially unsafe actions for the physical robot. For example, cma-es might find a policy that throws itself forward, obtaining high reward but risking damage on the physical robot. The penalty does not guarantee that the best simulation policies will be stable in the real world but it at least encourages them to be stable in simulation. On the physical robot we only measure forward distance travelled; if the robot falls we count the distance travelled as zero:

$$\begin{aligned} r(s_t,a_t) :={\left\{ \begin{array}{ll} 0 &{} t < l - 1 \\ \varDelta (s_t, s_0) \cdot (1 - \mathbb {I}(s_t)) &{} t = l \end{array}\right. }. \end{aligned}$$

We apply gat for sim-to-real transfer from Simspark to the physical nao. We learn f and \(f_\mathtt {sim}^{-1}\) with linear regression. To train f we collect 10 trajectories in \({\mathcal {M}}\) and train \(f_\mathtt {sim}^{-1}\) with 50 trajectories from \({\mathcal {M}_\mathtt {sim}}\). We chose 10 trajectories for \({\mathcal {M}}\) because after 10 the robot’s motors may begin to heat up which changes the dynamics of the joints.

In the Linear Policy Walking task we measure performance based on how far forward the robot walks. The initial policy fails to move the robot forward at all—though it is executing a walking controller, its feet never break the friction of the carpet and so it remains at the starting position. We run five trials of learning with simulator modification and five trials without. On average learning in simulation with gat resulted in the robot moving 4.95 cm forward while without simulator modification the robot only moved 1.3 cm on average.

Across the five trials without modification, two trials fail to find any improvement. The remaining three only find improvement in the first generation of cma-es—before cma-es has been able to begin exploiting inaccuracies in the simulation. In contrast, all trials with simulator modification find improving policies and improvement comes in later learning generations (on average generation 3 is the best).

We also plot example trajectories to see how the modified and unmodified simulations compare to reality. Instead of plotting all state and action variables, we only plot the state variable representing the robot’s right AnklePitch joint and the action that specifies a desired position for this joint. This joint was chosen because the main failure of policies learned without simulator modification is that the robot’s feet never break the friction of the carpet. We hypothesize that learning to properly move the ankles may be important for a policy to cross the reality gap and succeed in the real world.

Figure 7a shows the prediction of joint position for the learned forward model, f, as well as the joint position in the real world and simulation. The “Predicted State” curve is generated by using f as a simulator of how the joint position changes in response to the actions.Footnote 5 Figure 7a shows that in the real world the right AnklePitch joint oscillates around the desired angular position as given by the robot’s action. The forward model f predicts this oscillation while the simulator models the joint position as static.

Figure 7b shows the actual real world and simulated trajectories, both for the modified and unmodified simulators. Though the modified simulator still fails to capture all of the real world oscillation, it does so more than no modification. Learning in a simulator that more accurately models this motion leads to policies that are able to lift the robot’s legs enough to walk. This qualitative results also shows how action modification can be an effective strategy for simulator grounding.

Fig. 7
figure 7

Visualization of the robot’s right AnklePitch joint during the Linear Policy Walking task. Both sub-figures show the position trajectory for \({\mathcal {M}}\) (denoted “Real State”) and \({\mathcal {M}_\mathtt {sim}}\) (“No Grounding State”). They also both show the action though it is covered by the “No Grounding State” curve. a shows the gat forward model’s prediction of position given the same action sequence. b shows the actual position when acting in the modified simulation

6.4 Sim-to-sim walk engine policy optimization

In this section, we evaluate gat on the task of bipedal robot walking with a state-of-the-art walk controller for the nao robot. The initial policy is the open source University of New South Wales (unsw) walk engine developed for RoboCup Standard Platform League (spl) competitions (Ashar et al., 2015; Hall et al., 2016). This walk engine is a software module designed for the NAO robot that takes in the robot’s proprioceptive and inertial sensors and outputs desired positions for the robot’s joints; we refer the reader to Ashar et al. (2015) for full details of the initial policy’s implementation. This walk controller has been used by at least one team in the 2014, 2015, 2016, 2017, 2018, 2019 RoboCup Standard Platform League (spl) championship games in which teams of five naos compete in soccer matches. To the best of our knowledge, it is the fastest open source walk available for the nao. We first present a sim-to-sim evaluation of gat using Gazebo as a surrogate for the real world. Performing a sim-to-sim evaluation allows us to evaluate gat and baselines with more trials than would be possible to run on the physical robot. In the next section, we apply gat to optimize the UNSW walk engine the physical robot.

The unsw walk engine has 15 parameters that determine features of the walk (see Table 1 for a full list of these parameters). The values of the parameters from the open source release constitute the parameterization of the initial policy \(\pi _0\). Hengst (2014) describes the unsw walk controller in more detail. For this task, \(v(\pi , {\mathcal {M}})\) is the average forward walk velocity while executing \(\pi\). In simulation a trajectory terminates after a fixed time interval (7.5 seconds in SimSpark and 10 seconds in Gazebo) or when the robot falls. For policy improvement in simulation, we apply cma-es for 10 generations with a population size of 150 candidate policies evaluated in each generation.

Table 1 The initial parameter values found in the open source release of the unsw walk engine

We implement gat with two two-hidden-layer neural networks—one for f and one for \(f^{-1}_\mathtt {sim}\). Each function is a neural network with 200 hidden units in the first layer and 180 hidden units in the second.

As baselines, we evaluate the effectiveness of gat compared to learning with no grounding and grounding \({\mathcal {M}_\mathtt {sim}}\) by adding Gaussian noise to the robot’s actions. Adding an “envelope” of noise has been used before to minimize simulation bias by preventing the policy improvement algorithm from overfitting to the simulator’s dynamics (Jakobi et al., 1995). We refer to this baseline as ane for Action Noise Envelope. We hypothesize that gat is modifying simulation in a more effective way than just forcing learning to be robust to perturbation and will thus obtain a higher level of performance.

For gat we collect 50 trajectories of robot experience to train f and 50 trajectories of simulated experience to train \(f^{-1}_\mathtt {sim}\). For each method, we run 10 generations of the cma-es algorithm with population size of 150 and each member of the population evaluated in simulation with 20 trajectories. Overall, the cma-es optimization requires 30,000 simulated trajectories for each experimental trial. We run 10 total experimental trials for each method.

Table 2 gives the average improvement in stable walk policies for each method and the number of trials in which a method failed to produce a stable improvement. Results show that gat maximizes policy improvement while minimizing failure to transfer when transferring from a low-fidelity to high-fidelity simulator. ane improves upon no grounding in both improvement and number of iterations without improvement. Adding noise to the simulator encourages cma-es to propose robust policies which are more likely to be stable. However, gat further improves over ane—demonstrating that action transformations are grounding the simulator in a more effective way than simply injecting noise.

Table 2 also shows that on average, gat finds an improved policy within the first few generations after grounding. The grounding done by gat is inherently local to the trajectory distribution of \(\pi _{{\varvec{\theta }}_0}\). Thus as \(\pi _{\varvec{\theta }}\) changes, the action transformation function fails to produce a more realistic simulator. As policy improvement progresses, the best policies in each cma-es generation begin to over-fit to the dynamics of \({\mathcal {M}_\mathtt {sim}}\). Without grounding over-fitting happens almost immediately and so when learning with no grounding finds an improvement it is also usually in an early generation of cma-es. ane can mitigate over-fitting by emphasizing robust policies although it is limited in the improvement it finds compared to gat.

Table 2 This table compares the grounded action transformation algorithm (gat) with baseline approaches for transferring learning between SimSpark and Gazebo

6.5 Sim-to-real walk engine policy optimization

We now present our main empirical result—an application of gat to optimizing a state-of-the-art walking controller for the NAO robot. All experimental details are the same as those used in the sim-to-sim evaluation except for the following changes. On the physical robot, a trajectory terminates once the robot has walked four meters (\(\approx 20.5\)s with the initial policy) or falls. The data set \(\mathcal {D}\) consists of 15 trajectories collected with \(\pi _0\) on the physical nao. To ensure the robot’s motors stayed cool, we waited five minutes after collecting every five trajectories. For each iteration of gat, we run 10 generations of the cma-es algorithm with a population size of 150. For each generation of cma-es we select \(\arg \max \,v(\pi ,{{\cal M}_{\mathtt {sim}}})\) and evaluate it on the physical robot (resulting in 10 policies being evaluated on the physical robot). We evaluate each policy on the physical robot with five trajectories. If the robot falls in any trajectory the policy is considered unstable.

Table 3 gives the physical world walk velocity of policies learned in simulation with gat. The physical robot walks at a velocity of 19.52 cm/s with \(\pi _0\). gat with SimSpark and gat with Gazebo both improved walk velocity by over 30% in a single iteration. Policy improvement with cma-es required 30,000 trajectories per gsl iteration to find the 10 policies that were evaluated on the robot. In contrast the total number of trajectories executed on the physical robot is 65 (15 trajectories in \(\mathcal {D}\) and 5 evaluations per \(\pi _c \in \varPi\)). This result demonstrates gat can use sample-intensive simulation learning to optimize real world skills with a low number of trajectories on the physical robot.

Farchy et al. (2013) demonstrated the benefits of re-grounding (i.e., re-running the gsl framework from the best policy found) and further optimizing \(\pi\). We reground the simulator with 15 trajectories collected with the best policy found by gat with SimSpark and optimize for a further 10 generations of cma-es in the SimSpark simulation. The second iteration of gat results in a walk, \({\varvec{\theta }}_2\), which averages 27.97 cm/s for a total improvement of 43.27% over \({\varvec{\theta }}_0\).Footnote 6 Overall, improving the unsw walk by over 40% shows that gat can learn walk policies that outperform the fastest known stable walk for the nao robot.

Table 3 This table gives the maximum learned velocity and percent improvement for each method starting from \(\pi _0\) (top row)

7 Stochastic GAT (SGAT)

The experiments described in Sect. 6 established that gat can lead to successful sim-to-real transfer on a challenging task. This success naturally raises the question of under what conditions gat will succeed, and, on the other hand, when it might fail. Towards answering this question, we observe that because gat learns a deterministic forward model of the world, it may be limited when the real world state transitions are stochastic. We then introduce a generalization of gat and demonstrate how it overcomes this limitation.

When the real world has stochastic transitions, gat may be unable to ground the simulator in a way that leads to a good policy. To see this limitation, consider the toy example shown in Fig. 8. In Fig. 8, the optimal action in the simulator is \(a_3\), and in the real world, it is \(a_2\); however, in the gat grounded simulator, the optimal action becomes \(a_1\). Since gat’s forward model is deterministic, it predicts only the most likely next state, but other, less likely transitions are also important when computing an action’s value.

Fig. 8
figure 8

A toy example where gat may fail to ground the simulator for learning. The gray box depicts the grounding step with blue arrows representing the forward model and red arrows representing the inverse dynamics model. When the real world has stochastic transitions, the gat forward model only captures the most likely next state. gat may fail here, since the optimal action in the grounded simulator (\(a_3\)) is sub-optimal in the real environment

To address real world stochasticity, we introduce a generalization of gat—Stochastic Grounded Action Transformation (sgat)—which learns a stochastic model of the forward dynamics. In other words, the learned forward model, \(f_{real}\), predicts a distribution over next states, a potential next state is sampled from this distribution, and then the sampled state is used with \(f^{-1}_\mathtt {sim}\) instead of always taking the most likely next state. The grounding function learned by sgat is given by:

$$\begin{aligned} g(s,a) = f^{-1}_\mathtt {sim}(s, S')&&S' \sim f(s,a) \end{aligned}$$

where f(sa) now gives a distribution over next states instead of the single most likely next state. The sampling operation within the action transformer makes the overall action transformation process stochastic. Figure 9 illustrates the simulator from the example in Fig. 8 now grounded using sgat. Since the forward model accounts for stochasticity in the real world, the actions in the grounded simulator have the same effect as in the real world.

Fig. 9
figure 9

The sgat algorithm applied to the toy example in Fig. 8. In the sgat Grounded Simulator, the transitions match the real environment (Fig. 8b)

An implementation of gat can be extended to an implementation of sgat by replacing the predicted next state output of f with predicted parameters of the next state distribution. Let \(p(s_{t+1}|s_t,a_t)\) denote the probability of \(s_{t+1}\) under the distribution given by \(f(s_t, a_t)\). We can fit the stochastic forward model to the observed real world data by minimizing a negative log likelihood loss \(\mathcal {L} = -\log p(s_{t+1}|s_t,a_t)\) on the observed real world transition \((s_t, a_t, s_{t+1})\). For example, in continuous state and action domains, f could output the mean and covariance of a Gaussian distribution.

sgat generalizes gat as gat can be seen as a variant of sgat that always samples the most likely real world state given the current state and action. We next present an empirical study that shows that this generalization is crucial for real world domains with high stochasticity.

8 Stochastic GAT empirical study

This section reports on an empirical study of transfer from simulation with sgat compared to gat. We begin with a toy RL domain and progress to sim-to-real transfer of a bipedal walking controller for a nao robot on bumpy carpet. This additional empirical study is designed to answer the questions:

  1. 1.

    Does gat perform worse when real world stochasticticy is increased?

  2. 2.

    Can sgat successfully ground simulation even when the real world is stochastic?

Our empirical results show the benefit of modelling stochasticity when grounding a simulator for transfer to a stochastic real world environment.

8.1 Cliff walking

We first verify the benefit of sgat using a classical reinforcement learning domain, the Cliff Walking grid world (Sutton & Barto, 1998) shown in Fig. 10. In this domain, an agent must navigate around a cliff to reach a goal. The agent can move up, down, left, or right. If it tries to move into a wall, the action has no effect. The episode terminates when the agent either reaches the goal (reward of \(+100\)) or falls off the cliff (reward of \(-10\)). There is also a small time penalty (\(-0.1\) per time step), so the agent is incentivized to find the shortest path. There is no discounting, so the agent’s objective is to maximize the sum of rewards over an episode. We use policy iteration (Sutton & Barto, 1998) for the \(\mathtt {optimize}\) routine in simulation.

Fig. 10
figure 10

The agent starts in the bottom left and must reach the goal in the bottom right. Stepping into the red region penalizes the robot and ends the episode. The purple path is the most direct, but the blue path is safer when the transitions are stochastic (Color figure online)

We make Cliff Walking a sim-to-sim transfer problem by treating a variant of the domain with deterministic transitions as the simulator and a variant of the domain with stochastic transitions as a surrogate for the real world. In the stochastic “real” environment, there is a small chance at every time step that the agent moves in a random direction instead of the direction it chose. As in the Sect. 6, sim-to-sim experiments allow us to run more experiments than would be possible on a physical robot.

Figure 11 shows gat and sgat evaluated for different values of the environment noise parameter. Both the grounding steps and policy improvement steps are repeated until convergence for both algorithms. To evaluate the resulting policy, we estimate the expected return with 10,000 episodes in the “real” environment. At a value of zero, the “real” environment is completely deterministic. At a value of one, every transition is random. Thus, at both of these endpoints, there is no distinction between the expected return gained by the two algorithms.

Fig. 11
figure 11

The y-axis is the average performance of a policy evaluated on the “real” domain. The x-axis is the chance at each time step for the transition to be random. sgat outperforms gat for any noise value. Error bars not shown since standard error is smaller than 1 pixel

For every intermediate value, sgat outperforms gat. The policy trained using gat is unaware of the stochastic transitions, so it always takes the shortest and most dangerous path. Meanwhile the sgat agent learns as if it were training directly on the real environment in the presence of stochasticity. Though Cliff Walking is a relatively simple domain, this experiment demonstrates the importance of modelling the stochasticity in \({\mathcal {M}}\).

8.2 MuJoCo domains

Having shown the efficacy of sgat in a tabular domain, we now evaluate its performance in continuous control domains that are closer to real world robotics settings. We perform experiments on the OpenAI Gym MuJoCo environments to compare the effectiveness of sgat and gat when there is added noise in the target domain. We consider the case with just added noise and the case with both noise and domain mismatch between the source and target environments. We call the former Sim-to-NoisySim and the latter Sim-to-NoisyReal. We use the InvertedPendulum and HalfCheetah domains to test sgat in environments with both low and high dimensional state and action spaces. For policy improvement, we use an implementation of Trust Region Policy Optimization (trpo) (Schulman et al., 2015a), from the stable-baselines repository (Hill et al., 2018) with the default hyperparameters for the respective domains.

For gat, we use a neural network function approximator with two fully connected hidden layers of 64 neurons to represent the forward and inverse models. For sgat, the forward model outputs the parameters of a Gaussian distribution from which we sample the predicted next state.Footnote 7 In our implementation, the final dense layer outputs the mean, \(\mu\), and the log standard deviation, \(log(\sigma )\), for each element of the state vector. We include all state variables as state variables of interest.

We also compare against the ane approach from Sect. 6. This baseline is useful in showing that sgat is accomplishing more than simply adding noise to the actions from the policy. We note the comparison is not a perfectly fair comparison in the sense that robustness approaches such as ane are sensitive to user-defined hyperparameters that predict the variation in the environment—in this case, the magnitude of the added noise. sgat automatically learns the right amount of stochasticity from real world data. In these experiments, we chose the ane hyperparameters (e.g., noise value) with a coarse grid search.

We simulate stochasticity in the target domains by adding Gaussian noise with different standard deviation values to the actions input into the environment. We omit the results of Sim-to-NoisySim experiments for InvertedPendulum because both algorithms performed well on the transfer task. Figure 12 shows the performance on the “real” environment of policies trained four ways—naively on the ungrounded simulator, with sgat, with gat, and with ane. In this Sim-to-NoisyReal experiment, sgat performs much better than gat when the stochasticity in the target domain increases. Figure 13 shows the same experiment on HalfCheetah, both with and without domain mismatch. Both these environments have an action space of \([-1, 1]\).

Fig. 12
figure 12

Sim-to-NoisyReal experiment on InvertedPendulum. The “real” pendulum is 10 times heavier than the sim pendulum and has added Gaussian noise of different values. Error bars show standard error over ten independent training runs. Algorithms with striped bars used no real world data during training. sgat performs comparatively better in noisier target environments

Fig. 13
figure 13

Sim-to-NoisySim and Sim-to-NoisyReal experiments on HalfCheetah. In the NoisyReal environment, the “real” HalfCheetah’s mass is 43% greater than the sim HalfCheetah. Error bars show show standard error over ten independent training runs. Algorithms with striped bars used no real world data during training. When the “real” environment is highly stochastic, sgat performs better than gat. Meanwhile, ane does poorly on less noisy scenarios

The red dashed lines show the performance of a policy trained directly on the “real” environment until convergence, approximately the best possible performance. The axes are scaled respective to this line. The error bars show the standard error across 10 trials with different initialization weights. As the stochasticity increases, sgat policies perform better than those learned using gat. Meanwhile, ane does well only for particular noise values, depending on its training hyperparameters.

8.3 Nao robot experiments

Until this point in our analysis of sgat, we have used a modified version of the simulator in place of the “real” world so as to isolate the effect of stochasticity (as opposed to domain mismatch). However, the true objective of this research is to enable transfer to real robots, which may exhibit very different noise profiles than the simulated environments. Thus, in this section, we validate sgat on a real humanoid robot learning to walk on uneven terrain.

As before, we use the nao robot. We compared gat and sgat by independently learning control policies using these algorithms to walk on uneven terrain, as shown in Fig. 14. To create an uneven surface, we placed foam packing material under the turf of a robot soccer field. On this uneven ground, the walking dynamics become more random, since the forces acting on the foot are slightly different every time the robot takes a step. We use the same initial policy as in Sect. 6.5. This initial unoptimized policy achieves a speed of \(14.66 \pm 1.65\) cm/s on the uneven terrain. Aside from these details, the empirical set-up for this task is the same as in Sect. 6.5.

Fig. 14
figure 14

Experiment setup showing a robot walking on the uneven ground. The nao begins walking 40 cms behind the center of the circle and walks 300 cms. This image shows a successful walk executed by the robot at 2 sec intervals, learned using the proposed sgat algorithm

On flat ground, both methods produced very similar policies, but on the uneven ground, the policy learned using sgat was more successful than a policy learned using gat. We evaluated the best policy learned using each of sgat and gat after each grounding step by generating 10 trajectories on the physical robot. The average speed of the robot on the uneven terrain is shown in Table 4. Qualitatively, we find that the policy learned using sgat takes shorter steps and stays upright, thereby maintaining its balance on the uneven terrain, whereas the policy produced using gat learned to lean forward and walk faster, but fell down more often due to the uneven terrain. Both algorithms produce policies that improve the walking speed across grounding steps. The gat policy after the second grounding step always falls over, whereas the sgat policy was more stable and finished the course 9 out of 10 times. Overall, this experiment demonstrates that sgat allows sim-to-real transfer when the real world is stochastic. Though gat is able to improve the initial policy’s walking speed it is more unstable since it ignores stochasticity in the real world.

Table 4 Speed and stability of nao robot walking on uneven ground. The initial policy \(\theta _0\) walks at \(14.66 \pm 1.65\) cm/s and always falls down. Both sgat and gat find policies that are faster, but sgat policies are more stable than policies learned using gat

9 Discussion of limitations

In this section, we discuss limitations of the gat and sgat algorithms and our empirical evaluation. gat requires that there exists an action that can be taken in simulation to cause the simulator to behave as the real world would. Formally, \(\exists \hat{a} \in \mathcal {A}\) such that \(f(s,a) = f_\mathtt {sim}(s,\hat{a})\) for state s and action a. At a minimum this condition should hold for states and actions that are encountered during policy optimization. This requirement is also problematic for domains with P that have high variance in the next state variables and maximum likelihood prediction may be insufficient. However, sgat provides an alternative algorithm for such cases. Furthermore, both algorithms perform similarly on deterministic environments, suggesting that sgat should be the default option.

We evaluated gat on several robot reinforcement learning tasks in both simulation and the real world. In these experiments, we varied the task, policy representation, and the simulator and target MDP (either the real world or another simulator). However, there remain a large number of experimental knobs that we have not studied the importance of yet. Some of these include the reward function definitions, the RL algorithm used, and how the state variables of interest were defined. Further studies of these settings would broaden the breadth of conclusions we can draw about the general applicability of the gat algorithm.

In this work, we have only considered deterministic simulators, but simulators may have stochastic transitions as well, especially if the simulator was designed to anticipate process noise. However, when using an action transformer grounding approach, stochastic simulators make the learning problem more difficult. We can no longer sample from the distribution provided by the forward model. Instead, the inverse model must take in a distribution over states and output a distribution over actions.

10 Future work

This article introduced an algorithm, grounded action transformation (gat), that allows a reinforcement learning agent to learn with simulated data. In this section, we propose directions for future research on and application of our new algorithm.

10.1 Sim-to-real in non-robotics domains

We evaluated gat on a physical nao robot. gat is not specific to the nao and could be applied on other robotics tasks or even non-robotics tasks where a simulator is available a priori. The latter is of particular interest as the sim-to-real problem has been studied to a much lesser extent in non-robotics domains. gat is most applicable in tasks where the dynamics have a basis in physics and actions have a direct effect on some state variables. For example, if a robot increases the force with which it lifts its arm then it will see its arm lift higher or faster. In such settings, it is reasonable to assume that an effective action grounding function can be learned. It may be less applicable where the dynamics are derived from other factors such as human behavior.

10.2 Identifying state variables of interest

In our empirical evaluation, we manually chose the state variables of interest and modified actions to make the transitions of these variables more realistic. For instance, in Sec. 6.5, we knew it was important to model the effect of actions on the joint positions of the physical robot. Thus, we set the joint positions of the physical robot as the variables of interest. Automatically identifying these variables is an interesting direction for future work.

The state variables of interest should be variables that affect task reward and the goal is to identify them through the data collected for grounding. It may be difficult to identify these variables by simply running an initial policy; more exploratory actions may need to be taken during data collection. Another objective when identifying these variables is that the set of target variables should have minimal size while being large enough for the simulator to be sufficiently grounded for learning to progress. Clearly, setting all state variables to be target variables accomplishes the latter but the grounding problem may become more difficult. Thus methods for automatically identifying the variables should attempt to find the minimal set that still allows learning in the grounded simulation to transfer to the real world.

10.3 Grounded action transformation for deep reinforcement learning

Finally, our empirical evaluation considered relatively low dimensional policy representations: neural networks with a couple of hidden layers, linear functions, or existing parameterized controllers. Some of the most impressive, recent RL success stories have been accomplished with high dimensional neural network policy representations taking pixels as inputs. Applying gat and sgat to learn pixel-to-control policies is an interesting and challenging direction for future work. With more complex policy representations there is more chance that the RL algorithm will overfit to the simulator and thus high fidelity grounding is essential. Thus, more complex policy representations and deep RL algorithms are an important test of gat’s ability to ground a simulator.

11 Conclusion

We have introduced an algorithm which allows a robot to learn a policy in a simulated environment and the resulting policy transfer to the physical robot. This algorithm, called the grounded action transformation (gat) algorithm, makes a contribution towards allowing reinforcement learning agents to leverage simulated data to learn policies that are effective in the real world. We empirically evaluated gat on three robot learning tasks using the simulated or physical nao robot. In all cases, gat leads to higher task performance compared to no grounding. We also compared gat to a simulator randomization baseline and found that using real world data to modify the simulation was more effective than simply adding noise to the robot’s actions during learning. We applied gat to optimizing the parameters of an existing walk controller and learned the fastest stable walk that we know of for the nao robot. Finally we also developed a generalization of gat, sgat, that improves upon gat when the real world is highly stochastic.