Abstract
Mastering robotic manipulation skills through reinforcement learning (RL) typically requires the design of shaped reward functions. Recent developments in this area have demonstrated that using sparse rewards, i.e. rewarding the agent only when the task has been successfully completed, can lead to better policies. However, stateaction space exploration is more difficult in this case. Recent RL approaches to learning with sparse rewards have leveraged highquality human demonstrations for the task, but these can be costly, time consuming or even impossible to obtain. In this paper, we propose a novel and effective approach that does not require human demonstrations. We observe that every robotic manipulation task could be seen as involving a locomotion task from the perspective of the object being manipulated, i.e. the object could learn how to reach a target state on its own. In order to exploit this idea, we introduce a framework whereby an object locomotion policy is initially obtained using a realistic physics simulator. This policy is then used to generate auxiliary rewards, called simulated locomotion demonstration rewards (SLDRs), which enable us to learn the robot manipulation policy. The proposed approach has been evaluated on 13 tasks of increasing complexity, and can achieve higher success rate and faster learning rates compared to alternative algorithms. SLDRs are especially beneficial for tasks like multiobject stacking and nonrigid object manipulation.
1 Introduction
Reinforcement learning (RL) solves sequential decisionmaking problems by learning a policy that maximises expected rewards. Recently, with the aid of deep artificial neural network as function approximators, RLtrained agents have been able to autonomously master a number of complex tasks, most notably playing video games (Mnih et al., 2015) and board games (Silver et al., 2016). Robot manipulation has been extensively studied in RL, but is particularly challenging to master because it often involves multiple stages (e.g. stacking multiple blocks), highdimensional state spaces (e.g. dexterous hand manipulation Zhu et al., 2018; Andrychowicz et al., 2018) and complex dynamics (e.g. manipulating nonrigid objects). Although promising performance has been reported on a wide range of tasks like grasping (Levine et al., 2016; Popov et al., 2017), stacking (Nair et al., 2018a) and dexterous hand manipulation (Zhu et al., 2018; Andrychowicz et al., 2018), the learning algorithms usually require carefullydesigned reward signals to learn good policies. For example, Popov et al. (2017) propose a thoroughly weighted 5term reward formula for learning to stack Lego blocks and (Gu et al., 2017) use a 3term shaped reward to perform dooropening tasks with a robot arm. The requirement of handengineered, dense reward functions limits the applicability of RL in realworld robot manipulation to cases where taskspecific knowledge can be captured.
The alternative to designing shaped rewards consists of learning with only sparse feedback signals, i.e. a nonzero rewards indicating the completion of a task. Using sparse rewards is more desirable in practise as it generalises to many tasks without the need for handengineering (Silver et al., 2016; Andrychowicz et al., 2017; Burda et al., 2018a). On the other hand, learning with only sparse rewards is significantly more challenging since associating sequences of actions to nonzero rewards received only when a task has been successfully completed becomes more difficult. A number of existing approaches that address this problem have been proposed lately (Andrychowicz et al., 2017; Riedmiller et al., 2018; Houthooft et al., 2016; Burda et al., 2018b; Pathak et al., 2017; Burda et al., 2018a; Savinov et al., 2018; Zhang et al., 2019); some of them report some success in completing manipulation tasks like object pushing (Andrychowicz et al., 2017; Zhang et al., 2019), pickandplace (Andrychowicz et al., 2017), stacking two blocks (Riedmiller et al., 2018; Zhang et al., 2019), and target finding in a scene (Pathak et al., 2017; Savinov et al., 2018). Nevertheless, for more complex tasks such as stacking multiple blocks and manipulating nonrigid objects, there is scope for further improvement.
A particularly promising approach to facilitate learning has been to leverage human expertise through a number of manually generated examples demonstrating the robot actions required to complete a given task. When these demonstrations are available, they can be used by an agent in various ways, e.g. by attempting to generate a policy that mimics them (Ross et al., 2011; Bojarski et al., 2016; Xu et al., 2017), prelearning a policy from them for further RL (Silver et al., 2016; Hester et al., 2018), as a mechanism to guide exploration (Nair et al., 2018a), as data from which to infer a reward function (Ng & Russell, 2000; Abbeel & Ng, 2004; Finn et al., 2016; Ho & Ermon, 2016), and in combination with trajectories generated during RL (Vecerik et al., 2017; Sasaki et al., 2019; Reddy et al., 2019). Practically, however, human demonstrations are expensive to obtain, and their effectiveness ultimately depends on the competence of the demonstrators. Demonstrators with insufficient taskspecific expertise could generate lowquality demonstrations resulting in suboptimal policies. Although there is an existing body of work focusing on learning with imperfect demonstrations (Grollman & Billard, 2011; Zheng et al., 2014; Shiarlis et al., 2016; Gao et al., 2018; Brown et al., 2019; Choi et al., 2019), these methods usually assume that either qualitative evaluation metrics are available (Grollman & Billard, 2011; Shiarlis et al., 2016; Brown et al., 2019) or that a substantial volume of demonstrations can be collected (Zheng et al., 2014; Gao et al., 2018; Choi et al., 2019).
In this paper, we propose a novel approach that allows complex robot manipulation tasks to be learnt with only sparse rewards. In the tasks we consider, an object is manipulated by a robot so that, starting from a (random) initial position, it eventually reaches a goal position through a sequence of states in which its location and pose vary. For example, Fig. 1(top row) represents a pickandplace task in which the object is being picked up by the twofinger gripper and moved from its initial state to a predefined target location (red sphere). Our key observation is that every robot manipulation implies an underlying object locomotion task that can be explicitly modelled as an independent task for the object itself to learn. Figure 1(middle row) illustrates this idea for the pickandplace task: the object, on its own, must learn to navigate from any given initial position until it reaches its target position. More complex manipulation tasks involving nonrigid objects can also be thought as inducing such object locomotion tasks; for instance, in Fig. 1(bottom row), a 5tuple nonrigid object moves itself to the given target location and pose (see Fig. 3 for the description of the nonrigid object).
Although in the real world it is impossible for objects to move on their own, learning such object locomotion policies can be achieved in a virtual environment through a realistic physics engine such as MuJoCo (Todorov et al., 2012), Gazebo (Koenig & Howard, 2004) or Pybullet (Coumans & Bai, 2017). In our experience, such policies are relatively straightforward to learn using only sparse rewards since the objects usually operate in simple state/action spaces and/or have simple dynamics. Once a locomotion policy has been learnt, we utilise it to produce a form of auxiliary rewards guiding the main manipulation policy. We name these auxiliary rewards “Simulated Locomotion Demonstration Rewards” (SLDRs). During the process of learning the robot manipulation policy, the proposed SLDRs encourage the robot to execute policies implying object trajectories that are similar to those obtained by the object locomotion policy. Although the SLDRs can only be learnt through a realistic simulator, this requirement does not restrict their applicability to real world problems, and the resulting manipulation policies can still be transferred to physical systems. To the best of our knowledge, this is the first time that objectlevel policies are trained in a physics simulator to enable robot manipulation learning driven by only sparse rewards.
In our implementation, all the policies are learnt using deep deterministic policy gradient (DDPG) (Lillicrap et al., 2015), which has been chosen due to its widely reported effectiveness in continuous control; however, most RL algorithms compatible with continuous actions could have been used within the proposed SLD framework. Our experimental results involve 13 continuous control environments using the MuJoCo physics engine (Todorov et al., 2012) within the OpenAI Gym framework (Brockman et al., 2016). These environments cover a variety of robot manipulation tasks with increasing level of complexity, e.g. pushing, sliding and pickandplace tasks with a Fetch robotic arm, inhand object manipulation with a Shadow’s dexterous hand, multiobject stacking, and nonrigid object manipulation. Overall, across all environments, we have found that our approach can achieve faster learning rate and higher success rate compared to baselines methods, especially in more challenging tasks such as stacking objects and manipulating nonrigid objects. Baselines are provided to represent existing approaches that use rewardshaping, curiositybased auxiliary rewarding and auxiliary goal generation techniques.
The remainder of the paper is organised as follows. In Sect. 2 we review the most related work, and in Sect. 3 we provide some introductory background material regarding the RL modelling framework and algorithms we use. In Sect. 4, we develop the proposed methodology. In Sect. 5 we describe all the environments used for our experiments, and the experimental results are reported in Sect. 6. Finally, we conclude with a discussion and suggestions for further extensions in Sect. 7.
2 Related work
2.1 Robotic manipulation
Robotics requires sequential decision making under uncertainty, and therefore it is a common application domain of machine learning approaches including RL (Kroemer et al., 2019). Recent advances in RL have focused on locomotion (Lillicrap et al., 2015; Schulman et al., 2015; Mnih et al., 2016) and manipulation tasks (Fu et al., 2016; Gu et al., 2017), which includes grasping (Levine et al., 2016; Popov et al., 2017), stacking (Nair et al., 2018a) and dexterous hand manipulation (Zhu et al., 2018; Andrychowicz et al., 2018). These tasks are particularly challenging as they require continuous control over actions and the expected behaviours are hard to formulate through rewards. Due to the sample inefficiency problem of RL, most stateoftheart approaches rely on simulated environments such as MuJoCo (Todorov et al., 2012) as training using physical systems would be significantly slower and costly. Predicting how objects behave under manipulation has also been well studied. For example, Kopicki et al. (2009, 2011) and Belter et al. (2014) propose approaches to predict the motions of rigid objects under pushing actions with the aim of using these models to plan the robotic manipulation. Most recently, Li et al. (2018) has proposed to learn a particlebased dynamics from data to handle complex interactions between rigid bodies, deformable objects and fluids. The focus of these studies has been to develop learnable simulators to replace traditional physics engines, whereas in this paper our aim is to learn object policies using the simulators. Although we employ a traditional physics engine for this paper, this could be replaced with learnable simulators in future work.
2.2 Learning from demonstrations
A substantial body of work exists on how to leverage such demonstrations, when available, for reinforcement learning. Behaviour cloning (BC) methods approach sequential decisionmaking as a supervised learning problem (Pomerleau, 1988; Duan et al., 2017; Bojarski et al., 2016; Xu et al., 2017). Some BC methods include an expert demonstrator in the training loop to handle the mismatching between the demonstration data and the data encountered in the training procedure (Ross et al., 2011; Ratliff et al., 2007). Recent BC methods have also considered adversarial frameworks to improve the policy learning (Ho & Ermon, 2016; Wang et al., 2017). A different approach consists of inverse reinforcement learning, which seeks to infer a reward/cost function to guide the policy learning (Ng & Russell, 2000; Abbeel & Ng, 2004; Finn et al., 2016). Several methods have been developed to leverage demonstrations for robotic manipulation tasks with sparse rewards. For instance, Vecerik et al (2017) and Nair et al. (2018a) jointly use demonstrations with trajectories collected during the RL process to guide the exploration, and Hester et al. (2018) use the demonstrations to prelearn a policy, which is further finetuned in a following RL stage. Obtaining the training data requires specialised data capture setups such as teleoperation interfaces. In general, obtaining good quality demonstrations is an expensive process in terms of both human effort and equipment requirements. In contrast, the proposed method generates objectlevel demonstrations autonomously, and could potentially be used jointly with humangenerated demonstrations when these are available.
2.3 Goal conditioned policies and auxiliary goal generation
Goalconditioned policies (Schaul et al., 2015) that can generalise over multiple goals have been shown to be promising for robotic problems. For manipulation tasks with sparse rewards, several approaches have recently been proposed to automatically generate the auxiliary goals. For instance, Sukhbaatar et al. (2018) used a selfplay approach on reversible or resettable environments, Florensa et al. (2018) employed adversarial training for robotic locomotion tasks, Nair et al. (2018b) proposed variational autoencoders for visual robotics tasks, and Andrychowicz et al. (2017) introduced Hindsight Experience Replay (HER), which randomly draws synthetic goals from previously encountered states. HER in particular has been proved particularly effective, although the automatic goal generation can still be problematic on complex tasks involving multiple stages, e.g. stacking multiple objects, when used without demonstrations (Nair et al., 2018b). Some attempts have been made to form an explicit curriculum for such complex tasks; e.g. Riedmiller et al. (2018) manually define several semantically grounded subtasks each having its own individual reward. Methods such as this one requires significant human effort hence they cannot be readily applied across different tasks. The proposed method in this paper uses goalconditioned policies and adopts HER for auxiliary goal generation due to its effectiveness in robotic manipulation. However, it can be integrated with the other goal techniques in the literature.
2.4 Auxiliary rewards in RL
Lately, increasing efforts have been made to design general auxiliary reward functions aimed at facilitating learning in environments with only sparse rewards. Many of these strategies involve a notion of curiosity (Schmidhuber, 1991), which encourages agents to visit novel states that have not been seen in previous experience; for instance, Pathak et al. (2017) formulate the auxiliary reward using the error in predicting the RL agent’s actions by an inverse dynamics model, Houthooft et al. (2016) encourage the agent to visit the states that result the largest information gain in system dynamics, Burda et al. (2018a) construct the auxiliary reward based on the error in predicting the output of a fixed randomly initialised neural network, and Savinov et al. (2018) introduces the notion of state reachability. Despite the benefits introduced by these approaches, visiting unseen states may be less beneficial in robot manipulation tasks as exploring complex state spaces to find rewards is rather impractical (Andrychowicz et al., 2017). The proposed approach, on the other hand, produces the auxiliary rewards based on the underlying object locomotion; as such, it motivates the robot to mimic the optimal object locomotion rather than curiously exploring the continuous state space.
3 Background
3.1 Multigoal RL for robotic manipulation
We are concerned with solving a manipulation task: an object is presented to the robot, and has to be manipulated so as to reach a target position. In the tasks we consider, the target goal is specified by the object location and orientation, and the robot is rewarded only when it reaches its goal. We model the robot’s sequential decision process as a Markov Decision Process (MDP) defined by a tuple, \(M=\langle {\mathcal {S}}, {\mathcal {G}}, {\mathcal {A}}, {\mathcal {T}}, {\mathcal {R}}, \gamma \rangle\), where \({\mathcal {S}}\) is the set of states, \({\mathcal {G}}\) is the set of goals, \({\mathcal {A}}\) is the set of actions, \({\mathcal {T}}\) is the state transition function, \({\mathcal {R}}\) is the reward function and \(\gamma \in \left[ 0, 1\right)\) is the discounting factor. At the beginning of an episode, the environment samples a goal \(g \in {\mathcal {G}}\). The state of the environment at time t is denoted by \(s_t \in {\mathcal {S}}\) and includes both robotrelated and objectrelated features. In a real system, these features are typically continuous variables obtained through sensors of the robot. The position of the object \(o_t\) is one of the objectrelated features included in \(s_t\) and can be obtained through a known mapping, i.e. \(o_t=m_{{\mathcal {S}}\rightarrow {\mathcal {O}}}(s_t)\). A robot’s action is controlled by a deterministic policy, i.e. \(a_t=\mu _\theta (s_t,g): {\mathcal {S}}\times {\mathcal {G}}\rightarrow {\mathcal {A}}\), parameterised by \(\theta\). The environment moves to its next state through its state transition function, i.e. \(s_{t+1}={\mathcal {T}}(s_t,a_t): {\mathcal {S}}\times {\mathcal {A}}\rightarrow {\mathcal {S}}\), and provides an immediate and sparse reward \(r_t\), defined as
where \(\epsilon\) is a predefined threshold. Following its policy, the robot interacts with the environment until the episode terminates after T steps. The interaction between the robot and the environment generates a trajectory, \(\tau =(g,s_1,a_1,r_1,\ldots ,s_T,a_T,r_T,s_{T+1})\). The ultimate learning objective is to find the optimal policy that maximises the expected sum of the discounted rewards over the time horizon T, i.e.
where \(\gamma\) is the discount factor.
3.2 Deep deterministic policy gradient algorithm
Policy gradient (PG) algorithms update the policy parameters \(\theta\) in the direction of \(\nabla _\theta J(\mu _\theta )\) to maximise the expected return \(J(\mu _\theta ) = \mathbb {E}_{\tau \sim {\mathbb {P}}(\tau \mu _\theta )}[R(\tau )]\). Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) integrates nonlinear function approximators such as neural networks with Deterministic Policy Gradient (DPG) (Silver et al., 2014) that uses deterministic policy functions. DDPG maintains a policy (actor) network \(\mu _\theta (s_t,g)\) and an actionvalue (critic) network \(Q^\mu (s_t,a_t,g)\).
The actor \(\mu _\theta (s_t,g)\) deterministically maps states to actions. The critic \(Q^\mu (s_t,a_t,g)\) estimates the expected return when starting from \(s_t\) by taking \(a_t\), and then following \(\mu _\theta\) in the future states until the termination of the episode, i.e. \(Q^\mu (s_t,a_t,g) = {\mathbb {E}}\big [\sum _{i=t}^T{\gamma ^{it}r_i} \big  s_t,a_t,g,\mu _\theta \big ]\). When interacting with the environment, DDPG assures the exploration by adding a noise to the deterministic policy output, i.e. \(a_t = \mu _\theta (s_t,g) + {\mathcal {N}}\). Experienced transitions during these interactions, i.e. \(\langle g,s_t,a_t,r_t,s_{t+1}\rangle\), are stored in a replay buffer \({\mathcal {D}}\). The actor and critic networks are updated using the transitions sampled from \({\mathcal {D}}\). The critic parameters are learnt by minimising the following loss to satisfy the Bellman equation similarly to Qlearning (Watkins and Dayan 1992):
where \(y = r_t + \gamma Q^{\mu }(s_t,\mu (s_{t+1}),g)\). The actor parameters \(\theta\) are updated using the following policy gradient:
We adopt DDPG as the main training algorithm; however, the proposed idea can also be used with other offpolicy approaches that work with continuous action domains.
3.3 Hindsight experience replay
Hindsight Experience Replay (HER) (Andrychowicz et al., 2017) has been introduced to learn policies from sparse rewards, especially for robot manipulation tasks. The idea is to view the states achieved in an episode as pseudo goals (i.e. achieved goals) to facilitate learning even when the desired goal has not been achieved during the episode. Suppose we are given an observed trajectory, \(\tau =(g,s_1,a_1,r_1,\ldots ,s_T,a_T,r_T,s_{T+1})\). Since \(o_t\) can be obtained from \(s_t\) using a fixed and known mapping, the path that was followed by the object during the trajectory, i.e. \(o_1,\ldots ,o_{T+1}\), can be easily extracted. HER samples a new goal from this path, i.e. \(\tilde{g} \sim \{o_1,\ldots ,o_T\}\), and the rewards are recomputed with respect to \(\tilde{g}\), i.e. \(\tilde{r}_t = {\mathcal {R}}(o_{t+1}, \tilde{g})\). Using these rewards and \(\tilde{g}\), a new trajectory is created implicitly, i.e. \(\tilde{\tau }=(\tilde{g},s_1,a_1,\tilde{r}_1,\ldots ,s_T,a_T,\tilde{r}_T,s_{T+1})\). These HER trajectories \(\tilde{\tau }\) are used to train the policy parameters together with the original trajectories.
4 Methodology
Given a manipulation task, initially we introduce a corresponding auxiliary locomotion task for the object that is being manipulated, i.e. the object is assumed to be the decisionmaking agent. This auxiliary problem is usually significantly easier to learn compared to the original task. After learning the object locomotion policy, we use it on a rewardgenerating mechanism for the robot when learning the original manipulation task. In this section, we explain the steps involved in our proposed procedure, i.e. (a) how the object locomotion policies are learned, (b) how the proposed reward function is defined, and (c) how these auxiliary rewards are leveraged for robotic manipulation.
4.1 Object locomotion policies
The object involved in the manipulation task is initially modelled as an agent capable of independent decision making abilities, and its decision process is modelled by a separate MDP defined by a tuple \(L=\langle {\mathcal {Z}}, {\mathcal {G}}, {\mathcal {U}}, {\mathcal {Y}}, {\mathcal {R}}, \gamma \rangle\). Here, \({\mathcal {Z}}\) is the set of states, \({\mathcal {G}}\) is the set of goals, \({\mathcal {U}}\) is the set of actions, \({\mathcal {Y}}\) is the state transition function, \({\mathcal {R}}\) is the reward function and \(\gamma \in \left[ 0, 1\right)\) is the discounting factor. The same goal space, \({\mathcal {G}}\), is used as in M, and \(z_t \in {\mathcal {Z}}\) is a reduced version of \(s_t\) that only involves objectrelated features including the position of the object, i.e. \(o_t \subset z_t\). The object’s action space explicitly controls the pose of the object, and these actions are controlled by a deterministic policy, i.e. \(u_t=\nu _\theta (z_t,g): {\mathcal {Z}}\times {\mathcal {G}}\rightarrow {\mathcal {U}}\). The state transition is defined on a different space, i.e. \({\mathcal {Y}}: {\mathcal {Z}}\times {\mathcal {U}}\rightarrow {\mathcal {Z}}\); however, the same sparse reward function is used here as before. Figure 2a illustrates the training procedure used in this context and based on DDPG with HER. The optimal object policy \(\nu _\theta\) maximises the expected return \(J\left( \nu _\theta \right) = \mathbb {E}_{g,z_t \sim {\mathcal {D}}_L}[\sum _{i=1}^T{\gamma ^{i1}r_i}]\) where \({\mathcal {D}}_L\) denotes the replay buffer containing the trajectories, indicated by \(\eta\), obtained by \(\nu _\theta\) throughout training.
4.2 Robotic manipulation with simulated locomotion demonstration rewards (SLDR)
On the original manipulation task M, the robot receives the current environmental state and the desired goal and then decides how to act according to its policy \(\mu _\theta\). Whenever the object is moved from one position to another, the observed object locomotion is a consequence of robot’s actions. More concretely, the observed object action on M (hereafter denoted by \(w_t\)) is a function of the robot policy \(\mu _\theta\). The relation between \(w_t\) and \(\mu _\theta\) depends on the environmental dynamics whose closeform model is unknown. We use \(f:{\mathcal {A}}\rightarrow {\mathcal {U}}\) to denote this unknown relation, i.e. \(w_t=f(\mu _\theta (s_t,g))\).
The key steps of the proposed approach are as follows: as we had initially learnt an object locomotion policy on L, first we use it to enquire the optimal object action for the current state and goal, i.e. \(u_t=\nu _\theta (z_t,g)\). Then, we update \(\mu _\theta\) in order to make \(w_t\) get closer to \(u_t\). This learning objective can be written as follows:
Given that the the environment dynamics is unknown, we replace f in Eq. (5) with a parameterised model to approximate \(w_t\). Estimating \(w_t\) from robot actions is not straightforward as it requires keeping track of all previous actions, i.e. \(a_{1:t}\), and the initial state. Instead, we propose to estimate \(w_t\) by evaluating the transition from the current state to the next. Specifically, we substitute f with a parameterised inverse dynamic model, i.e. \({\mathcal {I}}_\phi :{\mathcal {Z}}\times {\mathcal {Z}}\rightarrow {\mathcal {U}}\), that we train to estimate the output of \(\nu _\theta (z_t,g)\) from \(z_t\) and \(z_{t+1}\), i.e. \(\nu _\theta (z_t,g) \approx {\mathcal {I}}_\phi (z_t, z_{t+1})\). We learn the parameters of \({\mathcal {I}}_\phi\) on the object locomotion task L (see Sect. 4.3 and Algorithm 1 for training details), and then employ the trained model on the manipulation task M to approximate \(w_t\). Substituting \({\mathcal {I}}_\phi\) into Eq. (5) leads to the following optimisation problem:
On M, \(z_{t+1}\) is a function of \({\mathcal {T}}(s_t, \mu _\theta (s_t,g))\). In our setting, the closeform of the state transition function \({\mathcal {T}}\) is unknown, instead \({\mathcal {T}}\) can only be sampled. Also, pursuing a modelfree approach, we do not aim to learn a model for \({\mathcal {T}}\). Therefore, minimising Eq. (6) through gradientbased methods is not an option for our setting as this would require differentiation through \({\mathcal {T}}\). Instead, we propose to formalise this objective as a reward to be maximised through a standard modelfree RL approach. The first obvious candidate for this reward notion can be written as follows:
Practically, however, the above reward is sensitive to the scales of \({\mathcal {I}}_\phi (z_t, z_{t+1})\) and \(\nu _\theta (z_t,g)\), and therefore it may require an additional normalisation term. Even with a normalisation term, the scale of the rewards would shift throughout the training depending on the exploration and the sampling. In order to deal with this issue, we propose another reward notion adopting \(Q^\nu\), i.e. the actionvalue function that had been learnt for \(\nu _\theta\) on object locomotion task L (see Sect. 4.3 and Algorithm 1 for training details). The proposed reward notion is written as follows:
We refer to Eq. (8) as the Simulated Locomotion Demonstrations Rewards (SLDR). Rather than comparing \(w_t\) and \(u_t\) directly with each other as in Eq. (7), the SLDR compares their actionvalues using \(Q^\nu\). Being learnt on L using sparse rewards, \(Q^\nu\) is wellbounded (Sutton and Barto 2018), and \(q_t^{\text {SLDR}}\) produced adopting \(Q^\nu\) does not require a normalisation term.
Note that, by definition, \(Q^\nu (z_t,u,g)\) gives the expected return for any object locomotion action \(u \in {\mathcal {U}}\), when it is taken at the current state \(z_t\) and then \(\nu _\theta\) is followed for the future states. Since \(\nu _\theta\) had been learnt through standard RL to maximise the sparse rewards, it is the optimal object locomotion policy, and therefore \(Q^\nu (z_t, \nu _\theta (z_t,g), g))\) gives the maximum expected return. Accordingly, \(q_t^{\text {SLDR}}\) can be viewed as the advantage of \(w_t\) with respect to \(\nu _\theta (z_t,g)\) in terms of the actionvalues, and is expected to be nonpositive. Maximising this term encourages the robot to induce similar object actions compared to the optimal ones according to \(\nu _\theta\).
4.3 Learning algorithms
In this subsection, we detail the learning algorithms for the object locomotion and the robotic manipulation policies. Figure 2 shows the block diagrams of the learning procedures.
4.3.1 Object locomotion policy
We learn the object locomotion policy only using the environmental sparse rewards as described in Algorithm 1. We adopt DDPG (Sect. 3.2) as the training framework together with HER (Sect. 3.3) to generate auxiliary transition samples to deal with the exploration difficulty caused by the sparse rewards. \(Q^\nu\) is updated to minimise Eq. (3), and \(\nu _\theta\) is optimised using the gradient in Eq. (4). Concurrently, we learn \({\mathcal {I}}_\phi\) using the trajectories generated during the policy learning process by minimising the following objective function:
where \({\mathcal {D}}_L\) is an experience replay buffer.
4.3.2 Robotic manipulation policy
Similarly, we learn the robotic manipulation policy adopting DDPG with HER as described in Algorithm 2. Using the optimisation objective given in Eq. (3), we learn two actionvalue functions: \(Q^\mu _r\) for the environmental sparse rewards \(r_t\), and \(Q^\mu _q\) for the proposed SLDR \(q_t^{\text {SLDR}}\). Accordingly, \(\mu _\theta\) is updated following the gradient below that uses both actionvalue functions:
where \({\mathcal {D}}_M\) is an experience replay buffer. Some tasks may include \(N>1\) objects, e.g. stacking. The proposed method is able to handle these tasks by using individual SLDR for each object and learning individual \(Q^\mu _{q_i}\) for each one of them. Then, the gradient required to update \(\mu _\theta\) is:
5 Environments
We have evaluated the SLD method on 13 simulated MuJoCo (Todorov et al., 2012) environments using two different robot configurations: 7DoF Fetch robotic arm with a twofinger parallel gripper and 24DoF Shadow’s Dexterous Hand. The tasks we have chosen to evaluate include single rigid object manipulation, multiple rigid object stacking and nonrigid object manipulation. Overall, we have used 9 MuJoCo environments (3 with Fetch robot arm and 6 with Shadow’s hand) for single rigid object tasks. Furthermore, we have included additional environments for multiple object stacking and nonrigid object manipulation using the Fetch robot arm. In all environments the rewards are sparse.
5.1 Fetch arm single object environments
These are the same Push, Slide and PickAndPlace tasks introduced in Plappert et al. (2018). In each episode, a desired 3D position (i.e. the target) of the object is randomly generated. The reward is zero if the object is within 5 cm range to the target, otherwise \(1\). The robot actions are 4dimensional: 3D for the desired arm movement in Cartesian coordinates and 1D to control the opening of the gripper. In pushing and sliding, the gripper is locked to prevent grasping. The observations include the positions and linear velocities of the robot arm and the gripper, the object’s position, rotation, angular velocity, the object’s relative position and linear velocity to the gripper, and the target coordinate. An episode terminates after 50 timesteps.
5.2 Shadow’s hand single object environments
These include the tasks first introduced in Plappert et al. (2018), i.e. Egg, Block, Pen manipulation. In these tasks, the object (a block, an eggshaped object, or a pen) is placed on the palm of the robot hand; the robot hand is required to manipulate the object to reach a target pose. The target pose is 7D describing the 3D position together with 4D quaternion orientation, and is randomly generated in each episode. The reward is 0 if the object is within some taskspecific range to the target, otherwise \(1\). As in Plappert et al. (2018), each task has two variants: Full and Rotate. In the Full variant, the object’s whole 7D pose is required to meet the given target pose. In the Rotate variants, the 3D object position is ignored and only the 4D object rotation is expected to the satisfy the desired target. Robot actions are 20dimensional controlling the absolute positions of all noncoupled joints of the hand. The observations include the positions and velocities of all 24 joints of the robot hand, the object’s position and rotation, the object’s linear and angular velocities, and the target pose. An episode terminates after 100 timesteps.
5.3 Fetch arm multiple object stacking environments
The stacking task is built upon the PickAndPlace task. We consider 2 and 3object stacking tasks. For Nobject stacking task, the target has 3N dimensions describing the desired positions of all N objects in 3D. Following Nair et al. (2018a), we start these tasks with the first object placed at its desired target. The robot needs to perform \(N1\) pickandplace actions without displacing the first object. The reward is zero if all objects are within 5cm range to their designated targets, otherwise the reward is assigned a value of \(1\). The robot actions and observations are similar to those in the PickAndPlace task. The episode length is 50 timesteps for 2object stacking and 100 for 3object.
5.4 Fetch arm nonrigid object environments
We build nonrigid object manipulation tasks based on the PickAndPlace task. Instead of using the original rigid block, we have created a nonrigid object by hinging some blocks sidebyside along their edges as shown in Fig. 3. A hinge joint is placed between two neighbouring blocks, allowing one rotational degree of freedom (DoF) along their coincident edges up to \(180^\circ\). We introduce two different variants: 3tuple and 5tuple. For the Ntuple task, N cubical blocks are connected with \(N1\) hinge joints creating \(N1\) internal DoF. The target pose has 3Ndimension describing the desired 3D positions of all N blocks, which are selected uniformly in each episode from a set of predefined target poses (see Fig. 3). The robot is required to manipulate the object to match the target pose. The reward is zero when all the N blocks are within a 2cm range to their corresponding targets, otherwise \(1\). Robot actions and observations are similar to those in the PickAndPlace tasks, excepting that the observations include the position, rotation, angular velocity, relative position and linear velocity to the gripper for each block. The episode length is 50 timesteps for both 3tuple and 5tuple.
5.5 Object locomotion environments
For each robotic manipulation task described above, we use an object locomotion task where we first learn \(\nu _\theta\), \(Q^\nu\) and \({\mathcal {I}}_\phi\). Here, we detail the observation and action space differences between object locomotion and robotic manipulation tasks.
For any task, the object’s observation is a subset of the robot’s observation, i.e. \(z_t \subset s_t\), and only includes objectrelated features while excluding those related to the robot. More concretely, for the environments with the Fetch arm, the object’s observations include the object’s position, rotation, angular velocity, the object’s relative position and linear velocity to the target, and the target location. For the environments with the Shadow’s hand, the object observations include the object’s position and rotation, the object’s linear and angular velocities, and the target pose. We define the object action as the desired relative change in the 7D object pose (3D position and 4D quaternion orientation) between two consecutive timesteps. This leads to 7D action spaces. Specifically for nonrigid objects, we define the object action as the desired relative change in the poses of the blocks at two ends. This leads to 14D action spaces. The rewards are the same as those in each robot manipulation task.
It is worth noting that, in the Full variants of Shadow’s hand environments, we consider the object translation and rotation as two individual locomotion tasks, and we learn separate locomotion policies and Qfunctions for each task. We find that the above strategy encourages the manipulation policy to perform translation and rotation simultaneously. Although object translation and rotation could be executed within a single task, we have empirically found that the resulting manipulation policies tend to prioritise one behaviour versus the other (e.g. they tend to rotate the object first, then translate it) and generally achieves a lower performance.
6 Experiments
6.1 Implementation and training process
Threelayer neural networks with ReLU activations was used to approximate all policies, actionvalue functions and inverse dynamics models. The Adam optimiser (Kingma and Ba 2014) was employed to train all the neural networks. During the training of locomotion policies, the robot was considered as a nonlearning component in the scene and its actions were not restricted to prevent any potential collision with the objects. We could have different choices for the actions of the robot. For example, we could let the robot move randomly or perform any arbitrary fixed action (e.g. a robot arm moving upwards with constant velocity until it reaches to the maximum height and then staying there). In preliminary experiments, we assessed whether this choice bears any effect on final performance, and concluded that no particular setting had clear advantages. For learning locomotion and manipulation policies, most of the hyperparameters suggested in the original HER implementation (Plappert et al., 2018) were retained with only a couple of exceptions for locomotion policies only: to facilitate exploration, with probability 0.2 (0.3 in Plappert et al., 2018) a random action was drawn from a uniform distribution, otherwise we retained the current action, and added Gaussian noise with zero mean and 0.05 (0.2 in Plappert et al., 2018) standard deviation. For locomotion policies, in all Shadow’s hand environments and 5tuple, we train the objects over 50 epochs. In the remaining environments, we stop the training after 20 epochs. When training the main manipulation policies, the number of epochs varies across tasks. For both locomotion and manipulation policies, each epoch includes 50 cycles, and each cycle includes 38 rollouts generated in parallel through 38 MPI workers using CPU cores. This leads to \(38 \times 50=1900\) full episodes per epoch. For each epoch, the parameters are updated 40 times using a batch size of 4864 on a GPU core. We normalise the observations to have zero mean and unit standard deviation as input of neural networks. We update mean and standard deviations of the observations using running estimation on the data in each rollout. We clip \(q_t^{\text {SLDR}}\) to the same range with the environmental sparse rewards, i.e. \(\left[ 1, 0\right]\).
Our algorithm has been implemented in PyTorch.^{Footnote 1} All the environments are based on OpenAI Gym. The corresponding source code, the environments, and illustrative videos for selected tasks have been made publicly available.^{Footnote 2}^{Footnote 3}^{Footnote 4}
6.2 Comparison and performance evaluation
We include the following methods for comparisons:

DDPGSparse Refers to DDPG (Lillicrap et al., 2015) using sparse rewards.

HERSparse Refers to DDPG with HER (Andrychowicz et al., 2017) using sparse rewards.

CHERSparse Refers to DDPG with Curriculumguided Hindsight Experience Replay (Fang et al., 2019) using sparse rewards.

HERDense Refers to DDPG with HER, using dense distancebased rewards.

DDPGSparse+SLDR Refers to DDPG using sparse environmental rewards and SLDR proposed in this paper.

HERSparse+RNDR Refers to DDPG with HER, using sparse environmental rewards and random network distillationbased auxiliary rewards (RNDR) (Burda et al., 2018a).

HERSparse+SLDR Refers to DDPG with HER, using sparse environmental rewards and SLDR.
We use DDPGSparse, HERSparse and HERDense as baselines. HERSparse+RNDR is a representative method constructing auxiliary rewards to facilitate policy learning. CHERSparse replaces the random selection mechanism of HER with an adaptive one that considers the proximity to true goals. DDPGSparse+SLDR and HERSparse+SLDR represents the proposed approach using SLDR with different methods for policy learning.
Following Plappert et al. (2018), we evaluate the performance after each training epoch by performing 10 deterministic test rollouts for each one of the 38 MPI workers. Then we compute the test success rate by averaging across the 380 test rollouts. For all comparison methods, we evaluate the performance with 5 different random seeds and report the median test success rate with the interquartile range. In all environments, we also keep the models with the highest test success rate for different methods and compare their performance.
6.3 Single rigid object environments
The learning curves for Fetch, the Rotate and Full variants of Shadow’s hand environments are reported in Figs. 4a and 5a, b, respectively. We find that HERSparse+SLDR features a faster learning rate and the best performance on all the tasks. This evidence demonstrates that SLDR, coupled with DDPG and HER, can facilitate policy learning with sparse rewards. The benefits introduced by HERSparse+SLDR are particularly evident in hand manipulation tasks (Fig. 5a, b) compared to fetch robot tasks (Fig. 4a), which are notoriously more complex to solve. Additionally, we find that HERSparse+SLDR outperforms HERSparse+RNDR in most tasks. A possible reason for this result is that most methods using auxiliary rewards are based on the notion of curiosity, whereby reaching unseen states is a preferable strategy, which is less suitable for manipulation tasks (Andrychowicz et al., 2017). In contrast, the proposed method exploits a notion of desired object locomotion to guide the main policy during training. We also observe that DDPGSparse+SLDR fails for most tasks. A possible reason for this is that, despite its effectiveness, the proposed approach still requires a suitable RL algorithm to learn from SLDR together with sparse environmental rewards. DDPG on its own is less effective for this task. We find that HERDense performs worse than HERSparse. This result support previous observations that sparse rewards may be more beneficial for complex robot manipulation tasks compared to dense rewards (Andrychowicz et al., 2017; Plappert et al., 2018). Finally, we observe that CHERSparse fails in most tasks and cannot facilitate successful learning. This is somewhat expected given our particular set up, and a possible explanation is in order. Sampling the replay buffer based on the proximity to true goals may work well for locomotion tasks because the distance between the robot gripper and the target is taken into account, and this distance is under direct control of the robot from the very first episode. On the other hand, in the manipulation tasks, the distance between the object and the target stays roughly constant in the early training episodes as the robot has not yet learned to interact with the object. Such a sampling technique prioritising the replays depending on proximity may produce biased batches that can potentially disrupt the learning process. For example, a random robot action causing the object to move away from the target would favour trajectories characterised by a lack of interaction between the robot and the object. Although we report some success on EggRotate, BlockRotate and PenRotate using CHER, this is much lower than the success observed when using HERSparse+SLDR and HERSparse.
6.4 Fetch arm multiple object environments
For environments with N objects, we reuse the locomotion policies trained on the PickAndPlace task with single objects, and obtain an individual SLDR for each one of N objects. We train \(N+1\) actionvalue functions in total, i.e. one for each SLDR and one for the environmental sparse rewards. The manipulation policy is trained using the gradient in Eq. (11).
Inspired by Plappert et al. (2018), we randomly select between two initialisation settings for the training: (1) the targets are distributed on the table (i.e. an auxiliary task) and (2) the targets are stacked on top of each other (i.e. the original stacking task). Each initialisation setting is randomly selected with a probability of 0.5. We have observed that this initialisation strategy helps HERbased methods complete the stacking tasks. From Fig. 4b, we find that HERSparse+SLDR achieves better performance compared to HERSparse, HERSparse+RND and HERDense in the 2object stacking task (Stack2), while other methods fail. On the more complex 3object stacking task (Stack3), HERSparse+SLDR is the only algorithm to succeed. HERSparse+RND occasionally solves the Stack3 task with fixed random seeds but the performance is unstable across different random seeds and multiple runs.
6.5 Fetch arm nonrigid object environments
The learning curves for 3tuple and 5tuple nonrigid object tasks are reported in Fig. 4c. Similarly to the multiple object environment, HERSparse+SLDR achieves better performance for the 3tuple task compared to HERSparse and HERSparse+RND, while the other methods fail to complete the task. For the more complex 5tuple task, only HERSparse+SLDR is able to succeed. Among the 4 predefined targets depicted in Fig. 3, HERSparse+SLDR can achieve 3 targets on average, and can accomplish all 4 targets in one instance, out of 5 runs with different random seeds.
6.6 Comparison Across the Best Models
Figure 6 summarises the performance of the models with the best test success rates for each one of the competing methods. We can see that the proposed HERSparse+SLDR achieves top performance compared to all other methods. Specifically, HERSparse+SLDR is the only algorithm that is able to steadily solve 3object stacking (Stack3) and 5tuple nonrigid object manipulation (5tuple). Remarkably, these two tasks have the highest complexity among all the 13 tasks. The Stack3 task includes multiple stages that require the robot to pick and place multiple objects with different source and target locations in a fixed order; in the 5tuple task the object has the most complex dynamics. For these complex tasks, the proposed SLDR seems to be particularly beneficial. A possible reason is that, although the task is very complex, the objects are still able to learn good locomotion policies (see Fig. 7a) and the rewards learnt from locomotion policies provides critical feedback on how the object should be manipulated to complete the task. This type of objectbased feedback is not utilised by other methods like HER and HER+RND. Our approach outperforms the runnerup by a large margin in the Full variants of Shadow’s hand manipulation tasks (EggFull, BlockFull and PenFull), which feature complex state/action spaces and system dynamics. Finally, the proposed method consistently achieves better or similar performance than the runnerup in other simpler tasks.
7 Conclusion and discussion
In this paper, we address the problem of mastering robot manipulation through deep reinforcement learning using only sparse rewards. The rationale for the proposed methodology is that robot manipulation tasks can be seen of as inducing object locomotion. Based on this observation, we propose to firstly model the objects as independent entities that need to learn an optimal locomotion policy through interactions with a realistically simulated environment, then these policies are leveraged to improve the manipulation learning phase.
We believe that using SLDRs introduces significant advantages. First, SLDRs are generated artificially through a RL policy, hence require no human effort. Producing human demonstrations for complex tasks may prove difficult and/or costly to achieve without significant investments in human resources. For instance, it may be particularly difficult for a human to generate good demonstrations for tasks such as manipulating nonrigid objects with a single hand or with a robotic gripper. On the other hand, we have demonstrated that the locomotion policies can be easily learnt, even for complex tasks, purely in a virtual environment; e.g., in our studies, these policies have achieved 100% success rate on all tasks (e.g. see Fig. 7a, b). Furthermore, since the locomotion policy is learnt through RL, our proposed approach does not require taskspecific domain knowledge and can be designed using only sparse rewards. Training the locomotion policies only requires the same sparse rewards provided by the environment hence the SLDRs produced through RL lead to high quality manipulation policies. This point has been supported by the empirical evidence obtained through experiments involving all 13 environments presented in this paper. As commonly observed in deep RL approaches, the use of neural networks as a function approximators for policies and inverse dynamics functions may introduce convergence issues and lead to nonoptimal policies, but despite these limitations the proposed methodology has been proved to be sufficiently reliable and competitive. The proposed approach is orthogonal to existing methods that use expert demonstrations, and combining them together would be an interesting direction to be explored in the future.
The performance of the proposed framework has been thoroughly examined on 13 robot manipulation environments of increasing complexity. These studies demonstrate that faster learning and higher success rate can be achieved through SLDRs compared to existing methods. In our experiments, SLDRs have enabled the robots to solve complex tasks, such as stacking 3 objects and manipulating nonrigid object with 5 tuples, whereas competing methods have failed. Remarkably, we have been able to outperform runnerup methods by a significant margin for complex Shadow’s hand manipulation tasks. Although SLDRs are obtained using a physics engine, this requirement does not restrict the applicability of the proposed approach to situations where the manipulation is learnt using real robot as long as the locomotion policy can prelearnt realistically.
Several aspects will be investigated in followup work. We have noticed that when the interaction between the manipulating robot and the objects is very complex, the manipulation policy may be difficult to learn despite the fact that the locomotion policy is successfully learnt. For instance, in the case of the 5tuple task with Fetch arm, although the locomotion policy achieves a 100% success rate (as shown in Fig. 7a), the manipulation policy does not always completes the task (as shown in Figs. 4c and 6). In such cases, when the ideal object locomotion depends heavily on the robot, the benefit of the SLDs is reduced. Another limitation is given by our Assumption 2 (Sect. 4.2), which may not hold for some tasks. For example, for pen manipulation tasks with Shadow’s hand, although the pen can rotate and translate itself to complete locomotion tasks (as shown in Fig. 7a), it is difficult for the robot to reproduce the same locomotion without dropping the pen. This issue can degrade the performance of the manipulation policy despite having obtained an optimal locomotion policy (see Figs. 5a, b and 6). A possible solution would be to train the manipulation policy and locomotion policy jointly, and check whether the robot can reproduce the object locomotion suggested by the locomotion policy; a notion of “reachability” of object locomotion could be used to regularise the locomotion policy and enforce \(P(z_t\mu _\theta )\overset{d}{=}P(z_t\nu _\theta )\).
An important aspect to bear in mind is that our methodology requires the availability of a simulated environment for the application at hand. Nowadays, due to the welldocumented sample inefficiency of most stateoftheart, modelfree DRL algorithms, such simulators are commonly used for training RL policies before deployment in the real world. Besides, creating physically realistic environments from existing 3D models using modern tools has become almost effortless. In this sense, the approach proposed in this work requires only a marginal amount of additional engineering once a simulator has been developed. For instance, using MuJoCo, setting up the object locomotion policies would only entail the removal of the robot from the environment and inclusion of the objects as “mocap” entities. In comparison with other approaches, such as those relying on human demonstrations, the additional effort required to enable SLDR is only minimal.
In this paper we have adopted DDPG as the main training algorithm due to its widely reported effectiveness in continuous control tasks. However, our framework is sufficiently general, and other algorithms may be suitable such as trust region policy optimisation (TRPO) (Schulman et al., 2015), proximal policy optimisation (PPO) (Schulman et al., 2017) and soft actorcritic (Haarnoja et al., 2018); analogously, modelbased methods (Chua et al., 2019; Zhang et al., 2019) could also provide feasible alternatives to be explored in future work.
Availability of data and material
Data used in the experiments has been generated through publicly available simulators. Related simulation files have been shared through the links given in the paper in order to fully reproduce the presented results.
Code availability
Codes has been shared through the links given in the paper in order to fully reproduce the presented results.
Notes
Source code: https://github.com/WMGDataScience/sldr.git.
Environments: https://github.com/WMGDataScience/gym_wmgds.git.
Supplementary videos: https://youtu.be/jubZ0dPVl2M.
References
Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning via inverse reinforcement learning. In International conference on machine learning.
Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., et al. (2018). Learning dexterous inhand manipulation. arXiv preprint arXiv:1808.00177.
Andrychowicz, M., Crow, D., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., & Zaremba, W. (2017). Hindsight experience replay. In Advances in neural information processing systems (pp. 5055–5065)
Belter, D., Kopicki, M., Zurek, S., & Wyatt, J. (2014). Kinematically optimised predictions of object motion. In 2014 IEEE/RSJ international conference on intelligent robots and systems (pp. 4422–4427). IEEE.
Bojarski, M., Del Testa, D., Dworakowski, D., Firner, B., Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller, U., Zhang, J., Zhang, X., & Zhao, J. (2016) End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
Brown, D. S., Goo, W., Nagarajan, P., & Niekum, S. (2019). Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on learning representations .
Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2018a). Exploration by random network distillation. arXiv preprint arXiv:1810.12894.
Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., & Efros, A. A. (2018b). Largescale study of curiositydriven learning. arXiv preprint arXiv:1808.04355.
Choi, S., Lee, K., & Songhwai, O. (2019). Robust learning from demonstrations with mixed qualities using leveraged gaussian processes. IEEE Transactions on Robotics, 35, 564–576.
Chua, K., Calandra, R., McAllister, R., & Levine, S. (2019). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In International conference on machine learning.
Coumans, E., & Bai, Y. (2017). Pybullet, a python module for physics simulation in robotics, games and machine learning. https://pybullet.org.
Duan, Y., Andrychowicz, M., Stadie, B. C., Ho, J., Schneider, J., Sutskever, I., Abbeel, P., & Zaremba, W. (2017). Oneshot imitation learning. In Advances in neural information processing systems (pp. 1087–1098).
Fang, M., Zhou, T., Du, Y., Han, L., & Zhang, Z. (2019) Curriculumguided hindsight experience replay.
Finn, C., Levine, S., & Abbeel, P. (2016). Guided cost learning: Deep inverse optimal control via policy optimization. In International conference on machine learning (pp. 49–58).
Florensa, C., Held, D., Geng, X., & Abbeel, P. (2018). Automatic goal generation for reinforcement learning agents. In International conference on machine learning (pp. 1514–1523).
Fu, J., Levine, S., & Abbeel, P. (2016). Oneshot learning of manipulation skills with online dynamics adaptation and neural network priors. In 2016 IEEE/RSJ International conference on intelligent robots and systems (IROS) (pp. 4019–4026). IEEE.
Gao, Y., Xu, H., Lin, J., Yu, F., Levine, S., & Darrell, T. (2018). Reinforcement learning from imperfect demonstrations. In International conference on learning representations.
Grollman, D. H., & Billard, A. (2011). Donut as I Do: Learning from failed demonstrations. In IEEE international conference on robotics and automation.
Gu, S., Holly, E., Lillicrap, T. P., & Levine, S. (2017). Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates. In International conference on robotics and automation (pp. 3389–3396).
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning.
Hester, T., et al. (2018). Deep qlearning from demonstrations. In AAAI Conference on artificial intelligence (pp. 3223–3230).
Ho, J., & Ermon, S. (2016). Generative adversarial imitation learning. In Advances in neural information processing systems (pp. 4565–4573).
Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., & Abbeel, P. (2016). VIME: Variational information maximizing exploration. In Advances in neural information processing systems.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. CoRR. arXiv:1412.6980.
Koenig, N., & Howard, A. (2004). Design and use paradigms for gazebo, an opensource multirobot simulator. In IEEE/RSJ international conference on intelligent robots and systems (IROS).
Kopicki, M., Wyatt, J., & Stolkin, R. (2009). Prediction learning in robotic pushing manipulation. In 2009 International conference on advanced robotics (pp. 1–6). IEEE.
Kopicki, M., Zurek, S., Stolkin, R., Mörwald, T., & Wyatt, J. (2011). Learning to predict how rigid objects behave under simple manipulation. In 2011 IEEE international conference on robotics and automation (pp. 5722–5729). IEEE.
Kroemer, O., Niekum, S., & Konidaris, G. (2019). A review of robot learning for manipulation: Challenges, representations, and algorithms. arXiv preprint arXiv:1907.03146.
Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). Endtoend training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1), 1334–1373.
Li, Y., Wu, J., Tedrake, R., Tenenbaum, J. B., & Torralba, A. (2018). Learning particle dynamics for manipulating rigid bodies, deformable objects, and fluids. arXiv preprint arXiv:1810.01566.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. CoRR. arXiv:1509.02971.
Mnih, V., et al. (2015). Humanlevel control through deep reinforcement learning. Nature, 518(7540), 529–533.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International conference on machine learning (pp. 1928–1937).
Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., & Abbeel, P. (2018a) Overcoming exploration in reinforcement learning with demonstrations. In International conference on robotics and automation (pp. 6292–6299).
Nair, A., Pong, V., Dalal, M., Bahl, S., Lin, S., & Levine, S. (2018b). Visual reinforcement learning with imagined goals. In Advances in neural information processing systems (pp. 9209–9220).
Ng, A. Y., & Russell, S. J. (2000). Algorithms for inverse reinforcement learning. In International conference on machine learning (pp. 663–670).
Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017). Curiositydriven exploration by selfsupervised prediction. In IEEE Conference on computer vision and pattern recognition workshops (pp. 16–17).
Plappert, M., et al. (2018). Multigoal reinforcement learning: Challenging robotics environments and request for research. CoRR. arXiv:1802.09464.
Pomerleau, D. (1988). ALVINN: An autonomous land vehicle in a neural network. In Advances in neural information processing systems (pp. 305–313).
Popov, I., Heess, N., Lillicrap, T. P., Hafner, R., BarthMaron, G., Vecerik, M., Lampe, T., Tassa, Y., Erez, T., & Riedmiller, M. A. (2017). Dataefficient deep reinforcement learning for dexterous manipulation. CoRR. arXiv:1704.03073.
Ratliff, N. D., Bagnell, J. A., & Srinivasa, S. S. (2007). Imitation learning for locomotion and manipulation. In International conference on humanoid robots (pp. 392–397).
Reddy, S., Dragan, A. D., & Levine, S. (2019). SQIL: Imitation learning via regularized behavioral cloning. arXiv preprint arXiv:1905.11108.
Riedmiller, M. A., et al. (2018). Learning by playing solving sparse reward tasks from scratch. In International conference on machine learning (pp. 4341–4350).
Ross, S., Gordon, G. J., & Bagnell, D. (2011). A reduction of imitation learning and structured prediction to noregret online learning. In International conference on artificial intelligence and statistics (pp. 627–635).
Sasaki, F., Yohira, T., & Kawaguchi, A. (2019). Sample efficient imitation learning for continuous control. In In International conference on learning representations.
Savinov, N., Raichuk, A., Marinier, R., Vincent, D., Pollefeys, M., Lillicrap, T., & Gelly, S. (2018). Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274.
Schaul, T., Horgan, D., Gregor, K., & Silver, D. (2015). Universal value function approximators. In Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6–11 July 2015 (pp. 1312–1320).
Schmidhuber, J. (1991). Curious modelbuilding control systems. In IEEE international joint conference on neural networks (pp. 1458–1463). IEEE.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning (pp. 1889–1897).
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Shiarlis, K., Messias, J., & Whiteson, S. (2016). Inverse reinforcement learning from failure. In International conference on autonomous agents & multiagent systems.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. A. (2014). Deterministic policy gradient algorithms. In textitInternational conference on machine learning (pp. 387–395).
Silver, D., et al. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489.
Sukhbaatar, S., Lin, Z., Kostrikov, I., Synnaeve, G., Szlam, A., & Fergus, R. (2018). Intrinsic motivation and automatic curricula via asymmetric selfplay. In In International conference on learning representations.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press.
Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ international conference on intelligent robots and systems (pp. 5026–5033). IEEE.
Vecerik, M., et al. (2017). Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards. CoRR. arXiv:1707.08817.
Wang, Z., Merel, J., Reed, S., Wayne, G., de Freitas, N., & Heess, N. (2017). Robust imitation of diverse behaviors. arXiv preprint arXiv:1707.02747.
Watkins, C. J. C. H., & Dayan, P. (1992). Technical note qlearning. Machine Learning, 8, 279–292.
Xu, H., Gao, Y., Yu, F., & Darrell, T. (2017). Endtoend learning of driving models from largescale video datasets. In IEEE Conference on computer vision and pattern recognition.
Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M., & Levine, S. (2019). SOLAR: deep structured representations for modelbased reinforcement learning. In International conference on machine learning.
Zheng, J., Liu, S., & Ni, L. M. (2014). Robust Bayesian inverse reinforcement learning with sparse behavior noise. In AAAI conference on artificial intelligence.
Zhu, H., Gupta, A., Rajeswaran, A., Levine, S., & Kumar, V. (2018). Dexterous manipulation with deep reinforcement learning: Efficient, general, and lowcost. arXiv preprint arXiv:1810.06045.
Author information
Authors and Affiliations
Contributions
OK proposed the idea, implemented the idea as a Python code, modified the gym environments for MuJoCo simulations, run the experiments, obtained the results and wrote the first draft of the manuscript. GM managed the project, provided the computational resources and revised the manuscript.
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Editors: Annalisa Appice, Sergio Escalera, Jose A. Gamez, Heike Trautmann
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kilinc, O., Montana, G. Reinforcement learning for robotic manipulation using simulated locomotion demonstrations. Mach Learn 111, 465–486 (2022). https://doi.org/10.1007/s10994021061161
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994021061161
Keywords
 Sparse rewards
 Reinforcement learning
 Robotic manipulation
 Auxiliary rewards