Reinforcement Learning for Robotic Manipulation using Simulated Locomotion Demonstrations

Mastering robotic manipulation skills through reinforcement learning (RL) typically requires the design of shaped reward functions. Recent developments in this area have demonstrated that using sparse rewards, i.e. rewarding the agent only when the task has been successfully completed, can lead to better policies. However, state-action space exploration is more difficult in this case. Recent RL approaches to learning with sparse rewards have leveraged high-quality human demonstrations for the task, but these can be costly, time consuming or even impossible to obtain. In this paper, we propose a novel and effective approach that does not require human demonstrations. We observe that every robotic manipulation task could be seen as involving a locomotion task from the perspective of the object being manipulated, i.e. the object could learn how to reach a target state on its own. In order to exploit this idea, we introduce a framework whereby an object locomotion policy is initially obtained using a realistic physics simulator. This policy is then used to generate auxiliary rewards, called simulated locomotion demonstration rewards (SLDRs), which enable us to learn the robot manipulation policy. The proposed approach has been evaluated on 13 tasks of increasing complexity, and can achieve higher success rate and faster learning rates compared to alternative algorithms. SLDRs are especially beneficial for tasks like multi-object stacking and non-rigid object manipulation.


Introduction
Reinforcement Learning (RL) solves sequential decision-making problems by learning a policy that maximises expected rewards. Recently, with the aid of deep artificial neural network as function approximators, RL-trained agents have been able to autonomously master a number of complex tasks, most notably playing video games [1] and board games [2]. Robot manipulation has been extensively studied in RL, but is particularly challenging to master because it often involves multiple stages (e.g. stacking multiple blocks), high-dimensional state spaces (e.g. dexterous hand manipulation [3,4]) and complex dynamics (e.g. manipulating non-rigid objects). Although promising performance has been reported on a wide range of tasks like grasping [5,6], stacking [7] and dexterous hand manipulation [3,4], the learning algorithms usually require carefully-designed reward signals to learn good policies. For example, [6] propose a thoroughly weighted 5-term reward formula for learning to stack Lego blocks and [8] use a 3-term shaped reward to perform door-opening tasks with a robot arm. The requirement of hand-engineered, dense reward functions limits the applicability of RL in real-world robot manipulation to cases where task-specific knowledge can be captured.
The alternative to designing shaped rewards consists of learning with only sparse feedback signals, i.e. a non-zero rewards indicating the completion of a task. Using sparse rewards is more desirable in practise as it generalises to many tasks without the need for hand-engineering [2,9,10]. On the other

Rigid Object Locomotion
Non-rigid Object Locomotion Figure 1: An illustration of the proposed approach. Top row: a general robot manipulation task of pick-and-place, which requires the robot to pick up an object (green cube) and place it to a specified location (red sphere). Middle row: the corresponding auxiliary locomotion task requires the object to move to the target location. Bottom row: the auxiliary locomotion task corresponding to a pick-and-place task with a non-rigid object (not shown). Note that the auxiliary locomotion tasks usually have significantly simpler dynamics compared to the corresponding robot manipulation task, hence can be learnt efficiently through standard RL, even for very complex tasks. The learnt locomotion policy is used to inform the robot manipulation policy.
hand, learning with only sparse rewards is significantly more challenging since associating sequences of actions to non-zero rewards received only when a task has been successfully completed becomes more difficult. A number of existing approaches that address this problem have been proposed lately [9,11,12,13,14,10,15,16]; some of them report some success in completing manipulation tasks like object pushing [9,16], pick-and-place [9], stacking two blocks [11,16], and target finding in a scene [14,15]. Nevertheless, for more complex tasks such as stacking multiple blocks and manipulating non-rigid objects, there is scope for further improvement.
A particularly promising approach to facilitate learning has been to leverage human expertise through a number of manually generated examples demonstrating the robot actions required to complete a given task. When these demonstrations are available, they can be used by an agent in various ways, e.g. by attempting to generate a policy that mimics them [17,18,19], pre-learning a policy from them for further RL [2,20], as a mechanism to guide exploration [7], as data from which to infer a reward function [21,22,23,24], and in combination with trajectories generated during RL [25,26,27]. Practically, however, human demonstrations are expensive to obtain, and their effectiveness ultimately depends on the competence of the demonstrators. Demonstrators with insufficient task-specific expertise could generate low-quality demonstrations resulting in sub-optimal policies. Although there is an existing body of work focusing on learning with imperfect demonstrations [28,29,30,31,32,33], these methods usually assume that either qualitative evaluation metrics are available [28,30,32] or that a substantial volume of demonstrations can be collected [29,31,33].
In this paper, we propose a novel approach that allows complex robot manipulation tasks to be learnt with only sparse rewards. In the tasks we consider, an object is manipulated by a robot so that, starting from a (random) initial position, it eventually reaches a goal position through a sequence of states in which its location and pose vary. For example, Figure 1 (top row) represents a pick-and-place task in which the object is being picked up by the two-finger gripper and moved from its initial state to a pre-defined target location (red sphere). Our key observation is that every robot manipulation implies an underlying object locomotion task that can be explicitly modelled as an independent task for the object itself to learn. Figure 1 (middle row) illustrates this idea for the pick-and-place task: the object, on its own, must learn to navigate from any given initial position until it reaches its target position. More complex manipulation tasks involving non-rigid objects can also be thought as inducing such object locomotion tasks; for instance, in Figure 1 (bottom row), a 5-tuple non-rigid object moves itself to the given target location and pose (see Figure 3 for the description of the non-rigid object).
Although in the real world it is impossible for objects to move on their own, learning such object locomotion policies can be achieved in a virtual environment through a realistic physics engine such as MuJoCo [34], Gazebo [35] or Pybullet [36]. In our experience, such policies are relatively straightforward to learn using only sparse rewards since the objects usually operate in simple state/action spaces and/or have simple dynamics. Once a locomotion policy has been learnt, we utilise it to produce a form of auxiliary rewards guiding the main manipulation policy. We name these auxiliary rewards "Simulated Locomotion Demonstration Rewards" (SLDRs). During the process of learning the robot manipulation policy, the proposed SLDRs encourage the robot to execute policies implying object trajectories that are similar to those obtained by the object locomotion policy.
Although the SLDRs can only be learnt through a realistic simulator, this requirement does not restrict their applicability to real world problems, and the resulting manipulation policies can still be transferred to physical systems. To the best of our knowledge, this is the first time that object-level policies are trained in a physics simulator to enable robot manipulation learning driven by only sparse rewards.
In our implementation, all the policies are learnt using deep deterministic policy gradient (DDPG) [37], which has been chosen due to its widely reported effectiveness in continuous control; however, most RL algorithms compatible with continuous actions could have been used within the proposed SLD framework. Our experimental results involve 13 continuous control environments using the MuJoCo physics engine [34] within the OpenAI Gym framework [38]. These environments cover a variety of robot manipulation tasks with increasing level of complexity, e.g. pushing, sliding and pickand-place tasks with a Fetch robotic arm, in-hand object manipulation with a Shadow's dexterous hand, multi-object stacking, and non-rigid object manipulation. Overall, across all environments, we have found that our approach can achieve faster learning rate and higher success rate compared to baselines methods, especially in more challenging tasks such as stacking objects and manipulating non-rigid objects. Baselines are provided to represent existing approaches that use reward-shaping, curiosity-based auxiliary rewarding and auxiliary goal generation techniques.
The remainder of the paper is organised as follows. In Section 2 we review the most related work, and in Section 3 we provide some introductory background material regarding the RL modelling framework and algorithms we use. In Section 4, we develop the proposed methodology. In Section 5 we describe all the environments used for our experiments, and the experimental results are reported in Section 6. Finally, we conclude with a discussion and suggestions for further extensions in Section 7.

Related Work
Robotic Manipulation. Robotics requires sequential decision making under uncertainty, and therefore it is a common application domain of machine learning approaches including RL [39]. Recent advances in RL have focused on locomotion [37,40,41] and manipulation tasks [42,8], which includes grasping [5,6], stacking [7] and dexterous hand manipulation [3,4]. These tasks are particularly challenging as they require continuous control over actions and the expected behaviours are hard to formulate through rewards. Due to the sample inefficiency problem of RL, most state-of-the-art approaches rely on simulated environments such as MuJoCo [34] as training using physical systems would be significantly slower and costly. Predicting how objects behave under manipulation has also been well studied. For example, [43,44,45] propose approaches to predict the motions of rigid objects under pushing actions with the aim of using these models to plan the robotic manipulation. Most recently, [46] has proposed to learn a particle-based dynamics from data to handle complex interactions between rigid bodies, deformable objects and fluids. The focus of these studies has been to develop learnable simulators to replace traditional physics engines, whereas in this paper our aim is to learn object policies using the simulators. Although we employ a traditional physics engine for this paper, this could be replaced with learnable simulators in future work. Learning from Demonstrations. A substantial body of work exists on how to leverage such demonstrations, when available, for reinforcement learning. Behaviour cloning (BC) methods approach sequential decision-making as a supervised learning problem [47,48,18,19]. Some BC methods include an expert demonstrator in the training loop to handle the mismatching between the demonstration data and the data encountered in the training procedure [17,49]. Recent BC methods have also considered adversarial frameworks to improve the policy learning [24,50]. A different approach consists of inverse reinforcement learning, which seeks to infer a reward/cost function to guide the policy learning [21,22,23]. Several methods have been developed to leverage demonstrations for robotic manipulation tasks with sparse rewards. For instance, [25,7] jointly use demonstrations with trajectories collected during the RL process to guide the exploration, and [20] use the demonstrations to pre-learn a policy, which is further fine-tuned in a following RL stage. Obtaining the training data requires specialised data capture setups such as teleoperation interfaces. In general, obtaining good quality demonstrations is an expensive process in terms of both human effort and equipment requirements. In contrast, the proposed method generates object-level demonstrations autonomously, and could potentially be used jointly with human-generated demonstrations when these are available.
Goal Conditioned Policies and Auxiliary Goal Generation. Goal-conditioned policies [51] that can generalise over multiple goals have been shown to be promising for robotic problems. For manipulation tasks with sparse rewards, several approaches have recently been proposed to automatically generate the auxiliary goals. For instance, [52] used a self-play approach on reversible or resettable environments, [53] employed adversarial training for robotic locomotion tasks, [54] proposed variational autoencoders for visual robotics tasks, and [9] introduced Hindsight Experience Replay (HER), which randomly draws synthetic goals from previously encountered states. HER in particular has been proved particularly effective, although the automatic goal generation can still be problematic on complex tasks involving multiple stages, e.g. stacking multiple objects, when used without demonstrations [54]. Some attempts have been made to form an explicit curriculum for such complex tasks; e.g. [11] manually define several semantically grounded sub-tasks each having its own individual reward. Methods such as this one requires significant human effort hence they cannot be readily applied across different tasks. The proposed method in this paper uses goal-conditioned policies and adopts HER for auxiliary goal generation due to its effectiveness in robotic manipulation. However, it can be integrated with the other goal techniques in the literature. Auxiliary Rewards in RL. Lately, increasing efforts have been made to design general auxiliary reward functions aimed at facilitating learning in environments with only sparse rewards. Many of these strategies involve a notion of curiosity [55], which encourages agents to visit novel states that have not been seen in previous experience; for instance, [14] formulate the auxiliary reward using the error in predicting the RL agent's actions by an inverse dynamics model, [12] encourage the agent to visit the states that result the largest information gain in system dynamics, [10] construct the auxiliary reward based on the error in predicting the output of a fixed randomly initialised neural network, and [15] introduces the notion of state reachability. Despite the benefits introduced by these approaches, visiting unseen states may be less beneficial in robot manipulation tasks as exploring complex state spaces to find rewards is rather impractical [9]. The proposed approach, on the other hand, produces the auxiliary rewards based on the underlying object locomotion; as such, it motivates the robot to mimic the optimal object locomotion rather than curiously exploring the continuous state space.

Multi-goal RL for Robotic Manipulation
We are concerned with solving a manipulation task: an object is presented to the robot, and has to be manipulated so as to reach a target position. In the tasks we consider, the target goal is specified by the object location and orientation, and the robot is rewarded only when it reaches its goal. We model the robot's sequential decision process as a Markov Decision Process (MDP) defined by a tuple, M = S, G, A, T , R, γ , where S is the set of states, G is the set of goals, A is the set of actions, T is the state transition function, R is the reward function and γ ∈ [0, 1) is the discounting factor. At the beginning of an episode, the environment samples a goal g ∈ G. The state of the environment at time t is denoted by s t ∈ S and includes both robot-related and object-related features. In a real system, these features are typically continuous variables obtained through sensors of the robot. The position of the object o t is one of the object-related features included in s t and can be obtained through a known mapping, i.e. o t = m S→O (s t ). A robot's action is controlled by a deterministic policy, i.e. a t = µ θ (s t , g) : S × G → A, parameterised by θ. The environment moves to its next state through its state transition function, i.e. s t+1 = T (s t , a t ) : S × A → S, and provides an immediate and sparse reward r t , defined as where is a pre-defined threshold. Following its policy, the robot interacts with the environment until the episode terminates after T steps. The interaction between the robot and the environment generates a trajectory, τ = (g, s 1 , a 1 , r 1 , . . . , s T , a T , r T , s T +1 ). The ultimate learning objective is to find the optimal policy that maximises the expected sum of the discounted rewards over the time where γ is the discount factor.

Deep Deterministic Policy Gradient Algorithm
Policy Gradient (PG) algorithms update the policy parameters θ in the direction of ∇ θ J(µ θ ) to maximise the expected return J( [37] integrates non-linear function approximators such as neural networks with Deterministic Policy Gradient (DPG) [56] that uses deterministic policy functions. DDPG maintains a policy (actor) network µ θ (s t , g) and an action-value (critic) network Q µ (s t , a t , g).
The actor µ θ (s t , g) deterministically maps states to actions. The critic Q µ (s t , a t , g) estimates the expected return when starting from s t by taking a t , and then following µ θ in the future states until the termination of the episode, i.e. Q µ (s t , a t , g) = E T i=t γ i−t r i s t , a t , g, µ θ . When interacting with the environment, DDPG assures the exploration by adding a noise to the deterministic policy output, i.e. a t = µ θ (s t , g) + N . Experienced transitions during these interactions, i.e. g, s t , a t , r t , s t+1 , are stored in a replay buffer D. The actor and critic networks are updated using the transitions sampled from D. The critic parameters are learnt by minimising the following loss to satisfy the Bellman equation similarly to Q-learning [57]: where y = r t + γQ µ (s t , µ(s t+1 ), g). The actor parameters θ are updated using the following policy gradient: We adopt DDPG as the main training algorithm; however, the proposed idea can also be used with other off-policy approaches that work with continuous action domains.

Hindsight Experience Replay
Hindsight Experience Replay (HER) [9] has been introduced to learn policies from sparse rewards, especially for robot manipulation tasks. The idea is to view the states achieved in an episode as pseudo goals (i.e. achieved goals) to facilitate learning even when the desired goal has not been achieved during the episode. Suppose we are given an observed trajectory, τ = (g, s 1 , a 1 , r 1 , . . . , s T , a T , r T , s T +1 ). Since o t can be obtained from s t using a fixed and known mapping, the path that was followed by the object during the trajectory, i.e. o 1 , . . . , o T +1 , can be easily extracted. HER samples a new goal from this path, i.e.g ∼ {o 1 , . . . , o T }, and the rewards are recomputed with respect tog, i.e.r t = R(o t+1 ,g). Using these rewards andg, a new trajectory is created implicitly, i.e.τ = (g, s 1 , a 1 ,r 1 , . . . , s T , a T ,r T , s T +1 ). These HER trajectoriesτ are used to train the policy parameters together with the original trajectories.

Methodology
Given a manipulation task, initially we introduce a corresponding auxiliary locomotion task for the object that is being manipulated, i.e. the object is assumed to be the decision-making agent. This auxiliary problem is usually significantly easier to learn compared to the original task. After learning the object locomotion policy, we use it on a reward-generating mechanism for the robot when learning the original manipulation task. In this section, we explain the steps involved in our proposed procedure, i.e. (a) how the object locomotion policies are learned, (b) how the proposed reward function is defined, and (c) how these auxiliary rewards are leveraged for robotic manipulation.

Object Locomotion Policies
The object involved in the manipulation task is initially modelled as an agent capable of independent decision making abilities, and its decision process is modelled by a separate MDP defined by a tuple L = Z, G, U, Y, R, γ . Here, Z is the set of states, G is the set of goals, U is the set of actions, Y is the state transition function, R is the reward function and γ ∈ [0, 1) is the discounting factor. The same goal space, G, is used as in M , and z t ∈ Z is a reduced version of s t that only involves object-related features including the position of the object, i.e. o t ⊂ z t . The object's action space explicitly controls the pose of the object, and these actions are controlled by a deterministic policy, i.e. u t = ν θ (z t , g) : Z × G → U. The state transition is defined on a different space, i.e. Y : Z × U → Z; however, the same sparse reward function is used here as before. Figure 2a illustrates the training procedure used in this context and based on DDPG with HER. The optimal object policy ν θ maximises the expected return where D L denotes the replay buffer containing the trajectories, indicated by η, obtained by ν θ throughout training.

Robotic Manipulation with Simulated Locomotion Demonstration Rewards (SLDR)
On the original manipulation task M , the robot receives the current environmental state and the desired goal and then decides how to act according to its policy µ θ . Whenever the object is moved from one position to another, the observed object locomotion is a consequence of robot's actions. More concretely, the observed object action on M (hereafter denoted by w t ) is a function of the robot policy µ θ . The relation between w t and µ θ depends on the environmental dynamics whose close-form model is unknown. We use f : A → U to denote this unknown relation, i.e. w t = f (µ θ (s t , g)).
The key steps of the proposed approach are as follows: as we had initially learnt an object locomotion policy on L, first we use it to enquire the optimal object action for the current state and goal, i.e. u t = ν θ (z t , g). Then, we update µ θ in order to make w t get closer to u t . This learning objective can be written as follows: Given that the the environment dynamics is unknown, we replace f in Eq. 5 with a parameterised model to approximate w t . Estimating w t from robot actions is not straight-forward as it requires keeping track of all previous actions, i.e. a 1:t , and the initial state. Instead, we propose to estimate w t by evaluating the transition from the current state to the next. Specifically, we substitute f with a parameterised inverse dynamic model, i.e. I φ : Z × Z → U, that we train to estimate the output of ν θ (z t , g) from z t and z t+1 , i.e. ν θ (z t , g) ≈ I φ (z t , z t+1 ). We learn the parameters of I φ on the object locomotion task L (see Section 4.3 and Algorithm 1 for training details), and then employ the trained model on the manipulation task M to approximate w t . Substituting I φ into Eq. 5 leads to the following optimisation problem: On M , z t+1 is a function of T (s t , µ θ (s t , g)). In our setting, the close-form of the state transition function T is unknown, instead T can only be sampled. Also, pursuing a model-free approach, we do not aim to learn a model for T . Therefore, minimising Eq. 6 through gradient-based methods is not an option for our setting as this would require differentiation through T . Instead, we propose to formalise this objective as a reward to be maximised through a standard model-free RL approach. The first obvious candidate for this reward notion can be written as follows: Practically, however, the above reward is sensitive to the scales of I φ (z t , z t+1 ) and ν θ (z t , g), and therefore it may require an additional normalisation term. Even with a normalisation term, the scale of the rewards would shift throughout the training depending on the exploration and the sampling. In order to deal with this issue, we propose another reward notion adopting Q ν , i.e. the action-value function that had been learnt for ν θ on object locomotion task L (see Section 4.3 and Algorithm 1 for training details). The proposed reward notion is written as follows:

Algorithm 1: Learning locomotion policy and inverse dynamic
Given :Locomotion MDP L = Z, G, U , Y, R, γ , Neural networks ν θ , Q ν and I φ A random process N L for exploration Fixed and known mapping function m Z→O : Z → O Initialise :Parameters of ν θ , Q ν and I φ Experience replay buffer D L for i episode = 1 to N episode do for i rollout = 1 to N rollout do Receive initial state z 1 and g, o 1 = m Z→O (z 1 ) for t = 1, T do Sample an object action: ut = ν θ (zt, g) + N L Execute the action: z t+1 = Y(zt, at), rt = R(o t+1 , g) Store (g, zt, ut, rt, z t+1 ) in D L Generate HER samples and store in D L for i update ← 1 to N update do Get a random mini-batch of samples from D L Update Q ν minimising the loss in Eq. (3) Update ν θ using the gradient in Eq.(4) Update I φ minimising the loss in Eq. (9) Return :ν θ , Q ν and I φ We refer to Eq. 8 as the Simulated Locomotion Demonstrations Rewards (SLDR). Rather than comparing w t and u t directly with each other as in Eq. 7, the SLDR compares their action-values using Q ν . Being learnt on L using sparse rewards, Q ν is well-bounded [58], and q SLDR t produced adopting Q ν does not require a normalisation term.
Note that, by definition, Q ν (z t , u, g) gives the expected return for any object locomotion action u ∈ U, when it is taken at the current state z t and then ν θ is followed for the future states. Since ν θ had been learnt through standard RL to maximise the sparse rewards, it is the optimal object locomotion policy, and therefore Q ν (z t , ν θ (z t , g), g)) gives the maximum expected return. Accordingly, q SLDR t can be viewed as the advantage of w t with respect to ν θ (z t , g) in terms of the action-values, and is expected to be non-positive. Maximising this term encourages the robot to induce similar object actions compared to the optimal ones according to ν θ .

Learning Algorithms
In this subsection, we detail the learning algorithms for the object locomotion and the robotic manipulation policies. Figure 2 shows the block diagrams of the learning procedures.
Object locomotion policy. We learn the object locomotion policy only using the environmental sparse rewards as described in Algorithm 1. We adopt DDPG (Section 3.2) as the training framework together with HER (Section 3.3) to generate auxiliary transition samples to deal with the exploration difficulty caused by the sparse rewards. Q ν is updated to minimise Eq. 3, and ν θ is optimised using the gradient in Eq. 4. Concurrently, we learn I φ using the trajectories generated during the policy learning process by minimising the following objective function: where D L is an experience replay buffer.
Robotic manipulation policy. Similarly, we learn the robotic manipulation policy adopting DDPG with HER as described in Algorithm 2. Using the optimisation objective given in Eq. 3, we learn two action-value functions: Q µ r for the environmental sparse rewards r t , and Q µ q for the proposed SLDR q SLDR t . Accordingly, µ θ is updated following the gradient below that uses both action-value functions: where D M is an experience replay buffer. Some tasks may include N > 1 objects, e.g. stacking. The proposed method is able to handle these tasks by using individual SLDR for each object and learning Return :µ individual Q µ qi for each one of them. Then, the gradient required to update µ θ is:

Environments
We have evaluated the SLD method on 13 simulated MuJoCo [34] environments using two different robot configurations: 7-DoF Fetch robotic arm with a two-finger parallel gripper and 24-DoF Shadow's Dexterous Hand. The tasks we have chosen to evaluate include single rigid object manipulation, multiple rigid object stacking and non-rigid object manipulation. Overall, we have used 9 MuJoCo environments (3 with Fetch robot arm and 6 with Shadow's hand) for single rigid object tasks. Furthermore, we have included additional environments for multiple object stacking and non-rigid object manipulation using the Fetch robot arm. In all environments the rewards are sparse.
Fetch Arm Single Object Environments. These are the same Push, Slide and PickAndPlace tasks introduced in [59]. In each episode, a desired 3D position (i.e. the target) of the object is randomly generated. The reward is zero if the object is within 5cm range to the target, otherwise −1. The robot Egg, Block, Pen manipulation. In these tasks, the object (a block, an egg-shaped object, or a pen) is placed on the palm of the robot hand; the robot hand is required to manipulate the object to reach a target pose. The target pose is 7D describing the 3D position together with 4D quaternion orientation, and is randomly generated in each episode. The reward is 0 if the object is within some task-specific range to the target, otherwise −1. As in [59], each task has two variants: Full and Rotate. In the Full variant, the object's whole 7D pose is required to meet the given target pose. In the Rotate variants, the 3D object position is ignored and only the 4D object rotation is expected to the satisfy the desired target. Robot actions are 20-dimensional controlling the absolute positions of all non-coupled joints of the hand. The observations include the positions and velocities of all 24 joints of the robot hand, the object's position and rotation, the object's linear and angular velocities, and the target pose. An episode terminates after 100 time-steps.
Fetch Arm Multiple Object Stacking Environments. The stacking task is built upon the PickAnd-Place task. We consider 2-and 3-object stacking tasks. For N -object stacking task, the target has 3N dimensions describing the desired positions of all N objects in 3D. Following [7], we start these tasks with the first object placed at its desired target. The robot needs to perform N − 1 pick-and-place actions without displacing the first object. The reward is zero if all objects are within 5cm range to their designated targets, otherwise the reward is assigned a value of −1. The robot actions and observations are similar to those in the PickAndPlace task. The episode length is 50 time-steps for 2-object stacking and 100 for 3-object.
Fetch Arm Non-rigid Object Environments. We build non-rigid object manipulation tasks based on the PickAndPlace task. Instead of using the original rigid block, we have created a non-rigid object by hinging some blocks side-by-side along their edges as shown in Figure 3. A hinge joint is placed between two neighbouring blocks, allowing one rotational degree of freedom (DoF) along their coincident edges up to 180 o . We introduce two different variants: 3-tuple and 5-tuple. For the N -tuple task, N cubical blocks are connected with N − 1 hinge joints creating N − 1 internal DoF.
The target pose has 3N -dimension describing the desired 3D positions of all N blocks, which are selected uniformly in each episode from a set of predefined target poses (see Figure 3). The robot is required to manipulate the object to match the target pose. The reward is zero when all the N blocks are within a 2cm range to their corresponding targets, otherwise −1. Robot actions and observations are similar to those in the PickAndPlace tasks, excepting that the observations include the position, rotation, angular velocity, relative position and linear velocity to the gripper for each block. The episode length is 50 time-steps for both 3-tuple and 5-tuple.
Object Locomotion Environments. For each robotic manipulation task described above, we use an object locomotion task where we first learn ν θ , Q ν and I φ . Here, we detail the observation and action space differences between object locomotion and robotic manipulation tasks.
For any task, the object's observation is a subset of the robot's observation, i.e. z t ⊂ s t , and only includes object-related features while excluding those related to the robot. More concretely, for the environments with the Fetch arm, the object's observations include the object's position, rotation, angular velocity, the object's relative position and linear velocity to the target, and the target location. For the environments with the Shadow's hand, the object observations include the object's position and rotation, the object's linear and angular velocities, and the target pose. We define the object action as the desired relative change in the 7D object pose (3D position and 4D quaternion orientation) between two consecutive time-steps. This leads to 7D action spaces. Specifically for non-rigid objects, we define the object action as the desired relative change in the poses of the blocks at two ends. This leads to 14D action spaces. The rewards are the same as those in each robot manipulation task.
It is worth noting that, in the Full variants of Shadow's hand environments, we consider the object translation and rotation as two individual locomotion tasks, and we learn separate locomotion policies and Q-functions for each task. We find that the above strategy encourages the manipulation policy to perform translation and rotation simultaneously. Although object translation and rotation could be executed within a single task, we have empirically found that the resulting manipulation policies tend to prioritise one behaviour versus the other (e.g. they tend to rotate the object first, then translate it) and generally achieves a lower performance.

Implementation and Training Process
Three-layer neural networks with ReLU activations was used to approximate all policies, action-value functions and inverse dynamics models. The Adam optimiser [60] was employed to train all the neural networks. During the training of locomotion policies, the robot was considered as a non-learning component in the scene and its actions were not restricted to prevent any potential collision with the objects. We could have different choices for the actions of the robot. For example, we could let the robot move randomly or perform any arbitrary fixed action (e.g. a robot arm moving upwards with constant velocity until it reaches to the maximum height and then staying there). In preliminary experiments, we assessed whether this choice bears any effect on final performance, and concluded that no particular setting had clear advantages. For learning locomotion and manipulation policies, most of the hyperparameters suggested in the original HER implementation [59] were retained with only a couple of exceptions for locomotion policies only: to facilitate exploration, with probability 0.2 (0.3 in [59]) a random action was drawn from a uniform distribution, otherwise we retained the current action, and added Gaussian noise with zero mean and 0.05 (0. Our algorithm has been implemented in PyTorch 1 . All the environments are based on OpenAI Gym. The corresponding source code, the environments, and illustrative videos for selected tasks have been made publicly available. 234

Comparison and Performance Evaluation
We include the following methods for comparisons: • DDPG-Sparse: Refers to DDPG [37] using sparse rewards. • HER-Sparse: Refers to DDPG with HER [9] using sparse rewards.
We use DDPG-Sparse, HER-Sparse and HER-Dense as baselines. HER-Sparse+RNDR is a representative method constructing auxiliary rewards to facilitate policy learning. CHER-Sparse replaces the random selection mechanism of HER with an adaptive one that considers the proximity to true goals. DDPG-Sparse+SLDR and HER-Sparse+SLDR represents the proposed approach using SLDR with different methods for policy learning.
Following [59], we evaluate the performance after each training epoch by performing 10 deterministic test rollouts for each one of the 38 MPI workers. Then we compute the test success rate by averaging across the 380 test rollouts. For all comparison methods, we evaluate the performance with 5 different random seeds and report the median test success rate with the interquartile range. In all environments, we also keep the models with the highest test success rate for different methods and compare their performance.

Single Rigid Object Environments
The learning curves for Fetch, the Rotate and Full variants of Shadow's hand environments are reported in Figure 4a, Figure 5a and Figure 5b, respectively. We find that HER-Sparse+SLDR features a faster learning rate and the best performance on all the tasks. This evidence demonstrates that SLDR, coupled with DDPG and HER, can facilitate policy learning with sparse rewards. The benefits introduced by HER-Sparse+SLDR are particularly evident in hand manipulation tasks ( Figure  5a and Figure 5b) compared to fetch robot tasks (Figure 4a), which are notoriously more complex to solve. Additionally, we find that HER-Sparse+SLDR outperforms HER-Sparse+RNDR in most tasks. A possible reason for this result is that most methods using auxiliary rewards are based on the notion of curiosity, whereby reaching unseen states is a preferable strategy, which is less suitable for manipulation tasks [9]. In contrast, the proposed method exploits a notion of desired object locomotion to guide the main policy during training. We also observe that DDPG-Sparse+SLDR fails for most tasks. A possible reason for this is that, despite its effectiveness, the proposed approach still requires a suitable RL algorithm to learn from SLDR together with sparse environmental rewards. DDPG on its own is less effective for this task. We find that HER-Dense performs worse than HER-Sparse. This result support previous observations that sparse rewards may be more beneficial for complex robot manipulation tasks compared to dense rewards [9,59]. Finally, we observe that CHER-Sparse fails in most tasks and cannot facilitate successful learning. This is somewhat expected given our particular set up, and a possible explanation is in order. Sampling the replay buffer based on the proximity to true goals may work well for locomotion tasks because the distance between the robot gripper and the target is taken into account, and this distance is under direct control of the robot from the very first episode. On the other hand, in the manipulation tasks, the distance between the object and the target stays roughly constant in the early training episodes as the robot has not yet learned to interact with the object. Such a sampling technique prioritising the replays depending on proximity may produce biased batches that can potentially disrupt the learning process. For example, a random robot action causing the object to move away from the target would favour trajectories characterised by a lack of interaction between the robot and the object. Although we report some success on EggRotate, BlockRotate and PenRotate using CHER, this is much lower than the success observed when using HER-Sparse+SLDR and HER-Sparse.

Fetch Arm Multiple Object Environments
For environments with N objects, we reuse the locomotion policies trained on the PickAndPlace task with single objects, and obtain an individual SLDR for each one of N objects. We train N + 1 action-value functions in total, i.e. one for each SLDR and one for the environmental sparse rewards. The manipulation policy is trained using the gradient in Eq. 11.
Inspired by [59], we randomly select between two initialisation settings for the training: (1) the targets are distributed on the table (i.e. an auxiliary task) and (2) the targets are stacked on top of each other (i.e. the original stacking task). Each initialisation setting is randomly selected with a probability of 0.5. We have observed that this initialisation strategy helps HER-based methods complete the stacking tasks. From Figure 4b, we find that HER-Sparse+SLDR achieves better performance compared to HER-Sparse, HER-Sparse+RND and HER-Dense in the 2-object stacking task (Stack2), while other methods fail. On the more complex 3-object stacking task (Stack3), HER-Sparse+SLDR is the only algorithm to succeed. HER-Sparse+RND occasionally solves the Stack3 task with fixed random seeds but the performance is unstable across different random seeds and multiple runs.

Fetch Arm Non-Rigid Object Environments
The learning curves for 3-tuple and 5-tuple non-rigid object tasks are reported in Figure 4c. Similarly to the multiple object environment, HER-Sparse+SLDR achieves better performance for the 3-tuple  Figure 6: Comparison of models with the best test success rate for all methods on all the environments.
task compared to HER-Sparse and HER-Sparse+RND, while the other methods fail to complete the task. For the more complex 5-tuple task, only HER-Sparse+SLDR is able to succeed. Among the 4 pre-defined targets depicted in Figure 3, HER-Sparse+SLDR can achieve 3 targets on average, and can accomplish all 4 targets in one instance, out of 5 runs with different random seeds. Figure 6 summarises the performance of the models with the best test success rates for each one of the competing methods. We can see that the proposed HER-Sparse+SLDR achieves top performance compared to all other methods. Specifically, HER-Sparse+SLDR is the only algorithm that is able to steadily solve 3-object stacking (Stack3) and 5-tuple non-rigid object manipulation (5-tuple).

Comparison Across the Best Models
Remarkably, these two tasks have the highest complexity among all the 13 tasks. The Stack3 task includes multiple stages that require the robot to pick and place multiple objects with different source and target locations in a fixed order; in the 5-tuple task the object has the most complex dynamics. For these complex tasks, the proposed SLDR seems to be particularly beneficial. A possible reason is that, although the task is very complex, the objects are still able to learn good locomotion policies (see Fig 7a) and the rewards learnt from locomotion policies provides critical feedback on how the object should be manipulated to complete the task. This type of object-based feedback is not utilised by other methods like HER and HER+RND. Our approach outperforms the runner-up by a large margin in the Full variants of Shadow's hand manipulation tasks (EggFull, BlockFull and PenFull), which feature complex state/action spaces and system dynamics. Finally, the proposed method consistently achieves better or similar performance than the runner-up in other simpler tasks.

Conclusion and Discussion
In this paper, we address the problem of mastering robot manipulation through deep reinforcement learning using only sparse rewards. The rationale for the proposed methodology is that robot manipulation tasks can be seen of as inducing object locomotion. Based on this observation, we propose to firstly model the objects as independent entities that need to learn an optimal locomotion policy through interactions with a realistically simulated environment, then these policies are leveraged to improve the manipulation learning phase.
We believe that using SLDRs introduces significant advantages. First, SLDRs are generated artificially through a RL policy, hence require no human effort. Producing human demonstrations for complex tasks may prove difficult and/or costly to achieve without significant investments in human resources. For instance, it may be particularly difficult for a human to generate good demonstrations for tasks such as manipulating non-rigid objects with a single hand or with a robotic gripper. On the other hand, we have demonstrated that the locomotion policies can be easily learnt, even for complex tasks, purely in a virtual environment; e.g., in our studies, these policies have achieved 100% success rate on all tasks (e.g. see Figure 7a and Figure 7b). Furthermore, since the locomotion policy is learnt through RL, our proposed approach does not require task-specific domain knowledge and can be designed using only sparse rewards. Training the locomotion policies only requires the same sparse rewards provided by the environment hence the SLDRs produced through RL lead to high quality manipulation policies. This point has been supported by the empirical evidence obtained through experiments involving all 13 environments presented in this paper. As commonly observed in deep RL approaches, the use of neural networks as a function approximators for policies and inverse dynamics functions may introduce convergence issues and lead to non-optimal policies, but  despite these limitations the proposed methodology has been proved to be sufficiently reliable and competitive. The proposed approach is orthogonal to existing methods that use expert demonstrations, and combining them together would be an interesting direction to be explored in the future.
The performance of the proposed framework has been thoroughly examined on 13 robot manipulation environments of increasing complexity. These studies demonstrate that faster learning and higher success rate can be achieved through SLDRs compared to existing methods. In our experiments, SLDRs have enabled the robots to solve complex tasks, such as stacking 3 objects and manipulating non-rigid object with 5 tuples, whereas competing methods have failed. Remarkably, we have been able to outperform runner-up methods by a significant margin for complex Shadow's hand manipulation tasks. Although SLDRs are obtained using a physics engine, this requirement does not restrict the applicability of the proposed approach to situations where the manipulation is learnt using real robot as long as the locomotion policy can pre-learnt realistically.
Several aspects will be investigated in follow-up work. We have noticed that when the interaction between the manipulating robot and the objects is very complex, the manipulation policy may be difficult to learn despite the fact that the locomotion policy is successfully learnt. For instance, in the case of the 5-tuple task with Fetch arm, although the locomotion policy achieves a 100% success rate (as shown in Figure 7a), the manipulation policy does not always completes the task (as shown in Figure 4c and Figure 6). In such cases, when the ideal object locomotion depends heavily on the robot, the benefit of the SLDs is reduced. Another limitation is given by our Assumption 2 (Section 4.2), which may not hold for some tasks. For example, for pen manipulation tasks with Shadow's hand, although the pen can rotate and translate itself to complete locomotion tasks (as shown in Figure  7b), it is difficult for the robot to reproduce the same locomotion without dropping the pen. This issue can degrade the performance of the manipulation policy despite having obtained an optimal locomotion policy (see Figure 5a, Figure 5b and Figure 6). A possible solution would be to train the manipulation policy and locomotion policy jointly, and check whether the robot can reproduce the object locomotion suggested by the locomotion policy; a notion of "reachability" of object locomotion could be used to regularise the locomotion policy and enforce P (z t |µ θ ) d = P (z t |ν θ ).
An important aspect to bear in mind is that our methodology requires the availability of a simulated environment for the application at hand. Nowadays, due to the well-documented sample inefficiency of most state-of-the-art, model-free DRL algorithms, such simulators are commonly used for training RL policies before deployment in the real world. Besides, creating physically realistic environments from existing 3D models using modern tools has become almost effortless. In this sense, the approach proposed in this work requires only a marginal amount of additional engineering once a simulator has been developed. For instance, using MuJoCo, setting up the object locomotion policies would only entail the removal of the robot from the environment and inclusion of the objects as "mocap" entities. In comparison with other approaches, such as those relying on human demonstrations, the additional effort required to enable SLDR is only minimal.
In this paper we have adopted DDPG as the main training algorithm due to its widely reported effectiveness in continuous control tasks. However, our framework is sufficiently general, and other algorithms may be suitable such as trust region policy optimisation (TRPO) [40], proximal policy optimisation (PPO) [62] and soft actor-critic [63]; analogously, model-based methods [64,16] could also provide feasible alternatives to be explored in future work.