Introduction

Humans can master increasingly complex manipulative behaviors and gradually develop advanced manipulation skills of exploiting high-level actions beyond lengthy primitive explorations. For example, in the infancy cycle, we typically go from only being able to fiddle with a toy to learning to grasp it directly. Then, as children, combining both high-level and low-level actions, such as push, pull, and grasp, we can accomplish more practical, efficient, and goal-oriented manipulation tasks, such as sorting a toy box. Humans naturally obtain such high-level manipulation skills through constantly observing, learning, and reproducing from interacting with the real world. It would be exciting for robots to learn to interact with objects like humans, particularly learning to incorporate high-level actions in manipulating objects with limited prior knowledge. However, learning robot manipulation skills in a real-world environment, particularly for both low-level and high-level actions, is indeed challenging.

Recently, visual foresight [12] has been widely demonstrated as a promising tool for learning visual-based robot manipulations in unknown environments from the standpoint of sensory prediction. More concretely, this line of work [7, 9, 39] is mainly built on a deep visual predictive model trained with high-dimensional visual streams for learning real-world dynamics. The learning of visual predictive models is typically task-independent [20, 31] and, therefore, can be generalized over different tasks. Even though promising results have been achieved, whether in the very vanilla visual foresight [12] or its follow-up works [7, 9, 39], robot actions in this paradigm are usually prescribed to be low level, such as small differential displacements of the robot end-effector. Take the example in Fig. 1. Even to relocate a sponge within a quite short distance, a robot using visual foresight will typically have to apply a fairly long sequence of low-level displacements, while a human expert can come up with more efficient solutions with higher level manipulation actions, such as simply picking (grasping) the sponge up and placing it to the target location directly. Moreover, such low-level actions are babbling-like, not only in the training data, e.g., RoboNet [3], but also in the short horizon of actions planned through the learned predictive model using model predictive control (MPC). Such a planning framework is usually confronted with increasing complexity for tasks requiring long sequences of low-level displacements of the robot end-effector.

Fig. 1
figure 1

Manipulation planning with visual inputs and high-level actions. Top: during the training, a visual predictive model is trained on a dataset of executed pick-and-place actions (e.g., grasping a toy to a specific location). Bottom: using the visual predictive model to optimize appropriate actions for the specific manipulation tasks (e.g., replacing a sponge’s position in the workspace)

To address these limitations, we go deeper into visual foresight by improving the model’s understanding of high-level actions in robot manipulations. The underlying intuition is to train a deep visual predictive model that can learn world dynamics under high-level robot actions and ultimately use it to determine appropriate high-level actions in task planning. However, training such a predictive model remains a challenge. High-level actions usually contain rich semantic information and cues that are not present in low-level actions, which poses two questions regarding understanding such actions. First, how can the robot learn a visual predictive model by leveraging the semantic information in high-level actions? Second, how can the robot learn high-level actions while still retaining the understanding of low-level actions? Learning semantic information from low-dimensional task representations, such as the object bounding box and semantic segmentation [34], usually relies on ground true annotation, which can hardly be achieved while learning from a large number of raw visual observations. Instead, we incorporate semantic information into a sequence of visual frames obtained under high-level robot actions and build a recurrent neural network to learn such information implicitly. This allows our model to learn both high-level and low-level robot actions from the interval of consecutive frames. The main contributions of this paper are summed up as follows:

  • We propose a novel visual predictive model for high-level robot actions containing an action decomposer and a video prediction model.

  • We present a sampling-based optimization method to utilize this visual predictive model for planning high-level pick-and-place actions in real robot tasks.

  • We contribute a novel vision dataset that contains a rich set of real robot pick-and-place actions.

We evaluate our method in terms of the accuracy of the predicted outcomes of high-level actions and the overall performance of using the predictive model in real robot downstream tasks. The results demonstrate that our approach can substantially learn to understand high-level robot actions and can promisingly be utilized in planning for real robot manipulation tasks. A video summary of this paper and more experimental results can be found at https://youtu.be/JOgjovETlVg.

Related work

Model-based reinforcement learning

The main difference between model-based reinforcement learning (RL) and model-free RL is the employment of world models learning transition dynamics in model-based RL. Model-based methods usually have more data efficiency than model-free methods [6] and require fewer reward signals during training. These can significantly reduce the robot–environment interaction in learning, which is often expensive and dangerous for robots. Model-based RL in robotics [5, 26] has attracted many studies in the last decade and has shown great success in low-dimensional environments [15, 17]. Recently, a line of literature called visual foresight [7, 9, 12] has proposed a way that leverages raw visuals directly in the model-based context. In visual foresight [12], a predictive model is trained to learn the concepts of robot actions by accurately predicting the visual outcomes based on both current visual observations and robot actions. Furthermore, the predictive model is task-agnostic, allowing it to be generalized over various tasks. Such an approach has shown robustness that can process real-world visual inputs and has demonstrated promising performance in real robot tasks. However, the focus of vanilla visual foresight [12] and its follow-up works [7, 9] is to leverage only low-level robot actions in prediction and planning. In contrast, our method learns a visual predictive model conditional on higher level robot actions, which can be used for more complex and efficient action planning. Hafner et al. [20] proposed a model-based approach that learned dynamics directly from pixels but plans actions in a latent space. Their approach has shown great success on a larger scale of longer horizon tasks in simulated environments. However, this method still requires some labeled data for training a reward function. In comparison, we focus on learning robot actions from only raw visual streams rich in real-world visual complexity.

Video prediction model

Recently, deep neural networks have made great progress in representing high-dimensional states and observations. Video prediction models have become a powerful tool for learning world dynamics in various domains, including autonomous driving [16, 29], human posture estimations [8, 23], and robotic manipulation [10]. These models learn from a large amount of unlabeled data in a self-supervised manner by utilizing the vision as a supervised signal. From earlier deterministic models [2, 11, 35] to VAE-based [24] models [1, 7, 25, 40], a latent space is employed in probabilistic models to catch the stochasticity of the real environment. To model time-variational stochasticity, [7] proposed using a learned prior in the stochastic model. The action-conditional video prediction models have been used to learn the robot’s actions, making them suited to the robot context.

Robot manipulations

Pick-and-place is a wide-spread action of robot manipulation in various robotics applications, including industrial [41] and domestic applications [37]. Traditionally, this problem has been studied through analytical estimation of object poses [32] and dynamic motion planning [13]. Both require object models and are unsuitable for unstructured environments. In recent years, data-driven methods for learning pick-and-place actions have gained significant attention in robotics, with both model-free [18] and model-based [14] techniques. Several works use learned geometric models [38, 44] to estimate object poses and infer actions. However, these methods still require object models during training. End-to-end models [33, 42] have the advantage of being agnostic to the object’s physics. They can directly infer the pick-and-place actions from pixels. An instance is the Transporter network [42], which utilizes a simple model architecture that can exploit spatial symmetries to effectively learn to plan pick-and-place actions from visual inputs. However, this line of methods tends to be task-specific and relies on task-specific demonstrations, which limits them from zero-shot generalization to new tasks.

Another sub-field of the pick-and-place actions is predicting the probability of success of picking. For instance, Dex-Net [28] uses a grasp quality convolutional neural network (GQ-CNN) to estimate optimal picking poses from a depth image. However, it does not consider task-related objectives such as which object should be picked and where it should be placed.

Problem formulation

Fig. 2
figure 2

Left: each pick-and-place action is formulated in SE(2) on the xy plane of the robot workspace, and the coordinate z is determined by the depth w.r.t. the plane. Right: the visual observations are acquired from an RGB camera above the workspace

Fig. 3
figure 3

The sequential motion primitives of a pick-and-place action: the robot moves the gripper to the upper point of the pick point, approaches the object down to the height of the pick point, grasps the object with the gripper, lifts the gripper back to the upper point, approaches the place location, lowers the gripper to the height of the place location, and finally releases the gripper

In this work, our objective is to learn high-level robot actions through a visual predictive model and ultimately enable their use in robot manipulation planning. To this end, we formulate the completion of a manipulation task by one or a sequence of high-level actions. We define each high-level action as a pick-and-place action grasping an object from above a pick position and releasing the gripper at a place position. The poses of the robot gripper at the pick and the place are both composed of a 3-D coordinate xyz, and a yaw rotation \(\theta \) in the robot base coordinate. As shown in Fig. 2 (left), we parameterize the high-level action as \(\textbf{a}^{(high)} = (\mathcal {P}_\text {pick}, \mathcal {P}_\text {place}) \in \mathcal {A}\), where \(\mathcal {P}_\text {pick}\) and \(\mathcal {P}_\text {place}\) are the poses of picking and placing defined in SE(2), which refers to the xy plane of the robot base coordinate system. z is determined as the vertical depth w.r.t. the horizontal plane. An RGB camera is used to acquire the visual observations of the workspace [Fig. 2 (right)].

Method

This section describes our approach to learning high-level robot actions by integrating a decomposer with a video prediction model and applying the learned model using sampling-based optimization techniques for desired tasks. An illustration of our method is shown in Fig. 5.

Visual predictive model of high-level actions

We use the notation \(\mathcal {M}: \{\textbf{I}_{\text {init}}, \textbf{a}^{(high)}\} \rightarrow \hat{\textbf{I}}\) to refer to a visual predictive model, where \(\textbf{I}_{\text {init}}\) is the initial visual observation and \(\hat{\textbf{I}}\) is the predicted visual outcome of a pick-and-place action \(\textbf{a}^{(high)} = (\mathcal {P}_\text {pick}, \mathcal {P}_\text {place})\). Model \(\mathcal {M}\) learns to understand high-level robot actions by being trained to predict visual outcomes. It is worth emphasizing that high-level actions like pick-and-place contain semantic information. For example, as shown in Fig. 3, the robot executes a pick-and-place action through several semantic steps: the robot moves the gripper to an upper point of the pick location, approaches the object down to the height of the pick, executes the gripper to grasp, lifts the gripper back to an upper point, approaches the place location, lowers the gripper to the height of the place, and releases the gripper. However, such semantic information can not be incorporated by the action formulation \(\mathcal {P}_\text {pick}\) and \(\mathcal {P}_\text {place}\).

To leverage such semantic information in the prediction, we propose combining a video prediction model with a high-level action decomposer. The decomposer converts a high-level action into a sequence of intermediate low-level actions. We thus incorporate the semantic information through the resulting intermediate visual frames and low-level actions. In the literature on robot manipulation [9], video prediction models typically predict visual frames autoregressively conditional on a sequence of low-level actions, i.e., the displacements of the robot end-effector. The advantages of using these two components together are twofold: (1) the semantic information is still retained in the decomposed low-level action sequence; (2) the model can learn both low-level and high-level actions concurrently.

Specifically, we decompose the high-level action into a sequence of low-level actions with \(\textbf{a}^{(high)} \rightarrow \{\textbf{s}_0, \textbf{a}^{(low)}_0, \textbf{a}^{(low)}_1, \ldots , \textbf{a}^{(low)}_{T-1}\}\), where \(\textbf{s}_0\) denotes the robot’s initial state that includes the end-effector’s pose \((x, y, z, \theta )\) and a binary scalar (open v.s. closed) of the gripper. \(\textbf{a}^{(low)}_t\) is the intermediate low-level action between two successive frames at time t, denoting the end-effector’s displacement \((\Delta x, \Delta y, \Delta z, \Delta \theta )\) and the binary scalar of the gripper. \(\textbf{I}_0\) is the initial frame and \(\textbf{I}_t\) is the resulting frame of action \(\textbf{a}^{(low)}_{t-1}\). T is the length of the sequence of low-level actions.

Fig. 4
figure 4

A probabilistic video prediction model conditional on robot actions. High-level action is decomposed into a sequence of low-level displacements \(\textbf{a}^{low}_{0}, \textbf{a}^{low}_{1}, \ldots \textbf{a}^{low}_{T-1}\) and an initial position \(\textbf{s}_{0}\) of the robot’s end-effector

Figure 4 shows a schematic of the video prediction model. At each step t, the model takes the observation \(\textbf{I}_{t}\) and action \(\textbf{a}^{(low)}_t\) as input and generates the next predicted frame \(\hat{\textbf{I}}_{t+1}\). By performing this prediction procedure autoregressively, i.e., using the predicted frame \(\hat{\textbf{I}}_{t+1}\) as the input for the next time step, we can predict the last frame of the low-level action sequence conditional on an initial frame.

To train the model, we gather a dataset \(\mathcal {D} = \{\xi _i\}_{i=1}^N\) of N high-level robot of actions, where each example \(\xi _i\) contains of a pick-and-place action \((\mathcal {P}_\text {pick}, \mathcal {P}_\text {place})\), its low-level decomposition \(\{\textbf{s}_0, \textbf{a}^{(low)}_0, \textbf{a}^{(low)}_1, \ldots , \textbf{a}^{(low)}_{T-1}\}\), and the corresponding visual frames \(\{\textbf{I}_0, \textbf{I}_1, \ldots , \textbf{I}_T\}\).

Fig. 5
figure 5

An overview of our method, including training a visual predictive model for high-level pick-and-place actions and using it with sample-based optimization for action planning

Variational video prediction model

Variational auto-encoders (VAE) [24] have been widely used for video prediction models. Following the action conditional video prediction paradigm, the prediction models typically take c initial frames \(\{\textbf{I}_0, \textbf{I}_1, \ldots , \textbf{I}_{c-1}\}\) and a sequence of action \(\{\textbf{a}_0, \textbf{a}_1, \ldots , \textbf{a}_{T-1}\}\) as inputs and predict subsequent future frames \(\{\textbf{I}_{c}, \textbf{I}_{c+1} \ldots , \textbf{I}_T\}\). VAEs introduce latent variables \(\textbf{z}\sim p(\textbf{z})\) to carry the stochastic nature of the real world. Thus, we can build a probabilistic model \(p_\theta (\textbf{I}_{t} \vert \textbf{I}_{0:t-1}, \textbf{a}_{0:t-1}, \textbf{z}_{1:t})\) that predicts the frame \(\hat{\textbf{I}}_t\) conditioned on the previous frames \(\textbf{I}_{0:t-1}\), actions \(\textbf{a}_{0:t-1}\) and the latent variables \(\textbf{z}_{1:t}\). Since estimating the marginalized distribution over the latent space \(\textbf{z}\) is intractable, it is not possible to directly maximize \(p_\theta (\textbf{I}_{t})\). To overcome this challenge, VAEs employ an inference network \(q_\phi (\textbf{z}_t \vert \textbf{I}_{0:t}, \textbf{a}_{0:t-1})\) to approximate the posterior of the true distribution of the latent variables \(\textbf{z}\). This posterior inference network is typically parameterized as a conditional Gaussian distribution \(\mathcal {N}(\mu _\phi (\textbf{I}_{0:t}, \textbf{a}_{0:t-1}), \sigma _\phi (\textbf{I}_{0:t}, \textbf{a}_{0:t-1}))\).

By utilizing the reparameterization strategy [24]

$$\begin{aligned} \textbf{z}= \mu _\phi (\textbf{I}_{0:t}, \textbf{a}_{0:t - 1}) + \sigma _\phi (\textbf{I}_{0:t}, \textbf{a}_{0:t - 1}) \times \epsilon , \epsilon \sim \mathcal {N}(\textbf{0}, \textbf{1}),\nonumber \\ \end{aligned}$$
(1)

the model can be trained by optimizing the variational lower bound of the log-likelihood

$$\begin{aligned} \mathcal {L}_{\theta ,\phi }(\textbf{I}_{c:T})= & {} \sum _{t=c}^{T} \left[ \mathbb {E}_{q_{\phi }(\textbf{z}_{t} \vert \textbf{I}_{0:t}, \textbf{a}_{0:t - 1})} \log p_\theta (\textbf{I}_{t} \vert \textbf{I}_{0:t - 1}, \textbf{a}_{0:t - 1}, \textbf{z}_{0:t}) \right. \nonumber \\{} & {} - \left. \beta D_{KL} (q_{\phi }(\textbf{z}_{t} \vert \textbf{I}_{0:t}, \textbf{a}_{0:t - 1}) \Vert p (\textbf{z}_{t}))\right] . \end{aligned}$$
(2)

To capture the variety of the stochastic information, Denton et al. [7] propose a learned-prior \(p_\psi (\textbf{z}_{t} \vert \textbf{I}_{0:t-1}, \textbf{a}_{0:t-1})\). This prior can also be parameterized as a conditional Gaussian distribution \(\mathcal {N}(\mu _\psi (\textbf{I}_{0:t-1}, \textbf{a}_{0:t-1}), \sigma _\psi (\textbf{I}_{0:t-1}, \textbf{a}_{0:t-1}))\).

The complete model is trained by maximizing

(3)

where \(\theta \), \(\phi \), and \(\psi \) are the parameters of the generative network, posterior network, and prior network, respectively. \(D_{KL}\) is the Kullback–Leibler divergence between the approximated posterior and the learned prior. \(\beta \) is a hyper-parameter representing the trade-off between minimizing the prediction error and fitting the prior. During training, the latent variables \(\textbf{z}_{t}\) are sampled from the posterior \(q_{\phi }(\textbf{z}_t)\). During testing, we directly sample \(\textbf{z}_t\) from the learned-prior \(p_{\psi }(\textbf{z}_t)\).

The implementation of this model contains an encoder, a decoder, a prior network, a prediction network, and a posterior network. Both the encoder and decoder are deep convolutional neural networks that map the pixels to latent space and map it back to the pixels, respectively. The prior, prediction, and posterior networks are convolutional LSTM networks for learning long-term dependencies.

Action planner with visual predictive model

The objective of a robot manipulation planner is to find one or a sequence of pick-and-place action(s) that maximizes the possibility of achieving the given goal frame \(\textbf{I}_{\text {goal}}\). We evaluate a pick-and-place action by computing the \(\ell _{1}\) cost function between the predicted frame of it and the goal frame. Subsequently, we optimize the pick-and-place actions using a sample-based algorithm known as the cross-entropy method (CEM).

figure a

The procedure is shown in Algorithm 1. Concretely, for an initial frame \(\textbf{I}_{\text {init}}\) and a goal frame \(\textbf{I}_{\text {goal}}\), we sample M pick-and-place actions \(\{\mathcal {P}_\text {pick}^{(m)}, \mathcal {P}_\text {place}^{(m)}\}^{M}\) from a normal multivariate Gaussian distribution. We predict the visual outcomes \(\hat{\textbf{I}}_{1:T}^{(m)}\) for each pick-and-place action, and then evaluate the cost of each action using \(c^{(m)} = \ell _{1}(\hat{\textbf{I}}_{T}^{(m)}, \textbf{I}_{\text {goal}})\). We then select K actions with the lowest costs, fit a new multivariate Gaussian distribution on these K pick-and-place actions, and resample a new set of M actions from this new distribution. We repeat the prediction and refitting procedures for n iterations. After the final iteration, we execute the pick-and-place action \(\{\mathcal {P}_\text {pick}^{*}, \mathcal {P}_\text {place}^{*}\}\) with the lowest cost, which has the predicted visual outcome closest to the given goal.

Since some complex tasks may require more than one pick-and-place action, we adopt a greedy strategy that selects the action with the lowest cost at each step and optimizes it anew over a current frame until the task is successful within the maximum steps. In contrast to previous approaches that used CEM and MPC to optimize low-level actions, our method optimizes high-level actions, resulting in greater planning efficiency. By avoiding the need for repeated optimization at each step of a low-level action, our approach is more closely aligned with human planning strategies that involve selecting actions from a higher level.

Experiments and results

Fig. 6
figure 6

Environmental setup. It includes a a horizontal workspace, b a 7-DoF Franka Emika Panda robot, c an RGB camera observing the workspace, d a 2-finger Franka Emika gripper, and e a depth camera used to obtain the height position of each action

Fig. 7
figure 7

An example in PandaGrasp-Pick &Place dataset. Left: the pick-and-place action. Right: the observation of a sequence of visual frames

Our experiments aim to evaluate whether the robot can learn high-level actions through the proposed visual predictive model and ultimately leverage them in real-world robot manipulation tasks. The question is twofold: (1) Can the proposed method predict accurate visual outcomes of robot pick-and-place actions? and (2) can the visual predictive model learning pick-and-place actions be used to plan real robot tasks, and can this lead to a greater success rate and planning efficiency?

We conduct both quantitative and qualitative experiments to answer the above questions. To answer question (1), we compare the predicted visual outcomes of pick-and-place actions between our method and baseline methods that use either a conditional variational autoencoder (CVAE) or models trained only on data of low-level actions. For question (2), we compare our approach with the vanilla visual MPC that uses only low-level robot actions on a variety of real robot tasks. In addition, we further conduct experimental comparisons with other pick-and-place methods, including Transporter Network [30, 43] and Dex-Net [28], where custom algorithms are designed, but no world dynamic models are considered. More visualizations and videos can be found at https://youtu.be/JOgjovETlVg.

Experimental setup

We train and evaluate our proposed method in a real robot environment. For both data collection and evaluation, we use a 7-DoF Franka Emika Panda robot equipped with a two-finger Franka Emika gripper, as shown in Fig. 6. To obtain visual observations, we place an RGB camera on the side and process the images to a 64x64 pixel resolution for the predictive model. Since we parameterize pick-and-place actions on a horizontal 2D plane in SE(2), another depth camera providing the depth map of the workspace is mounted at the robot’s end-effector to obtain the height position (z-axis) of each action. In our experiments, all models are trained with 4\(\times \) NVIDIA Tesla V100 (32 GB) graphics cards, while inferences are done with one consumer graphics card NVIDIA GeForce 3090.

Dataset

We collect a real robot dataset named PandaGrasp-Pick &Place, to train and evaluate the proposed method. While RoboNet [4] introduced an autonomous data collection strategy to obtain data on interactions between the robot and objects in open-world environments and released a promising dataset, actions in RoboNet are mostly low-level displacements of the robot’s end-effector. To address this limitation, we introduce high-level pick-and-place action on our dataset. Specifically.

RoboNet

[4] provides a large open dataset containing 150K trajectories of robot manipulation from several robots. Each trajectory in RoboNet records a sequence of visual observations and low-level actions defined as the displacements of the robot end-effector. Despite RoboNet providing a large number of examples, the babbling-like exploration strategy results in a scarcity of high-level actions in the provided examples. In our experiments, we use RoboNet to pre-train the visual predictive model of our method and establish baselines for evaluating the visual prediction performance.

PandaGrasp-Pick &Place

As its name indicates, it is a dataset containing the pick-and-place actions of a Franka Panda robot. Concretely, as shown in Fig. 7, each example in the dataset records the robot performing a random pick-and-place action. The robot executes the actions according to the primitives defined in Fig. 3, which involve approaching the picking position, grasping the object, and moving to the placing position. We record the visual observation of each action as a sequence of 21 frames in length, according to the high-level action decomposer proposed in the section “Method”.

Visual prediction conditional on high-level actions

To study whether the visual predictive model can understand robot pick-and-place actions correctly, we evaluate the accuracy of the predicted frames of such actions in reference to the ground truth frames. Our quantitative evaluations are with three metrics: structural similarity index measure (SSIM) [36], peak signal-to-noise ratio (PSNR) [22], and learned perceptual image patch Similarity (LPIPS) [45].

Fig. 8
figure 8

Qualitative comparison between the predictions of a CVAE network and our proposed method. a The initial frame, b the ground truth of the last frame, c the prediction of the CVAE network, and d the prediction of our proposed method. The green boxes on the pictures highlight that a sponge is being manipulated in this example

Table 1 Quantitative comparisons of the predicted last frames between the CVAE network and our proposed method (mean ± standard error)
Table 2 Quantitative comparison among different visual predictive models trained on high-level or low-level actions on the average of the sequence of frames (mean ± standard error)
Table 3 Quantitative comparison among different visual predictive models trained on high-level or low-level actions on different stages of pick-and-place actions (mean ± standard error)

We first compare our method with a CVAE network that directly predicts the last frame of a pick-and-place action given an initial frame. For a fair comparison, the CVAE network shares the same encoder and decoder structure as our method, and both methods are trained on the same data. The results (Table 1) show that compared with the CVAE network, using our proposed method to predict the visual outcomes of pick-and-place action achieves better performance on all metrics.

Figure 8 presents a qualitative comparison between the predictions of the CVAE network and our proposed method. The initial frame used as input to both methods is shown as Fig. 8a, b which shows the ground truth of the last frame. Figure 8c, d shows the predictions of the CVAE network and ours, respectively. The green boxes on the pictures highlight that a sponge is being manipulated in this example. Upon comparing the predictions to the initial frame, both methods predict that the sponge has been moved from its initial position. However, when we compare the predicted frames to the ground truth, only our method accurately predicts that the sponge has been moved to the intended place position. In contrast, the CVAE network fails to generate a reasonable object in the prediction results. This is because the CVAE network learns the pick-and-place actions only from mapping pixels between the initial and the last frames. Our method, however, learns pick-and-place actions through a sequence of intermediate frames, which have more semantic information about the actions. This enables us to correctly predict the sequence of frames up to the last frame we are interested in.

We also conduct a comparison with two baselines to evaluate whether or not the visual predictive model still has the ability to learn high-level actions in the absence of such actions. Specifically, the baseline models are trained using only low-level actions (i.e., the end-effector’s displacements), similar to those in RoboNet. To mitigate the bias from the specific robots and environments in our dataset, we acquire a dataset with our setup but adopt the babbling-like methodology in RoboNet. We refer to this dataset as Panda-Babbling. The compared baselines and our model are as follows:

  1. (1)

    A visual predictive model trained on RoboNet.

  2. (2)

    A visual predictive model pre-trained on RoboNet and fine-tuned on Panda-Babbling.

  3. (3)

    A visual predictive model pre-trained on RoboNet and fine-tuned on PandaGrasp-Pick &Place.

Table 2 shows the average quantitative results over the prediction of the entire sequence of pick-and-place actions. The model trained on pick-and-place action data outperforms other models trained only on low-level actions. We then evaluate the models on different stages of pick-and-place actions (Table 3). Although the model trained on Panda-Babbling performs better in the short horizon, such as the first two stages, it fails to predict in the long horizon, which is important for pick-and-place actions. In contrast, our method trained on PandaGrasp-Pick &Place performs better prediction on the long horizon.

Figure 9 shows a qualitative comparison of prediction results across different models. Although the model trained on Panda-Babbling successfully learns the gripper movements related to the low-level displacement actions, it fails to learn object movements. In contrast, the model trained on PandaGrasp-Pick &Place achieves more accurate predictions of both gripper and object movements.

Fig. 9
figure 9

Qualitative comparison among the predictions of visual predictive models that are whether being trained on the data of pick-and-place actions or not. We show some keyframes of the sequence of frames. The red boxes highlight whether the sponge is correctly predicted to be picked up, and the green boxes highlight whether the sponge is correctly predicted to be placed

Evaluation on real tasks with high-level actions

Fig. 10
figure 10

Three real robot tasks in our experiments. Left: relocating an object without obstacles. Middle: relocating an object with obstacles. Right: placing an object into a bowl

This section evaluates whether using high-level actions in the prediction and planning leads to a greater success rate in real robot tasks and more efficiency in planning, especially for tasks related to pick-and-place actions. As shown in Fig. 10, we compare our method with the vanilla visual MPC [9, 39] on three manipulation tasks, including

  1. 1.

    Relocating an object without obstacles, where the goal is to relocate an object into a desired position without obstacles in the workspace.

  2. 2.

    Relocating an object with obstacles, where the goal is to move the object across to a new location without impacting obstacles in the workspace.

  3. 3.

    Placing an object into a bowl, where the robot is required to place an object into a particular target area such as a bowl.

Given an initial frame and the goal frame, our method performs visual predictions on a set of pick-and-place actions. We then select the action with the predicted frame resulting in the lowest \(\ell _{1}\) loss to the goal frame. In contrast, for the vanilla visual MPC, we adopt the planner method described in citewu2021greedy, where the predictive model predicts the frames on a set of sequences of low-level actions on a short horizon and iteratively selects the first action of the sequence that leads to the lowest \(\ell _{1}\) loss to the provided goal frame. For each task, we repeat the experiments with ten configurations by randomly putting objects in the workspace and designating a goal visual frame according to the corresponding task specifications. Table 4 lists the various objects in our experiments to diversify the configurations.

Table 4 Various objects used in our experiments
Table 5 Comparisons between success rate and efficiency of different planning methods
Fig. 11
figure 11

Qualitative visualization for using the prediction of pick-and-place actions in three real robot tasks. Left: the given initial and goal frames of a specific task. Middle: the prediction of the planned action. Right: the real execution of the robot

We annotate an experiment as a success if the target object is relocated or placed into the goal configuration within the maximum number of steps and as a failure otherwise. In Fig. 11, we present a set of qualitative experiments showing that the learned model of high-level actions can complete all three manipulation tasks related to pick-and-place actions. Table 5 shows that using high-level action prediction and planning leads to higher success rates than vanilla visual MPC. This is particularly apparent in tasks that involve more complex robot and/or object interactions, e.g., with obstacles or other objects in the scene.

Furthermore, Table 5 shows that the average number of CEM iterations of planning on pick-and-place actions is much less (7.7\(\times \)) than that of planning on low-level actions. This highlights that planning on high-level pick-and-place actions leads to greater efficiency in downstream tasks. Regardless of our method or vanilla visual MPC, each CEM iteration denotes an action re-planning process. In vanilla visual MPC, action re-planning occurs after every low-level action. In contrast, our method only performs the re-planning after the high-level action, resulting in greater planning efficiency. Although the autoregressive generation makes our method predict a longer horizon than vanilla visual MPC (20 frames vs. ten frames), planning pick-and-place actions allows for a reduction in the number of CEM iterations, still resulting ins greater planning efficiency.

Comparison with other planning frameworks

In this section, we provide experimental analysis and compare our approach with other state-of-the-art planning frameworks of pick-and-place actions, such as the Transporter Network [30, 43], and Dex-Net [28].

Transporter network

The Transporter network leverages visual cues to determine the task’s goal and ultimately uses them to estimate the robot’s pick-and-place actions. To compare with the Transport network, we replicate it to perform robot rearranging tasks in our local environment. Following [43], we implement a similar user interface to acquire human demonstrations, which we use to train the model of the Transport network.

Concretely, we obtain 500 human demonstrations of placing an object into a round bowl, as shown in the left part of Fig. 12. The results in Table 6 demonstrate that we have trained a model that performs very well (a 100% success rate) on the task of “placing an object into a round bowl”. However, we also observe a limitation in using visual cues to generalize between tasks. Table 6 shows that this model performs very limitedly on the new task of “placing an object into a rectangle bowl”. This is due to the learning of the Transport network being task-specific, and the demonstrations used for training limit the model to only manipulate with a round bowl. In contrast, for our approach, we aim to use a visual predictive model as the world model to learn the interactions between the robot and objects without making the model dependent on any specific task. The results in Table 6 show that our method still achieves a success rate of 50% on placing an object into a rectangle bowl, although in the training, the model has never seen either a demonstration of placing an object into a rectangle bowl or even the rectangle bowl itself. The intention of this experiment is not to compare an absolute winner but rather to foster an open dialog concerning whether to use inductive bias to generalize tasks or world dynamics.

Fig. 12
figure 12

Left: we obtain 500 human demonstrations of placing an object into a round bowl. Right: the trained model performs very limitedly on the new task of “placing an object into a rectangle bowl”

Table 6 Comparison of the success rate of generalization from the task of demonstrations to a new task

Dex-Net

Dex-Net [28] is a state-of-the-art picking method that estimates the optimal picking poses from a depth image. However, it does not take into account task-related objectives, such as which object to pick up and where to place it for a specific task’s goal. Nevertheless, we are interested in whether or not we can introduce Dex-Net into our approach, e.g., using it to select picking. We thus evaluate the performance of Dex-Net in the task-specific situation. We conduct this evaluation on two tasks: one is to relocate an object, with only the target object visible for the Dex-Net, and another one is to place an object into a bowl, with both the object and the bowl present in the field of view.

Table 7 Comparison of the success rate of picking in the task with single object vs. multiple objects

The results in Table 7 indicate that when only the target object is visible, Dex-Net performs well (100%) in selecting a suitable picking on this object. However, in cases where multiple objects are present in the scene, Dex-Net may not consistently identify the suitable picking on the relevant object. For example, in the task of placing an object into the bowl, Dex-Net sometimes selects a picking on the edge of the bowl, which will lead to the failure of the task. In contrast, our method can select picking on the task-relevant object more consistently (60 vs. 80%). In a nutshell, methods like Dex-Net that select the most suitable pick only by the learned geometry can be used to help select the pick in pick-and-place tasks, but they need to be well adapted to the task goal.

Conclusion and discussion

We propose a visual predictive model that learns the high-level pick-and-place actions in the real robot manipulation environment. The predictive model combines a high-level action decomposer and a video prediction network to learn the intrinsic semantic information of high-level actions. We also expand our previous work [27] and contribute a new dataset. PandaGrasp-Pick &-Place contains 5K examples of a Franka Panda robot executing pick-and-place actions. In our experiments, we find that our method outperforms a CVAE network to predict the target frame conditional on the initial frame and the pick-and-place action. By comparing different visual predictive models that are trained on high-level action or not, we find that our proposed method can substantially learn the pick-and-place actions. We then evaluate our method with sample-based optimization on several real robot tasks. Our method can find appropriate pick-and-place action, especially in the scenario where this kind of high-level action is more reasonable. We also found some limitations in our work. The introduction of the primitive of high-level actions reduces the generality compared to the vanilla Visual Foresight, resulting in a lower success rate in the task of relocating without obstacles, which may be accomplished more easily through pushing. We believe that the generalization of our method could be improved through a more general primitive or combing various primitives. Also, although learning a world dynamic is task-agnostic and more generalized, it may be more challenging than learning models for each specific task, such as the Transporter network. As the state-of-the-art video predictive and generative models [19] [21] advance, their capability to learn world dynamics will become more powerful, eventually leading to better performance in the downstream tasks.