Introduction

Deformable objects can be widely seen in many automating tasks such as food handling, assistive dressing and the manufacturing, assembly and sorting of garments [1]. Vision-based deformable object rearrangement (e.g. rope straightening and cloth folding) is one of the most investigated and fundamental deformable manipulation tasks, where the robot is supposed to infer a sequence of manipulation actions (e.g. pick and place) from solely visual observations (e.g. point cloud [2] and RGB image [3]) to rearrange a deformable object into a prescribed goal configuration. Different from rigid manipulation [4,5,6], deformable rearrangement poses two new challenges. The first challenge lies in the high dimensionality of the deformable configuration space [7]. In contrast to rigid objects, whose configurations are frequently represented as 6-D poses w.r.t. a common reference frame, how to represent the configurations of a deformable object in an efficient and accurate manner, particularly with only visual observations, still remains unresolved. The second challenge comes from the highly complex and non-linear dynamics of deformable materials [8], which makes the object behaviors under certain robot actions (e.g. pull and push) hard to model and predict during manipulation inference and planning.

To tackle high dimensional configuration space, an efficient representation strategy from visual observation of deformable objects is particularly necessary. Convolution Neural Network (CNN) paves the way of extracting hidden dense features of deformable objects from visual observations. Another widely-explored representation strategy is based on keypoint detection from visual observations [9, 10]. Representing the states of a deformable object as keypoints can significantly decrease the dimensionality of its configuration space and therefore lead to more data-efficient policy learning of deformable rearrangement while compared with using the convolutional features [11]. However, these methods have the defect of not considering the global interactions among different visual parts, which may convey important clues of deformable configurations. Recently, handcrafted graph structure provide a solution to represent the interactions among keypoints. Concretely, by viewing keypoints as the nodes, the keypoint interactions can be simply represented as edges of a graph structure [12, 13]. In our previous work [14], we have used a handcrafted graph structure to represent the deformable object and modified a CNN-based manipulation policy learning architecture. However, handcrafted rules limit the expressiveness of the graph structures in representing deformable configurations. To better handle the complex deformable configurations during robot rearrangement, as an extension of our previous work [15], we proposed a learned dynamic graph representation strategy where the interactions among keypoints are learned rather than predefined with handcrafted rules. The keypoint interactions are learned by the model in the rearranging policy learning. Since the graph is constructed to improve the performance of rearrangement tasks, the learned keypoint interactions go beyond the expert knowledge and therefore can be more suitable and efficient for representing the deformable object configurations.

As for manipulation policy learning under deformable dynamics, the end-to-end fashion have become a research focus recently, where the robot learns deformable rearrangement policy from visual observation recently. However, most pioneering works tend to provide task-specific models. Research on establishing a general framework that can be used on different rearranging tasks has achieved some progress recently. Seita et al. [3] proposed the goal-conditioned transporter network which performs well on several deformable rearranging tasks. Lim et al. [16] present a more systematic classification of the rearrangement tasks and achieved a better performance in multi-task learning. However, the end-to-end policy learning from visual observation depends heavily on the image style and therefore can potentially bring about severe sim-to-real learning gaps. To narrow the sim-to-real gap, we propose a two-staged learning framework (Fig. 1). More specifically, we firstly design a keypoint detector that can extract keypoints from visual observations effectively. And we propose local-GNN, a local graph neural network to learn manipulation policy from keypoints in current and goal visual observations. In this way, the image style is isolated from manipulation policy learning. The current and goal keypoints are transformed into two dynamic graph, and proposed local-GNN first updates both graphs locally to obtain the accurate representations of object states during robot rearrangement. Then proposed local-GNN exchanges messages globally across the updated local graphs to find their best-matched node pairs, i.e. to search for the optimal manipulation action pairs that can narrow the gap between the current and goal state to the best.

Fig. 1
figure 1

Illustration of the proposed local-CNN. We propose local-GNN, a light and suitable local graph neural network learning to manipulate deformable objects with an efficient and accurate graph representation of the deformable rearrangement dynamics

We present both simulated and real experiments on a variety of deformable rearrangement tasks to evaluate our proposed method. The results demonstrate that our method can be a more efficient and general for vision-based goal-conditioned deformable rearrangement compared with state-of-the-art frameworks. Moreover, leveraging the keypoint and graph representation, our model is more expressive in deformable rearrangement dynamics, but much lighter in model size and complexity. Besides single-task learning, our method achieves comparable multi-task learning performance. Real-world experiments also reveal the enhanced sim-to-real transferability of our framework. The contributions of this work are summarized as follows:

  1. (1)

    We propose a novel graph representation strategy, where the keypoints are detected and encoded into a dynamic graph as an efficient and accurate structural replacement of the high-dimensional visual observations for vision-based deformable rearrangement;

  2. (2)

    We propose local-GNN, a light and effective graph neural network that utilizes local and global attentions between two dynamic graphs to automatically learn the deformable rearrangement dynamics and to infer the optimal deformable rearranging policy together;

  3. (3)

    We propose the two-staged learning framework, which greatly narrows the sim-to-real gap of learning vision-based goal-conditioned deformable rearrangement.

The rest of this paper is organized as follows. The related work is reviewed in “Related work”. We introduce our learning framework briefly in “Problem formulation”. “Learning for deformable object rearrangement” presents details of learning algorithms in our proposed framework. The experimental setup and results are provided/in “Experiment results”. “Conclusion” concludes this paper and discusses some future works.

Table 1 Research gaps and contributions of previous studies

Related work

This section reviews related work on the configuration representation of deformable objects, manipulation planning of deformable rearrangement, and manipulation policy learning from keypoints (Table 1).

Configuration representation of deformable objects

Considering the high dimensionality of the configuration space of deformable objects, an effective representation method is necessary. Adopting Convolution Neural Network (CNN) to extracting hidden dense features of deformable objects from visual observations is widely investigated. However, CNN focuses solely on the local features in visual images, which limits its expressiveness and efficiency in deformable object configurations, as the global relationships (or interactions) among different parts of a deformable object is important. Besides, the convolutional features depend heavily on the image style and therefore can potentially bring about severe sim-to-real learning gaps. Keypoints representation strategies have the advantage of lower dimensionality of configuration space compared with CNN feature. Considering that keypoint representation methods also ignore interactions between keypoints, Miller et al. [22] introduced predefined geometric constraints to incorporate keypoint interactions, which however are a strong prior and can hardly be obtained accurately. Recent advances in graphs provide another potential solution to represent the interactions among keypoints without geometric prior. However, keypoint interactions in these graphs are usually handcrafted, i.e. each interaction is simply constructed as whether there exists a physical connection between two keypoints and each keypoint is connected to its neighbors within a predefined distance. In our model design, keypoint interactions is learning during rearrangement policy learning instead of defining by handcrafted rules, which can exploit the potential of graphs in rearrangement more fully.

Manipulation planning of deformable objects

There are two main approaches towards manipulation planning of deformable objects. Particularly, model-based approaches rely on an accurate forward dynamic model, which can predict the configurations of deformable objects under a certain manipulation actions. The forward dynamic modeling method can be mainly divided into the physics-based methods (e.g. mass-spring system and continuum mechanics) and the data-driven methods. The accuracy of physics-based methods highly depends on tuning the involved physical parameters properly and therefore an accurate physics-based method is usually too complex and expensive to obtain [1]. The data-driven methods learn the forward dynamic model from quantities of interaction data between robots and deformable objects [19], which however is data-consuming and poor in generalization on different objects and tasks.

Policy-based approaches aim to obtain optimal manipulation policies directly from observation without the establishment of a forward dynamic model. This line of works can be divided into two categories according to the source of the training data: imitation learning and reinforcement learning. In imitation learning, the manipulation task is often formulated as a supervised learning problem where the robot should imitate the observed behaviors [18]. Reinforcement learning obtains rearranging skills through robot exploratory interactions [17, 23]. However, most previous policy-based methods are limited to single-task learning, which is inefficient in real-world applications. However, our model learns manipulation policies from dynamic graphs with an efficient local-GNN architecture, which our proposed method is proved be a general framework for multiple deformable object rearrangement tasks and performs well in the multi-task learning scenario.

Manipulation policy learning from keypoints

Learning manipulation policies from keypoints has become a research focus recently because keypoints can be effective alternatives for high-dimensional visual observations and lead to more data-efficient manipulation policy learning. Within the context of deformable object manipulation, early works focus on tracking the keypoints of deformable objects [24]. To achieve efficient manipulation policy learning, Lin et al. [25] have used the positions of keypoints on the rope as the reduced states. To bridge the sim-to-real gap, Wang et al. [20] treated keypoints as nodes in a graph and designed an offline-online learning framework based on graph neural networks. Ma et al. [21] designed a graph neural network to learn the forward dynamic model of the deformable objects and achieved precise visual manipulation. However, most previous graph neural network-based methods rely on a model predictive controller to compensate for the prediction error of the forward dynamic model, which brings a heavy computational burden. In addition, the generalization of the pre-trained dynamic model on different objects and tasks is usually limited. To provide a general learning framework, we design a local-GNN to learn manipulation policy directly from keypoints (represented as a graph) without the establishment of a forward dynamic model.

Problem formulation

This section presents a detailed problem definition and a brief introduction of the proposed learning framework. Table 2 lists key notations used in this work.

Problem formulation

As shown in Fig. 2, given a visual observation \({\textbf {{I}}}_0\in {\mathbb {R}}^{ \text {w} \times \text {h}\times \text {c}}\) of the initial state of the deformable object and a prescribed goal state \({\textbf {{I}}}_{\text {g}}\in {\mathbb {R}}^{ \text {w} \times \text {h}\times \text {c}}\) also specified as a visual observation, where \(\text {w}\), \(\text {h}\) and \(\text {c}\) denote the width, height and channel of the visual observations respectively, we formulate the problem of goal-conditioned deformable rearrangement as to find a policy \(\pi \) that can generate a sequence of robot pick and place actions \(\{\varvec{\alpha }_{t} \} (t=0, 1,2,\ldots k)\) in a closed-loop manner:

$$\begin{aligned} \varvec{\alpha }_{t} \leftarrow \pi ({\textbf {{I}}}_t, {\textbf {{I}}}_{\text {g}}) \text {and}\quad {\textbf {{I}}}_{t+1} \leftarrow {\mathcal {T}}({\textbf {{I}}}_t, \varvec{\alpha }_{t}) \end{aligned}$$
(1)

such that:

$$\begin{aligned} \left\| {\textbf {{I}}}_{k+1}-{\textbf {{I}}}_{\text {g}}\right\| _{\text {latent}} \le \gamma \end{aligned}$$
(2)

where \({\mathcal {T}}\) denotes the state transition describing the deformable rearrangement dynamics to be learned by the policy, and \(\gamma \) denotes the similarity threshold defined in the latent space, which determines if the object state is close enough to the goal state during robot manipulation.

Table 2 Nomenclature

We define each action \(\varvec{\alpha }\in {\mathcal {A}}\) as a pick action followed by a place action:

$$\begin{aligned} \varvec{\alpha }_t = \{\varvec{\alpha }_{t}^{\text {pick}}, \quad \varvec{\alpha }_{t}^{\text {place}}\} \end{aligned}$$
(3)

where \(\varvec{\alpha }_{t}^{\text {pick}}\) and \(\varvec{\alpha }_{t}^{\text {place}}\) denote the poses of robot end-effector while grasping and releasing part of the object respectively. More concretely, we consider tabletop manipulation tasks and hence both poses \(\varvec{\alpha }_{t}^{\text {pick}}\) and \(\varvec{\alpha }_{t}^{\text {place}}\) are defined in \(\text {SE}(2)\), where positions are sampled from a fixed-height planar surface \((x{-}y)\) and rotations are defined around the z-axis. Similar to previous work [3, 26], our method obtains the pick or place height (z) w.r.t. a heightmap generated from visual observations.

Fig. 2
figure 2

Problem overview. We formulate the goal-conditioned deformable rearrangement problem as to find a sequence of robot pick and place actions to rearrange the object from an initial state \({\textbf {{I}}}_0\) to a prescribed goal state \({\textbf {{I}}}_{\text {g}}\). a The input contains only the visual observations \({\textbf {{I}}}_0, {\textbf {{I}}}_{\text {g}}\) of the current and goal object states. b The method generates a sequence of pick and place actions \(\{\varvec{\alpha }_{t}\} (t=0, 1,2,\ldots k)\) rearranging the deformable object to the goal state. c Each action is defined as a pair of a pick \(\varvec{\alpha }_{t}^{\text {pick}}\) and a place \(\varvec{\alpha }_{t}^{\text {place}}\) action

Method overview

Our policy \(\pi \) consists of two main components, namely a keypoint detector \(\pi _{\text {keypoint}}\) that extracts two sets of keypoints, denoted by \(\mathcal {P}_t\), \(\mathcal {P}_{\text {g}}\), from the current and goal visual observations \( {\textbf {{I}}}_t, {\textbf {{I}}}_{\text {g}}\) of a deformable object, respectively, and a manipulation planner \(\pi _{\text {plan}}\) that determines optimal pick and place actions on the keypoints from sets \(\mathcal {P}_t\), \(\mathcal {P}_{\text {g}}\), to manipulate the object to the goal state \({\textbf {{I}}}_{\text {g}}\) in a close loop manner.

Concretely, the keypoint detector \(\pi _{\text {keypoint}}\) firstly maps from the high-dimensional visual observations \({\textbf {{I}}}_t\), \({\textbf {{I}}}_{\text {g}}\) of the deformable object into two sets of 2-D keypoints \(\mathcal {P}_t, \mathcal {P}_{\text {g}}\),

$$\begin{aligned} \mathcal {P}_t \leftarrow \pi _{\text {keypoint}}({\textbf {{I}}}_t), \mathcal {P}_{\text {g}} \leftarrow \pi _{\text {keypoint}}({\textbf {{I}}}_{\text {g}}) \end{aligned}$$
(4)

where each keypoint \({\varvec{{p}}}\) in \(\mathcal {P}_t\), \(\mathcal {P}_{\text {g}}\) locates a candidate position for robot pick and place on the object respectively. We aim to train the keypoint detector such that the output keypoints \(\mathcal {P}\) from the visual observation \({\textbf {{I}}}\) can be a more efficient, accurate and compact structural representation of the deformable object facilitating the subsequent searching in planning (“Learning for keypoint detection”).

The planner \(\pi _{\text {plan}}\) then reasons from the keypoints to find the optimal pair of pick and place actions

$$\begin{aligned} \varvec{\alpha }_{t}=\pi _{\text {plan}}( \mathcal {P}_t, \mathcal {P}_{\text {g}}). \end{aligned}$$
(5)

We formulate the manipulation planning of deformable rearrangement as a sequence-to-sequence (S2S) problem, and propose the local-GNN to learn the deformable rearrangement dynamics and reason about the optimal pick and place actions until Eq. (2) is satisfied (“Learning for deformable rearrangement”).

Learning for deformable object rearrangement

This section presents details of our learning framework for goal-conditioned deformable object rearrangement. Briefly, it consists of extracting keypoint and graph representations and then finding the optimal manipulation actions from visual observations.

Learning for keypoint detection

Our method starts from extracting a set of keypoints \(\mathcal {P}=\{{\varvec{{p}}}_i\}_{i=1}^m\) from the visual observation \({\textbf {{I}}}\), as an effective accurate and compact structural representation of the high-dimensional configuration of the deformable object. We borrow ideas from previous work [27, 28] and train a deep convolution neural network as our keypoint detector \(\pi _{\text {keypoint}}\). As shown in Fig. 3, given a visual observation \({\textbf {{I}}}\in {\mathbb {R}}^{\text {h} \times \text {w} \times \text {c}}\) of the deformable object, the detector first produces a group of m feature maps \(\{{\textbf {{H}}}_i\}_{i=1}^{m}\) from the visual cues extracted in the latent space. Each feature map \({\textbf {{H}}}_i\in {\mathbb {R}}^{\text {h}^{\prime } \times \text {w}^{\prime }}\) is then regarded as a map of probability distribution processed to find the exact location \({\varvec{{p}}}_i\) of the i-th corresponding keypoint in the visual observation.

We apply a spatial softmax strategy to extract the keypoint location from a feature map. Specifically, given a feature map \({\textbf {{H}}}_i\), while using \(\Omega :\text {w}\times \text {h}\) and \(\Omega ^{\prime }:\text {w}^{\prime } \times \text {h}^{\prime }\) to denote the pixel domain of the visual observation \({\textbf {{I}}}\) and the feature map \({\textbf {{H}}}_i\) respectively, the location of its corresponding i-th keypoint on the feature map domain \(\Omega ^{\prime }\) can be obtained as

$$\begin{aligned} {\varvec{{p}}}_i^{\Omega ^{\prime }}=(x_i^{\Omega ^{\prime }},y_i^{\Omega ^{\prime }}) = \frac{\sum _{{\varvec{{c}}}\in {\Omega ^{\prime }}}{\varvec{{c}}}\times \exp ({\textbf {{H}}}_i({\varvec{{c}}}))}{\sum _{{\varvec{{c}}}\in \Omega ^{\prime }}\exp ({\textbf {{H}}}_i({\varvec{{c}}}))} \end{aligned}$$
(6)

where \({\varvec{{c}}}\in {\Omega ^{\prime }}\) denotes a pixel location on the feature map. The superscript indicates the reference domain and is omitted for the visual observation domain \(\Omega \) for simplicity. The spatial softmax strategy (Eq. (6)) condenses each feature map into a keypoint, which is fully differentiable and therefore makes the keypoint detector trainable.

The corresponding keypoint \({\varvec{{p}}}_k\) on the visual observation domain \(\Omega \) can then be obtained via a linear scaling mapping from the keypoint on feature map domain \({\varvec{{p}}}_i^{\Omega ^{\prime }}\)

$$\begin{aligned} {\varvec{{p}}}_k =(x_k,y_k)=\left( \frac{\text {H}}{\text {H}^{\prime }} x_k^{\Omega ^{\prime }},\frac{\text {W}}{\text {W}^{\prime }}y_k^{\Omega ^{\prime }}\right) . \nonumber \\ \end{aligned}$$
(7)
Fig. 3
figure 3

Keypoint detector. We design a keypoint detector to exact keypoints from the RGB observations. The keypoint detector outputs a Gaussian heatmap centered at the coordinates of keypoints from the RGB observations

We train the keypoint detector \(\pi _{\text {keypoint}}\) by optimizing over a Gaussian heatmap in a supervised fashion. Specifically, rather than optimizing the detector directly on the keypoints, our method generates two Gaussian heatmaps centered at the estimated keypoints \(\mathcal {P}=\{{\varvec{{p}}}_i\}_{i=1}^m\) and at their corresponding ground truth locations \(\mathcal {P}^*=\{{\varvec{{p}}}^*_i\}_{i=1}^m\) on the visual observation respectively

$$\begin{aligned}&{\textbf {{G}}}(\mathcal {P})=\sum _{i=1}^m \exp \left( -\frac{1}{2\varvec{\sigma }^2}\left\| {\varvec{{p}}}-{\varvec{{p}}}_i\right\| ^2\right) , \quad {\varvec{{p}}}\in \Omega \end{aligned}$$
(8)
$$\begin{aligned}&{\textbf {{G}}}^*(\mathcal {P}^*)=\sum _{i=1}^m \exp \left( -\frac{1}{2\varvec{\sigma }^2}\left\| {\varvec{{p}}}-{\varvec{{p}}}^*_i\right\| ^2\right) ,\quad {\varvec{{p}}}\in \Omega \end{aligned}$$
(9)

where \(\varvec{\sigma }\) is the constant standard deviations. The keypoint detector is then trained by minimizing a pixel-wise \(\text {L2}\) loss between the Gaussian heatmaps \({\textbf {{G}}}\) and \({\textbf {{G}}}^*\). The Gaussian representation of keypoints provides additional information on pixels which are less likely to be keypoints, and therefore make the training more efficiently.

Figure 4 shows example Gaussian heatmaps generated by our keypoint detector. Leveraging the keypoint detector, our method transforms each visual observation into a set of keypoints, which are then fed into the manipulation planner as candidate locations for robot pick and place. Rather than searching in the whole pixel domain of visual observations, where most locations are either invalid or redundant for deformable rearrangement, our method reduces the action exploration into a limited number of keypoints, making the subsequent planning more efficient.

Fig. 4
figure 4

Gaussian heatmaps generated by our keypoint detector. The left column represents the original visual observation, and the right column represents the Gaussian heatmap centered at keypoints detected from the visual observation

Fig. 5
figure 5

Model architecture. Our model detects keypoints from current and goal visual observations and establish the representation vectors of keypoints firstly. Two representation graphs are established based on keypoints. A local-GNN with self-attention layers and cross-attention layers is used to update the two graphs then. Finally, the probability distribution of pick and place action is generated from the updated graph nodes by a soft-max layer

Learning for deformable rearrangement

Our method then reasons from the keypoints of the current and goal states \(\mathcal {P}_t\), \(\mathcal {P}_{\text {g}}\) at each time step to find the optimal pick and place actions respectively, i.e. \(\varvec{\alpha }_{t}^{\text {pick}} \in \mathcal {P}_t\), \(\varvec{\alpha }_{t}^{\text {place}} \in \mathcal {P}_{\text {g}} \; (t=0, 1,2,\ldots ,k)\), which is formulated as a sequence-to-sequence problem

$$\begin{aligned} {{\textbf {Q}}}_{\text {pick}},{{\textbf {Q}}}_{\text {place}} =\psi ( \mathcal {P}_t, \mathcal {P}_{\text {g}}) \end{aligned}$$
(10)

where \({{\textbf {Q}}}_{\text {pick}}\), \({{\textbf {Q}}}_{\text {place}}\) correspond to the probability distributions of pick and place success on the keypoints in \(\mathcal {P}_t\) and \(\mathcal {P}_{\text {g}}\) respectively. The optimal pick and place actions can therefore be determined as

$$\begin{aligned}{} & {} \varvec{\alpha }_{t}^{\text {pick}}=\underset{{\varvec{{p}}}\in \mathcal {P}_t }{\text {argmax}}({{\textbf {Q}}}_{\text {pick}}({\varvec{{p}}}))\nonumber \\{} & {} \varvec{\alpha }_{t}^{\text {place}} = \underset{{\varvec{{p}}}\in \mathcal {P}_{\text {g}}}{\text {argmax}}({{\textbf {Q}}}_{\text {place}}({\varvec{{p}}})). \end{aligned}$$
(11)

We build local-GNN with an attention-based updating strategy to learn the above functions (Eqs. (10) and (11)). Briefly, leveraging the two sets of keypoints, our method first constructs two dynamic graphs to represent the current and goal states of the deformable object. The two graphs are then further updated by local-GNN (1) to exact the hidden features that can effectively characterize the object states, (2) to learn the deformable rearrangement dynamics, and (3) to infer the optimal pick and place actions that can drive the object from the current state to the goal state as close as possible, from the intrinsic structures and underlying interactions among the keypoints/nodes of the two dynamic graphs.

Model architecture: Concretely, as shown in Fig. 5, at each time step, the keypoints \(\{\mathcal {P}_t, \mathcal {P}_{\text {g}}\}\) detected from the current and goal visual observations \(\{{\textbf {{I}}}_t,{\textbf {{I}}}_{\text {g}}\}\) are first embedded into a latent space with a multilayer perceptron (MLP),

$$\begin{aligned} {\varvec{{x}}}^{0} = {\textrm{MLP}}^0({\varvec{{p}}}). \end{aligned}$$
(12)

The obtained hidden features \(\{\mathcal {X}_t^0, \mathcal {X}_{\text {g}}^0\}\) are then utilized as the initial nodes of two dynamic graphs \(\{\mathcal {G}_t, \mathcal {G}_{\text {g}}\}\) to represent the initial and goal states of the deformable object respectively. Particularly, we define two types of graph edges (Fig. 5-Right), including self-edges \(\epsilon _{\text {s}}\) connecting every two nodes from a same graph and cross-edges \(\epsilon _{\text {c}}\) connecting every two nodes from across two different graphs.

The two graphs are then passed and updated progressively through local-GNN, which consists of a number of \(\ell _{\text {s}}\) self-attention layers for local graph updating, and a number of \(\ell _{\text {c}}\) cross attention layers for keypoint matching between graphs. Specifically, at each updating layer, all nodes in \(\{\mathcal {X}_t, \mathcal {X}_{\text {g}}\}\) are updated by aggregating messages through self-edges \({\mathcal {E}}_{\text {s}}\) or cross-edges \({\mathcal {E}}_{\text {c}}\) in the two graphs respectively:

$$\begin{aligned} {\varvec{{x}}}_i^{\ell +1} = {\varvec{{x}}}_i^{\ell }+{\text {MLP}}\left( \left[ {\varvec{{x}}}_i^{\ell } \Vert \varvec{m}_{{\mathcal {E}}\rightarrow i}\right] \right) \end{aligned}$$
(13)

where the right superscript indexes the updating layer, and \(\left[ \cdot \Vert \cdot \right] \) denotes the concatenation operation. The updating message \(\varvec{m}_{{\mathcal {E}}\rightarrow i}\) for the i-th node represents the resultant messages aggregated from nodes \(\{j: \epsilon _{j\rightarrow i}\in {\mathcal {E}}\}\) connected to the i-th node, where the edge set \({\mathcal {E}}= {\mathcal {E}}_{\text {s}}\) and \({\mathcal {E}}= {\mathcal {E}}_{\text {c}}\) during the local and cross updating stage respectively.

Our method aggregates the message \(\varvec{m}_{{\mathcal {E}}\rightarrow i}\) for the i-th node by using its attention values with all neighboring nodes as in Transformer [29], which has been widely applied for sequence-to-sequence processing [30, 31]. Specifically, at each self-attention layer (\(\ell \le \ell _{\text {s}}\), \({\mathcal {E}}={\mathcal {E}}_{\text {s}}\)), the network aggregates messages through self-edges to obtain a more accurate and efficient graph representation of the object state in rearrangement. While at each cross-attention layer \((\ell _{\text {s}} < \ell \le \ell _{\text {c}},\) \({\mathcal {E}}={\mathcal {E}}_{\text {c}}),\) the network aggregates messages through cross-edges to further update both graphs \(\mathcal {G}_t\) and \(\mathcal {G}_{\text {g}}\), which essentially compares and matches nodes between \(\mathcal {X}_t\) and \(\mathcal {X}_{\text {g}}\) to search for the optimal pair of pick and place actions.

Finally, the updated graph nodes are passed through a softmax layer to output the probability distributions of pick and place actions \({{\textbf {Q}}}_{\text {pick}},{{\textbf {Q}}}_{\text {place}}\) on their corresponding keypoints \(\mathcal {P}_t, \mathcal {P}_{\text {g}}\) respectively, from which the robot selects the optimal pick and place actions to rearrange the deformable object as given by Eq. (11). The above procedure runs in a close loop manner until the obtained object state is close enough to the prescribed goal state.

Training and implementation

To train the network, we adopt the paradigm of imitation learning and generate a dataset of N stochastic expert demonstrations \({\mathcal {D}}=\{ \tau _i \}_{i=1}^{N}\), where each demonstration \(\tau _i\) contains a sequence of visual observation and action pairs:

$$\begin{aligned} \tau _i=\{({\textbf {{I}}}_1, \varvec{\alpha }_1), ({\textbf {{I}}}_2, \varvec{\alpha }_2),\ldots , ({\textbf {{I}}}_{T_i}, \varvec{\alpha }_{T_i})\}. \end{aligned}$$
(14)
Fig. 6
figure 6

Deformable rearrangement tasks in our dataset. Each represents two instances, where the left are the current states and the right are the goal states

Details of the demonstration dataset are present in “Dataset construction”. Leveraging the dataset, we formulate the training of local-GNN as a supervised classification problem. We employ the cross-entropy error as the loss function, which has been widely proved to be efficient for classification learning:

$$\begin{aligned} \mathcal {L}= w_{\text {pick}}*\mathcal {L}_{\text {pick}} + w_{\text {place}}*\mathcal {L}_{\text {place}} \end{aligned}$$
(15)

and

$$\begin{aligned}{} & {} \mathcal {L}_{\text {pick}} = -\sum _{{\varvec{{p}}}\in \mathcal {P}_{\text {pick}}}y_{\text {pick}}({\varvec{{p}}})\log ({{\textbf {Q}}}_{\text {pick}}({\varvec{{p}}}))\nonumber \\{} & {} \mathcal {L}_{\text {place}} = -\sum _{{\varvec{{p}}}\in \mathcal {P}_{\text {place}}}y_{\text {place}}({\varvec{{p}}})\log ({{\textbf {Q}}}_{\text {place}}({\varvec{{p}}})) \end{aligned}$$
(16)

where \(\mathcal {L}_{\text {pick}}\), \(\mathcal {L}_{\text {place}}\) denote the pick and place loss, and \(w_{\text {pick}}\), \(w_{\text {place}}\) are the scalar coefficients of loss weight respectively. If \({\varvec{{p}}}\) is a ground truth pick/place position, \(y_{\text {pick/place}}({\varvec{{p}}})=1\), otherwise \(y_{\text {pick/place}}({\varvec{{p}}})=0\).

Dataset construction

Learning an end-to-end robot manipulation policy is particularly data-thirsty. However, large-scale datasets of real-robot demonstrations are usually costly and inaccessible. To this end, we modify the simulation environment in [3] and construct our own dataset of deformable rearrangement tasks in PyBullet [32], which provides highly-realistic visual rendering and simulations of deformable dynamics.

Particularly, we set up a fixed UR5 manipulator mounted with a suction cup as its end-effector for deformable rearrangement. We use a RGB-D camera just above the robot workspace to collect visual observations of deformable objects from a top-down perspective. We build the dataset with three types of deformable objects, including ropes, rope rings and cloth (Fig. 6). Each type of deformable objects is designed with a large number of rearrangement tasks with great randomness, such that the task diversity in our dataset is sufficient to guarantee the generalization and effectiveness of the learned framework. It is worth noting that both initial and goal configurations of our rearrangement tasks are completely randomized and provided as visual observations, which poses a higher generalization requirement to our framework. Details of the deformable rearrangement tasks in our dataset are present in Table 3 and in the supplementary materials.

Table 3 Deformable rearrangement tasks involved in our dataset

Experiment results

Table 4 Results of success rates

This section presents both simulated and real-world experiments to evaluate the performance of our framework. Particularly, we aim to answer the following questions: (1) How well does our framework compare with the baseline methods on deformable rearrangement tasks? (2) How well does our framework perform on real-world deformable rearrangement tasks? and (3) How well does our framework perform on multi-task policy learning?

Simulation experiments

We first evaluate the performance of our method on a variety of goal-conditioned deformable rearrangement tasks in simulation. Specifically, at each experiment of rearranging a certain deformable object, the robot is provided with a random goal state of the object in the form of visual image, and is supposed to manipulate the object to the goal state without any intermediate sub-goal inputs.

Baseline methods

We compared our method with five baseline methods widely applied to goal-conditioned deformable rearrangement:

  1. 1.

    Conv-MLP represents a convolution-based neural network model architecture, which consists of a group of convolutional layers followed by a multilayer perceptron (MLP) to regress the robot pick and place actions from visual observations directly.

  2. 2.

    GTCN [3] represents the typical goal-conditioned transporter network for deformable rearrangement. It relies not only on the convolutional operations to exact hidden dense features from visual observations, but also leverages the cross-correlation between dense convolution features (namely the transporter operation) of the current and goal states to infer optimal pick and place actions.

  3. 3.

    Graph Transporter [14] represents an optimized GCTN network, where handcrafted graph structure is used to represent deformable objects and supplement global interaction information to CNN features during manipulation policy learning.

  4. 4.

    MLP represents a multilayer perceptron, which learns manipulation policy directly from positions of keypoints.

  5. 5.

    GCN represents a Graph Convolutional Network architecture, which learns manipulation policy directly from the positions of keypoints. In GCN, each node is updated by aggregating messages from edges in the graph. The (edges) node connection relationships are handcrafted in GCN. We define a complete graph, where each node is connected to all other nodes.

All above methods are provided with visual observations of the current object states and a goal visual observation as inputs, and output the SE(2) poses of robot pick and place actions at each timestep. Note that all above methods are trained in the single-task learning scenario.

Fig. 7
figure 7

We evaluate our framework on eight types of deformable rearrangement tasks in simulation. Each example task shows four frames in the sequence. Experimental results show that our framework captures the dynamics of deformable rearrangement accurately and efficiently

Success rate

We compare above methods on their success rates of completing a common set of deformable rearrangement tasks. Specifically, we define a task as a success if the robot completes the task within twenty pick and place actions, and otherwise as a failure. We evaluate the success rate for each single type of deformable rearrangement tasks, with forty random unseen task instances, and apply all methods trained with 10, 100 and 1000 demonstrations separately.

The results are shown in Table 4. Overall, GCTN preforms better than Conv-MLP, especially for models trained with more demonstrations (e.g. the last column of each type of tasks), which demonstrates the effectiveness of the transporter architecture on deformable rearrangement. By introducing a handcrafted graph structure to provide global interaction information, Graph Transporter outperforms GCTN, which illustrates the necessity of global interactions in deformable object rearrangement.

Leveraging the dynamic graph representation and local-GNN based policy learning model, our method outperforms all baseline methods with the highest success rates on all task cases. Particularly on the tasks with more twining goal configurations (e.g. the N-shape and Square-shape scenarios of rope manipulation) or tasks with more complex deformable dynamics (e.g. all cloth manipulation tasks), our model shows more significant advantages of achieving higher success rates than the baseline models. It should also be mentioned that our proposed local-GNN model outperforms other models based on keypoints (MLP, GCN). In our local-GNN, we adopted an attention mechanism to learn the interaction relationships (whether there is an edge between every two nodes and the weight of the edge), which leads our model to outperform MLP and GCN (based on handcrafted complete graphs).

Our method also achieve orders of magnitude higher sample efficiency than baseline models. Particularly, it can be observed in Table 4 the success rates of our method trained on 10 and 100 demonstrations are much higher than those of baseline models trained on 10 and 100 demonstrations, and even higher than those of the baselines trained on 1000 demonstrations on almost all involved tasks.

These results demonstrate that compared with other architectures, our local-GNN model architecture can capture the dynamics of deformable rearrangement more efficiently and accurately, and therefore can be a more general and suitable framework for deformable rearrangement tasks. In other words, our model is more expressive than the baseline methods in modeling the deformable rearrangement dynamics. Figure 7 shows several example solutions of goal-conditioned deformable rearrangement tasks generated by our method trained on 1000 demonstrations. Note that Fig. 7 shows only key action frames for the simplicity of demonstration, but all tasks are completed in twenty actions.

Model capacity

Besides having a superior expressiveness on the deformable rearrangement problem, our model is much lighter and simpler than the baseline methods. For a detailed comparison, we calculate the FLOPs (Floating Point Operations), model parameters and inference times of our model and GCTN. As explained in “Learning for deformable object rearrangement”, our method consists of a keypoint detector and a local-GNN running in a close-loop manner. For a fair comparison, the inference time of our framework is therefore calculated as the total time of detecting keypoints and inferring manipulation actions from the keypoints. We run the analysis with an NVIDIA Tesla T4 GPU on an Intel Xeon Platinum 8255C CPU. The results are shown in Table 5. Overall, our framework is dramatically (more than two orders of magnitude) lighter in terms of model FLOPs and parameters than GCTN, which also leads to much less (more than two times) inference time of our model.

The combined strength of expressiveness, efficiency and simplicity of our model mainly comes from that compared with the convolutional visual features centered in many previous methods [3, 16], the graph features highlighted in our model excel at handling the sparse information of deformable dynamics in rearrangement and can capture the characteristics of deformable rearrangement in a more accurate and efficient manner. In addition, the numerical calculations on two keypoint sets (while each set is essentially a much smaller subset of the whole pixel set of a visual observation) in our model are much less than those of convolution operations on two whole visual observations, which further benefits the efficiency of our proposed framework.

Table 5 Analysis on the model capacity of different models

Sim-to-real experiments

As described previously, the processing of visual observations (keypoint detection) in our framework is naturally separated from the subsequent planning of manipulation actions (local-GNN update and action generation). Such a hierarchy feature enables the transfer of our framework from simulation to reality much more simple and robust. Specifically, our framework can learn to rearrange deformable objects from a large quantity of demonstrations in simulation at the first stage. Since the planning module with local-GNN takes only keypoints as inputs, our framework can transfer the learned skills from simulation to reality by only fine-tuning the keypoint detector, i.e. to accommodate the sim-to-real gap in the visual observations. In addition, since our planning module is a GNN-based model architecture, the length of each input sequence (i.e. the number of keypoints in each dynamic graph) is adjustable, without the need of modifying the model architecture, which further simplifies the sim-to-real of our framework.

We evaluate the sim-to-real performance of our framework with a UR5 robot manipulator mounted with a suction cup (Fig. 8). The deformable object (either a rope or a cloth) is placed on the platform in front of the robot. Note that we have used a cloth with a plaid pattern, which is used for the ease of fine-tuning the keypoint detector in real images. The plaid pattern can help us label keypoints in the image manually, which can increase the efficiency of keypoint detector fine-tuning. We use a Realsense RGB-D camera fixed on the top of the platform and just above the deformable object to collect visual observations. We apply a model instance of our framework initially trained on 1000 demonstrations in simulation. We then collect 500 real visual observations of the rope and the cloth each, and use them to fine-tune the sim-trained keypoint detector to reduce the sim-to-real gap, e.g. the errors caused by image style difference. We set an empirical keypoint number for the involved rope and cloth rearranging tasks to be 5 and 8 respectively, which however can be determined differently, e.g. proportional to the complexity of the task or object dynamics.

Fig. 8
figure 8

Our real experimental setup includes a a UR5 robot, b a Realsense camera, c a suction cup and d a deformable object (rope or cloth)

Fig. 9
figure 9

The robot rearranges a rope to a U-shape with six pick and place actions

Fig. 10
figure 10

The robot rearranges a rope to a L-shape with six pick and place actions

We apply the fine-tuned model on a number of twenty random instances for both rope and cloth rearrangement tasks. The obtained average success rates are 100% and 95% for the rope and cloth rearrangement respectively, which are almost comparable to the model performance in simulation (Table 4). Figures 9, 10, 11, 12, 13 and 14 show six example deformable rearrangement tasks in our real experiments. Each task is provided with an initial and a goal visual observation (the leftmost sub-figure of each task). The final visual state of the deformable object is illustrated and compared with the provided goal state (the rightmost sub-figure of each task). Clearly, our model completes all involved real tasks with a final object state very close to the provided goal. At each timestep, we show the selected optimal pick and place actions, and their corresponding probability distributions over the object keypoints respectively. A full video recording of these experiments can be found in the supplementary video.

Multi-task policy learning

Most previous learning frameworks for deformable rearrangement are typically performed in the single-task learning scenario [3, 26], i.e. one model instance is supposed to be trained separately to learn the specific manipulation skill for each single type of deformable objects. Such a task isolation in face results from their limited efficiency and generalization in handling deformable rearrangement tasks. In consideration of the superior expressiveness and simplicity of our method, we also train our model in the multi-task learning scenario, i.e. we train one single model with demonstrations from all different types of deformable rearrangement tasks together in our dataset. We hope that our model can learn the rearrangement skills generalized over different types of deformable objects embedded in the multi-task demonstrations.

We train our multi-task model and then evaluate the model performance on each type of deformable rearrangement tasks separately. The results are summarized in Table 6. Comparing with the results in Table 4, the results for single-task learning, the multi-task model performs comparably to the models trained in the single-task learning scenario. That is, in addition to single-task deformable rearrangement skills, our model can learn directly multi-task skills of deformable rearrangement from demonstrations. It indicates that on the one hand, the rearrangement skills of different types of deformable tasks do have similarities that can be effectively captured by our model, so that our multi-task model can still perform well beyond the single-task learning paradigm. On the other hand, our graph representation strategy is sufficiently expressive and suitable for modeling more general deformable rearrangement dynamics, and thus learning deformable rearrangement skills with the graph representation has a strong generalization capability.

Fig. 11
figure 11

The robot straightens a bent rope with three pick and place actions

Fig. 12
figure 12

The robot folds the cloth in half with three pick and place actions

Fig. 13
figure 13

The robot flattens the cloth with three pick and place actions

Fig. 14
figure 14

The robot folds the cloth diagonally with three pick and place actions

Table 6 The average success rates (%) of multi-task learning in simulation by our proposed framework

Conclusion

We have proposed local-GNN, a light and efficient learning framework for vision-based goal-conditioned deformable object rearrangement tasks. Different from many previous studies which rely mainly on the convectional features from visual observations [3, 17, 18], our method leverages keypoints and dynamic graphs to model the deformable rearrangement dynamics. Extensive experiments have been conducted to demonstrate the performance of proposed dynamic graph structures in deformable object representation, and prove that our local-GNN can be more general and suitable for learning goal-conditioned deformable rearrangement policies. One limitation of our model is that our model learns manipulation policy directly from keypoints. However, when the deformable object has self-occlusion, it would be hard to detect accurate keypoints from RGB images. In future work, we aim to investigate detecting keypoints from point clouds rather than RGB images, which can potentially broaden the application scenarios of the proposed method.