TransSC: Transformer-based Shape Completion for Grasp Evaluation

Currently, robotic grasping methods based on sparse partial point clouds have attained a great grasping performance on various objects while they often generate wrong grasping candidates due to the lack of geometric information on the object. In this work, we propose a novel and robust shape completion model (TransSC). This model has a transformer-based encoder to explore more point-wise features and a manifold-based decoder to exploit more object details using a partial point cloud as input. Quantitative experiments verify the effectiveness of the proposed shape completion network and demonstrate it outperforms existing methods. Besides, TransSC is integrated into a grasp evaluation network to generate a set of grasp candidates. The simulation experiment shows that TransSC improves the grasping generation result compared to the existing shape completion baselines. Furthermore, our robotic experiment shows that with TransSC the robot is more successful in grasping objects that are randomly placed on a support surface.


I. INTRODUCTION
Robotic grasping evaluation is a challenging task due to incomplete geometric information from single-view visual sensor data [1].Many probabilistic grasp planning models have been proposed to address this problem, such as Motel Carlo, Gaussian Process and uncertainty analysis [2]- [4].However, these analytic methods are always computationally expensive.With the development of deep learning techniques, data-driven grasp detection methods have shown great potential [5]- [8] to solve this problem.They generate lots of grasp candidates and estimate the corresponding grasp quality, resulting in a better grasp performance and generalization.But as most of these methods still rely on original sensor input like 2D (image) and 2.5D (depth map), there exists a physical grasping defect when the gripper interacts with real object surfaces or edges because of the incomplete pixel-wise and point-wise representations.
To cope with this limitation, the missing geometric and semantic information of the object needs to be restored or repaired to generate a better grasping interaction.Additional sensor input such as tactile sensor is introduced to supplement original vision sensing [9].However, object uncertainty still exists and extra sensor interference with the object will directly affect the final grasping result.Another strategy is to use shape completion to infer the original object shape while traditional grasping-based shape Fig. 1.Overview of our shape completion based grasp pipeline.The upper line is the shape completion module.In this module, a partial point cloud ζp with n points is first input into a transformer-based encoder to extract point-wise and self-attention features, which outputs a latent vector with m dimensions.Then, the latent vector is concatenated with another latent feature from a flat/spatial point seed generator to predict multiple spatial surfaces in the manifold-based decoder.Finally these surfaces are montaging into a complete point cloud ζc.The bottom line is the grasp evaluation module, the complete point cloud ζc is the input of our grasp detection pipeline PointNetGPD to compute the grasp quality Q i .The grasp with the highest score G best will be send to calculate collision free trajectory and executed in a real robot experiment.completion methods use a high-resolution voxelized grid as object representation [2], [10], [11], causing a high memory cost and information loss since the sparsity of the sensory input.To avoid extra sensor cost and obtain complete object information, a novel transformer-based shape completion module is proposed in this work based on an original sparse point cloud.Compared with the traditional convolutional network layer, the transformer has achieved state-of-the-art results in the visual recognition and segmentation [12], [13], which enables our shape completion module to achieve a better performance.
As illustrated in Fig. 1, we present a novel grasping pipeline that uses a sparse point cloud to execute the grasp directly without converting it into voxel grids during the shape completion process and transforming it into mesh in the grasp planning process.The pipeline consists of two submodules: The transformer-based shape completion module and the grasp evaluation module.In the first module, a non-synthetic partial point cloud dataset based on a YCB object was constructed.Not cropping the object randomly or viewing the object in a physical simulator, our dataset contains lots of real cameras and environmental noise, which guarantees an improved grasping interaction in a real robot environment.Based on this dataset, we propose a novel encoder-decoder point cloud completion network architecture (TransSC), which outperforms some representative baselines in different evaluation metrics.In the second module, our previous work [8] is referred.We use PointNet to obtain feature representation of the repaired point cloud and build a grasp detection network to generate and evaluate a set of grasp candidates.The grasp with the highest score will be executed in the real robot experiment.The proposed pipeline is validated in a simulation experiment and a robotic experiment, which all demonstrate our shape completion pipeline can improve grasping performance significantly.
The main contributions of this paper can be summarized as: • A large-scale non-synthetic partial point cloud dataset is constructed based on the YCB-Video dataset.As the dataset is based on 3D point cloud data captured by a real RGB-D camera, the noise that comes from it will facilitate the generalization of our work.evaluation and the Moveit!Task Constructor for motion planning, we demonstrate a robust grasp planning pipeline that using the shape completion result as input could get a better grasp planning result compared to the single view and no shape completion work.

II. RELATED WORK Deep Visual Robotic Grasping
With the development of deep learning, many methods for deep visual grasping have been proposed.Borrowing from the ideas of 2D object recognition, monocular camera images were firstly used to predict the probability that the input grasps were successful [14].In [15] and [3], a single RGB-D image of the target object was used to generate a 6D-pose grasp and effective end-effector trajectories.However, these works are not suitable to deal with sparse 3D object information and spatial grasps.Compared with the 2D feature representations from images, 3D voxel or point cloud data could provide robotic grasping with more semantic and spatial information.Given a synthetic grasp dataset, [5] transformed scanning 3D object information into Truncated Signed Distance Function (TSDF) representations and passed them into a Volumetric Grasping Network (VGN) to directly output grasp quality, gripper orientation and gripper width at each voxel.[6] designed a special grasp proposal module that defines anchors of grasp centers and related 3D grid corners to predict a set of 6D grasps from a partial point cloud.[7] used handcrafted projection features on a normalized point cloud to construct a CNN-based grasp quality evaluation model.In our previous work [8], we used PointNet [16] to extract raw point cloud features and built a grasp evaluation network, which achieves a great performance in robotic grasping experiments.
Grasp-based Shape Completion For robotic grasping, the key challenge is to recognize objects in 3D space and avoid potential perception uncertainty.When the RGB-D camera captures an object from a particular viewpoint, the 3D information of the object is incomplete, which means a lot of semantic and spatial information is missing.This will affect the quality of later generated grasping and cause wrong grasping poses.
Recently, some researchers proposed to use shape completion to enable robotic grasping.In [10], the observed object from 2.5D range sensors was firstly converted to occupancy voxel grid data.Then the voxelized data were input into a CNN network and formed a high-resolution voxel output.Furthermore, the completion result was transformed into mesh and then loaded into Graspit![17] to generate a grasp.[2] used dropout layers to modify the network, which enabled the prediction of shape samples at runtime.Meanwhile, Monte Carlo Sampling and probabilistic grasp planning were used to generate grasp candidates.As traditional analytic grasping methods are computationally expensive, [11] combined the shape completion of a voxel grid and a data-driven grasping planning strategy (GQCNN) [18] to propose a structure called FC-GQCNN, where synthetic object shapes were obtained from a top-down physics simulator and grasps were generated from depth images.
In conclusion, traditional grasp shape completion methods mainly voxelized the 2.5D data into occupancy grids or distance fields to train a convolutional network.However, these high-resolution voxel grids will entail a high memory cost.Moreover, detailed semantic information is often lost as an artifact of discretization, which causes meaningful geometric features of objects not to be learned from the neural network.To obtain more complete geometric features and retain original object information, a transformer-based shape completion module is introduced in our proposed method.Without converting the observed partial point cloud into the voxel grid and mesh, our completion method outputs a repaired point cloud at arbitrary resolution and outperforms existing methods.Furthermore, PointNet [16] is introduced for the representation learning of the repaired point cloud and a grasp evaluation network is constructed to generate grasp candidates.Therefore, our grasp evaluation framework also achieves a better grasping performance than original framework without shape completion.

III. PROBLEM FORMULATION
We consider a setup consisting of a robotic arm with parallel-jaw grippers, an RGB-D camera and an object to be grasped.Also, we assume that the RGB-D camera could capture the depth map of an object and convert it to a 2.5D partial point cloud P ∈ R N ×3 .For simplicity, all spatial quantities are in camera coordinates.
Given a gripper configuration C and camera observation P, our goal is firstly to learn an encoder-decoder point cloud completion network, which could repair an observed 2.5D partial point cloud P ∈ R N ×3 , turning it into a complete 3D point cloud P c ∈ R N ×3 .After that, a grasp evaluation network based on P c is trained to predict a set of grasp candidates G i and compute relative grasp quality Q i .The grasp with the highest score G best will be executed in the real robot experiment.

IV. ROBOTIC GRASPING EVALUATION VIA SHAPE COMPLETION AND GRASP DETECTION A. Dataset Construction
Traditional shape completion methods use synthetic CAD models from the ShapeNet [19] or ModelNet [20] datasets to generate partial and corresponding complete point cloud data, while these synthetic data contain little noise from the camera and robotic environment.In order to simulate real point cloud data distribution, we summarize a shape completion dataset from the YCB-Video Dataset [21].Nonsynthetic RGB-D video images (∼ 133,827 frames) in the YCB-Video Dataset are firstly chosen, while most of them vary insignificantly.Thus, a pre-processed image dataset is obtained by reducing every 5 frames.Meanwhile, to cover distinguishable shapes with different levels of detail, 18 objects are also chosen in the YCB-Video dataset.In this work, the ground-truth point cloud of 18 objects is created by the farthest point sampling (FPS) 2048 points on each object model.Not randomly sampling or cropping complete point clouds on the unit sphere to get partial point clouds, RGB-D images and related object label images in the preprocessed dataset are loaded to compute the matching partial point clouds using related camera intrinsic parameters.To approximate the distribution of point cloud data of real objects and retain the semantic information, a large number of cameras and environmental noise data are kept on, though a certain radius is used to remove partial outliers.For the convenience of network training, the partial point clouds are also unified into the size of 2048 points by FPS or replicating points.To enable an accurate comparison with existing baselines, the canonical center of the partial point cloud of each object is transformed into the canonical center of the ground-truth point cloud using pose information.
Finally, more than 70,000 partial point clouds are collected in our dataset.Compared to other synthetic point cloud datasets, our dataset also does well at preserving the real point cloud distribution of occluded objects.

B. Transformer-based Encoder Module
As shown in Fig. 2, we compare our proposed encoder module with several common competitive methods.Multilayer Perception (MLP) is a simple baseline architecture to extract point features.This method maps each point into different dimensions and extracts the maximum value from the final K dimensions to formulate a latent vector.A simple generalization for MLP is to combine semantic features from a low-level dimension with those of a highlevel dimension.The MSF (Multi-scale Fusion) [22] module inflates the dimension of the latent vector from 1024 to 1408 to obtain semantic features from different dimensions.To improve the performance of the feature extractor, L-GAN [23] proposed to use a Maxpooling layer appropriately.Concatenated Multiple Layer Perception (CMLP) [24] maxpools the output of the last k layers to guarantee that multiscale feature vectors are concatenated directly.An overview of our proposed Transformer-based multi-layer perception (TMLP) module is shown in Fig. 2(d).Without an extra skip connection structure and a maxpooling operation from different layers, the Multi-head Self-attention (MHSA) [25] module is introduced to replace the traditional convolutional layer [128 × 256 × 1].
MHSA aims to transform (encode) the input point feature into a new feature space, which contains point-wise and self-attention features.Fig. 2(e) shows a simple MHSA architecture used in TMLP, which includes two sub-layers.In our first layer, the multi-head number is set to 8 and the input feature dimension for each point is 128.Unlike natural language processing (NLP) problems, the 128-dimensional feature vector A in ∈ R 2048×128 will enter into the multihead attention module directly without positional encoding.This is because each point in the point cloud has its unique x − y − z coordinates.The output feature Z is formed by concatenating the attention of each attention head.A residual structure is also used to add and normalize the output feature Z with A in .This process can be formulated as follows: where SA i represents the i-th self-attention layer, each has the same output dimension size with input feature vector A in , and W 0 is the weight of the linear layer.A out represents the output point-wise features of the first sublayer.
The second sub-layer is called Feed-forward module, which is a fully connected network.Point-wise features A out are processed through two linear transformations and one ReLU activation.Furthermore, a residual network is also used to fuse and normalize the output features.Finally, we can get the MHSA module output FF out ∈ R 2048×128 as: where W 1 , W 2 and b 1 , b 2 represent the weight and bias value of the corresponding linear transformation, respectively.

C. Manifold-based Decoder Module
Inspired by the AtlasNet [26], a manifold-based decoder module is designed to predict a complete point cloud from partial point cloud features.As shown in Fig. 3, a complete point cloud could can be assumed that it consists of multiple sub-surfaces.Therefore, we only concentrate on obtaining each sub-surface, then we gather them and make appropriate montage to form the final complete point cloud.To get each sub-surface, a point seed generator is used to concatenate with global feature vector P g ∈ R 2048×1024 output from the encoder, where point initialization values are computed from a flat (f ) or spatial (g) sampler.As the coordinate values of the ground-truth point cloud are limited to between [-1, 1], point initialization values are also limited in this range.After that, the concatenated feature vector P concat ∈ R 2048×M (M = 1026 or 1027) is input into K convolutional layers, where all sampled 2D or 3D points will be mapped to 3D points on each sub-surface.In our decoder, the sub-surface number is set to 16.Unlike other voxel-based shape completion methods, our decoder module achieves an arbitrary resolution for the final completion results.
Evaluation Metrics To evaluate our shape completion results, we used two permutation-invariant metrics called Chamfer Distance (CD) and Earth Mover's Distance (EMD) as our evaluation goal [27].Given two arbitrary point clouds S 1 and S 2 , CD measures the average distance between each point in one point cloud to its nearest point coordinates in the other point cloud.
(6) While Earth Mover's Distance considers two equal point sets S 1 and S 2 and is defined as: CD has been widely used in most shape completion tasks because it is efficient to compute.However, EMD is chosen as our completion loss because CD is blind to some visual inferiority and ignores details easily [23].With ∅ : S 1 → S 2 being bijective, EMD could solve the assignment and transformation problem in which one point cloud is mapped into another.

D. PointNetGPD: Grasping Detection Module
Giving the complete point cloud from previous steps, we put the point cloud into a geometric-based grasp pose generation algorithm (GPG) [29], which outputs a set of grasp proposals G i .We then transform G i into a gripper    coordinate system and use points inside the gripper as the input of PointNetGPD.The output grasp will then be sent to the MoveIt!Task Constructor [30] to plan a feasible trajectory for pick and place task.
PointNetGPD is trained on a grasp dataset generated using reconstructed YCB object mesh and evaluates the input grasp quality.The grasp candidates in the grasp dataset are all collision-free with respect to the target object.As a result, the grasp evaluation network assumes all the input grasp candidates are not colliding with the object.If the object has occlusion due to the camera viewpoint, current geometric-based grasp proposal algorithm will generate grasp candidates that collide with the object.Thus, using a complete point cloud could ensure that the grasp candidate generation algorithm generates grasp sets that do not collide with the graspable objects.Fig. 4 shows the comparison of grasp generation result using GPG [29] with and without point cloud completion, where Fig. 4(b) shows a candidate generated using partial point cloud and Fig. 4(c) shows a grasp candidate generated using complete point cloud.We can see that the grasp in Fig. 4(b) has a collision with the real object while Fig. 4(c) avoids generating such that grasp.

A. Quantitative Evaluation of Proposed Shape Completion Network
Training and Implementation details To evaluate model performance and reduce training time, 8 categories of different objects in our dataset are chosen to train the shape completion model.The training set and validation set are split into 0.8:0.2.We implement our network on PyTorch.All the building modules are trained by using the Adam optimizer with an initial learning rate of 0.0001 and a batch size of 16.All the parameters of the network are initialized using a Gaussian sampler.Batch Normalization (BN) and ReLU activation units are all employed at the encoder and decoder module except the final tanh layer producing point coordinates, and Dropout operation is used in the MHSA module to suppress model overfitting.
1) Comparison with Existing Methods: In this subsection, we compare our method against several representative baselines that are also used for point cloud completion, including AtlasNet [26] and MSN [28].The Oracle method means that we randomly resample 2048 points from the original surface of different YCB objects.Corresponding   .I and Table.II.Our method is developed into two models based on the different point seed generators (f /g) in the decoder module.It can be seen that our method outperforms other methods in most objects on both EMD and CD distances.For the same completion loss, our (flat) model achieves an average of about 15% improvement in terms of the EMD distance with respect to the latest MSN (vanilla) model.Since our dataset contains much noise from the camera and the environment, we found that fusing the output completion result with the original point cloud makes the performance significantly worse, which can be seen from the comparison of MSN (fusion) and MSN (vanilla).
It also implies that our model is robust enough, which is conducive to rapid deployment in real robot experiments.Furthermore, compared with ideal results from the Oracle method, it demonstrates that point cloud completion remains an arduous task to solve.
2) Ablation Studies: To comprehensively evaluate our proposed shape completion model, in this section we provide a series of ablation studies on our YCB-based dataset.Accordingly, the effectiveness of each special module in our model is analysed as follows: We first evaluate our transformer-based encoder module with other representative encoder modules under the same setting of convolutional/transformer layer number and object inputs.As shown in Tab.III, our encoder has a better result overall, though CMLP could get a great result on Mug's completion.When the point seed in the decoder is flat, we further analyze the influence of different point seed distributions and surface numbers in Tab.IV and Tab.V. We can see that both Uniform and Gaussian sample method can achieve a better result at (0, 1).We choose U nif orm(0, 1) in our model, as it can achieve best results.Like the weight parameters in the neural network, the initialization value of points cannot be close to zero, which predicts the worst result.As illustrated in Tab.V, when the sub-surface number increases, the overall model performance improves.However, the improvement of completion results is limited when the number is above 16.
3) Visualization Analysis: Fig. 5 shows the visualized shape completion results using our TransSC.To facilitate visual analysis, the input partial point cloud of each object is first preprocessed to remove noisy data from the camera and the environment.It can be seen that the geometric loss of the input point cloud in our dataset comes from the change of the camera viewpoint and the occlusion of other objects, which causes a big challenge for our model.The output results of the canonical pose show that our model works well on all simple and complex objects.Moreover, our model can generate realistic structures and details like the mug handle, bowl edge and bottle mouth.To enable robotic grasping, another shape completion model based on the arbitrary ground-truth pose is retrained through transforming the ground truth pose to the original pose of the input partial point cloud, and completion results are also shown in Fig. 5. Obviously, arbitrary output is not as good as the canonical output while it still restores the overall shape of each object well.It also demonstrates that achieving object completion of arbitrary poses in the real environment is still a formidable task.

B. Simulation Grasp Experiments with complete shapes
We use Graspit! [17] to evaluate the quality of shape completion similar to [10].First, the Alpha shapes algorithm [31] is used to implement surface reconstruction of completion object.The output 3D mesh is then imported into GraspIt!Simulator to calculate grasps.To have a fair comparison, we also use Barrett Hand to generate grasps.After finishing the grasp generation, we remove the completion object and import the ground-truth object into the same place.Meanwhile, the Barrett Hand is moved back for 20 cm along the approach direction and then approaches the object until the gripper detect a collision or reach the calculated grasp pose.Furthermore, we adjust the gripper to the calculated grasp joint angles and perform the autograsp function in GraspIt! to ensure the gripper contacts with the object surface or reaches the joint limit.The joint angle difference and position difference are then recorded.We use four objects (bleach cleanser, cracker box, pitcher  base and power drill) in the YCB objects set and calculate 100 grasps for each object in our experiment.We compare the average difference of joint angle and grasp pose from our shape completion model to that of Laplacian smoothing in Meshlab (Partial), mirroring completion [32] (Mirror), RANSAC-based approach [33] and voxel-based completion [10].Note that we use two different models, canonical and arbitrary.The canonical model means all the training is transformed into the same object coordinate system and the arbitrary model means all the training data are transformed into the camera's coordinate system.Although from Fig. 5 we can see the canonical model has a better shape completion result, but it requires a 6D pose of the target object if we want to map the complete point cloud into the real environment.To avoid this complication of adding a 6D pose estimation module and achieve real robot experiments, the arbitrary model is also trained.The simulation result is shown in Table .VI.It can be seen that Ours (canonical pose) gets the best simulation grasping performance, which outperforms other completion types.Ours (arbitrary pose) also obtains a great simulation result though its average joint pose is slightly bigger than RANSAC-based and voxel-based methods.Moreover, the average grasp pose errors of both models are smaller than other methods significantly.The larger joint error and lower pose error of Ours (arbitrary pose) indicates that corresponding completion object is slightly larger than the groundtruth object.The average difference from two models also demonstrates that a perfect shape completion in an arbitrary pose is much harder than in a canonical pose.

C. Robotic Experiments
To evaluate the performance improvement using complete point cloud for robotic grasping, we choose six YCB objects to test the grasping success rate.The robot for evaluation is a UR5 robot arm equipped with a Robotiq 3-finger gripper.The vision sensor is an Industrial 3D camera from Mechmind 1 to acquire a high-quality partial point cloud.The selected six objects are list in Table.VII.We select these objects because they are typical objects that may fail to generate good grasp candidates without shape completion.For other objects such as banana or marker, they are quite simple and small, which causes that improvement of shape completion on the grasping result is minor.
For the selected six objects, we perform grasp evaluation on two different methods: PointNetGPD grasp with/without shape completion.We run the robot experiment by randomly putting the object on the table and grasping for ten times, then calculating the success rate.The experiment result is shown in Table .VII.We can see that all six objects' grasp success rates using PointNetGPD with TransSC outperform or even with original method.The low success rate of power drill for both methods is because when the robot tries to grasp the head of the power drill, the contact area is too slippy.The failures of PointNetGPD with observed point cloud input are mainly from the limit of camera viewpoint, and GPG generates grasp candidates that sink into the object.An example of this situation is shown in Fig. 4.This is a strong evidence that our shape completion model can improve the grasp success rate in some particular objects.

VI. CONCLUSION AND FUTURE WORK
We present a novel transformer-based shape completion network (TransSC), which is robust to sparse and noisy point cloud input.A transformer-based encoder and manifoldbased decoder are designed in our network, which makes our model achieve a great completion result and outperform other representative methods.Besides, TransSC could be easily embedded into a grasp evaluation pipeline and improve grasping performance significantly.
The lack of geometric information on the object in our dataset is not only due to the change of the camera viewpoint but also the occlusion of different objects.Thus, TransSC could also achieve shape completion for occluded objects.In future work, our goal is to integrate semantic segmentation into our shape completion pipeline to make the robot grasp objects better in a cluttered environment.

Fig. 2 .
Fig. 2. Illustration of various encoder structures for point cloud completion.(a) is a simple multiple-layer perception (MLP) structure.(b) is a multi-scale fusion (MSF) module, which can fuse features from different layers directly.(c) is concatenated multiple layer perception (CMLP), it also can concatenate multi-dimensional latent features while the max pooling operation is used to extract latent features further.(d) shows our Transformer-based multiple layer perception (TMLP) module, which integrates the Multi-head Self-attention (MHSA) module into the MLP structure.(e) depicts the architecture of the MHSA module.

Fig. 3 .
Fig. 3. Illustration of the decoder structure for point cloud completion.The feature vector with m dimensions from the encoder is firstly concatenated with latent feature from a special point seed generator f or g.Then three convolutional layers as the backbone are used to extract features and form different manifold-based surfaces, respectively.Finally, these surfaces are gathering and montaging into a complete point cloud.

Fig. 4 .
Fig. 4. Comparison of grasp candidates generated using GPG [29].(a) RGB image to show the example environment, (b) grasp generated with partial point cloud, (c) grasp generated with complete point cloud.
EMD and CD distance between the resampled point cloud and the ground-truth point cloud provide an upper bound for the performance.Relative comparison results are shown in Table

Fig. 5 .
Fig. 5. Shape completion result using TransSC.The canonical pose result is trained under a fixed point cloud coordinate system while the arbitrary pose result is trained under the camera perspective.In the robot experiment, the arbitrary pose training result is used to generate grasps.

TABLE I COMPARISON
OF EARTH MOVER'S DISTANCE IN DIFFERENT POINT CLOUD COMPLETION MODELS

TABLE II COMPARISON
OF CHAMFER DISTANCE IN DIFFERENT POINT CLOUD COMPLETION MODELS

TABLE III COMPARISON
OF EMD AND CD FROM DIFFERENT ENCODER STRUCTURES Earth Mover's Distance (EMD) MLP

TABLE IV COMPARISON
OF AVERAGE EMD AND CD FROM DIFFERENT POINT