On robot grasp learning using equivariant models

Real-world grasp detection is challenging due to the stochasticity in grasp dynamics and the noise in hardware. Ideally, the system would adapt to the real world by training directly on physical systems. However, this is generally difficult due to the large amount of training data required by most grasp learning models. In this paper, we note that the planar grasp function is SE(2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{SE}(2)$$\end{document}-equivariant and demonstrate that this structure can be used to constrain the neural network used during learning. This creates an inductive bias that can significantly improve the sample efficiency of grasp learning and enable end-to-end training from scratch on a physical robot with as few as 600 grasp attempts. We call this method Symmetric Grasp learning (SymGrasp) and show that it can learn to grasp “from scratch” in less that 1.5 h of physical robot time. This paper represents an expanded and revised version of the conference paper Zhu et al. (2022).


Introduction
Grasp detection detects good grasp poses in a scene directly from raw visual input (e.g., RGB or depth images) using machine learning.The learning-based method generalizes to novel objects.This is in contrast to classical model-based methods that attempt to reconstruct the geometry and the pose of objects in a scene and then reason geometrically about how to grasp those objects.
Most current grasp detection models are datadriven, i.e., they must be trained using large offline datasets.For example, [30] trains on a dataset consisting of over 7M simulated grasps, [2] trains on over 2M simulated grasps, [26] trains on grasp data drawn from over 6.7M simulated point clouds, and [41] trains on over 700k simulated grasps.Some models are trained using datasets obtained via physical robotic grasp interactions.For example, [33] trains on a dataset created by performing 50k grasp attempts over 700 hours, [17] trains on over 580k grasp attempts collected over the course of 800 robot hours, and [1] train on a dataset obtained by performing 27k grasps over 120 hours.
Such reliance on collecting large datasets necessitates either learning in simulation or using significant amounts of robot time to generate data, motivating the desire for a more sample efficient grasp detection model, i.e., a model that can achieve good performance with a smaller dataset.In this paper, we propose a novel grasp detection strategy that improves sample efficiency significantly by incorporating the equivariant structure into the model.We term our strategy Symmetric Grasp learning as SymGrasp.Our key observation is that the target grasp function (from images onto grasp poses) is SE(2)-equivariant.That is, rotations and translations of the input image should Springer Nature 2021 L A T E X template nt Models correspond to the same rotations and translations of the detected grasp poses at the output of the function.In order to encode the SE(2)-equivariance in the target function, we constrain the layers of our model to respect this symmetry.Compared with conventional grasp detection models that must be trained using tens of thousands of grasp experiences, the equivariant structure we encode into the model enables us to achieve good grasp performance after only a few hundred grasp attempts.
This paper makes several key contributions.First, we recognize that the grasp detection function from images to grasp poses is a SE(2)-equivariant function.Then, we propose a neural network model using equivariant layers to encode this property.Finally, we introduce several algorithmic optimizations that enable us to learn to grasp online using a contextual bandit framework.Ultimately, our model is able to learn to grasp opaque (using depth images) and transparent objects (using RGB images) with a good success rate after only approximately 600 grasp trials -1.5 hours of robot time.Although the model we propose here is only for 2D grasping (i.e., we only detect top down grasps rather than all six dimensions as in 6-DOF grasp detection), the sample efficiency is still impressive and we believe the concepts could be extended to higher-DOF grasp detection models in the future.
These improvements in sample efficiency are important for several reasons.First, since our model can learn to grasp in only a few hundred grasp trials, it can be trained easily on a physical robotic system.This greatly reduces the need to train on large datasets created in simulation, and it therefore reduces our exposure to the risks associated with bridging the sim2real domain gap -we can simply do all our training on physical robotic systems.Second, since we are training on a small dataset, it is much easier to learn on-policy rather than off-policy, i.e., we can train using data generated by the policy being learned rather than with a fixed dataset.This focuses learning on areas of state space explored by the policy and makes the resulting policies more robust in those areas.Finally, since we can learn efficiently from a small number of experiences, our policy has the potential to adapt relatively quickly at run time to physical changes in the robot sensors and actuators.
2 Related Work

Equivariant convolutional layers
Equivariant convolutional layers incorporate symmetries into the structure of convolutional layers, allowing them to generalize across a symmetry group automatically.This idea was first introduced as G-Convolution [4] and Steerable CNN [6].E2CNN is a generic framework for implementing E(2) Steerable CNN layers [49].In applications such as dynamics [44,48] and reinforcement learning [28,43,46,47] equivariant models demonstrate improvements over traditional approaches.

Sample efficient reinforcement learning
Recent work has shown that data augmentation using random crops and/or shifts can improve the sample efficiency of standard reinforcement learning algorithms [21,23].It is possible to improve sample efficiency even further by incorporating contrastive learning [31], e.g.CURL [24].The contrastive loss enables the model to learn an internal latent representation that is invariant to the type of data augmentation used.The FERM framework [52] applies this idea to robotic manipulation and is able to learn to perform simple manipulation tasks directly on physical robotic hardware.The equivariant models used in this paper are similar to data augmentation in that the goal is to leverage problem symmetries to accelerate learning.However, whereas data augmentation and contrastive approaches require the model to learn an invariant or equivariant encoding, the equivariant model layers used in this paper enforce equivariance as a prior encoded in the model.This simplifies the learning task and enables our model to learn faster (see Section 6).

Grasp detection
In grasp detection, the robot finds grasp configurations directly from visual or depth data.This is in contrast to classical methods which attempt to reconstruct object or scene geometry and then do grasp planning.See [34] for a review on this topic.2D Grasping: Several methods are designed to detect grasps in 2D, i.e., to detect the planar position and orientation of grasps in a scene based on top-down images.A key early example of this was DexNet 2.0, which infers the quality of a grasp centered and aligned with an oriented image patch [26].Subsequent work proposed fully convolutional architectures, thereby enabling the model to quickly infer the pose of all grasps in a (planar) scene [9,22,29,37,53] (some of these models infer the z coordinate of the grasp as well).3D Grasping: There is much work in 3D grasp detection, i.e., detecting the full 6-DOF position and orientation of grasps based on truncated signed distance function (TSDF) or point cloud input.A key early example of this was GPD [42] which inferred grasp pose based on point cloud input.Subsequent work has focused on improving grasp candidate generation in order to improve efficiency, accuracy, and coverage [1,2,10,13,16,30,40].
On-robot Grasp Learning: Another important trend has been learning to grasp directly from physical robotic grasp experiences.Early examples of this include [33] who learn to grasp from 50k grasp experiences collected over 700 hours of robot time and [25] who learn a grasp policy from 800k grasp experiences collected over two months.QT-Opt [17] learns a grasp policy from 580k grasp experiences collected over 800 hours and [15] extends this work by learning from an additional 28k grasp experiences.[39] learns a grasp detection model from 8k grasp demonstrations collected via demonstration and [51] learns a online pushing/grasping policy from just 2.5k grasps.
Transparent objects grasping: Commercial depth sensors that are based on structured-light or time-offlight techniques often fail to sense transparent objects accurately [50].Pixels in the depth image are often dropped due to specularities or the object is simply invisible to the sensor because it is transparent [36].To avoid this type of failure, RGB or RGB-D sensors are commonly used.[50] infers grasp pose from RGBD images.They collect paired RGB and D images on opaque objects and utilize transfer learning from a trained D modality model to an RGBD modality model.[36] reconstructs depth images from RGBD images using CNN and then performs grasp detection.Likewise, [14] and [18] infer grasp pose from reconstructed depth images, but using neural radiance field (NeRF) [27] for reconstruction.These methods either rely on collecting paired images for training or require training a NeRF model per grasp during evaluation.In contrast, our method is trained directly on RGB images without requiring paired images or NeRF models.
Equivariance through canonicalization in grasping: An alternative to modeling rotational symmetry using equivariant neural network layers is an approach known as canonicalization where we learn a model over the non-equivariant variables assuming a single "canonical" group element [11,20,48].Equivariance based on canonicalization is common in robotic grasping where it is not unusual to translate and rotate the input image so that it is expressed in the reference frame of the hand, e.g.[26,30,42].This way, the neural network model need only infer the quality of a grasp for a single canonical grasp pose rather than over arbitrary translations and orientations.In this paper, we compare our model-based approach to equivariance with VPG, a method that obtains rotational equivariance via canonicalization [39].Our results in Section 6.2 suggest that the model-based approach has a significant advantage.

Equivariant Neural Network Models
In this paper, we use equivariant neural network layers defined with respect to a finite group (e.g. a finite group of rotations).Equivariant neural network [4][5][6]49] encodes a function f : X → Y that satisfy the equivariance constraint: gf (x) = f (gx), where g ∈ G is an element of a finite group.gx is shorthand for the action of g on x, e.g.rotation of an image x.
Similarly, g(f (x)) describes the action of g on f (x).
Below, we make these ideas more precise and summarize how the equivariance constraint is encoded into a neural network layer.

The cyclic group
We are primarily interested in equivariance with respect to the group of planar rotations, SO(2).However, in practice, in order to make our models computationally tractable, we will use the cyclic subgroup C n is the group of discrete rotations by multiples of 2π/n radians.

Representation of a group
The way a group element g ∈ G acts on x depends on how x is represented.If x is a point in the plane, then g acts on x via the standard representation, ρ 1 (g)x, where ρ 1 (g) is the standard 2 × 2 rotation matrix corresponding to g.In the hidden layers of an equivariant neural network model, it is common to encode a separate feature map for each group element.For example, suppose G is the order n cyclic group and suppose x is a set of features x = (x 1 , x 2 , . . ., x n ) ∈ R n×λ that maps the kth group element to a feature x k .The regular representation of g acts on x by permuting its elements: ρ reg (g)x = (x n−m+1 , . . . ,x n , x 1 , x 2 , . . . ,x n−m ) where g is the mth element in C n .Finally, it is sometimes the case that x is invariant to the action of the group elements.In this case, we have the trivial representation, ρ 0 (g)x = x.

Feature maps of equivariant convolutional layers
An equivariant convolutional layer maps between feature maps which transform by specified representations ρ of the group.In the hidden layers of an equivariant model, generally, an extra dimension is added to the feature maps to encode group elements via the regular representation.So, whereas the feature map used by a standard convolutional layer is a tensor F ∈ R m×h×w , an equivariant convolutional layer adds an extra dimension: F ∈ R k×m×h×w , where k denotes the dimension of the group representation.This tensor associates each pixel (u, v) ∈ R h×w with a matrix F(u, v) ∈ R k×m .

Action of the group operator on the feature map
Given a feature map F ∈ R k×m×h×w associated with group G and representation ρ, a group element g ∈ G acts on F via: where x ∈ R 2 denotes pixel position.The RHS of this equation applies the group operator in two ways.First, ρ 1 (g) −1 rotates the pixel position x using the standard representation.Second, ρ applies the rotation to the feature representation.If the feature is invariant to the rotation, then we use the trivial representation ρ 0 (g).However, if the feature vector changes according to rotation (e.g. the feature denotes grasp orientation), then it must be transformed as well.This is accomplished by setting ρ in Equation 1 to be the regular representation that transforms the feature vector by a circular shift.

The equivariant convolutional layer
An equivariant convolutional layer is a function h from F in to F out that is constrained to represent only equivariant functions with respect to a chosen group G.The feature maps F in and F out are associated with representations ρ in and ρ out acting on feature spaces R kin and R kout respectively.Then the equivariant constraint for h is [5]: This constraint can be implemented by tying kernel weights K(y) ∈ R kout×kin in such a way as to satisfy the following constraint [5]: Please see Appendix.B for an example of equivariant convolustional layer.When all hidden layers h in a neural network satisfy Equation 2, then by induction the entire neural network is equivariant [5].

Augmented State Representation (ASR)
We will formulate SE(2) robotic grasping as the problem of learning a function from an m channel image, s ∈ S = R m×h×w , to a gripper pose a ∈ A = SE (2) from which an object may be grasped.Since we will use the contextual bandit framework, we need to be able to represent the Q-function, Q : R m×h×w × SE(2) → R.However, this is difficult to do using a single neural network due to the GPU memory limitation.To combat this, we will use the Augmented State Representation (ASR) [38,45] to model Q as a pair of functions, Q 1 and Q 2 .Another advantage of using ASR is we can use different group order in Q 1 , Q 2 , as explained in Section 5.1.3.We follow the ASR framework that factors The first function is a mapping Q 1 : R m×h×w × X → R which maps from the image s and the translational component of action X onto value.This function is defined to be: h and w ′ ≤ w which maps from an image patch and an orientation onto value.This function takes as input a cropped version of s centered on a position x, crop(s, x), and an orientation, θ, and outputs the corresponding Inference is performed on the model by evaluating x * = arg max x∈X Q 1 (s, x) first and then evaluating θ * = arg max θ Q 2 (crop(s, x * ), θ).Since each of these two models, Q 1 and Q 2 , are significantly smaller than Q would be, the inference is much faster.Figure 1 shows an illustration of this process.The top of the figure shows the action of Q 1 while the bottom shows Q 2 .Notice that the semantics of Q 2 imply that the θ depends only on crop(s, x), a local neighborhood of x, rather than on the entire scene.This assumption is generally true for grasping because grasp orientation typically depends only on the object geometry near the target grasp point.

Planar grasp detection
The planar grasp detection function Γ : R m×h×w → SE(2) maps from a top down image of a scene containing graspable objects, s ∈ S = R m×h×w , to a planar gripper pose, a ∈ A = SE(2), from which an object can be grasped.This is similar to the formulations used by [26,29].

Formulation as a Contextual Bandit
We formulate grasp learning as a contextual bandit problem where the state is an image s ∈ S = R m×h×w and the action a ∈ A = SE(2) is a grasp pose to which the robot hand will be moved and a grasp will be attempted, expressed in the reference frame of the image.After each grasp attempt, the agent receives a binary reward R drawn from a Bernoulli distribution with unknown probability r(s, a).The true Q-function denotes the expected reward of taking action a from s. Since R is binary, we have that Q(s, a) = r(s, a).This formulation of grasp learning as a bandit problem is similar to that used by, e.g.[8,17,51].

Invariance Assumption
We assume that the (unknown) reward function r(s, a) that denotes the probability of a successful grasp is invariant to translations and rotations g ∈ SE(2).Let gs denote the image s translated and rotated by g.Similarly, let ga denote the action translated and rotated by g.Therefore, our assumption is: Intuitively, when the image of a scene transforms, the grasp poses (located with respect to the image) transform correspondingly.

Equivariant Learning
We use equivariant neural networks [49] to model the Q-function that enforces the invariance assumption in Section 4.3.Therefore once the Q function is fit to a data sample r(s, a), it generalizes to any g ∈ SE(2) transformed data sample r(gs, ga).This generalization could lead to a significant sample efficiency improvement during training.

Invariance properties of
The assumption that the reward function r is invariant to transformations g ∈ SE(2) implies that the optimal Q-function is also invariant to g, i.e., Q(s, a) = Q(gs, ga).In the context of the augmented state representation (ASR, see Section 3.2), this implies separate invariance properties for Q 1 and Q 2 : where g θ ∈ SO(2) denotes the rotational component of g ∈ SE(2), gx denotes the rotated and translated vector x ∈ R 2 , and g θ (crop(s, x)) denotes the cropped image rotated by g θ .
nt Models

Discrete Approximation of SE(2)
We implement the invariance constraints of Equation 5and 6 using a discrete approximation to SE(2).We constrain the positional component of the action to be a discrete pair of positive integers x ∈ {1 . . .h} × {1 . . .w} ⊂ Z 2 , corresponding to a pixel in s, and constrain the rotational component of the action to be an element of the finite cyclic group C n = {2πk/n : 0 ≤ k < n, i ∈ Z}.This discretized action space will be written ŜE(2) = Z 2 × C n .

Equivariant Q-Learning with ASR
In Q-Learning with ASR, we model Q 1 and Q 2 as neural networks.We model Q 1 as a fully convolutional UNet [35] q 1 : R m×h×w → R 1×h×w that generates Q value for each discretized translational actions from the input state image.We model Q 2 as a standard convolutional network q 2 : R m×h ′ ×w ′ → R n that evaluates Q value for n discretized rotational actions based on the image patch.The networks q 1 and q 2 thus model the functions Q 1 and Q 2 by partially evaluating at the first argument and returning a function in the second.As a result, the invariance properties of Q 1 and Q 2 (Equation 5 and 6) imply the equivairance of q 1 and q 2 : where g ∈ ŜE(2) acts on the output of q 1 through rotating and translating the Q-map, and g θ ∈ C n acts on the output of q 2 by performing a circular shift of the output Q values via the regular representation ρ reg .This is illustrated in Figure 2. In Figure 2a we take an example of a depth image s in the upper left corner.If we rotate and translate this image by g (lower left of Figure 2a) and then evaluate q 1 , we arrive at q 1 (gs).This corresponds to the LHS of Equation 7.However, because q 1 is an equivariant function, we can calculate the same result by first evaluating q 1 (s) and then applying the transformation g (RHS of Equation 7). Figure 2b illustrates the same concept for Equation 8. Here, the network takes the image patch crop(s, x) as input.If we rotate the image patch by g θ and then evaluate q 2 , we obtain the LHS of Equation 8, q 2 (g θ crop(s, x)).However, because q 2 is equivariant, we can obtain the same result by evaluating q 2 (crop(s, x)) and circular shifting the resulting (a) Illustration of Equation 7(b) Illustration of Equation 8.
vector to denote the change in orientation by one group element.

Model Architecture of Equivariant q 1
As a fully convolutional network, q 1 inherits the translational equivariance property of standard convolutional layers.The challenge is to encode rotational equivariance so as to satisfy Equation 7. We accomplish this using equivariant convolutional layers that satisfy the equivariance constraint of Equation 2where we assign F in = s ∈ R 1×m×h×w to encode the input state s and F out ∈ R 1×1×h×w to encode the output Q-map.Both feature maps are associated with the trivial representation ρ 0 such that the rotation g operates on these feature maps by rotating pixels without changing their values.We use the regular representation ρ reg for the hidden layers of the network to encode more comprehensive information in the intermediate layers.We found we achieved the best results when we defined q 1 using the dihedral group D 4 which expresses the group generated by rotations of multiples of π/2 in combination with vertical reflections.

Model Architecture of Equivariant q 2
Whereas the equivariance constraint in Equation 7 is over ŜE(2), the constraint in Equation 8 is over C n only.We implement Equation 8 using Equation 2 with an input of F in = crop(s, x) ∈ R 1×m×h ′ ×w ′ as a trivial representation, and an output of F out ∈ R n×1×1×1 as a regular representation.q 2 is defined in terms of the group C n , assuming the rotations in the action space are defined to be multiples of 2π/n.

q 2 Symmetry Expressed as a Quotient Group
It turns out that additional symmetries exist when the gripper has a bilateral symmetry.In particular, it is often the case that rotating a grasp pose by π radians about its forward axis does not affect the probability of grasp success, i.e.,r is invariant to rotations of the action by π radians.When this symmetry is present, we can model it using the quotient group C n /C 2 ∼ = {2πk/n : 0 ≤ k < n/2, k ∈ Z, 0 ≡ π} which pairs orientations separated by π radians into the same equivalence classes.

Other Optimizations
While our use of equivariant models to encode the Q-function is responsible for most of our gains in sample efficiency (Section 6.3), there are several additional algorithmic details that, taken together, have a meaningful impact on performance.

Loss Function
In the standard ASR loss function, given a one step reward r(s, a), where a = (x, θ), Q 1 and Q 2 have targets [45]: In L 1 term, however, since the reward r(s, a) is the ground truth return of Q 2 (crop(s, x), θ), we correct Q 2 with r(s, a).Denoted Q2 as the corrected We then modify L 1 to learn from Q2 : (13) In addition to the above, we add an off-policy loss term L ′′ 1 that is evaluated with respect to an additional k grasp positions X ⊂ X sampled using a Boltzmann distribution from Q 1 (s): where Q 2 provide targets to train Q 1 .This off-policy loss minimizes the gap between Q 1 and Q 2 .Our combined loss function is therefore

Prioritizing failure experiences in minibatch sampling
In the contextual bandit setting, we want to avoid the situation where the agent repeats incorrect actions in a row.This can happen because some failure grasps left the scene intact, thus the image and the Q-map are intact.We address this problem by prioritizing failure experiences.When a grasp failed, the experience is included in the sampled minibatch on the next SGD step [51], thereby updating the Q-function prior to reevaluating it on the next time step.The updates in Q reduce the chance of selecting the same (bad) action.

Boltzmann exploration
We find empirically Boltzmann exploration is better compares to ϵ-greedy exploration in our grasp setting.We use a temperature of τ training during training and a lower temperature of τ test during testing.Using a non-zero temperature at test time helped reduce the chances of repeatedly sampling a bad action.

Data augmentation
Even though we are using equivariant neural networks to encode the Q-function, it can still be helpful to perform data augmentation as well.This is because the granularity of the rotation group encoded in q 1 (D 4 ) is coarser than that of the action space (C n /C 2 ).We address this problem by augmenting the data with translations and rotations sampled from ŜE(2).For each experienced transition, we add eight additional ŜE(2)-transformed images to the replay buffer.
5.2.5 Softmax at the output of q 1 and q 2 Since we are using a contextual bandit with binary rewards and the reward function r(s, a) denotes the parameter of a Bernoulli distribution at s, a, we know that Q 1 and Q 2 must each take values between zero nt Models and one.We encode this prior using an element-wise softmax layer at the output of each of the q 1 and q 2 networks.

Selection of the z coordinate
In order to execute a grasp, we must calculate a full x, y, θ, z goal position for the gripper.Since our model only infers a planar grasp pose, we must calculate a depth along the axis orthogonal to this plane (the z axis) using other means.In this paper, we calculate z by taking the average depth over a 5 × 5 pixel region centered on the grasp point in the input depth image.The commanded gripper height is set to an offset value from this calculated height.While executing the motion to this height, we monitor force feedback from the arm and halt the motion prematurely if a threshold is exceeded.

Optimizations for transparent object grasping using RGB input
In order to grasp transparent objects using our model, we found it was helpful to make a few small modifications to our model and setup.Most importantly, we found it was essential to use RGB rather than depth-only image input to the model.Bin color: We found that for transparent objects, our system performed much better using a black bin color rather than a white or transparent bin color.Figure 6 illustrates this difference in setup.We believe that the performance difference is due to the larger contrast between the transparent objects and the bin in the RGB spectrum.
Dihedral group in q2: Another optimization we used in our transparent object experiments was to implement Equation 8 using dihedral group D n which expresses the group of multiples of 2π/n and reflections, rather than C n .As in Section 5.1.6,we use a quotient group that encodes gripper symmetries.Here, this quotient group becomes D n /D 2 , which pairs orientations separated by π and reflected orientations into the same equivalent class.Collision penalty: In our experiments with transparent objects, we found that collision was a more significant problem than it was with opaque objects.We believe this was a result of the fact that since the transparent objects did not completely register in the depth image, standard collision checking between the object point cloud and the gripper did not suffice to prevent collisions.Therefore, in our transparent object experiments, we penalized successful grasps that produced collisions during grasping by awarding those grasps only 0.8 reward instead of the full 1.0 reward.
6 Experiments in Simulation 6.1 Setup

Object Set
All simulation experiments are performed using objects drawn from the GraspNet-1Billion dataset [10].This includes 32 objects from the YCB dataset [3], 13 adversarial objects used in DexNet 2.0 [26], and 43 additional objects unique to GraspNet-1Billion [10] (a total of 88 objects).Out of these 88 objects, we exclude two bowls because they can be stably placed in non-graspable orientations, i.e., they can be placed upside down and cannot be grasped in that orientation using standard grippers.Also, we scale these objects so that they are graspable from any stable object configuration.Lastly, objects are assigned with random RGB values drawn from uniform distribution U ((0.6, 0.6, 0.6), (1, 1, 1)).We refer to these 86 mesh models as our simulation "object set", shown in Figure 3a.

Simulation Details
Our experiments are performed in Pybullet [7].The environment includes a Kuka robot arm and a 0.3m × 0.3m tray with inclined walls (Figure 3b).At the beginning of each episode, the environment is initialized with 15 objects drawn uniformly at random from our object set and dropped into the tray from a height of 40 cm so that they fall into a random configuration.The state is a depth image (depth modality) or RGB image (RGB modality) captured from a topdown camera (Figure 3c and d).On each time step, the agent perceives a state and selects an action to execute which specifies the planar pose to which to move the gripper.A grasp is considered to have been successful if the robot is able to lift the object more than 0.1m above the table.The environment will be reinitialized when all objects have been removed from the tray or 30 grasp attempts have been made.

Baseline Model Architectures
We compare our method against two different model architectures from the literature: VPG [51] and FC-GQ-CNN [37].Each model is evaluated alone and then with two different data augmentation strategies (soft equ and RAD).In all cases, we use the contextual bandit formulation described in Section 4.2.
The baseline model architectures are: VPG: Architecture used for grasping in [51].This model is a fully convolutional network (FCN) with a single-channel output.The Q value of different gripper orientations is evaluated by rotating the input image.We ignore the pushing functionality of VPG.FC-GQ-CNN: Model architecture used in [37].This is an FCN with 8channel output that associates each grasp rotation to a channel of the output.During training, our model uses Boltzmann exploration with a temperature of τ = 0.01 while the baselines use ϵ-greedy exploration starting with ϵ = 50% and ending with ϵ = 10% over 500 grasps (this follows the original implementation in [51]).

Data Augmentation Strategies
The data augmentation strategies are: n× RAD: The method from [23] that augments each sample in the mini-batch with respect to Equation 4. Specifically, for each SGD step, we first draw bs (where bs is the batch size) samples from the replay buffer.Then for each sample, we augment both the observation and the action using a random SE(2) transformation, while the reward is unchanged.We perform n SGD steps on the RAD augmented mini-batch after each grasp sample.n× soft equ: similar to n× RAD except that we produce a mini-batch by drawing bs/n samples, randomly augment those samples n times with respect to equation.4, then perform a single SGD step.Details can be found in Appendix C.

Results and Discussion
The learning curves of Figure 4 show the grasp success rate versus the number of grasping attempts on depth modality.Figure 4a shows online learning performance.Each data point is the average success rate over the last 150 grasps (therefore, the first data point occurs at 150). Figure 4b shows testing performance by stopping training every 150 grasp attempts and performing 1000 test grasps and reporting average performance over these 1000 test grasps.Our method tests at a lower test temperature of τ = 0.002 while the baselines test pure greedy behavior.Generally, our proposed equivariant model convincingly outperforms the baseline methods and data augmentation strategies with depth modality.In particular, Figure 4b shows that the testing success rate of the equivariant model after 150 grasp attempts has the same or better performance than all of the baseline methods after 1500 grasp attempts.Notice that each of the two data augmentation methods we consider (RAD and soft equ) has a positive effect on the baseline methods.However, after training for the full 1500 grasp attempts, our equivariant model converges to the highest grasp success rate (93.9 ± 0.4%).Please see Appendix.D for a comparison with longer training horizon.

Ablation Study
There are three main parts of SymGrasp as described in this paper: 1) the use of equivariant convolutional layers instead of standard convolution layers; 2) the use of the augmented state representation (ASR)  instead of a single network; 3) the various optimizations described in Section 5.2.Here, we evaluate the performance of the method when ablating each of these three parts in the depth modality.For additional ablations, see Appendix D.

Baselines
In no equ, we replace all equivariant layers with standard convolutional layers.In no ASR, we replace the equivariant q 1 and q 2 models described in Section 3.2 by a single equivariant network.In no opt, we remove the optimizations described in Section 5.2.In addition to the above, we also evaluated rot equ which is the same as no ASR except that we replace ASR with a U-net [35] and apply 4× RAD [23] augmentation.Detailed network architectures can be found in Appendix A.

Results and Discussion
Figure 5a and b shows the results where they are reported exactly in the same manner as in Section 6.2.no equ does worst, suggesting that our equivariant model is critical.We can improve on this somewhat by adding data augmentation (rot equ), but this sill underperforms significantly.The other ablations, no ASR and no opt demonstrate that those parts of the method are also important.

Effect of background color
This experiment evaluates the effect of background color with RGB modality.We compare training with different tray colors which have an offset from the mean RGB value of the objects (0.8, 0.8, 0.8).For example, in mean−0.8, the tray color would be black (0, 0, 0).We test four different colors from black (mean−0.8) to white (mean+0.2).

Results and Discussion
Figure 5c and d show that the contrast between background and object color has a big impact on grasp learning.In particular, the model has the worst performance when the background color is the same as the mean of the object color (mean).The model performs best when the background color has the most significant contrast from the mean of the object color (mean −0.8).
7 Experiments in Hardware 7.1 Setup

Robot Environment
Our experimental platform is comprised of a Universal Robots UR5 manipulator equipped with a Robotiq 2F-85 parallel-jaw gripper.For depth modality, we use an Occipital Structure Sensor and for RGB modality we use a Kinect Azure Sensor paired with two black trays.The dual-tray grasping environment is shown in Figure 6.The workstation is equipped with an Intel Core i7-7800X CPU and an NVIDIA GeForce GTX 1080 GPU.

Self-Supervised Training
The training begins with the 15 training objects (Figure 9a for opaque object grasping and Figure 10a for transparent object grasping) being dropped into one of the two trays by the human operator.Then, the robot attempts to pick objects from one tray and drop them on another.The drop location is sampled from a Gaussian distribution centered in the middle of the receiving tray.All grasp attempts are generated by the contextual bandit.When all 15 objects have been transported in this way, training switches to attempting to grasp from the other tray and drop into the first.Whether all objects are transported or not is decided by the heuristic.For opaque objects, we threshold the depth sensor reading.For transparent objects, we compare the RGB image of the current scene with an RGB image of an empty tray, similar to [17].During training, the robot performs 600 grasp attempts in this way (that is 600 grasp attempts, not 600 successful grasps).
A reward r will be set to 1 if the gripper was blocked by the grasped object, otherwise r = 0.

In-Motion Computation
We are able to nearly double the speed of robot training by doing all image processing and model learning while the robotic arm was in motion.This was implemented in Python as a producer-consumer process using mutexs.As a result, our robot is constantly in motion during training and the training speed for our equivariant algorithm is limited by the velocity of robot motion.

Evaluation procedure
The evaluation begins with the human operator dropping objects into one tray, after this, no human interference is allowed.We evaluate the success rate of robot grasping objects.A key failure mode during testing is repeated failure grasps.To combat this, we use the procedure of [51] to reduce the chances of repeated grasp failures.The procedure is that after a grasp failure, we perform multiple SGD steps using that experience to "discourage" the model from selecting the same action and then use that updated model for the subsequent grasp.After a successful grasp, we discard these updates and reload the original network.All runs are evaluated by freezing the corresponding model and executing 100 greedy (or near greedy) test grasps for each object set in the easy-object test set (Figure 9b), the hard-object test set (Figure 9c).

Model Details
For all methods, prior to training on the robot, model weights are initialized randomly using an independent seed.No experiences from simulation are used, i.e.,we train from scratch.For our depth algorithm, the q 1 network is defined using D 4 -equivariant layers and the q 2 network is defined using C 16 /C 2 -equivariant layers.For ours RGB algorithm, the q 2 network is defined using D 16 /D 2 -equivariant layers.During training, we use Boltzmann exploration with a temperature of 0.01.During testing, the temperature is reduced to 0.002 (near-greedy).For more details, see Appendix G. 7.2 Opaque object grasping

Objects
For opaque objects grasping, all training happens using the 15 objects shown in Figure 9a.After training, we evaluate grasp performance on both the "easy" test objects (Figure 9b) and the "hard" test objects (Figure 9c).Note that both test sets are novel with respect to the training set.

Baselines
In our robot experiments for opaque object grasping, we compare our method against 8× RAD VPG [51] [23] and 8× RAD FC-GQ-CNN [37] [23], the two baselines we found to perform best in simulation.As before, 8× RAD VPG, uses a fully convolutional network (FCN) with a single output channel.The Q-map for each gripper orientation is calculated by rotating the input image.After each grasp, we perform 8× RAD data augmentation (8 optimization steps with a minibatch containing randomly translated and rotated image data).8× RAD FC-GQ-CNN also has an FCN backbone, but with eight output channels corresponding to each gripper orientation.It uses 8× RAD data augmentation as well.All exploration is the same as it was in simulation except that the ϵ-greedy schedule goes from 50% to 10% over 200 steps rather than over 500 steps.

Results and Discussion
Figure 7a shows the learning curves on opaque object grasping for the three methods during learning.An important observation is that the results from training on the physical robot Figure 7a match the simulation training results Figure 4a. Figure 7a shows for opaque object grasping, our method achieves a success rate of > 90% after 600 grasp attempts while the baselines are near 70%.Table 1 shows the testing performance and also demonstrates our method significantly outperforms the baselines during testing.However, performance is lower on the "hard" test set.We hypothesize that the lower "hard" set performance is due to a lack of sufficient diversity in the training set.The other observation is that since each of these 600-grasp training runs takes approximately 1.5 hours, suggests these methods could efficiently directly learn from real-world data thus avoiding the simulation-toreal-world gap and adapting to physical changes in the robot.

Transparent object grasping 7.3.1 Objects
For transparent object grasping, all training happens using the 15 objects shown in Figure 10a.After training, we evaluate grasp performance on the indistribution training set and out-of-distribution testing objects (Figure 10a, b).Note that the test set is novel with respect to the training set.

RGB Modality Details
The top-down RGB images are orthographic projections of noisy RGB point clouds.RGB values are normalized to fit a Gaussian distribution N (0, 0.01).
During training, each mini-batch of images is augmented in brightness with a (0.9, 1.1) range.

Baselines
For transparent object grasping, we compare Ours RGB (black), Ours RGB (white), and Ours D.
Ours RGB (black) is the baseline in our proposed methods, Ours RGB (white) changes black trays to white trays, see figure D8.Ours D is the best baseline in physical opaque objects grasping experiments, here we train and test on transparent objects.  2 shows the grasping performance of ours RGB (black) on the transparent object set. Figure 8 shows a transparent object grasping process during evaluation.Figure 7b shows that for transparent object grasping, ours RGB (black) produces a significant (> 40%) improvement in success rate compared to ours depth, suggesting that the RGB information is needed to perform well in this setting.The background color is also important: changing the tray from black to white leads to a > 30% decrement in success rate.Table 2 summarizes that ours RGB achieves > 80% success rate on both the training set and testing set, indicating the method learns a good grasp function of in-distribution object set and generalizes to novel transparent objects.Also, notice that from a computational perspective, our method is significantly cheaper (1.8 seconds) than NeRF based approaches to transparent object grasping like EVO Nerf [18] and Dex Nerf [14] which take 9.5 second and 16 seconds, respectively.

Conclusions and Limitations
Our main contribution is to recognize that planar grasp detection is SE(2)-equivariant and to leverage this structure using an SO(2)-equivariant model architecture.This model is significantly more sample efficient than other grasp-learning methods and can learn a good grasp function with less than 600 grasp samples.This increase in sample efficiency enables us to learn to grasp on a physical robotic system in a practical A key limitation in both simulation and the real world is that despite the fast learning rate, grasp success rates (after training) still seems to be limited to the low/mid 90% for opaque object and 80% for transparent objects.This is the same success rate seen in with other grasp detection methods [26,30,42], but it is disappointing here because one might expect faster adaptation to lead ultimately to better grasp performance.This could simply be an indication of the complexity of the grasp function to be learned or it could be a result of stochasticity in the simulator and on the real robot.Another limitation of our work is that it is limited to an open-loop control setting where the model infers only a goal pose.Nevertheless, we expect that the equivariant model structure leverage in this paper would also be useful in the closed loop setting, as suggested by the results of [47].grasps to illustrate the sample efficiency.As shown in Figure .D4, our method not only learns faster but also converges to a better success rate, compared with the other two baselines.
We then ablate each components in Section.5.2.Baselines are: In ASR loss, both q 1 and q 2 minimize l2 loss between prediction and the reward r.
In No prioritizing, the mini-batch is uniformly sampled from the replay buffer.In e greedy, we replace Boltzmann exploration with e greedy exploration that linearly anneals from 0.5 to 0.1 in 500 grasps.In No data aug, no data augmentation is performed.In No softmax the pixel-wise softmax in the last layer of q 1 and q 2 is removed.We then ablate each component in Section.5.3.Baselines are: In Cyclic group, we replace D16 group with C16 group in q2.In No collision penalty, we replace collision penalty by using binary grasp success reward.
Figure D6 a, b, shows the learning results.Cyclic group shows Cyclic group in q 2 leads to slower learning before 300 grasp, compares to Dihedral group in q 2 ; No collision penalty shows collision penalty encourages quicker and better convergence.

Fig. 1
Fig. 1 Illustration of the ASR representation.Q 1 selects the translational component of an action, Q 2 selects the rotational component.

Fig. 3
Fig. 3 (a) The 86 objects used in our simulation experiments are drawn from the GraspNet-1Billion dataset [10].(b) Phybullet simulation.(c) and (d) are top-down depth and RGB images of the grasping scene.

Fig. 4
Fig. 4 Comparison with baselines.All lines are an average of four runs.Shading denotes standard error.(a) shows learning curves as a running average over the last 150 training grasps.(b) shows the average near-greedy performance of 1000 validation grasps performed every 150 training steps.

Fig. 5 Fig. 6
Fig.5Ablation study for depth observation.Lines are an average over 4 runs.Shading denotes standard error.In the left column, learning curves as a running average over the last 150 training grasps.In the right column is the average near-greedy performance of 1000 validation grasps performed every 150 training steps.The first row is in the depth modality while the second row is in the RGB modality.

Fig. 7
Fig. 7 Learning curves for (a) the opaque object grasping and (b) the transparent object grasping hardware experiment, the parenthesis indicates the color of trays.All curves are averaged over 4 runs with different random seeds and random object placement.Each data point in the curve is the average grasp success over the last 60 grasp attempts.Shading denotes standard error.

Fig. 8
Fig. 8 Illustrations for transparent object grasping.(a) shows the depth observation, notice that most transparent objects are not sensed.(b) shows the same scene with RGB modality, notice that objects are sensed up to orthographic projection error.(c) shows the Q 1 map from the observation in (b) and the selected action in the red dot.(d) shows the executed grasp.

Figure 7 (
Figure 7(b) shows the learning curves for transparent object grasping and Table2shows the grasping performance of ours RGB (black) on the transparent object set.Figure8shows a transparent object grasping process during evaluation.Figure7bshows that for transparent object grasping, ours RGB (black) produces a significant (> 40%) improvement in success rate compared to ours depth, suggesting that the RGB information is needed to perform well in this setting.The background color is also important: changing the tray from black to white leads to a > 30% decrement in success rate.Table2summarizes that ours RGB achieves > 80% success rate on both the training set and testing set, indicating the method learns a good grasp function of in-distribution object set and generalizes to novel transparent objects.Also, notice that from a computational perspective, our method is significantly cheaper (1.8 seconds) than NeRF based approaches to transparent object grasping like EVO Nerf[18] and Dex Nerf[14] which take 9.5 second and 16 seconds, respectively.

Fig. 9
Fig. 9 Object sets used for training and testing.Both training and the test set easy include 15 objects while the test set hard has 20 objects.Objects were curated so that they were graspable by the Robotiq 2F-85 parallel jaw gripper from any configuration and visible to the Occipital Structure Sensor.
Fig. C2 The neural network architecture for ours and ablations.R means regular representation, T means trivial representation, and Q means quotient representation.
Fig. C3 Action space constraint for action selection.(a) The test set easy cluttered scene.(b) The state s.(c) The action space x positive , overlays the binary mask x positive with the state s for visualization.(d) The Q-values within the action space.(e) Selecting an action.(f) Executing a grasp.

Fig. D4
Fig. D4 Additional comparison with baselines for Section.6.2 with depth modality.Lines are an average over 4 runs.Shading denotes standard error.(a) learning curves as a running average over the last 150 training grasps.(b) average near-greedy performance of 1000 validation grasps performed every 150 training steps.
Figure D5 a, b, shows the learning results.Figure.D5(b) shows that ASR loss affects the performance the most, indicating the importance of the loss function in Section.5.2.1.Other components No data aug, e greedy slightly reduces the performance indicating the marginal benefits of data augmentation and Boltzmann exploration.Lastly, No softmax performs badly at the 600 grasps and No prioritizing performs badly at the 1500 grasps, these suggest softmax and prioritizing failure grasp stabilizes learning.

Fig. D5
Fig. D5 Additional ablation study for Section.5.2 with depth modality.Lines are an average over 4 runs.Shading denotes standard error.(a) learning curves as a running average over the last 150 training grasps.(b) average near-greedy performance of 1000 validation grasps performed every 150 training steps.

Fig. D6
Fig. D6 Additional ablation study for Section.5.3 with depth modality.Lines are an average over 4 runs.Shading denotes standard error.(a) learning curves as a running average over the last 150 training grasps.(b) average near-greedy performance of 1000 validation grasps performed every 150 training steps.

Fig. D7
Fig.D7The definition of θ. x is the x-axle of the workspace while ⃗ n is the normal of the gripper.

Fig
Fig.D8Transparent object grasping scenes.First row: using black trays.Second row: using white trays.

Table 1
This improvement enabled us to increase robot training speed from approximately 230 grasps per hour to roughly 400 grasps per hour.Evaluation success rate (%), standard error, and training time per grasp t SGD (in seconds) in the hardware experiments.

Table 2
Evaluation success rate (%), standard error, and capture time per grasp (in seconds) in the hardware experiments of transparent objects.Results are an average of 100 grasps per training run averaged over four runs, performed on the train and held out test objects shown in Figure 10 a and b.