1 Introduction

Grasp detection detects good grasp poses in a scene directly from raw visual input (e.g., RGB or depth images) using machine learning. The learning-based method generalizes to novel objects. This is in contrast to classical model-based methods that attempt to reconstruct the geometry and the pose of objects in a scene and then reason geometrically about how to grasp those objects.

Most current grasp detection models are data-driven, i.e., they must be trained using large offline datasets. For example, Mousavian et al. (2019) trains on a dataset consisting of over 7 M simulated grasps, Breyer et al. (2021) trains on over 2 M simulated grasps, Mahler et al. (2017) trains on grasp data drawn from over 6.7M simulated point clouds, and ten Pas et al. (2017a) trains on over 700k simulated grasps. Some models are trained using datasets obtained via physical robotic grasp interactions. For example, Pinto and Gupta (2015) trains on a dataset created by performing 50k grasp attempts over 700 h, Kalashnikov et al. (2018) trains on over 580k grasp attempts collected over the course of 800 robot hours, and Berscheid et al. (2021) train on a dataset obtained by performing 27k grasps over 120 h.

Such reliance on collecting large datasets necessitates either learning in simulation or using significant amounts of robot time to generate data, motivating the desire for a more sample efficient grasp detection model, i.e., a model that can achieve good performance with a smaller dataset. In this paper, we propose a novel grasp detection strategy that improves sample efficiency significantly by incorporating the equivariant structure into the model. We term our strategy Symmetric Grasp learning as SymGrasp. Our key observation is that the target grasp function (from images onto grasp poses) is \(\textrm{SE}(2)\)-equivariant. That is, rotations and translations of the input image should correspond to the same rotations and translations of the detected grasp poses at the output of the function. In order to encode the \(\textrm{SE}(2)\)-equivariance in the target function, we constrain the layers of our model to respect this symmetry. Compared with conventional grasp detection models that must be trained using tens of thousands of grasp experiences, the equivariant structure we encode into the model enables us to achieve good grasp performance after only a few hundred grasp attempts.

This paper makes several key contributions. First, we recognize that the grasp detection function from images to grasp poses is a \(\textrm{SE}(2)\)-equivariant function. Then, we propose a neural network model using equivariant layers to encode this property. Finally, we introduce several algorithmic optimizations that enable us to learn to grasp online using a contextual bandit framework. Ultimately, our model is able to learn to grasp opaque (using depth images) and transparent objects (using RGB images) with a good success rate after only approximately 600 grasp trials—1.5 h of robot time. Although the model we propose here is only for 2D grasping (i.e., we only detect top down grasps rather than all six dimensions as in 6-DOF grasp detection), the sample efficiency is still impressive and we believe the concepts could be extended to higher-DOF grasp detection models in the future.

These improvements in sample efficiency are important for several reasons. First, since our model can learn to grasp in only a few hundred grasp trials, it can be trained easily on a physical robotic system. This greatly reduces the need to train on large datasets created in simulation, and it therefore reduces our exposure to the risks associated with bridging the sim2real domain gap—we can simply do all our training on physical robotic systems. Second, since we are training on a small dataset, it is much easier to learn on-policy rather than off-policy, i.e., we can train using data generated by the policy being learned rather than with a fixed dataset. This focuses learning on areas of state space explored by the policy and makes the resulting policies more robust in those areas. Finally, since we can learn efficiently from a small number of experiences, our policy has the potential to adapt relatively quickly at run time to physical changes in the robot sensors and actuators.

2 Related work

2.1 Equivariant convolutional layers

Equivariant convolutional layers incorporate symmetries into the structure of convolutional layers, allowing them to generalize across a symmetry group automatically. This idea was first introduced as G-Convolution (Cohen & Welling, 2016a) and Steerable CNN (Cohen & Welling, 2016b). E2CNN is a generic framework for implementing \(\textrm{E}(2)\) Steerable CNN layers (Weiler & Cesa, 2019). In applications such as dynamics  (Walters et al., 2020; Wang et al., 2020b) and reinforcement learning (van der Pol et al., 2020; Mondal et al., 2020; Wang et al., 2021, 2022) equivariant models demonstrate improvements over traditional approaches.

2.2 Sample efficient reinforcement learning

Recent work has shown that data augmentation using random crops and/or shifts can improve the sample efficiency of standard reinforcement learning algorithms (Laskin et al., 2020a; Kostrikov et al., 2020). It is possible to improve sample efficiency even further by incorporating contrastive learning (van den Oord et al., 2018), e.g. CURL (Laskin et al., 2020b). The contrastive loss enables the model to learn an internal latent representation that is invariant to the type of data augmentation used. The FERM framework (Zhan et al., 2020) applies this idea to robotic manipulation and is able to learn to perform simple manipulation tasks directly on physical robotic hardware. The equivariant models used in this paper are similar to data augmentation in that the goal is to leverage problem symmetries to accelerate learning. However, whereas data augmentation and contrastive approaches require the model to learn an invariant or equivariant encoding, the equivariant model layers used in this paper enforce equivariance as a prior encoded in the model. This simplifies the learning task and enables our model to learn faster (see Sect. 6).

2.3 Grasp detection

In grasp detection, the robot finds grasp configurations directly from visual or depth data. This is in contrast to classical methods which attempt to reconstruct object or scene geometry and then do grasp planning. See Platt (2023) for a review on this topic.

2D Grasping: Several methods are designed to detect grasps in 2D, i.e., to detect the planar position and orientation of grasps in a scene based on top-down images. A key early example of this was DexNet 2.0, which infers the quality of a grasp centered and aligned with an oriented image patch (Mahler et al., 2017). Subsequent work proposed fully convolutional architectures, thereby enabling the model to quickly infer the pose of all grasps in a (planar) scene (Morrison et al., 2018; Satish et al., 2019; Depierre et al., 2018; Kumra et al., 2019; Zhou et al., 2018) (some of these models infer the z coordinate of the grasp as well).

3D Grasping: There is much work in 3D grasp detection, i.e., detecting the full 6-DOF position and orientation of grasps based on truncated signed distance function (TSDF) or point cloud input. A key early example of this was GPD (ten Pas et al., 2017b) which inferred grasp pose based on point cloud input. Subsequent work has focused on improving grasp candidate generation in order to improve efficiency, accuracy, and coverage (Huang et al., 2022; Mousavian et al., 2019; Sundermeyer et al., 2021; Jiang et al., 2021; Fang et al., 2020; Breyer et al., 2021; Berscheid et al., 2021).

On-robot Grasp Learning: Another important trend has been learning to grasp directly from physical robotic grasp experiences. Early examples of this include (Pinto & Gupta, 2015) who learn to grasp from 50k grasp experiences collected over 700 h of robot time and Levine et al. (2018) who learn a grasp policy from 800k grasp experiences collected over two months. QT-Opt (Kalashnikov et al., 2018) learns a grasp policy from 580k grasp experiences collected over 800 h and James et al. (2019) extends this work by learning from an additional 28k grasp experiences. Song et al. (2020) learns a grasp detection model from 8k grasp demonstrations collected via demonstration and Zeng et al. (2018) learns a online pushing/grasping policy from just 2.5k grasps.

Transparent objects grasping: Commercial depth sensors that are based on structured-light or time-of-flight techniques often fail to sense transparent objects accurately (Weng et al., 2020). Pixels in the depth image are often dropped due to specularities or the object is simply invisible to the sensor because it is transparent  (Sajjan et al., 2020). To avoid this type of failure, RGB or RGB-D sensors are commonly used. Weng et al. (2020) infers grasp pose from RGBD images. They collect paired RGB and D images on opaque objects and utilize transfer learning from a trained D modality model to an RGBD modality model. Sajjan et al. (2020) reconstructs depth images from RGBD images using CNN and then performs grasp detection. Likewise, Ichnowski et al. (2021) and Kerr et al. (2022) infer grasp pose from reconstructed depth images, but using neural radiance field (NeRF) (Mildenhall et al., 2021) for reconstruction. These methods either rely on collecting paired images for training or require training a NeRF model per grasp during evaluation. In contrast, our method is trained directly on RGB images without requiring paired images or NeRF models.

\({\text {Equivariance through canonicalization in grasping:}}\) An alternative to modeling rotational symmetry using equivariant neural network layers is an approach known as canonicalization where we learn a model over the non-equivariant variables assuming a single “canonical” group element (Wang et al., 2020b; Kofinas et al., 2021; Gao et al., 2020). Equivariance based on canonicalization is common in robotic grasping where it is not unusual to translate and rotate the input image so that it is expressed in the reference frame of the hand, e.g. Mahler et al. (2017), ten Pas et al. (2017b) and Mousavian et al. (2019). This way, the neural network model need only infer the quality of a grasp for a single canonical grasp pose rather than over arbitrary translations and orientations. In this paper, we compare our model-based approach to equivariance with VPG, a method that obtains rotational equivariance via canonicalization (Song et al., 2020). Our results in Sect. 6.2 suggest that the model-based approach has a significant advantage.

3 Background

3.1 Equivariant neural network models

In this paper, we use equivariant neural network layers defined with respect to a finite group (e.g. a finite group of rotations). Equivariant neural network (Weiler & Cesa, 2019; Cohen et al., 2018; Cohen & Welling, 2016b, a) encodes a function \(f: X \rightarrow Y\) that satisfy the equivariance constraint: \(gf(x) = f(gx)\), where \(g \in G\) is an element of a finite group. gx is shorthand for the action of g on x, e.g. rotation of an image x. Similarly, g(f(x)) describes the action of g on f(x). Below, we make these ideas more precise and summarize how the equivariance constraint is encoded into a neural network layer.

3.1.1 The cyclic group \(C_n \le \textrm{SO}(2)\)

We are primarily interested in equivariance with respect to the group of planar rotations, \(\textrm{SO}(2)\). However, in practice, in order to make our models computationally tractable, we will use the cyclic subgroup \(C_n\) of \(\textrm{SO}(2)\), \(C_n=\{ 2 \pi k / n: 0\le k < n\}\). \(C_n\) is the group of discrete rotations by multiples of \(2\pi /n\) radians.

3.1.2 Representation of a group

The way a group element \(g \in G\) acts on x depends on how x is represented. If x is a point in the plane, then g acts on x via the standard representation, \(\rho _1(g)x\), where \(\rho _1(g)\) is the standard \(2 \times 2\) rotation matrix corresponding to g. In the hidden layers of an equivariant neural network model, it is common to encode a separate feature map for each group element. For example, suppose G is the order n cyclic group and suppose x is a set of features \(x = (x_1, x_2, \dots , x_n) \in \mathbb {R}^{n \times \lambda }\) that maps the kth group element to a feature \(x_k\). The regular representation of g acts on x by permuting its elements: \(\rho _{reg}(g)x = (x_{n-m+1}, \dots , x_n, x_1, x_2, \dots , x_{n-m})\) where g is the mth element in \(C_n\). Finally, it is sometimes the case that x is invariant to the action of the group elements. In this case, we have the trivial representation, \(\rho _0(g)x = x\).

3.1.3 Feature maps of equivariant convolutional layers

An equivariant convolutional layer maps between feature maps which transform by specified representations \(\rho \) of the group. In the hidden layers of an equivariant model, generally, an extra dimension is added to the feature maps to encode group elements via the regular representation. So, whereas the feature map used by a standard convolutional layer is a tensor \(\mathcal {F} \in \mathbb {R}^{m \times h \times w}\), an equivariant convolutional layer adds an extra dimension: \(\mathcal {F} \in \mathbb {R}^{k \times m \times h \times w}\), where k denotes the dimension of the group representation. This tensor associates each pixel \((u, v) \in \mathbb {R}^{h \times w}\) with a matrix \(\mathcal {F}(u, v) \in \mathbb {R}^{k \times m}\).

3.1.4 Action of the group operator on the feature map

Given a feature map \(\mathcal {F} \in \mathbb {R}^{k \times m \times h \times w}\) associated with group G and representation \(\rho \), a group element \(g \in G\) acts on \(\mathcal {F}\) via:

$$\begin{aligned} (g \mathcal {F})(x) = \rho (g) \mathcal {F} (\rho _1(g)^{-1} x), \end{aligned}$$

where \(x \in \mathbb {R}^2\) denotes pixel position. The RHS of this equation applies the group operator in two ways. First, \(\rho _1(g)^{-1}\) rotates the pixel position x using the standard representation. Second, \(\rho \) applies the rotation to the feature representation. If the feature is invariant to the rotation, then we use the trivial representation \(\rho _0(g)\). However, if the feature vector changes according to rotation (e.g. the feature denotes grasp orientation), then it must be transformed as well. This is accomplished by setting \(\rho \) in Eq. 1 to be the regular representation that transforms the feature vector by a circular shift.

3.1.5 The equivariant convolutional layer

An equivariant convolutional layer is a function h from \(\mathcal {F}_{in}\) to \(\mathcal {F}_{out}\) that is constrained to represent only equivariant functions with respect to a chosen group G. The feature maps \(\mathcal {F}_{in}\) and \(\mathcal {F}_{out}\) are associated with representations \(\rho _{in}\) and \(\rho _{out}\) acting on feature spaces \(\mathbb {R}^{k_{in}}\) and \(\mathbb {R}^{k_{out}}\) respectively. Then the equivariant constraint for h is Cohen et al. (2018):

$$\begin{aligned} h(g \mathcal {F}_{in}) = g h(\mathcal {F}_{in}) = g \mathcal {F}_{out}. \end{aligned}$$

This constraint can be implemented by tying kernel weights \(K(y) \in \mathbb {R}^{k_{out} \times k_{in}}\) in such a way as to satisfy the following constraint (Cohen et al., 2018):

$$\begin{aligned} K(gy) = \rho _{out}(g) K(y)\rho _{in}(g)^{-1}. \end{aligned}$$

Please see “Appendix Appendix B” for an example of equivariant convolustional layer.

When all hidden layers h in a neural network satisfy Eq. 2, then by induction the entire neural network is equivariant (Cohen et al., 2018).

3.2 Augmented state representation (ASR)

We will formulate \(\textrm{SE}(2)\) robotic grasping as the problem of learning a function from an m channel image, \(s \in S = \mathbb {R}^{m \times h \times w}\), to a gripper pose \(a \in A = \textrm{SE}(2)\) from which an object may be grasped. Since we will use the contextual bandit framework, we need to be able to represent the Q-function, \(Q: \mathbb {R}^{m \times h \times w} \times \textrm{SE}(2) \rightarrow \mathbb {R}\). However, this is difficult to do using a single neural network due to the GPU memory limitation. To combat this, we will use the Augmented State Representation (ASR) (Sharma et al., 2017; Wang et al., 2020a) to model Q as a pair of functions, \(Q_1\) and \(Q_2\). Another advantage of using ASR is we can use different group order in \(Q_1, Q_2\) , as explained in Sect. 5.1.3.

We follow the ASR framework that factors \(\textrm{SE}(2) = \mathbb {R}^2 \times \textrm{SO}(2)\) into a translational component \(X \subseteq \mathbb {R}^2\) and a rotational component \(\Theta \subseteq \textrm{SO}(2)\). The first function is a mapping \(Q_1: \mathbb {R}^{m \times h \times w} \times X \rightarrow \mathbb {R}\) which maps from the image s and the translational component of action X onto value. This function is defined to be: \(Q_1(s,x) = \max _{\theta \in \Theta } Q(s,(x,\theta ))\). The second function is a mapping \(Q_2: \mathbb {R}^{m \times h' \times w'} \times \Theta \rightarrow \mathbb {R}\) with \(h' \le h\) and \(w' \le w\) which maps from an image patch and an orientation onto value. This function takes as input a cropped version of s centered on a position x, \(\textrm{crop}(s,x)\), and an orientation, \(\theta \), and outputs the corresponding Q value: \(Q_2(\textrm{crop}(s,x),\theta ) = Q(s,(x,\theta ))\).

Inference is performed on the model by evaluating \(x^* = \mathop {\text {arg}\,\text {max}}\nolimits _{x \in X} Q_1(s,x)\) first and then evaluating \(\theta ^* = \mathop {\text {arg}\,\text {max}}\limits _{\theta }Q_2(\textrm{crop}(s,x^*),\theta )\). Since each of these two models, \(Q_1\) and \(Q_2\), are significantly smaller than Q would be, the inference is much faster. Figure 1 shows an illustration of this process. The top of the figure shows the action of \(Q_1\) while the bottom shows \(Q_2\). Notice that the semantics of \(Q_2\) imply that the \(\theta \) depends only on \(\textrm{crop}(s,x)\), a local neighborhood of x, rather than on the entire scene. This assumption is generally true for grasping because grasp orientation typically depends only on the object geometry near the target grasp point.

Fig. 1
figure 1

Illustration of the ASR representation. \(Q_1\) selects the translational component of an action, \(Q_2\) selects the rotational component

4 Problem statement

4.1 Planar grasp detection

The planar grasp detection function \(\Gamma : \mathbb {R}^{m \times h \times w} \rightarrow \textrm{SE}(2)\) maps from a top down image of a scene containing graspable objects, \(s \in S = \mathbb {R}^{m \times h \times w}\), to a planar gripper pose, \(a \in A = \textrm{SE}(2)\), from which an object can be grasped. This is similar to the formulations used by Mahler et al. (2017) and Morrison et al. (2018).

4.2 Formulation as a contextual bandit

We formulate grasp learning as a contextual bandit problem where the state is an image \(s \in S = \mathbb {R}^{m \times h \times w}\) and the action \(a \in A = \textrm{SE}(2)\) is a grasp pose to which the robot hand will be moved and a grasp will be attempted, expressed in the reference frame of the image. After each grasp attempt, the agent receives a binary reward R drawn from a Bernoulli distribution with unknown probability r(sa). The true Q-function denotes the expected reward of taking action a from s. Since R is binary, we have that \(Q(s, a) = r(s, a)\). This formulation of grasp learning as a bandit problem is similar to that used by, e.g. Danielczuk et al. (2020), Kalashnikov et al. (2018) and Zeng et al. (2018).

4.3 Invariance assumption

We assume that the (unknown) reward function r(sa) that denotes the probability of a successful grasp is invariant to translations and rotations \(g \in \textrm{SE}(2)\). Let gs denote the image s translated and rotated by g. Similarly, let ga denote the action translated and rotated by g. Therefore, our assumption is:

$$\begin{aligned} r(s,a) = r(gs,ga). \end{aligned}$$

Intuitively, when the image of a scene transforms, the grasp poses (located with respect to the image) transform correspondingly.

5 SymGrasp: symmetric grasp learning

5.1 Equivariant learning

We use equivariant neural networks (Weiler & Cesa, 2019) to model the Q-function that enforces the invariance assumption in Sect. 4.3. Therefore once the Q function is fit to a data sample r(sa), it generalizes to any \(g\in \textrm{SE}(2)\) transformed data sample r(gsga). This generalization could lead to a significant sample efficiency improvement during training.

5.1.1 Invariance properties of \(Q_1\) and \(Q_2\)

The assumption that the reward function r is invariant to transformations \(g \in \textrm{SE}(2)\) implies that the optimal Q-function is also invariant to g, i.e., \(Q(s, a) = Q(gs,ga)\). In the context of the augmented state representation (ASR, see Sect. 3.2), this implies separate invariance properties for \(Q_1\) and \(Q_2\):

$$\begin{aligned} Q_1(gs, gx)&= Q_1(s, x) \end{aligned}$$
$$\begin{aligned} Q_2(g_\theta (\textrm{crop}(s, x)), g_\theta +\theta )&= Q_2(\textrm{crop}(s, x), \theta ), \end{aligned}$$

where \(g_\theta \in \textrm{SO}(2)\) denotes the rotational component of \(g\in \textrm{SE}(2)\), gx denotes the rotated and translated vector \(x \in \mathbb {R}^2\), and \(g_\theta (\textrm{crop}(s, x))\) denotes the cropped image rotated by \(g_\theta \).

5.1.2 Discrete approximation of \(\textrm{SE}(2)\)

We implement the invariance constraints of Eqs. 5 and 6 using a discrete approximation to \(\textrm{SE}(2)\). We constrain the positional component of the action to be a discrete pair of positive integers \(x \in \{1\dots h\}\times \{1\dots w\}\subset \mathbb {Z}^2\), corresponding to a pixel in s, and constrain the rotational component of the action to be an element of the finite cyclic group \(C_n = \{2\pi k/n: 0\le k < n, i\in \mathbb {Z}\}\). This discretized action space will be written \(\hat{\textrm{SE}}(2) = \mathbb {Z}^2 \times C_n\).

5.1.3 Equivariant Q-learning with ASR

In Q-Learning with ASR, we model \(Q_1\) and \(Q_2\) as neural networks. We model \(Q_1\) as a fully convolutional UNet (Ronneberger et al., 2015) \(q_1: \mathbb {R}^{m \times h\times w}\rightarrow \mathbb {R}^{1\times h\times w}\) that generates Q value for each discretized translational actions from the input state image. We model \(Q_2\) as a standard convolutional network \(q_2: \mathbb {R}^{m \times h'\times w'}\rightarrow \mathbb {R}^n\) that evaluates Q value for n discretized rotational actions based on the image patch. The networks \(q_1\) and \(q_2\) thus model the functions \(Q_1\) and \(Q_2\) by partially evaluating at the first argument and returning a function in the second. As a result, the invariance properties of \(Q_1\) and \(Q_2\) (Eqs. 5, 6) imply the equivairance of \(q_1\) and \(q_2\):

$$\begin{aligned} q_1(gs)&= g q_1(s) \end{aligned}$$
$$\begin{aligned} q_2(g_\theta \textrm{crop}(s,x))&= \rho _{reg}(g_\theta ) q_2(\textrm{crop}(s,x)) \end{aligned}$$

where \(g\in \hat{\textrm{SE}}(2)\) acts on the output of \(q_1\) through rotating and translating the Q-map, and \(g_\theta \in C_n\) acts on the output of \(q_2\) by performing a circular shift of the output Q values via the regular representation \(\rho _{reg}\).

This is illustrated in Fig. 2. In Fig. 2a we take an example of a depth image s in the upper left corner. If we rotate and translate this image by g (lower left of Fig. 2a) and then evaluate \(q_1\), we arrive at \(q_1(gs)\). This corresponds to the LHS of Eq. 7. However, because \(q_1\) is an equivariant function, we can calculate the same result by first evaluating \(q_1(s)\) and then applying the transformation g (RHS of Eq. 7). Figure 2b illustrates the same concept for Eq. 8. Here, the network takes the image patch \(\textrm{crop}(s,x)\) as input. If we rotate the image patch by \(g_\theta \) and then evaluate \(q_2\), we obtain the LHS of Eq. 8, \(q_2(g_\theta \textrm{crop}(s,x))\). However, because \(q_2\) is equivariant, we can obtain the same result by evaluating \(q_2(\textrm{crop}(s,x))\) and circular shifting the resulting vector to denote the change in orientation by one group element.

Fig. 2
figure 2

Equivariance relations expressed by Eqs. 7 and 8

5.1.4 Model architecture of equivariant \(q_1\)

As a fully convolutional network, \(q_1\) inherits the translational equivariance property of standard convolutional layers. The challenge is to encode rotational equivariance so as to satisfy Eq. 7. We accomplish this using equivariant convolutional layers that satisfy the equivariance constraint of Eq. 2 where we assign \(\mathcal {F}_{in}=s\in {\mathbb {R}^{1\times m\times h\times w}}\) to encode the input state s and \(\mathcal {F}_{out} \in \mathbb {R}^{1\times 1\times h\times w}\) to encode the output Q-map. Both feature maps are associated with the trivial representation \(\rho _0\) such that the rotation g operates on these feature maps by rotating pixels without changing their values. We use the regular representation \(\rho _{reg}\) for the hidden layers of the network to encode more comprehensive information in the intermediate layers. We found we achieved the best results when we defined \(q_1\) using the dihedral group \(D_4\) which expresses the group generated by rotations of multiples of \(\pi /2\) in combination with vertical reflections.

5.1.5 Model architecture of equivariant \(q_2\)

Whereas the equivariance constraint in Eq. 7 is over \(\hat{\textrm{SE}}(2)\), the constraint in Eq. 8 is over \(C_n\) only. We implement Eq. 8 using Eq. 2 with an input of \(\mathcal {F}_{in} = \textrm{crop}(s, x) \in \mathbb {R}^{1\times m\times h'\times w'}\) as a trivial representation, and an output of \(\mathcal {F}_{out} \in \mathbb {R}^{n\times 1\times 1\times 1}\) as a regular representation. \(q_2\) is defined in terms of the group \(C_n\), assuming the rotations in the action space are defined to be multiples of \(2\pi /n\).

5.1.6 \(q_2\) symmetry expressed as a quotient group

It turns out that additional symmetries exist when the gripper has a bilateral symmetry. In particular, it is often the case that rotating a grasp pose by \(\pi \) radians about its forward axis does not affect the probability of grasp success, i.e.,r is invariant to rotations of the action by \(\pi \) radians. When this symmetry is present, we can model it using the quotient group \(C_n / C_2 \cong \{ 2 \pi k /n: 0\le k < n/2, k\in \mathbb {Z}, 0 \equiv \pi \}\) which pairs orientations separated by \(\pi \) radians into the same equivalence classes.

5.2 Other optimizations

While our use of equivariant models to encode the Q-function is responsible for most of our gains in sample efficiency (Sect. 6.3), there are several additional algorithmic details that, taken together, have a meaningful impact on performance.

5.2.1 Loss function

In the standard ASR loss function, given a one step reward r(sa), where \(a=(x, \theta )\), \(Q_1\) and \(Q_2\) have targets (Wang et al., 2020a):

$$\begin{aligned} \mathcal {L}&= \mathcal {L}_{1} + \mathcal {L}_{2} \end{aligned}$$
$$\begin{aligned} \mathcal {L}_{1}&= \tfrac{1}{2}(Q_1(s, x) - \max _{\theta '} Q_2(\textrm{crop}(s, x), \theta '))^2 \end{aligned}$$
$$\begin{aligned} \mathcal {L}_{2}&= \tfrac{1}{2}(Q_2(\textrm{crop}(s, x), \theta ) - r(s, a))^2. \end{aligned}$$

In \(\mathcal {L}_1\) term, however, since the reward r(sa) is the ground truth return of \(Q_2(\textrm{crop}(s, x), \theta )\), we correct \(Q_2\) with r(sa). Denoted \(\hat{Q}_2\) as the corrected \(Q_2\)

$$\begin{aligned} \hat{Q}_2(\textrm{crop}(s, x), {\theta '}) = {\left\{ \begin{array}{ll} Q_2(\textrm{crop}(s, x), {\theta '}), &{} \text {if } {\theta '}\ne \theta \\ r(s, a), &{} \text {if } {\theta '}=\theta \end{array}\right. } \end{aligned}$$

We then modify \(\mathcal {L}_1\) to learn from \(\hat{Q}_2\):

$$\begin{aligned} \mathcal {L}_1' = \frac{1}{2}\Big (Q_1(s, x) - \max _{\theta '} \left[ \hat{Q}_2(\textrm{crop}(s, x), {\theta '}) \right] \Big )^2. \end{aligned}$$

In addition to the above, we add an off-policy loss term \(\mathcal {L}_1''\) that is evaluated with respect to an additional k grasp positions \(\bar{X} \subset X\) sampled using a Boltzmann distribution from \(Q_1(s)\):

$$\begin{aligned} \begin{aligned} \mathcal {L}_1'' =&\frac{1}{2k}\sum _{x_i\in \bar{X}} \Big (Q_1(s, x_i) \\&-\max _{\theta ^{'}} \left[ Q_2(\textrm{crop}(s, x_i), \theta ^{'}) \right] \Big )^2, \end{aligned} \end{aligned}$$

where \(Q_2\) provide targets to train \(Q_1\). This off-policy loss minimizes the gap between \(Q_1\) and \(Q_2\). Our combined loss function is therefore \(\mathcal {L} = \mathcal {L}_{1}' + \mathcal {L}_{1}'' + \mathcal {L}_{2}\).

5.2.2 Prioritizing failure experiences in minibatch sampling

In the contextual bandit setting, we want to avoid the situation where the agent repeats incorrect actions in a row. This can happen because some failure grasps left the scene intact, thus the image and the Q-map are intact. We address this problem by prioritizing failure experiences. When a grasp failed, the experience is included in the sampled minibatch on the next SGD step (Zeng et al., 2018), thereby updating the Q-function prior to reevaluating it on the next time step. The updates in Q reduce the chance of selecting the same (bad) action.

5.2.3 Boltzmann exploration

We find empirically Boltzmann exploration is better compares to \(\epsilon \)-greedy exploration in our grasp setting. We use a temperature of \(\tau _\text {training}\) during training and a lower temperature of \(\tau _\text {test}\) during testing. Using a non-zero temperature at test time helped reduce the chances of repeatedly sampling a bad action.

5.2.4 Data augmentation

Even though we are using equivariant neural networks to encode the Q-function, it can still be helpful to perform data augmentation as well. This is because the granularity of the rotation group encoded in \(q_1\) (\(D_4\)) is coarser than that of the action space (\(C_n/C_2\)). We address this problem by augmenting the data with translations and rotations sampled from \(\hat{\textrm{SE}}(2)\). For each experienced transition, we add eight additional \(\hat{\textrm{SE}}(2)\)-transformed images to the replay buffer.

5.2.5 Softmax at the output of \(q_1\) and \(q_2\)

Since we are using a contextual bandit with binary rewards and the reward function r(sa) denotes the parameter of a Bernoulli distribution at sa, we know that \(Q_1\) and \(Q_2\) must each take values between zero and one. We encode this prior using an element-wise softmax layer at the output of each of the \(q_1\) and \(q_2\) networks.

5.2.6 Selection of the z coordinate

In order to execute a grasp, we must calculate a full \(x,y,\theta , z\) goal position for the gripper. Since our model only infers a planar grasp pose, we must calculate a depth along the axis orthogonal to this plane (the z axis) using other means. In this paper, we calculate z by taking the average depth over a \(5 \times 5\) pixel region centered on the grasp point in the input depth image. The commanded gripper height is set to an offset value from this calculated height. While executing the motion to this height, we monitor force feedback from the arm and halt the motion prematurely if a threshold is exceeded.

5.3 Optimizations for transparent object grasping using RGB input

In order to grasp transparent objects using our model, we found it was helpful to make a few small modifications to our model and setup. Most importantly, we found it was essential to use RGB rather than depth-only image input to the model.

\({\text {Bin color:}}\) We found that for transparent objects, our system performed much better using a black bin color rather than a white or transparent bin color. Figure 6 illustrates this difference in setup. We believe that the performance difference is due to the larger contrast between the transparent objects and the bin in the RGB spectrum.

\({\text {Dihedral group in}}\,q2\): Another optimization we used in our transparent object experiments was to implement Eq. 8 using dihedral group \(D_n\) which expresses the group of multiples of \(2\pi /n\) and reflections, rather than \(C_n\). As in Sect. 5.1.6, we use a quotient group that encodes gripper symmetries. Here, this quotient group becomes \(D_n/D_2\), which pairs orientations separated by \(\pi \) and reflected orientations into the same equivalent class.

\({\text {Collision penalty:}}\) In our experiments with transparent objects, we found that collision was a more significant problem than it was with opaque objects. We believe this was a result of the fact that since the transparent objects did not completely register in the depth image, standard collision checking between the object point cloud and the gripper did not suffice to prevent collisions. Therefore, in our transparent object experiments, we penalized successful grasps that produced collisions during grasping by awarding those grasps only 0.8 reward instead of the full 1.0 reward.

6 Experiments in simulation

6.1 Setup

Fig. 3
figure 3

a The 86 objects used in our simulation experiments are drawn from the GraspNet-1Billion dataset (Fang et al., 2020). b Phybullet simulation. c and d Top-down depth and RGB images of the grasping scene

6.1.1 Object set

All simulation experiments are performed using objects drawn from the GraspNet-1Billion dataset (Fang et al., 2020). This includes 32 objects from the YCB dataset (Calli et al., 2017), 13 adversarial objects used in DexNet 2.0 (Mahler et al., 2017), and 43 additional objects unique to GraspNet-1Billion (Fang et al., 2020) (a total of 88 objects). Out of these 88 objects, we exclude two bowls because they can be stably placed in non-graspable orientations, i.e., they can be placed upside down and cannot be grasped in that orientation using standard grippers. Also, we scale these objects so that they are graspable from any stable object configuration. Lastly, objects are assigned with random RGB values drawn from uniform distribution U((0.6, 0.6, 0.6), (1, 1, 1)). We refer to these 86 mesh models as our simulation “object set”, shown in Fig. 3a.

6.1.2 Simulation details

Our experiments are performed in Pybullet (Coumans & Bai, 2016). The environment includes a Kuka robot arm and a \(0.3~\text {m}\times 0.3~\text {m}\) tray with inclined walls (Fig. 3b). At the beginning of each episode, the environment is initialized with 15 objects drawn uniformly at random from our object set and dropped into the tray from a height of 40 cm so that they fall into a random configuration. The state is a depth image (depth modality) or RGB image (RGB modality) captured from a top-down camera (Fig. 3c, d). On each time step, the agent perceives a state and selects an action to execute which specifies the planar pose to which to move the gripper. A grasp is considered to have been successful if the robot is able to lift the object more than 0.1m above the table. The environment will be reinitialized when all objects have been removed from the tray or 30 grasp attempts have been made.

6.2 Comparison against baselines on depth modality

6.2.1 Baseline model architectures

We compare our method against two different model architectures from the literature: VPG (Zeng et al., 2018) and FC-GQ-CNN (Satish et al., 2019). Each model is evaluated alone and then with two different data augmentation strategies (soft equ and RAD). In all cases, we use the contextual bandit formulation described in Sect. 4.2. The baseline model architectures are: \({\text {VPG:}}\) Architecture used for grasping in (Zeng et al., 2018). This model is a fully convolutional network (FCN) with a single-channel output. The Q value of different gripper orientations is evaluated by rotating the input image. We ignore the pushing functionality of VPG. \({\text {FC-GQ-CNN:}}\) Model architecture used in (Satish et al., 2019). This is an FCN with 8-channel output that associates each grasp rotation to a channel of the output. During training, our model uses Boltzmann exploration with a temperature of \(\tau = 0.01\) while the baselines use \(\epsilon \)-greedy exploration starting with \(\epsilon = 50\%\) and ending with \(\epsilon = 10\%\) over 500 grasps (this follows the original implementation in Zeng et al. (2018)).

6.2.2 Data augmentation strategies

The data augmentation strategies are: \({n\times \,\text {RAD:}}\) The method from Laskin et al. (2020a) that augments each sample in the mini-batch with respect to Eq. 4. Specifically, for each SGD step, we first draw bs (where bs is the batch size) samples from the replay buffer. Then for each sample, we augment both the observation and the action using a random \(\textrm{SE}(2)\) transformation, while the reward is unchanged. We perform n SGD steps on the RAD augmented mini-batch after each grasp sample. \({n\times \,\text {soft equ:}}\) similar to \(n \times \) RAD except that we produce a mini-batch by drawing bs/n samples, randomly augment those samples n times with respect to Eq. 4, then perform a single SGD step. Details can be found in “Appendix Appendix C”.

6.2.3 Results and discussion

The learning curves of Fig. 4 show the grasp success rate versus the number of grasping attempts on depth modality. Figure 4a shows online learning performance. Each data point is the average success rate over the last 150 grasps (therefore, the first data point occurs at 150). Figure 4b shows testing performance by stopping training every 150 grasp attempts and performing 1000 test grasps and reporting average performance over these 1000 test grasps. Our method tests at a lower test temperature of \(\tau =0.002\) while the baselines test pure greedy behavior.

Generally, our proposed equivariant model convincingly outperforms the baseline methods and data augmentation strategies with depth modality. In particular, Fig. 4b shows that the testing success rate of the equivariant model after 150 grasp attempts has the same or better performance than all of the baseline methods after 1500 grasp attempts. Notice that each of the two data augmentation methods we consider (RAD and soft equ) has a positive effect on the baseline methods. However, after training for the full 1500 grasp attempts, our equivariant model converges to the highest grasp success rate (\(93.9\pm 0.4\%\)). Please see “Appendix Appendix D” for a comparison with longer training horizon.

Fig. 4
figure 4

Comparison with baselines. All lines are an average of four runs. Shading denotes standard error. a Shows learning curves as a running average over the last 150 training grasps. b Shows the average near-greedy performance of 1000 validation grasps performed every 150 training steps

6.3 Ablation study

There are three main parts of SymGrasp as described in this paper: (1) the use of equivariant convolutional layers instead of standard convolution layers; (2) the use of the augmented state representation (ASR) instead of a single network; (3) the various optimizations described in Sect. 5.2. Here, we evaluate the performance of the method when ablating each of these three parts in the depth modality. For additional ablations, see “Appendix Appendix D”.

6.3.1 Baselines

In \({\text {no equ}}\), we replace all equivariant layers with standard convolutional layers. In \({\text {no ASR}}\), we replace the equivariant \(q_1\) and \(q_2\) models described in Sect. 3.2 by a single equivariant network. In \({\text {no opt}}\), we remove the optimizations described in Sect. 5.2. In addition to the above, we also evaluated \({\text {rot equ}}\) which is the same as \({\text {no ASR}}\) except that we replace ASR with a U-net (Ronneberger et al., 2015) and apply \(4\times \) RAD (Laskin et al., 2020a) augmentation. Detailed network architectures can be found in “Appendix Appendix A”.

6.3.2 Results and discussion

Figure 5a and b shows the results where they are reported exactly in the same manner as in Sect. 6.2. \({\text {no equ}}\) does worst, suggesting that our equivariant model is critical. We can improve on this somewhat by adding data augmentation (\({\text {rot equ}}\)), but this sill underperforms significantly. The other ablations, \({\text {no ASR}}\) and \({\text {no opt}}\) demonstrate that those parts of the method are also important.

6.4 Effect of background color

This experiment evaluates the effect of background color with RGB modality. We compare training with different tray colors which have an offset from the mean RGB value of the objects (0.8, 0.8, 0.8). For example, in \({\text {mean}-0.8}\), the tray color would be black (0, 0, 0). We test four different colors from black (mean\(-0.8\)) to white (mean\(+0.2\)).

6.4.1 Results and discussion

Figure 5c and d show that the contrast between background and object color has a big impact on grasp learning. In particular, the model has the worst performance when the background color is the same as the mean of the object color (\({\text {mean}}\)). The model performs best when the background color has the most significant contrast from the mean of the object color (\({\text {mean}\,-0.8}\)).

Fig. 5
figure 5

Ablation study for depth observation. Lines are an average over 4 runs. Shading denotes standard error. In the left column, learning curves as a running average over the last 150 training grasps. In the right column is the average near-greedy performance of 1000 validation grasps performed every 150 training steps. The first row is in the depth modality while the second row is in the RGB modality

Fig. 6
figure 6

Setup for self-supervised training on the robot

7 Experiments in hardware

7.1 Setup

7.1.1 Robot environment

Our experimental platform is comprised of a Universal Robots UR5 manipulator equipped with a Robotiq 2F-85 parallel-jaw gripper. For depth modality, we use an Occipital Structure Sensor and for RGB modality we use a Kinect Azure Sensor paired with two black trays. The dual-tray grasping environment is shown in Fig. 6. The workstation is equipped with an Intel Core i7-7800X CPU and an NVIDIA GeForce GTX 1080 GPU.

7.1.2 Self-supervised training

The training begins with the 15 training objects (Fig. 9a for opaque object grasping and Fig. 10a for transparent object grasping) being dropped into one of the two trays by the human operator. Then, the robot attempts to pick objects from one tray and drop them on another. The drop location is sampled from a Gaussian distribution centered in the middle of the receiving tray. All grasp attempts are generated by the contextual bandit. When all 15 objects have been transported in this way, training switches to attempting to grasp from the other tray and drop into the first. Whether all objects are transported or not is decided by the heuristic. For opaque objects, we threshold the depth sensor reading. For transparent objects, we compare the RGB image of the current scene with an RGB image of an empty tray, similar to Kalashnikov et al. (2018). During training, the robot performs 600 grasp attempts in this way (that is 600 grasp attempts, not 600 successful grasps). A reward r will be set to 1 if the gripper was blocked by the grasped object, otherwise \(r=0\).

7.1.3 In-motion computation

We are able to nearly double the speed of robot training by doing all image processing and model learning while the robotic arm was in motion. This was implemented in Python as a producer-consumer process using mutexs. As a result, our robot is constantly in motion during training and the training speed for our equivariant algorithm is limited by the velocity of robot motion. This improvement enabled us to increase robot training speed from approximately 230 grasps per hour to roughly 400 grasps per hour.

7.1.4 Evaluation procedure

The evaluation begins with the human operator dropping objects into one tray, after this, no human interference is allowed. We evaluate the success rate of robot grasping objects. A key failure mode during testing is repeated failure grasps. To combat this, we use the procedure of Zeng et al. (2018) to reduce the chances of repeated grasp failures. The procedure is that after a grasp failure, we perform multiple SGD steps using that experience to “discourage” the model from selecting the same action and then use that updated model for the subsequent grasp. After a successful grasp, we discard these updates and reload the original network.

All runs are evaluated by freezing the corresponding model and executing 100 greedy (or near greedy) test grasps for each object set in the easy-object test set (Fig. 9b), the hard-object test set (Fig. 9c).

7.1.5 Model details

For all methods, prior to training on the robot, model weights are initialized randomly using an independent seed. No experiences from simulation are used, i.e.,we train from scratch. For our depth algorithm, the \(q_1\) network is defined using \(D_4\)-equivariant layers and the \(q_2\) network is defined using \(C_{16}/C_2\)-equivariant layers. For ours RGB algorithm, the \(q_2\) network is defined using \(D_{16}/D_2\)-equivariant layers. During training, we use Boltzmann exploration with a temperature of 0.01. During testing, the temperature is reduced to 0.002 (near-greedy). For more details, see “Appendix Appendix G”.

Table 1 Evaluation success rate (\(\%\)), standard error, and training time per grasp \(t_\textit{SGD}\) (in seconds) in the hardware experiments
Table 2 Evaluation success rate (\(\%\)), standard error, and capture time per grasp (in seconds) in the hardware experiments of transparent objects

7.2 Opaque object grasping

7.2.1 Objects

For opaque objects grasping, all training happens using the 15 objects shown in Fig. 9a. After training, we evaluate grasp performance on both the “easy” test objects (Fig. 9b) and the “hard” test objects (Fig. 9c). Note that both test sets are novel with respect to the training set.

7.2.2 Baselines

In our robot experiments for opaque object grasping, we compare our method against \({8\times \,\text {RAD VPG}}\) (Zeng et al., 2018; Laskin et al., 2020a) and \({8\times \,\text {RAD FC-GQ-CNN}}\) (Satish et al., 2019; Laskin et al., 2020a), the two baselines we found to perform best in simulation. As before, \({8\times \,\text {RAD VPG}}\), uses a fully convolutional network (FCN) with a single output channel. The Q-map for each gripper orientation is calculated by rotating the input image. After each grasp, we perform \(8\times \) RAD data augmentation (8 optimization steps with a mini-batch containing randomly translated and rotated image data). \({8\times \, \text {RAD FC-GQ-CNN}}\) also has an FCN backbone, but with eight output channels corresponding to each gripper orientation. It uses \(8\times \) RAD data augmentation as well. All exploration is the same as it was in simulation except that the \(\epsilon \)-greedy schedule goes from 50 to \(10\%\) over 200 steps rather than over 500 steps.

Fig. 7
figure 7

Learning curves for a the opaque object grasping and b the transparent object grasping hardware experiment, the parenthesis indicates the color of trays. All curves are averaged over 4 runs with different random seeds and random object placement. Each data point in the curve is the average grasp success over the last 60 grasp attempts. Shading denotes standard error

Fig. 8
figure 8

Illustrations for transparent object grasping. a Shows the depth observation, notice that most transparent objects are not sensed. b Shows the same scene with RGB modality, notice that objects are sensed up to orthographic projection error. c Shows the \(Q_1\) map from the observation in (b) and the selected action in the red dot. d Shows the executed grasp

7.2.3 Results and discussion

Figure 7a shows the learning curves on opaque object grasping for the three methods during learning. An important observation is that the results from training on the physical robot Fig. 7a match the simulation training results Fig. 4a. Figure 7a shows for opaque object grasping, our method achieves a success rate of \(>90\%\) after 600 grasp attempts while the baselines are near \(70\%\). Table 1 shows the testing performance and also demonstrates our method significantly outperforms the baselines during testing. However, performance is lower on the “hard” test set. We hypothesize that the lower “hard” set performance is due to a lack of sufficient diversity in the training set. The other observation is that since each of these 600-grasp training runs takes approximately 1.5 h, suggests these methods could efficiently directly learn from real-world data thus avoiding the simulation-to-real-world gap and adapting to physical changes in the robot.

7.3 Transparent object grasping

7.3.1 Objects

For transparent object grasping, all training happens using the 15 objects shown in Fig. 10a. After training, we evaluate grasp performance on the in-distribution training set and out-of-distribution testing objects (Fig. 10a, b). Note that the test set is novel with respect to the training set.

7.3.2 RGB modality details

The top-down RGB images are orthographic projections of noisy RGB point clouds. RGB values are normalized to fit a Gaussian distribution \(\mathcal {N}(0, 0.01)\). During training, each mini-batch of images is augmented in brightness with a (0.9, 1.1) range.

7.4 Baselines

For transparent object grasping, we compare \({\text {Ours RGB (black)}}\), \({\text {Ours RGB (white)}}\), and \({\text {Ours D}}\). \({\text {Ours RGB (black)}}\) is the baseline in our proposed methods, \({\text {Ours RGB (white)}}\) changes black trays to white trays, see Fig. 18. \({\text {Ours D}}\) is the best baseline in physical opaque objects grasping experiments, here we train and test on transparent objects.

7.5 Results and discussion

Figure 7b shows the learning curves for transparent object grasping and Table 2 shows the grasping performance of \({\text {ours RGB}}\) (black) on the transparent object set. Figure 8 shows a transparent object grasping process during evaluation.

Figure 7b shows that for transparent object grasping, \({\text {ours RGB}}\) (black) produces a significant (\(>40\%\)) improvement in success rate compared to \({\text {ours depth}}\), suggesting that the RGB information is needed to perform well in this setting. The background color is also important: changing the tray from black to white leads to a \(>30\%\) decrement in success rate. Table 2 summarizes that \({\text {ours RGB}}\) achieves \(> 80\%\) success rate on both the training set and testing set, indicating the method learns a good grasp function of in-distribution object set and generalizes to novel transparent objects. Also, notice that from a computational perspective, our method is significantly cheaper (1.8 s) than NeRF based approaches to transparent object grasping like EVO Nerf Kerr et al. (2022) and Dex Nerf (Ichnowski et al., 2021) which take 9.5 s and 16 s, respectively (Figs. 9, 10).

Fig. 9
figure 9

Object sets used for training and testing. Both training and the test set easy include 15 objects while the test set hard has 20 objects. Objects were curated so that they were graspable by the Robotiq 2F-85 parallel jaw gripper from any configuration and visible to the Occipital Structure Sensor

Fig. 10
figure 10

Transparent object sets used for training and testing. The Training set includes 15 objects and the testing set includes 10 objects. Objects were curated so that they were graspable by the Robotiq 2F-85 parallel jaw gripper from any configuration

8 Conclusions and limitations

Our main contribution is to recognize that planar grasp detection is \(\textrm{SE}(2)\)-equivariant and to leverage this structure using an \(\textrm{SO}(2)\)-equivariant model architecture. This model is significantly more sample efficient than other grasp-learning methods and can learn a good grasp function with less than 600 grasp samples. This increase in sample efficiency enables us to learn to grasp on a physical robotic system in a practical amount of time (approximately and hour and a half) without relying on any sort of pretraining or semi-supervised learning. Training completely on physical systems is important for two main reasons. First, it eliminates the need to train in simulation, thereby avoiding the sim2real gap. Second, it enables our system to adapt to changes or idiosyncrasies of the robot hardware or the physical environment that would be hard to simulate.

A key limitation in both simulation and the real world is that despite the fast learning rate, grasp success rates (after training) still seems to be limited to the low/mid \(90\%\) for opaque object and \(80\%\) for transparent objects. This is the same success rate seen in with other grasp detection methods  (Mahler et al., 2017; ten Pas et al., 2017b; Mousavian et al., 2019), but it is disappointing here because one might expect faster adaptation to lead ultimately to better grasp performance. This could simply be an indication of the complexity of the grasp function to be learned or it could be a result of stochasticity in the simulator and on the real robot. Another limitation of our work is that it is limited to an open-loop control setting where the model infers only a goal pose. Nevertheless, we expect that the equivariant model structure leverage in this paper would also be useful in the closed loop setting, as suggested by the results of  Wang et al. (2022).