Humans see novel objects and can almost immediately determine how to pick them. The capabilities of robots lag far behind. Robotic grasping and manipulation is a critical challenge [1]. Creating cognitive robots that can operate at the same level of dexterity as humans has been approached for many decades. Despite the interest in research and industry, it remains an unsolved problem [2] [3].

Shorter product lifecycles and the steadily rising demand for customization require more flexible and changeable production systems leading to the need for an automatic configuration (Plug & Produce) of robot systems [4]. Developing robots that can operate in dynamic and unstructured environments (i.e., bin-picking, household or everyday environments, professional services) is of great interest. Approaches to robotic grasping utilize learning-based methods to automatically configure for the given task without any human intervention which allows to significantly reduce programming efforts [5]. Machine learning in particular is a promising approach to robotic grasping due to the generalization ability to novel objects.

This article aims to provide a comprehensive overview of different approaches to robotic grasping. A categorization of different methods is proposed as well as various techniques for grasping and sim-to-real transfer—motivated by the lack of real-world data—are introduced.

Categorization of Methods

Approaches to vision-based robotic grasping can be categorized along multiple different criteria. Generally speaking, approaches can be divided into analytic or data-driven methods [6, 7]. Analytic (or sometime called geometric) approaches typically analyze the shape of a target object to identify a suitable grasp pose. Data-driven (or sometimes called empirical) approaches are based on machine learning and have gained popularity in recent years. They have made significant progress due to increased data availability, better computational resources, and algorithmic improvements. This review article focuses on learning-based approaches to robotic grasping and manipulation. For analytic grasping approaches, we refer the readers to [7,8,9].

Furthermore, approaches can be categorized as model-based or model-free, depending on whether or not specific knowledge about the object (e.g., CAD model or previously scanned model [10]) is used to solve the considered task. They can further be differentiated on whether they are focused on grasping and manipulating rigid, articulated, or flexible/deformable objects and whether the method is able to handle known, familiar, or unknown objects [6]. Figure 1 gives an overview of typical pipelines to robotic grasping. Model-based approaches for known rigid objects typically include a pose estimation step and allow a precise placement of the object. Model-free approaches directly propose grasp candidates and typically aim for a generalization to novel objects.

Fig. 1
figure 1

Typical pipelines to robotic grasping: Model-based approaches (top row) typically estimate the object pose, determine a suitable grasp pose on the object, plan a path, and finally execute the grasp. Model-free approaches (bottom row) directly determine grasp poses based on the observations given from the sensor. When being trained in simulation, sim-to-real techniques are needed for a robust transfer. This review article discusses the green elements of the figure

An additional criterion is the type of machine learning, i.e. whether the system is trained using supervised learning (SL) or reinforcement learning (RL) [11]. Annotations can be provided by humans or obtained in a self-supervised manner, i.e., the labels are generated automatically. Approaches typically either sample grasp candidates and rank them using a neural network (discriminative approaches) [12, 13] or directly generate suitable grasp poses (generative approaches) [14, 15]. Furthermore, approaches differ on whether they are trained in a simulation environment, in the real world, or both and utilize various kinds of sensor data (RGB image, depth image, RGB-D image, point cloud, potentially multiple sensors, …). Moreover, methods either operate in an open- (i.e., without any feedback) or closed-loop fashion [3, 16, 17]. Using continuous feedback based on visual features is commonly referred to as visual servoing [17]. Besides the robot hardware, the gripper type (two-finger gripper, suction gripper, …) and gripper freedom (4D, 6D, …) also differentiate approaches. Moreover, some approaches focus on grasping of single separated objects only, while others target grasping in dense clutter. Furthermore, some methods are able to perform pre-grasp manipulations in order to move the object in a better configuration for grasping. Table 1 provides an overview of the discussed approaches and shows a small and exemplary selection from the variety of methods available in the literature. In addition to the abovementioned criteria, the reported grasp success rate is indicated, although being determined on different benchmarks.

Table 1 Overview of the discussed approaches to robotic grasping and manipulation (selection)

Object Pose Estimation for Robotic Grasping

Model-based robotic grasping can be considered as a three-stage process where first object poses are estimated, then a grasp pose is determined, and finally a collision-free and kinematically feasible path is planned towards the object to pick it [34, 35]. This chapter focuses on the first part, which has the goal to estimate the translation and rotation relative to a given reference frame (usually the camera) of potentially multiple objects in the scene. This task is challenging because of sensor noise, varying lighting conditions, clutter and occlusions, and the variety of objects in the real world. Furthermore, object symmetries result in pose ambiguities which have to be addressed because with symmetries different annotations for identical observations are available [36,37,38,39]. For learning-based approaches on the second part, we refer the readers to [40].

When utilizing object-specific knowledge, approaches typically require an object-specific configuration (high amount of manual tuning) until a satisfactory system performance is reached which limits the scalability to novel objects [5]. More specifically, parameters for the template or feature matching method for pose estimation [41, 42] or the definition of robust grasp poses together with (static) priorities are required [35] and have to be tuned in real-world experiments. Therefore, model-based approaches aim for an automatic configuration with minimal user input and without any tuning that has to be done by experts to allow a fast and easy transfer to novel objects.

Utilizing the strength of supervised learning for 6D object pose estimation requires large amounts of labeled data for training. Creating and annotating datasets with 6D poses is very tedious, time-consuming, and does not scale [43]. Thus, it is a trend to train models on synthetic data because simulations are an abundant source of data and flawless ground truth annotations are automatically available (see also “Simulations” section). Transfer techniques are used for deployment to the real world (see also “Techniques for Sim-to-Real Transfer” section). [18, 20•]

In recent years, research in 6D object pose estimation has been dominated by approaches based on convolutional neural networks (CNNs). Approaches typically either discretize the pose space in bins and predict a class [44, 45] or solve pose estimation in terms of a regression task [19, 20•, 46]. DOPE [18] uses a deep neural network to process an RGB image, outputs the 2D image coordinates of the 3D bounding box of the objects, and uses a PnP algorithm [47] to estimate the 6D pose of each instance. The model is trained entirely on synthetic data while for the transfer from simulation to the real world, DOPE employs a combination of domain randomization [48••] and photorealistic rendering. The authors further demonstrate that the pose estimator trained on synthetic data can operate in real-world grasping systems with sufficient accuracy.

Pose estimation challenges [49, 50] and standard benchmarking systems [51] for pose estimation allow advancing the state of the art and enable a transparent and fair comparison of different approaches. Especially, the robust pose estimation of multiple objects in bulk is a great challenge and of major importance. These scenarios, which are often present in industrial bin-picking scenarios, are challenging due to a high amount of clutter and occlusion as visualized in Fig. 2. A challenge focusing on 6D object pose estimation for bin-picking [49] has been organized at IROS 2019 and utilized a large-scale dataset [43] comprising fully 6D pose-annotated synthetic and real-world scenes. For evaluation, the metric from Brégier et al. [36, 37] was used which properly accounts for object symmetries and considers objects with visibility of more than 50%.

Fig. 2
figure 2

Cluttered scene for bin-picking

In general, learning-based approaches have proven to be robust to occlusions due to learning plausible object pose configurations [49]. PPR-Net [19], the winning method of the aforementioned challenge, operates on point clouds and utilizes PointNet++ [52] to estimate a 6D pose for each point of the point cloud and applies clustering in 6D space to compute the final pose hypotheses by averaging each identified cluster. The approach is outperformed by OP-Net [20•] in terms of average precision on the noisy Siléane dataset [36]. Furthermore, OP-Net is much faster than PPR-Net because it provides a much more compact parameterization of the output and does not require post processing. The approach discretizes the 3D space of the scene and regresses a pose and confidence for each resulting volume element.

A major advantage of learning-based object pose estimators is that they do not require a manual parameter tuning for the configuration of new objects [41, 42]. Furthermore, they can be entirely trained on synthetic data, which can easily be obtained using a physics simulation by dropping objects in a random position and orientation above a bin in the case of bin-picking [43] or by placing (household) objects in virtual scenes [18].

Model-Free Robotic Grasping

Model-free approaches are attractive due to their ability to generalize to unseen objects [53] and pose a dominant direction in robotic grasping research. They do not use prior knowledge about the objects and therefore work without a pose estimation step, which is in contrast to the approaches discussed in the “Object Pose Estimation for Robotic Grasping” section. Approaches often show promising results in terms of generalization ability to novel objects, and models are usually trained in an end-to-end fashion. A placement of the objects after picking is mainly not considered and the type of object being picked is unknown.

Supervised Learning for Robotic Grasping

Supervised learning is concerned with learning a (non-linear) mapping based on labeled training data. In this chapter, we categorize the approaches as discriminative or generative depending on whether the grasp configuration is the input or output.

Discriminative Approaches

Discriminative approaches sample grasp candidates (e.g., using CEM [54]) and rank them using a neural network. For grasp execution, the robot chooses the grasp with the highest score. These approaches typically have a high runtime because they require multiple forward passes of the neural network to get high-quality grasps. Nonetheless, these approaches come with the advantage that arbitrarily many grasp pose can be evaluated and these methods are not limited by discretization of the grasping primitives/output space. Furthermore, a gradient-based refinement process can be applied/employed to improve the grasp success rate [32•].

Levine et al. [24] proposed a learning-based approach to hand-eye coordination for robotic grasping based on RGB images. In their work, they used up to 14 robots to collect success labels for 800,000 grasps in 2 months. The trained convolutional neural network can predict the grasp success for a given candidate based on an RGB image of the bin and is used to servo the gripper towards successful grasps. While this approach demonstrates the potential of learning-based approaches to robotic grasping, changes in the hardware setup require the collection of new data for retraining the system.

Dex-Net [12, 26] uses a physics simulation to grasp objects in randomized poses on a plane. The outcome of the grasp is logged together with an aligned crop of a depth image where the grasp is located forming one sample to the dataset. Their Grasp Quality Convolutional Neural Network (GQ-CNN) is trained by using that dataset. The trained model can predict the grasp success for given grasp candidates and depth images and generalizes to different rigid, articulated, or flexible objects unseen during training. The Dex-Net framework has been extended to suction grippers [13] and a dual-arm robot [27] where the policy infers whether to use a parallel jaw or suction gripper for emptying a cluttered bin. Furthermore, a fully convolutional network architecture generating grasps has been proposed to avoid an expensive sampling and ranking of grasp candidates [28].

Generative Approaches

Generative approaches output a grasp configuration. One approach to this—called robotic grasp detection—is to detect oriented rectangles [55] in the image plane, which represent promising grasp candidates for parallel jaw grippers. This parameterization comprises the position, orientation, and opening width of the gripper as visualized in Fig. 3. The problem of robotic grasp detection is analogous to object detection [56,57,58] in computer vision with the only difference being an added term for the gripper orientation.

Fig. 3
figure 3

Parameterization for robotic grasp detection: Two values for the position, two for the size, and one for the orientation of the oriented rectangle. Red sides indicate the jaws of the gripper and blue the opening width

For the scenario where a single object is placed on a plane surface, Redmon et al. [14] proposed a system called SingleGrasp which can predict an oriented rectangle and simultaneously classify the object for a given RGB-D image using a neural network. Since an object can be grasped in multiple different ways, they also introduced MultiGrasp, which can predict multiple grasp poses per image. This approach led to the You Only Look Once (YOLO) [56, 57] approach for object detection. Lenz et al. [21] proposed a learning-based two-stage system that samples candidates and ranks them using a second neural network. In their work, they demonstrated that their approach can be used for real-world robotic grasping tasks. An increased performance is obtained by utilizing more sophisticated network architectures [3].

A public dataset for robotic grasp detection is the Cornell grasping dataset [59] which comprises 1035 images from 280 objects with human annotated grasps. Due to the low number of samples, the dataset has been heavily augmented for good performance [14]. The Jacquard dataset [60] comprises over 50,000 synthetic samples of more than 11,000 objects with grasps obtained from grasping trials in simulation and enables better generalization due to the increased diversity.

Utilizing these public datasets, GG-CNN [15, 22] outputs a grasp configuration together with a quality estimate for each pixel in the image using a small fully convolutional architecture. Due to its low computational demands, the approach can be used for closed-loop grasping in dynamic/non-static environments. Furthermore, this approach can grasp in clutter, although the model is trained on single isolated images only, which is due to the convolution being a local operation.

TossingBot [30] learns to throw arbitrary objects to given target locations which allows to increase the physical reachability of a robot arm. The authors propose an end-to-end formulation that jointly learns to infer control parameters for grasping and throwing from images of objects in a bin by trial and error. As a result, the system learns to select grasps that lead to predictable throws through self-supervision. The problem of throwing is simplified to predict the release velocity only. The release velocity is estimated using a physics-based controller and adjusted based on the residual estimate of the neural network.

Generative approaches are fast because they require one forward pass only. They usually provide multiple grasp candidates simultaneously and the highest quality grasp is executed by the robot.

Reinforcement Learning for Robotic Grasping and Manipulation

Deep reinforcement learning has emerged as a promising and powerful technique to automatically acquire control policies by trial and error. By processing raw sensory inputs, such as images, complex behaviors can be performed.

Pre-grasp manipulations such as pushing or shifting [61, 62] are also of major importance to rearrange cluttered objects and ensure that the objects can be grasped at all or more robustly. Using reinforcement learning, the trained policies also demonstrate generalization to novel objects [61, 62].

A comparison of a variety of methods based on deep reinforcement learning on grasping tasks is provided in [63]. QT-Opt [29••] demonstrates a rich set of manipulation strategies and responds dynamically to disturbances and perturbations. The robot observes a reward of 1 for successfully lifting an object and 0 for a failed grasp. Their closed-loop vision-based control framework operates in a similar setup as in [24, 25•, 64•] and reports a grasp success rate of 96% on unseen objects by optimizing long-horizon grasp success with a total of about 800 robot hours collected within 4 months and across 7 robots.

“Grasping in the Wild” [33••] allows a closed-loop 6D grasping of novel objects based on human demonstrations and can operate in dynamics scenes with moving objects, up to some speed constraint.

Simulations and Sim-to-Real Transfer

Despite all advantages w.r.t. performance and robustness, deep learning has the disadvantage of requiring large amounts of data for training. This is especially problematic in robotics, where the generation of training data on real-world systems can be expensive and time-consuming. For instance, Pinto et al. [23] trained a robot to grasp novel objects by collecting 50,000 trials in more than 700 h, Levine et al. [24, 25•] required 800,000 grasps parallelized over 14 robots in 2 months for robust grasping performance, and QT-Opt [29••] collected over 560,000 grasps within the course of several weeks across 7 robots. Additionally, these systems are not invariant to changes in the hardware setup such as changing the gripper, table height, or moving the camera. To avoid the need to setup “arm farms” for learning robust robotic grasping and manipulation policies, using simulations is an attractive alternative.


Commonly used physics simulations are V-REP/CoppeliaSim [65], PyRep [66], MoJuCo [67], Blender [68], and Gazebo [69], to name only a few. To overcome these aforementioned limitations, simulations can be employed because they provide an abundant source of data with flawless annotations. Furthermore, simulations are fast and can be parallelized across multiple machines for rapid learning or data generation. Physics simulations allow training the robots without wear and tear of the components and no interruption of production in the field. Apart from these advantages, simulations require the explicit programming of the desired application, potentially require license costs, and do not perfectly capture the properties of the real world.

Techniques for Sim-to-Real Transfer

Generally, models trained in simulations do not tend to directly transfer well to the real world due to the “reality gap” [64•, 70, 71]. This section discusses different approaches to allow bridging the simulation-to-reality gap. Models can be transferred to the real world by providing better simulations, domain randomization [48••], or domain adaptation [64•, 70, 72, 73].

Domain Randomization

The technique domain randomization [48••] applies various randomizations on the observations (vision randomization) or system dynamics (dynamics randomization) such that the real world appears to the model as just another variation. Randomizing various visual aspects of the simulator such as textures and colors of the objects and the background, lighting, object placement including camera placement, and type and amount of noise added to the image forces the network to learn to focus on the essential features of the image (vision randomization). Randomizations can also be applied to the dynamics of the system or environment [71] including gravity, mass of each link in the robot’s body, damping of each joint, pose of the robot base as well as mass, friction, and damping of the manipulated objects (dynamics randomization) for a robust transfer from simulation to the real world.

This technique has been successfully used for object localization [48••], segmentation [74], robot control for pick-and-place [75], swing-peg-in-hole [76], opening a cabinet drawer [76], in-hand manipulation [77], one-handed Rubik’s Cube solving [78], precise 6D pose regression in highly cluttered environments [20•], etc. Modifications propose an automatic scheduling of the intensity of the randomization based on the current performance of the system [78] or adapting simulation randomizations by using real-world data to identify distributions that are particularly suited for a successful transfer [76]. Synthesizing millions of random object shapes for training [79] indicates further potentials of this technique for robotic grasping.

Domain Adaptation

Domain adaptation is a process that allows a machine learning model, trained with samples from a source domain to generalize to a target domain, which can be achieved by utilizing unlabeled data from the target domain. In sim-to-real transfer, the source domain is (usually) the simulation and the target domain is the real world. Prior work can be grouped into feature-level domain adaptation [80, 81], which focuses on learning domain-invariant features, and pixel-level domain adaptation [70], which focuses on restyling of images to bridge the domain gap [16, 64•].

Domain adaptation techniques are usually based on generative adversarial networks (GANs) [82]. With some unlabeled real-world data, those approaches allow a drastic reduction in the number of real-world samples needed. Using a similar system for hand-eye coordination as in [24, 25•], GraspGAN [64•] allows reducing the number of real-world samples needed to approximately 2% for similar system performance. This is a drastic reduction of the required real-world samples needed and allows a faster deployment of the solution in different setups.

Still, these approaches require data from the target domain (i.e., some samples from the real world are needed) which negatively affects scalability. Apart from being hard to train and often yielding fragile training results, the output images from the generator network (refiner) are not perfectly realistic and may include inaccuracies and artifacts.

RCAN [16] translates randomized simulation images to a canonical simulation version which are then used for policy training. The trained system can be used to translate real-world images to canonical images and consequently allows a sim-to-real transfer of the grasping policy, which is demonstrated by using QT-Opt [29••].


As there are often many new approaches to pose estimation which are evaluated on a small number of datasets only, the Benchmark for 6D Object Pose Estimation BOP [51] aims for standardizing datasets to allow a better comparability. Apart from challenges such as “Occluded Object Challenge” [83], SIXD [50], and “Object Pose Estimation Challenge for Bin-Picking” [49], BOP also organizes challenges for pose estimation.

Challenges focusing on robotic grasping and manipulation [84, 85] are of great value to the research community because of capturing and advancing the current state of the art in the field. The Amazon Picking/Robotics Challenge [2, 86,87,88,89,90,91] focused on autonomous picking in warehouse scenarios. Still, a participation can be challenging due to the required participation on site and hardware costs. Introducing detailed instructions on how to place the objects for picking [92] allows a comparison of different approaches. Especially, simulation environments allow a benchmarking of grasping and manipulation approaches under reproducible scenarios without hardware costs and are of high importance to measure scientific progress [93].


Learning-based approaches to robotic grasping enable picking of diverse sets of objects and are able to demonstrate high grasping success rates even in cluttered scenes and non-static environments. Machine learning and simulation allow fast and easy deployment due to the automatic configuration of model-based solutions and generalization abilities to novel objects of model-free approaches.

Despite impressive results, robotic grasping and manipulation is not solved. All discussed model-free approaches execute top-down grasps and have a limited flexibility in the gripper orientation. There is only a limited number of works focusing on learning-based approaches to robotic grasping in 6D for single objects [32•, 94,95,96] or in clutter [31, 33••, 34, 97]. While getting an increased focus in research, model-free grasping in 6D is especially relevant for picking objects from a cluttered bin [35], from a shelf [10], or for more robust grasps in general.

Usually, the task of the robot is to “grasp anything.” Some works focus on a directed grasping to pick a specific object from a cluttered scene [63, 73, 98]. Model-free approaches do not allow a precise placement of the objects. Instead of simply dropping the picked object, many practical applications require an at least semi-precise or gentle placement of the components, which has been addressed less. While solutions for avoiding the entanglement of objects exist [99, 100], no general solution has been proposed to unhook complex object geometries.