Keywords

1 Introduction and Motivation

In the manufacturing industry, a trend towards robot-based automation has been observed for years. From 2013 to 2018, the number of new robot installations has increased by an average of 19% per year [1]. Reasons for this are the rising quality standards and labor costs which lead to high competitive pressure. In countermove, economic automation is becoming more and more difficult as product life cycles become shorter and batch sizes smaller. This places great demands on the flexibility and autonomy of the technologies to handle this variety [2]. Especially in assembly, the degree of automation is often still very low, as the generation of variants is usually shifted as far back in the value chain as possible to final assembly in order to minimize its impact. Therefore, a key challenge is the flexible interaction of the robot with its environment. It must be able to handle a wide range of components, which are often fed in an unknown position and orientation, and this with an economical level of implementation effort. In recent years, the robotics and computer vision community has contributed a wide range of different approaches to solve the grasping problem. Analytical approaches consider kinematic and dynamic formulations in grasp synthesis [3]. However, these approaches are characterized by high computational complexity and cannot be generalized well to unknown objects, which is why the developed ML-based methods are promising approaches [4].

The aim of this paper is to provide an overview of these current approaches in research and to highlight the remaining challenges for their use in production, especially in assembly: In Sect. 2, state-of-the-art on ML-based grasping approaches in research is given. Section 3 analyzes the production requirements for the application, followed by the presentation of a derived integration approach of grasping into the digital process chain of assembly in Sect. 4. Finally, Sect. 5 identifies the gaps that need to be closed in order to implement an integration to meet the requirements.

2 State of the Art

Sensor-based perception of the environment are fundamental capabilities of a smart robot. In this regard, autonomous or partially autonomous grasping based on vision systems is one of the sub-disciplines of robotics that can contribute greatly towards improving the flexibility of robotic applications. According to Kumra et al. the vision-based grasping process can be seen as a sequence of three sub-steps: Grasp detection, trajectory planning and execution of the grasp [5]. This paper will mainly focus on the first step of this sequence, which again can be divided into three sub-problems shown in Fig. 1.

Fig. 1
figure 1

Subtask classification of vision-based grasp detection systems [6]

2.1 Object Localization

The object localization task can be further divided into pure localization and localization including the detection of the class of the object. Since the manipulation requires spatial knowledge about the object, only the 3D localization methods will be described. To use these 3D localization methods, a RGB-D camera is used, that provides depth data in addition to the RGB image [6]. The pure localization task can applied to simple objects like cubes or cylinders but the method can be improved by combining shape primitives with the use of triangular meshes to be able to map various types of objects Rusu et al. ([7]). The localization of objects with no restriction regarding its shape is called salient object detection. Many approaches take the pixels of an image as inputs. Instead, salient feature vectors can be used as inputs to a CNN to learn the combination of different salient features for the recognition of salient objects [8]. Outputs of the detection of objects are the 3D bounding box and the class label of the object, which can either be detected sub-sequentially as in ImVoteNet [9] or at once by using a regression method like 3DSSD [10]. To further refine the position of the object, instance segmentation can be used. The starting point is the bounding box of the object, within which the 3D position of the object is detected. OccuSeg uses the occupancy to cluster segments despite partial occlusions [11].

2.2 Object Pose Estimation

The second subproblem is the Object Pose Estimation, where the 6D pose of the localized part must be determined. The degrees of freedom to be determined can be reduced by a predefined part feeding. Du et al. cluster the existing methods into three categories [6]. Firstly, there are the correspondence-based methods, in which corresponding feature points between captured image information and the object to be grasped are searched for. It is possible to utilize deep learning algorithms and to work with 2D RGB images like HybridPose [12] as well as with 3D point clouds like 3DMatch [13]. These methods are suitable if the object has a rich texture and geometric details.

The second group of algorithms are the template-based methods. A multitude of templates are labeled with corresponding 6D poses for the object to be grasped. If 2D images are used as in [14], the 6D problem is reduced to an image retrieval problem, because the image is only compared with a known set of 2D images of the object. If, on the other hand, 3D template-based methods are used, the recorded point cloud is directly compared with the 3D model of the object as in MaskedFusion [15]. In general, template-based methods are suitable especially if the object has few distinctive textures and geometric details.

The third category are the voting-based methods. Here, not the whole image is analyzed at once, but every single 2D pixel respectively every 3D point is considered separately and contributes a vote to the estimation. If the objects to be grasped have a high degree of occlusion, then voting-based methods can be effective. On the one hand, there are indirect voting-based methods in which the image points first contribute a vote for higher-level features from which the 6D pose can be indirectly derived. This is shown in YOLOff Gonzalez et al. [16]. On the other hand, direct voting-methods can be used, where the pixels vote directly for the 6D pose of the object as in DenseFusion [17].

2.3 Grasp Estimation

The goal of Grasp Estimation is to find a robust grasp pose. According to Du et al. the algorithms for Grasp Estimation can be divided into 2D planar grasps and 6D grasps [6]. The 2D planar grasp has two fixed axes of rotation, so that only the height of the plane, the position in the plane and the rotation around the normal vector are determined. The developed algorithms of both categories refer to either analytical or ML-based approaches. Due to the dependence on assumptions to be made (friction, object stiffness, object complexity etc.) analytical approaches in practice do not generalize well over new objects [18].

One of the ML based grasp methods is the project Dex-Net presented by Mahler et al.. The input is a recorded point cloud, which is evaluated by a CNN regarding the grasp quality of all grasping candidates. Zeng et al. performs a pixel-wise evaluation of the grasp affordance for different grasping primitive actions and perform the end-effector position and orientation with the highest affordance. Furthermore, the project Form2Fit [21] not only deals with grasping new objects but also with placing them in the desired position. A trained fully convolutional network (FCN) detects correspondences between the object surface and the shape of the target position.

3 Production Requirements on Vision-Based Grasping

In order to evaluate the industrial applicability of today’s algorithms for ML-based grasping, the production requirements for such a system must first be analyzed. In this chapter, these requirements are categorized into six categories (Fig. 2) to identify gaps in the usability and functionality of today’s solutions.

Fig. 2
figure 2

Clustering of production requirements on vision-based grasping

First, the required performance of the system is derived directly from quality and productivity requirements, which can be translated into the required precision and speed of the grasp detection. The second category is the robustness of the system against external influences such as poor lighting conditions, humidity and a dynamic image background. Another important factor are the components to be grasped. On the one hand, the components themselves, i.e. their variance, dimensions, shape, transparency and surface, and on the other hand the way they are fed to the process, has to be considered. The feeding can vary in the level of order, the degree of occlusion and hooking as well as the distance between the components. The hardware is required to provide the necessary computing power for the execution of the algorithms in a cost-effective manner in order to enable a profitable operation of the system. The available interfaces of the software as well as the range of compatible hardware like robots, grippers and sensors like cameras have a great influence on the integratability and transferability of the solution. Finally, required data sets and programming efforts should be mentioned, which directly impact the implementation effort and the competence hurdle for the programmer. The number, scope and quality of compatible data sets for training the algorithms, on the other hand, have great influence on the performance of the system. Furthermore, to achieve good industrializability, it must be possible to integrate existing product and process data. Physical component data, functional surfaces and the requirements of subsequent process steps are some examples of important parameters when selecting a grasp. In addition, parameters of the equipment, such as force limits and workspaces of robots and grippers, must also be taken into account.

Fig. 3
figure 3

Integration of ML-based grasping approaches in the digital process chain

4 Integration of ML-based Grasping in Assembly Processes

To meet these requirements, the three presented steps of a grasping system need to be embedded into a novel end-to-end system and closely linked to the digital process chain and its corresponding product lifecycle. Such a concept is proposed in Fig. 3. The product lifecycle can be sub-divided into engineering, production planning, production, usage and recycling. During the engineering phase, the product is designed and can be disassembled into product specifications, drawings, CAD models of each single part. In the production planning, the data of the engineering phase is used to plan the production and especially the process and assembly sequence. The grasping system is implemented in this phase. In the subsequent production, the actual gripping process is carried out. Throughout the entire life cycle, product and process data must be made available in accessible formats via a central digital process chain which is an important enabler for the seamless integration of engineering data into the robot-based assembly process.

Before the individual components are selected, the overall performance and robustness of the system required to fulfill the task at hand have to be defined. The precision and speed required to assemble the components is determined by the product while the light and background as robustness parameters are given by the environment. This narrows down the suitable algorithms to perform the tasks. Another important factor for selecting the algorithms is defined by the programming requirements. In order to make the system versatile, it should be intuitively operable and have enough autonomy to make the supervision by a human operator redundant.

The object localization task requires RGB and depth images belonging to a CAD model. With the ImVoteNet architecture for example, an object locater is trained on both RGB and depth data to efficiently detect the 3D bounding boxes of the objects as well as the class [9]. As batch sizes of products continue to shrink, multiple object classes are placed at the assembly station at once. The classification is therefore a crucial step during the object localization to be able to choose the right object which must be assembled next. Depending on the algorithm, the hardware is chosen based on the required interfaces, the transferability of the system and economic aspects. This is closely connected to the data input coming from the digital process chain. The latter serves as the connection between the product lifecycle and the grasping process and has to deliver the product data, process data and hardware parameters in the format processable by the algorithms.

Based on this first classification and the calculated bounding box, the pose estimation of the object follows. Most objects which are assembled in the production do not have rich texture which make the correspondence-based methods unsuitable in many cases. For weak texture and geometric detail, the template-based methods perform good, while for occlusion which is common in the production, voting-based methods are a good choice. The DenseFusion algorithm uses both RGB images and depth data for the pose estimation of objects, which are fed into the process via the digital process chain [17]. Before estimating the pose of the object, DenseFusion does an object segmentation on the RGB image to detect the pixels belonging to a specific object. After this step, both the RGB and depth data are fused accurately to predict the 6D pose of the desired object. Each pixel of the RGB image votes for a 6D pose which results in a good estimation even if parts of the object are occluded.

The last step is the selection of grasps based on the object pose. With the ML approach of DexNet, object localization and pose estimation do not have to be done but possible grasps are generated directly based on depth data [19]. The biggest drawback of this approach is the lack of object specific data for grasp generation. In the production the exact grasping location is highly relevant. The functional surfaces, the weight, center of gravity and the position where the object has to be placed in the assembly are known from the engineering phase. These factors are combined in the component requirement consisting of parts and feeding. Using these factors, grasp positions are generated. During the production, after the 6D pose of the object was detected, one grasp is selected. The selection process takes into account the pose of the object, the position of the assembly, the used robotic hardware and the environment to avoid collisions but at the same time optimize the time used to assemble the object. To make use of DexNet’s good grasp selection and at the same time use object specific data, we used DexNet’s grasp selection as a starting point. With this selection, we estimated the 6D pose of the object with an iterative closest point algorithm. The advantage of this approach is the good quality of the pre-selection by DexNet followed by an exact pose estimation of the object.

To train the ML-algorithms, training datasets generated either synthetically or by physical experiments are necessary. The advantage of synthetical data is the cheap generation and the possibility to include unexpected scenarios, but the different physical conditions and parameters have to be considered nonetheless. This makes the transfer of the algorithms from the simulation to reality a challenging task. The conduction of physical experiments to collect the data is more expensive and time consuming, but the data is closer to reality and can thus lead to more robust solutions [19].

5 Current Challenges

The previous section highlighted examples of how an intelligent combination of information from the product life cycle with existing ML approaches can sustainably improve the robustness and performance of gripping systems and thus find more use in assembly. However, it also becomes clear, that it is difficult to compare the existing algorithms on a common ground. This is partly because they sometimes focus on individual steps or combine several steps, and partly because they are tested with different data sets. This makes it difficult to find the optimal combination for the individual application. In order to make this possible, a test framework is required in which the models or a combination of algorithms can be tested against each other in a defined setting, as shown in Fig. 3. In such a model, the constraints of the environment are set. Since the environmental conditions and specific hardware properties can only be modeled to a limited extent, there must be a defined input stream that, in addition to the input data, also provides reference data for evaluating the result. Based on this, the individual models can then be exchanged or arranged differently until the intended requirements are met. Via defined interfaces, the algorithms can also access information from the product life cycle to improve the overall result.

Beside that, the robustness and safety of such systems must be further improved. While robustness to different lighting conditions can be achieved by training with a heterogeneous data set, the problem of reflective surfaces remains even when using stereo camera systems. Strategies must also be developed to continue operating efficiently in the event of a system failure. The system should be able to overcome such errors by having an alternative solution especially in safety critical processes and to learn from its mistakes for a continuous optimization of the solution.

Moreover, it is important to incorporate significantly more process and product knowledge into the decision-making processes of the algorithms. Therefore, there is a demand for research regarding the incorporation of domain knowledge in the training process, in transfer learning as well as on methods of data augmentation for those data types that are particularly relevant for industrial use. On the one hand, the algorithms have to offer appropriate interfaces and on the other hand, the corresponding data has to be converted into compatible and standardized formats. In general, ML approaches must be considered more in the overall tool and value chain of the process in which they are to be integrated.

Finally, the acceptance and transparency of ML solutions must also be addressed. It is important that ML based systems shift from current black box models into comprehensible systems. Explainable AI is an important keyword here, without which the broad industrial use of the algorithms is difficult to implement.

6 Conclusion and Outlook

In the context of this paper it became obvious that there are still some challenges to be solved in order to enable ML-based gripping for broad industrial use in assembly. It was shown that the requirements from production are very complex and multilayered. In particular, the parameters influence each other very strongly, so that a generalization is only possible to a limited extent or only for individual domains. On the other hand, it became clear that the described approaches offer advantages over classical, analytical approaches. For example, flexibility was derived from the assembly perspective as a central requirement, which can be achieved much more easily through ML.

However, it also became apparent that a chaining of different modules with an underlying end-to-end data process chain is absolutely necessary to achieve the higher-level objectives. For the daily use in production, the whole tool chain should be considered in a holistic approach and it should be clarified how the individual modules can be linked together in an effective way and how robustness and precision can be increased by use of underlying data. To do this, a test framework is needed to benchmark the existing models and approaches against each other in a defined environment. Therefore, it is planed to develop such a framework to enable the user to decide which approach fits to his requirement and to reveal remaining potentials for further research.