Keywords

1 Motivation

Short product life cycles, an increasing amount of product variants and more complex goods pose challenges to the manufacturing industry. Flexible automation involving robot systems contributes to improve the situation. However, conventional automation reaches limitations in scenarios with uncertainties. These include manipulation and grasping operations in bin picking for material supply and machine feeding [1].

Deep Reinforcement Learning (RL) is an enabler for autonomous robot skills able to cope with complex grasping operations. However, implementation of RL within industrial applications is limited to specific and low demanding use cases. This is caused by the complexity of learning environment setups [2]. Deep Imitation Learning (IL) represents an alternative, whereby human cognitive skills are involved in the learning process. Thus, less parametrization is required. As demonstrations formulate an explicit, often intuitive learning-objective, more manipulation scenarios are covered [3].

Limiting factors of IL are the restriction to human demonstrations and the required effort to generate a sufficient annotations amount [4]. The utilized Human Machine Interface (HMI) represents another factor for IL performance. Existing approaches lack a real-time complementary exploitation of multiple sensor and semantic data sources.

In this context, the contribution of this paper is to propose an Augmented Virtuality (AV)-based input demonstration refinement method. The method enables efficient hybrid learning for manipulation operations. Hybrid learning combines known RL and IL algorithms by formulating weighted objective functions within shared constraints. In computer science, AV refers to the augmentation of the Virtual Reality (VR) with real-world elements, enriching the user experience [5]. The overall objective of the method is to reduce required adaptation efforts to new and changing scenarios. In addition, the improved annotation quality through demonstrations increases grasping success rates. The hybrid learning method is further characterized by flexible iterative learning. In addition, successive AV-based dataset refinement and fault interventions during system ramp-up enable application tuning up to operational productive deployment.

2 Related Work: Learning Strategies in Industrial Bin Picking

While fully autonomous robots have not yet been proven to be deployable to the shopfloor, industrial application of partly autonomous robot capabilities is an active field of research. Skill- and behavior-based abstracting architectures as a flexible design paradigm for robot software find application across industrial research domains [6].

Industrial bin picking is a subdomain of manipulation of the aforementioned setting. Stereo vision-based object recognition and pose estimation based on Convolutional Neural Networks (CNN) improve success rates of bin picking applications [7, 8].

However, the underlying manipulation skill is often a source of grasping failures, still inhibiting success rates [9]. Current research focuses on flexible grasping strategies with increased activity in the RL domain [1, 2]. Markovian Q-learning within Actor-Critic neural networks like Soft Actor Critic (SAC) can enable the application of RL to complex robot scenarios [10]. As a result, high success rates are achieved for simple handling [11]. Through Policy Gradient methods like Proximal Policy Optimization (PPO), collision avoidance can be integrated into learning strategies [12].

In [11] and [12] manipulation tasks are either presented in a simplified manner or unsolved grasping attempts still remain. RL is affected by specific hyperparametrization and environment stochasticity [13]. In general, heterogeneous manipulation tasks benefit from more abstract approaches instead of case-specific RL implementations.

IL, as part of the learning by demonstrations domain, is an alternative Machine Learning (ML) paradigm suitable for complex manipulation. In particular, it provides symbiotic integration with AV teleoperation and offers a more abstract characteristic enabling a wider robotic application range [14]. IL is mostly applied in scenarios requiring sequences of specific state transitions to reduce search space complexity [15]. Since algorithms like Behavioral Cloning (BC) and Generative Adversarial Imitation Learning (GAIL) require multiple high-quality demonstrations, IL is combined with Human-in-the-Loop (HuITL) [3, 16]. Here, failure intervention through teleoperation realizes dataset refinement [4]. This strategy is costly in terms of input data generation and more or less overfits when utilized in easy-to-solve bin picking tasks.

Since RL and IL share the theoretical paradigm of Markovian Decision Processes, scientific approaches combining their target functions exist [17]. Thereby, initial learning from demonstrations is proposed to enable faster positive reward collection. The combined RL and IL approach serves for policy improvement and diversification. This hybrid learning strategy is referred as reward-consistent Imitation Learning [17].

Consequently, although IL and HuITL approaches for teleoperated intervention as well as hybrid RL and IL concepts are already considered in research, more in-depth R&D is required for industrial manipulation scenarios (e. g. bin picking). Related methods are either tailored to specific setups and manipulation scenarios or do not explicitly address IL for manipulation bottlenecks. Furthermore, IL potentials often remain unexploited due to inappropriate VR-HMIs. These are inferior compared to an AV-based real-world environment reconstruction deployed for dataset refinement.

3 Augmented Virtuality-Based Hybrid Manipulation Learning

In the following section, the hybrid RL-IL method involving AV-based input refinement for bin picking scenarios is introduced (see Fig. 1). Subsequently, Fig. 2 describes an architectural concept for method integration along a suggested ramp-up process.

Fig. 1.
figure 1

Hybrid Learning Strategy for weighted Reinforcement- and reward consistent Imitation Learning improving the bin picking grasping policy by utilizing AV-based input refinement.

The proposed hybrid learning strategy (see Fig. 1) is designed to improve grasping policies underlying to autonomous bin picking skills iteratively. In the upper right section of Fig. 1, sensor data obtained from a 3D-RGB camera is processed by a CNN for object localization and pose estimation. This data serves as input for the grasping policy (top).

As hybrid learning strategy (dashed outlined box), DL implemented as weighted RL-IL training (Fig. 1, center, green box) is utilized. The initial training is either performed via conventional RL (Fig. 1, lower left) or via hybrid learning. RL, PPO and SAC algorithms are utilized for training. For hybrid learning, a VR-based IL environment (Fig. 1, bottom, center) serves during virtual commissioning. The latter enriches the initial dataset with further human demonstrations for subsequent training iterations.

A simple RL environment for industrial bin picking serves for digital grasping failure simulation (Fig. 1, lower left). It consists of: a virtual agent (blue) with a collision model, a virtual Small Load Carrier (SLC), the collision environment and virtual objects within the SLC (top, left). The physics engine of the VR is utilized for realistic random multi-object arrangement and filling of the SLC with virtual grasping objects.

The Tool-Center-Point (TCP) of a simulated gripper represents the virtual agent. Continuous actions in form of single translations or rotations within global cartesian space are taken with each step of an episode. Thereby, an episode end is reached as soon as the agent either surpasses a maximum number of steps taken or by receiving a sparse reward. Positive sparse rewards are triggered by the collision of the TCP with grasping areas of an object to grasp. Negative rewards, on the other hand, are triggered by the collision with any environmental element. Optionally, dense rewards are awarded each time the agent frame approximates the grasping area of a target object.

Fig. 2.
figure 2

Method utilization along ramp-up process (left). Sample architecture for method integration involving scene-generation for fault reproduction and a demonstration-HMI (right).

Once initial data generation and virtual refinement are completed (Fig. 1), the grasping policy is deployed to the robot system. In case of failed autonomous grasping attempts during ramp-up or subsequent productive operation, an AV teleoperation interface serves as fault intervention mechanism. Hence, human fault-solving capabilities serve as demonstration input for subsequent weighted RL-IL-retraining.

Human demonstrations (Fig. 2, (A)) are captured during virtual commissioning (VR scenes, (B)) and also during online intervention of occurring grasping failures (AV scenes, (C)). The AV serves as the scene for HuITL input data refinement involving online or recorded offline sensor data. Search space complexity is aimed to be reduced by utilization of an initial demonstration dataset generated in randomized scenarios.

The AV scene generator (D) provides the required storing, respectively snapshotting of virtual as well as real-world robot scenes. This enables in addition asynchronous offline input data refinement (C). Therefore, a digital twin is generated involving stored raw and processed sensor data from a defined area of interest. This includes the point cloud of the SLC area, the environment configuration as well as related component specific information (e. g. derived object classifications, localizations and six Degrees of Freedom (6DoF) pose estimates).

Both VR and AV share in principle the same 3D rendering engine. In AV mode, however, the real-world robot environment is rendered by a soft real-time capable environment reconstruction pipeline. The latter operates with multiple sensor data inputs and their subsequent processing and characterization (e. g. object localization, pose estimation and knowledge augmentation) [5]. The IL stack (Fig. 1, right) used for VR and AV scenes employs reward-consistent IL utilizing BC and GAIL algorithms.

4 Setup and Procedure of Experiments

A demonstrator and a digital twin are set up for method validation. The repository is provided on: https://github.com/FAU-FAPS/hybrid_manipulationlearning_unity3dros.

4.1 Demonstrator Setup

The method as well as the architecture proposed are implemented using Unity3D as a physics simulation engine running on an Industrial PC (IPC) equipped with a NVIDIA RTX 2080 GPU. Unity ML-Agents is utilized as the Deep Learning API for episode design and runtime environment for demonstrations, training and inference. An HTC Vive Pro serves as HMI, thereby the SteamVR Unity plugin and OpenVR are used [5]. On a second IPC, the Robot Operating System (ROS) is installed for communication with the robot. Motion commands generated throughout the HMI are sent to a teleoperation middleware [18]. Point clouds are gathered by a robot wrist mounted stereo camera. Pose estimates of grasping objects are provided by a combined image processing pipeline. Here, the DL-based Frustum PointNets algorithm is complemented by the fifth release of the You Only Look Once (YOLO, v5) algorithm for region proposal.

Regarding robot and grasping components, industrial standard systems and semi-finished goods facilitate comparability. Thereby, a YASKAWA HC10 six-joints articulated robot equipped with a conventional electro-mechanical two-finger gripper is utilized. As major benchmark component a shifting rod from a lorry’s limited-slip differential is chosen (see Fig. 3). The method is proven adaptable to components with differing characteristics, as shown and described within the github-repo documentation.

4.2 Procedure of Experiments

For evaluation, three scenes within Unity3D are implemented: Scene A for AV fault-virtualization and -demonstration, Scene B for training based on a defined set of hyperparameters and Scene C for virtual or real robot inference (see Fig. 3).

Fig. 3.
figure 3

Evaluation workflow involving AV fault virtualization, demonstration and training as well as virtual inference and ROS-trajectory export; in addition: real-world bin picking setup with exemplary grasping component “shifting rod” and YASKAWA HC10 articulated robot.

Reward accumulation during training serves as evaluation metric for comparison of:

RL not requiring human demonstrations, weighted RL-IL with 350 initial demonstrations in randomized scenarios as well as weighted RL-IL based on 100 initial demonstrations and 250 fault scenario demonstrations as input refinement. The latter will be referred to as weighted REF.

For every learning method, ten training runs are initiated for statistical validation of results. Each run consists of \(3\times {10}^{7}\) steps. Weightings of objective functions are adapted during runs involving IL: BC algorithm is active with a weighted objective strength of 20% until step \(1\times {10}^{7}\), whereas GAIL is active with a strength of 10% throughout an entire run. Dense reward functions are active until step \(2\times {10}^{7}\). Fault scenario demonstrations are performed in 25 real-world bin picking intervention scenarios caused by failed RL- or RL-IL-based component grasping. For each individual fault scenario, ten subsequent clearance demonstrations are performed. Three volunteers experienced with the system performed the teleoperated demonstration process.

In a second experiment, grasping success rates achieved and resulting grasping durations during inference with the virtual and the real robot system are compared. This is performed for RL, RL-IL and REF. To this end, a sample size of \(N=51\) runs is chosen. The sample size is validated through calculation of the according \(p\)-values for RL, RL-IL and REF. For each method, the networks with the most representative accumulated reward achieved during the first experiment is chosen. Every failed grasping in the virtual scene is also counted as a failure for the real-world robot system. Resulting trajectories are not exported in order to prevent collision damage.

RL, RL-IL and REF share the same configuration of hyperparameters. The experiments do not focus on optimization of hyperparameters, hence one configuration leading to satisfying learning has been chosen and remains unchanged for valid comparison. The utilized hyperparameter configuration files will be provided within the repository.

5 Results and Discussion of the Hybrid Learning Evaluation

Good results for all methods are achieved with PPO over SAC. For graphs in Fig. 4 (A), the final mean accumulated reward for RL has a value of 0.765 (Standard Deviation (SD): 0.016), for weighted RL-IL it equals 0.833 (SD: 0.005) and for weighted REF 0.865 (SD: 0.006) is calculated. With these networks, mean grasping success rates during virtual inference over \(1\times {10}^{7}\) steps of 76.61% for RL, 82.45% for RL-IL and 84.38% for REF is obtained. While RL learns in a shorter time span, it converges faster.

Fig. 4.
figure 4

Mean reward value for cumulative reward (A) and episode length across each step (B) with 2.5% and 97.5% quantiles. Each for five trainings in three learning sets.

For the graphs shown in Fig. 4 (B), the mean episode length given in steps for RL is 34.0 (SD: 0.44), for RL-IL 117.8 (SD: 3.56), and for REF 97.6 (SD: 2.07).

Virtually achieved grasping success rates (see Fig. 5 (A)) show significance as pRL = 0.0078, pRL-IL = 0.0024 and pREF = 0.0064 are calculated for ten inference runs over the sample number \(N\) chosen. Experiments using the real robot system show a drop in grasping success rate for RL, which is not as drastic for RL-IL and REF. On the other hand, the drop is smaller for REF in comparison to RL-IL.

Fig. 5.
figure 5

Results for grasping success rate (A) and grasping duration (B) within the real robot environment in comparison to virtual inference before trajectory export to ROS

Mean grasping durations of 1.1 s (SD: 0.01) for RL, 5.9 s (SD: 0.61) for weighted RL-IL and 4.9 s (SD: 0.63) for weighted REF are obtained. The measured superiority of RL within Fig. 5 (B) matches observations in Fig. 4 (B). While the obtained distribution of grasping durations is more homogeneous for RL-IL and REF, the values are closer for RL. This improves slightly for REF over RL-IL, while some outliers larger than the median are measured. Grasping duration of all methods is adjustable by scaling the trajectory execution. The grasping success itself is not influenced by this.

Graphs plotted in Fig. 4 (A) reveal improved cumulative reward due to hybrid learning in virtual training environments. REF increases rewards even further. Advantages of RL over RL-IL and REF with regard to productivity are in a faster learning between steps \(0.3\times {10}^{7}\) and \(1\times {10}^{7}\) as well as in shorter episode length (Fig. 4 (B)). The increased episode length in RL-IL and REF is explainable by the more elaborate and tentative nature of the human grasping, being imitated. The superior performance of REF aligns with a shorter episode length during fault scenario demonstration. An underlying reason therefore could be human routine over repeated demonstrations. Considering computed quantiles across both graphs within Fig. 4, a more stable and reliable learning of RL-IL and REF compared to RL is concluded.

Figure 5 (A) verifies observations made with regard to accumulated reward. During inference, RL-IL and REF reveal a higher grasping success rate compared to RL. The grasping success rate drops while inference for RL utilizing the real robot. This is explained due to simplifications within the training environment. As RL, in contrast to IL, is optimizing manipulation movements, some trajectories learned under a simplified setting do not lead to success in real-world. Nevertheless, RL-IL and REF almost replicated success rates to the real environment. It is concluded that even with more simple but efficient learning models, sufficient results are achieved by IL. In particular, this applies for input refinement by demonstration within intervention scenarios (REF).

6 Conclusion and Outlook

In this work, a method for AV input demonstration refinement improving hybrid manipulation learning is described. Within a bin picking experimental setup involving standard components and systems, the method proves to considerably reduce required component adaptation effort for successful grasping. Thereby, grasping success rates for the application are noticeably increased. Major enablers are the weighted enrichment of RL with IL as well as the successive reward-consistent demonstration within the immersive AV (exploiting human cognitive skills). In contrast to solely RL-based learning, the hybrid strategy is less affected by the domain gap between virtual commissioning and reality. Compared to pure IL, the required effort to generate a sufficient number of annotations for autonomous operation is considerably reduced.

Even though the hybrid learning strategy shows promising outcomes, further R&D is required. Future work will therefore investigate optimization of hyperparameters. In addition, iterative continuous AV input demonstration refinement along the ramp-up process will be emphasized. Further research is required regarding transferability of human demonstrations to similar use cases through higher levels of abstraction.