Augmented Virtuality Input Demonstration Refinement Improving Hybrid Manipulation Learning for Bin Picking

Blank, Andreas; Zikeli, Lukas; Reitelshöfer, Sebastian; Karlidag, Engin; Franke, Jörg

doi:10.1007/978-3-031-18326-3_32

Andreas Blank ORCID: orcid.org/0000-0002-5904-9680¹²,
Lukas Zikeli¹²,
Sebastian Reitelshöfer¹²,
Engin Karlidag¹³ &
…
Jörg Franke¹²

Part of the book series: Lecture Notes in Mechanical Engineering ((LNME))

Included in the following conference series:

International Conference on Flexible Automation and Intelligent Manufacturing

6322 Accesses

Abstract

Beyond conventional automated tasks, autonomous robot capabilities aside human cognitive skills are gaining importance in industrial applications. Although machine learning is a major enabler of autonomous robots, system adaptation remains challenging and time-consuming. The objective of this research work is to propose and evaluate an augmented virtuality-based input demonstration refinement method improving hybrid manipulation learning for industrial bin picking. To this end, deep reinforcement and imitation learning are combined to shorten required adaptation timespans to new components and changing scenarios. The method covers initial learning and dataset tuning during ramp-up as well as fault intervention and dataset refinement. For evaluation standard industrial components and systems serve within a real-world experimental bin picking setup utilizing an articulated robot. As part of the quantitative evaluation, the method is benchmarked against conventional learning methods. As a result, required annotation efforts for successful object grasping are reduced. Thereby, final grasping success rates are increased. Implementation samples are available on: https://github.com/FAU-FAPS/hybrid_manipulationlearning_unity3dros

You have full access to this open access chapter, Download conference paper PDF

Sim-to-Real 6D Object Pose Estimation via Iterative Self-training for Robotic Bin Picking

Sim-to-Real Deep Reinforcement Learning with Manipulators for Pick-and-Place

Learning Based Industrial Bin-Picking Trained with Approximate Physics Simulator

Keywords

1 Motivation

Short product life cycles, an increasing amount of product variants and more complex goods pose challenges to the manufacturing industry. Flexible automation involving robot systems contributes to improve the situation. However, conventional automation reaches limitations in scenarios with uncertainties. These include manipulation and grasping operations in bin picking for material supply and machine feeding [1].

Deep Reinforcement Learning (RL) is an enabler for autonomous robot skills able to cope with complex grasping operations. However, implementation of RL within industrial applications is limited to specific and low demanding use cases. This is caused by the complexity of learning environment setups [2]. Deep Imitation Learning (IL) represents an alternative, whereby human cognitive skills are involved in the learning process. Thus, less parametrization is required. As demonstrations formulate an explicit, often intuitive learning-objective, more manipulation scenarios are covered [3].

Limiting factors of IL are the restriction to human demonstrations and the required effort to generate a sufficient annotations amount [4]. The utilized Human Machine Interface (HMI) represents another factor for IL performance. Existing approaches lack a real-time complementary exploitation of multiple sensor and semantic data sources.

In this context, the contribution of this paper is to propose an Augmented Virtuality (AV)-based input demonstration refinement method. The method enables efficient hybrid learning for manipulation operations. Hybrid learning combines known RL and IL algorithms by formulating weighted objective functions within shared constraints. In computer science, AV refers to the augmentation of the Virtual Reality (VR) with real-world elements, enriching the user experience [5]. The overall objective of the method is to reduce required adaptation efforts to new and changing scenarios. In addition, the improved annotation quality through demonstrations increases grasping success rates. The hybrid learning method is further characterized by flexible iterative learning. In addition, successive AV-based dataset refinement and fault interventions during system ramp-up enable application tuning up to operational productive deployment.

2 Related Work: Learning Strategies in Industrial Bin Picking

While fully autonomous robots have not yet been proven to be deployable to the shopfloor, industrial application of partly autonomous robot capabilities is an active field of research. Skill- and behavior-based abstracting architectures as a flexible design paradigm for robot software find application across industrial research domains [6].

Industrial bin picking is a subdomain of manipulation of the aforementioned setting. Stereo vision-based object recognition and pose estimation based on Convolutional Neural Networks (CNN) improve success rates of bin picking applications [7, 8].

However, the underlying manipulation skill is often a source of grasping failures, still inhibiting success rates [9]. Current research focuses on flexible grasping strategies with increased activity in the RL domain [1, 2]. Markovian Q-learning within Actor-Critic neural networks like Soft Actor Critic (SAC) can enable the application of RL to complex robot scenarios [10]. As a result, high success rates are achieved for simple handling [11]. Through Policy Gradient methods like Proximal Policy Optimization (PPO), collision avoidance can be integrated into learning strategies [12].

In [11] and [12] manipulation tasks are either presented in a simplified manner or unsolved grasping attempts still remain. RL is affected by specific hyperparametrization and environment stochasticity [13]. In general, heterogeneous manipulation tasks benefit from more abstract approaches instead of case-specific RL implementations.

IL, as part of the learning by demonstrations domain, is an alternative Machine Learning (ML) paradigm suitable for complex manipulation. In particular, it provides symbiotic integration with AV teleoperation and offers a more abstract characteristic enabling a wider robotic application range [14]. IL is mostly applied in scenarios requiring sequences of specific state transitions to reduce search space complexity [15]. Since algorithms like Behavioral Cloning (BC) and Generative Adversarial Imitation Learning (GAIL) require multiple high-quality demonstrations, IL is combined with Human-in-the-Loop (HuITL) [3, 16]. Here, failure intervention through teleoperation realizes dataset refinement [4]. This strategy is costly in terms of input data generation and more or less overfits when utilized in easy-to-solve bin picking tasks.

Since RL and IL share the theoretical paradigm of Markovian Decision Processes, scientific approaches combining their target functions exist [17]. Thereby, initial learning from demonstrations is proposed to enable faster positive reward collection. The combined RL and IL approach serves for policy improvement and diversification. This hybrid learning strategy is referred as reward-consistent Imitation Learning [17].

Consequently, although IL and HuITL approaches for teleoperated intervention as well as hybrid RL and IL concepts are already considered in research, more in-depth R&D is required for industrial manipulation scenarios (e. g. bin picking). Related methods are either tailored to specific setups and manipulation scenarios or do not explicitly address IL for manipulation bottlenecks. Furthermore, IL potentials often remain unexploited due to inappropriate VR-HMIs. These are inferior compared to an AV-based real-world environment reconstruction deployed for dataset refinement.

3 Augmented Virtuality-Based Hybrid Manipulation Learning

In the following section, the hybrid RL-IL method involving AV-based input refinement for bin picking scenarios is introduced (see Fig. 1). Subsequently, Fig. 2 describes an architectural concept for method integration along a suggested ramp-up process.

The proposed hybrid learning strategy (see Fig. 1) is designed to improve grasping policies underlying to autonomous bin picking skills iteratively. In the upper right section of Fig. 1, sensor data obtained from a 3D-RGB camera is processed by a CNN for object localization and pose estimation. This data serves as input for the grasping policy (top).

As hybrid learning strategy (dashed outlined box), DL implemented as weighted RL-IL training (Fig. 1, center, green box) is utilized. The initial training is either performed via conventional RL (Fig. 1, lower left) or via hybrid learning. RL, PPO and SAC algorithms are utilized for training. For hybrid learning, a VR-based IL environment (Fig. 1, bottom, center) serves during virtual commissioning. The latter enriches the initial dataset with further human demonstrations for subsequent training iterations.

A simple RL environment for industrial bin picking serves for digital grasping failure simulation (Fig. 1, lower left). It consists of: a virtual agent (blue) with a collision model, a virtual Small Load Carrier (SLC), the collision environment and virtual objects within the SLC (top, left). The physics engine of the VR is utilized for realistic random multi-object arrangement and filling of the SLC with virtual grasping objects.

The Tool-Center-Point (TCP) of a simulated gripper represents the virtual agent. Continuous actions in form of single translations or rotations within global cartesian space are taken with each step of an episode. Thereby, an episode end is reached as soon as the agent either surpasses a maximum number of steps taken or by receiving a sparse reward. Positive sparse rewards are triggered by the collision of the TCP with grasping areas of an object to grasp. Negative rewards, on the other hand, are triggered by the collision with any environmental element. Optionally, dense rewards are awarded each time the agent frame approximates the grasping area of a target object.

Once initial data generation and virtual refinement are completed (Fig. 1), the grasping policy is deployed to the robot system. In case of failed autonomous grasping attempts during ramp-up or subsequent productive operation, an AV teleoperation interface serves as fault intervention mechanism. Hence, human fault-solving capabilities serve as demonstration input for subsequent weighted RL-IL-retraining.

Human demonstrations (Fig. 2, (A)) are captured during virtual commissioning (VR scenes, (B)) and also during online intervention of occurring grasping failures (AV scenes, (C)). The AV serves as the scene for HuITL input data refinement involving online or recorded offline sensor data. Search space complexity is aimed to be reduced by utilization of an initial demonstration dataset generated in randomized scenarios.

The AV scene generator (D) provides the required storing, respectively snapshotting of virtual as well as real-world robot scenes. This enables in addition asynchronous offline input data refinement (C). Therefore, a digital twin is generated involving stored raw and processed sensor data from a defined area of interest. This includes the point cloud of the SLC area, the environment configuration as well as related component specific information (e. g. derived object classifications, localizations and six Degrees of Freedom (6DoF) pose estimates).

Both VR and AV share in principle the same 3D rendering engine. In AV mode, however, the real-world robot environment is rendered by a soft real-time capable environment reconstruction pipeline. The latter operates with multiple sensor data inputs and their subsequent processing and characterization (e. g. object localization, pose estimation and knowledge augmentation) [5]. The IL stack (Fig. 1, right) used for VR and AV scenes employs reward-consistent IL utilizing BC and GAIL algorithms.

4 Setup and Procedure of Experiments

A demonstrator and a digital twin are set up for method validation. The repository is provided on: https://github.com/FAU-FAPS/hybrid_manipulationlearning_unity3dros.

4.1 Demonstrator Setup

The method as well as the architecture proposed are implemented using Unity3D as a physics simulation engine running on an Industrial PC (IPC) equipped with a NVIDIA RTX 2080 GPU. Unity ML-Agents is utilized as the Deep Learning API for episode design and runtime environment for demonstrations, training and inference. An HTC Vive Pro serves as HMI, thereby the SteamVR Unity plugin and OpenVR are used [5]. On a second IPC, the Robot Operating System (ROS) is installed for communication with the robot. Motion commands generated throughout the HMI are sent to a teleoperation middleware [18]. Point clouds are gathered by a robot wrist mounted stereo camera. Pose estimates of grasping objects are provided by a combined image processing pipeline. Here, the DL-based Frustum PointNets algorithm is complemented by the fifth release of the You Only Look Once (YOLO, v5) algorithm for region proposal.

Regarding robot and grasping components, industrial standard systems and semi-finished goods facilitate comparability. Thereby, a YASKAWA HC10 six-joints articulated robot equipped with a conventional electro-mechanical two-finger gripper is utilized. As major benchmark component a shifting rod from a lorry’s limited-slip differential is chosen (see Fig. 3). The method is proven adaptable to components with differing characteristics, as shown and described within the github-repo documentation.

4.2 Procedure of Experiments

For evaluation, three scenes within Unity3D are implemented: Scene A for AV fault-virtualization and -demonstration, Scene B for training based on a defined set of hyperparameters and Scene C for virtual or real robot inference (see Fig. 3).

Reward accumulation during training serves as evaluation metric for comparison of:

RL not requiring human demonstrations, weighted RL-IL with 350 initial demonstrations in randomized scenarios as well as weighted RL-IL based on 100 initial demonstrations and 250 fault scenario demonstrations as input refinement. The latter will be referred to as weighted REF.

For every learning method, ten training runs are initiated for statistical validation of results. Each run consists of \(3\times {10}^{7}\) steps. Weightings of objective functions are adapted during runs involving IL: BC algorithm is active with a weighted objective strength of 20% until step \(1\times {10}^{7}\), whereas GAIL is active with a strength of 10% throughout an entire run. Dense reward functions are active until step \(2\times {10}^{7}\). Fault scenario demonstrations are performed in 25 real-world bin picking intervention scenarios caused by failed RL- or RL-IL-based component grasping. For each individual fault scenario, ten subsequent clearance demonstrations are performed. Three volunteers experienced with the system performed the teleoperated demonstration process.

In a second experiment, grasping success rates achieved and resulting grasping durations during inference with the virtual and the real robot system are compared. This is performed for RL, RL-IL and REF. To this end, a sample size of \(N=51\) runs is chosen. The sample size is validated through calculation of the according \(p\)-values for RL, RL-IL and REF. For each method, the networks with the most representative accumulated reward achieved during the first experiment is chosen. Every failed grasping in the virtual scene is also counted as a failure for the real-world robot system. Resulting trajectories are not exported in order to prevent collision damage.

RL, RL-IL and REF share the same configuration of hyperparameters. The experiments do not focus on optimization of hyperparameters, hence one configuration leading to satisfying learning has been chosen and remains unchanged for valid comparison. The utilized hyperparameter configuration files will be provided within the repository.

5 Results and Discussion of the Hybrid Learning Evaluation

Good results for all methods are achieved with PPO over SAC. For graphs in Fig. 4 (A), the final mean accumulated reward for RL has a value of 0.765 (Standard Deviation (SD): 0.016), for weighted RL-IL it equals 0.833 (SD: 0.005) and for weighted REF 0.865 (SD: 0.006) is calculated. With these networks, mean grasping success rates during virtual inference over \(1\times {10}^{7}\) steps of 76.61% for RL, 82.45% for RL-IL and 84.38% for REF is obtained. While RL learns in a shorter time span, it converges faster.

For the graphs shown in Fig. 4 (B), the mean episode length given in steps for RL is 34.0 (SD: 0.44), for RL-IL 117.8 (SD: 3.56), and for REF 97.6 (SD: 2.07).

Virtually achieved grasping success rates (see Fig. 5 (A)) show significance as p_RL = 0.0078, p_RL-IL = 0.0024 and p_REF = 0.0064 are calculated for ten inference runs over the sample number \(N\) chosen. Experiments using the real robot system show a drop in grasping success rate for RL, which is not as drastic for RL-IL and REF. On the other hand, the drop is smaller for REF in comparison to RL-IL.

Mean grasping durations of 1.1 s (SD: 0.01) for RL, 5.9 s (SD: 0.61) for weighted RL-IL and 4.9 s (SD: 0.63) for weighted REF are obtained. The measured superiority of RL within Fig. 5 (B) matches observations in Fig. 4 (B). While the obtained distribution of grasping durations is more homogeneous for RL-IL and REF, the values are closer for RL. This improves slightly for REF over RL-IL, while some outliers larger than the median are measured. Grasping duration of all methods is adjustable by scaling the trajectory execution. The grasping success itself is not influenced by this.

Graphs plotted in Fig. 4 (A) reveal improved cumulative reward due to hybrid learning in virtual training environments. REF increases rewards even further. Advantages of RL over RL-IL and REF with regard to productivity are in a faster learning between steps \(0.3\times {10}^{7}\) and \(1\times {10}^{7}\) as well as in shorter episode length (Fig. 4 (B)). The increased episode length in RL-IL and REF is explainable by the more elaborate and tentative nature of the human grasping, being imitated. The superior performance of REF aligns with a shorter episode length during fault scenario demonstration. An underlying reason therefore could be human routine over repeated demonstrations. Considering computed quantiles across both graphs within Fig. 4, a more stable and reliable learning of RL-IL and REF compared to RL is concluded.

Figure 5 (A) verifies observations made with regard to accumulated reward. During inference, RL-IL and REF reveal a higher grasping success rate compared to RL. The grasping success rate drops while inference for RL utilizing the real robot. This is explained due to simplifications within the training environment. As RL, in contrast to IL, is optimizing manipulation movements, some trajectories learned under a simplified setting do not lead to success in real-world. Nevertheless, RL-IL and REF almost replicated success rates to the real environment. It is concluded that even with more simple but efficient learning models, sufficient results are achieved by IL. In particular, this applies for input refinement by demonstration within intervention scenarios (REF).

6 Conclusion and Outlook

In this work, a method for AV input demonstration refinement improving hybrid manipulation learning is described. Within a bin picking experimental setup involving standard components and systems, the method proves to considerably reduce required component adaptation effort for successful grasping. Thereby, grasping success rates for the application are noticeably increased. Major enablers are the weighted enrichment of RL with IL as well as the successive reward-consistent demonstration within the immersive AV (exploiting human cognitive skills). In contrast to solely RL-based learning, the hybrid strategy is less affected by the domain gap between virtual commissioning and reality. Compared to pure IL, the required effort to generate a sufficient number of annotations for autonomous operation is considerably reduced.

Even though the hybrid learning strategy shows promising outcomes, further R&D is required. Future work will therefore investigate optimization of hyperparameters. In addition, iterative continuous AV input demonstration refinement along the ramp-up process will be emphasized. Further research is required regarding transferability of human demonstrations to similar use cases through higher levels of abstraction.

References

Fujita, M., Domae, Y., Noda, A., Garcia Ricardez, G.A., Nagatani, T.: What are the important technologies for bin picking? technology analysis of robots in competitions based on a set of performance metrics. Adv. Robot. 33(17), 1–15 (2019)
Article Google Scholar
Mohammed, M.Q., Chung, K.L., Chyi, C.S.: Review of Deep Reinforcement Learning-Based Object Grasping: Techniques, Open Challenges, and Recommendations 8, 178450-178481 (2020)
Google Scholar
Mandlekar, A., Xu, D., Martín-Martín, R., Zhu, Y., Fei-Fei, L., Savarese, S.: Human-in-the-Loop Imitation Learning using Remote Teleoperation (2020)
Google Scholar
Zhang, T., McCarthy, Z., Jow, O., Lee, D., Chen, X.: Deep imitation learning for complex manipulation tasks from virtual reality teleoperation. In: Lynch, K. (Hrsg) IEEE International Conference on Robotics and Automation (ICRA). 21–25 May 2018, S 5628–5635 (2018)
Google Scholar
Blank, A., Kosar, E., Karlidag, E., Guo, Q., Kohn, S., Sommer, O.: Hybrid environment reconstruction improving user experience and workload in augmented virtuality teleoperation. Procedia Manufacturing 55, 40–47 (2021)
Article Google Scholar
Pane, Y., Mokhtari, V., Aertbelien, E.: Autonomous Runtime Composition of Sensor-Based Skills Using Concurrent Task Planning. Rob. Aut. Lett. 6(4), 6481–6488 (2021)
Google Scholar
Wang, H., Situ, H., Zhuang, C.: 6D Pose estimation for bin-picking based on improved mask R-CNN and DENSEFUSION. In: 2021 26th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA ). IEEE, S 1–7 (2021)
Google Scholar
Lee, S., Lee, Y.: Real-time industrial bin-picking with a hybrid deep learning-engineering approach. In: Lee, W.: (Hrsg) 2020 IEEE International Conference on Big Data and Smart Computing. 19–22 February 2020, Busan, Korea : proceedings, S 584–588 (2020)
Google Scholar
Yan, W., Xu, Z., Zhou, X., Su, Q., Li, S., Wu, H.: Fast object pose estimation using adaptive threshold for bin-picking. IEEE Access 8, 63055–63064 (2020)
Article Google Scholar
Brito, T., Queiroz, J., Piardi, L., Fernandes, L.A., Lima, J., Leitão, P.: A machine learning approach for collaborative robot smart manufacturing inspection for quality control systems. Procedia Manufacturing 51, 11–18 (2020). https://doi.org/10.1016/j.promfg.2020.10.003
Article Google Scholar
Marchesini, E., Corsi, D., Benfatti, A., Farinelli, A., Fiorini, P.: Double Deep Q-Network for Trajectory Generation of a Commercial 7DOF Redundant Manipulator. In: International Conference on Robotic Computing. 25–27 February 2019, Naples, Italy. IEEE, pp. 421–422 (2019)
Google Scholar
Lim, J., Lee, J., Lee, C., Kim, G., Cha, Y.: Designing path of collision avoidance for mobile manipulator in worker safety monitoring system using reinforcement learning. In: Ohara, K., Akiyama, Y.: (Hrsg) IEEE-ISR2021 (online). 2021 IEEE International Conference on Intelligence and Safety for Robotics : 4–6 March 2021, Nagoya, Japan, pp. 94–97 (2021)
Google Scholar
Fan, L., Zhu, Y., Zhu, J.: SURREAL: Open-source reinforcement learning framework and robot manipulation benchmark. In: Conference on Robot Learning, pp. 767–782 (2018)
Google Scholar
Luebbers, M.B., Brooks, C., Mueller, C.L., Szafir, D., Hayes, B.: Using augmented reality for interactive long-term robot skill maintenance via constrained learning from demonstration. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), (2021)
Google Scholar
Kim, H., Ohmura, Y., Kuniyoshi, Y.: Gaze-based dual resolution deep imitation learning for high-precision dexterous robot manipulation. Rob. Aut. Lett. 6(2), 1630–1637 (2021)
Google Scholar
Ablett, T., Marić, F., Kelly, J.: Fighting Failures with FIRE: Failure Identification to Reduce Expert Burden in Intervention-Based Learning (2020)
Google Scholar
Kawakami, D., Ishikawa, R., Roxas, M., Sato, Y., Oishi, T.: Learning 6DoF Grasping Using Reward-Consistent Demonstration (2021)
Google Scholar
Blank, A., Karlidag, E., Zikeli, G.L., Metzner, M., Franke, J.: Adaptive Motion Control Middleware for Teleoperation Based on Pose Tracking and Trajectory Planning. Annals of Scientific Society for Assembly, Handling and Industrial Robotics, Hannover (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Factory Automation and Production Systems (FAPS), Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany
Andreas Blank, Lukas Zikeli, Sebastian Reitelshöfer & Jörg Franke
Digital Industries (DI) Factory Automation (FA), Siemens, Nuremberg, Germany
Engin Karlidag

Authors

Andreas Blank
View author publications
You can also search for this author in PubMed Google Scholar
Lukas Zikeli
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Reitelshöfer
View author publications
You can also search for this author in PubMed Google Scholar
Engin Karlidag
View author publications
You can also search for this author in PubMed Google Scholar
Jörg Franke
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andreas Blank .

Editor information

Editors and Affiliations

Industrial and Systems Engineering, Wayne State University, Detroit, MI, USA
Kyoung-Yun Kim
Industrial and Systems Engineering, Wayne State University, Detroit, MI, USA
Leslie Monplaisir
Industrial and Systems Engineering, Wayne State University, Detroit, MI, USA
Jeremy Rickli

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Blank, A., Zikeli, L., Reitelshöfer, S., Karlidag, E., Franke, J. (2023). Augmented Virtuality Input Demonstration Refinement Improving Hybrid Manipulation Learning for Bin Picking. In: Kim, KY., Monplaisir, L., Rickli, J. (eds) Flexible Automation and Intelligent Manufacturing: The Human-Data-Technology Nexus . FAIM 2022. Lecture Notes in Mechanical Engineering. Springer, Cham. https://doi.org/10.1007/978-3-031-18326-3_32

Download citation

DOI: https://doi.org/10.1007/978-3-031-18326-3_32
Published: 13 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18325-6
Online ISBN: 978-3-031-18326-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics