Abstract
In this work, an approach for robot skill learning from voice command and hand movement sequences is proposed. The motion is recorded by a 3D camera. The proposed framework consists of three elements. Firstly, a hand detector is applied on each frame to extract key points, which are represented by 21 landmarks. The trajectories of index finger tip are then taken as hand motion for further processing. Secondly, the trajectories are divided into five segments by voice command and finger moving velocities. These five segments are: reach, grasp, move, position and release, which are considered as skills in this work. The required voice commands are grasp and release, as they have short duration and can be viewed as discrete events. In the end, dynamic movement primitives are learned to represent reach, move and position. In order to show the result of the approach, a human demonstration of a pick-and-place task is recorded and evaluated.
You have full access to this open access chapter, Download conference paper PDF
Keywords
1 Introduction
The demand for customized products has been increasing rapidly in the last decades. The manufacturing process should be adjusted upon individual request. Collaborative robots can work with human workers hand-in-hand for assembly tasks, which can improve the flexibility in task execution. However, the application of hybrid system is still in its infant stage. One obstacle is the complex robot programming process. Another one is the required expertise from the worker for each specific type of robot. Moreover, the tasks have to be re-programmed each time a new request is received from the factory. It is time consuming and causes higher production cost.
Learning from demonstration is a promising programming paradigm for non-experts. Kinesthetic teaching is widely explored in the last decades for data collection [1]. However, the process can be a tedious task for a human worker especially for multi-step tasks. Instead of guiding the robot directly by hand, visual observation gains more attention recently, thanks to the development in field of computer vision. Hand movement can be tracked and recorded by optical sensors. The trajectories from demonstration are then segmented to elementary action sequences such as pick-and-place objects, which are also known as skills. A task model is then defined as a sequence of skills [2]. The basic motions (reach, grasp, move, position and release) in methods-time measurement (MTM) [3] are considered as skills in this work, such that a learned task model can be optimized in a more flexible way during execution. For instance, the move motion can be optimized while the reach and grasp remain unchanged. The representation is also beneficial for integrating natural language as voice command such as grasp and release, since they can be considered as discrete event both for human speaking and for robot execution. The aim of this work is to develop a framework, in which the robot is able to learn a task from integrated natural language instruction and video demonstration. The main contributions of this work are:
-
Proposal of pipeline to extract hand motion from 3D video sequences.
-
Proposal of integrating voice commands with velocity-based motion segmentation.
-
Definition of skills according to methods-time measurement (MTM): extraction of discrete skills from voice command and extraction of continuous skills from visual observation.
2 Related Work
This section provides a summary of recent literature on robot learning from visual observation. Ding et al. developed a learning strategy for assembly tasks, in which the continuous human hand movement are tracked by a 3D camera [4]. Finn et al. presented a visual imitation learning method that enables a robot to learn new skills from raw pixel input [5]. It allows the robot to acquire task knowledge from a single demonstration. However, the training process is time consuming and the learned model is prone to environment changes. Qiu et al. presented a system, which observes human demonstrations by a camera [6]. A human worker demonstrates an object handling task wearing a hand glove. The hand-pose is estimated based on a deep learning model trained by 3D input data [7]. The human demonstration is segmented by Hidden Markov Models (HMM) into motion primitives, so-called skills. The skills are represented by Dynamic Movement Primitives (DMPs), which allows the generalization to new goal positions. But there are no rules for defining semantic of skills in the existing works. Pick up, place and locate are considered as skills by Qiu et al. [6], however Kyrarin et al. define them as start arm moving, object grasp, object release [8]. This causes difficulty when comparing the performance of different approaches. Shao et al. developed a framework which allows robot to learn manipulation concepts from human visual demonstrations and natural language instructions [9]. By manipulation concepts they mean for instance “put [something] behind/into/in front of [something”. The model’s inputs are natural language instruction and an RGB image of the initial scene. The outputs are the parameters of a motion trajectory to accomplish the task in the given environment. Task policies are trained by integrated reinforcement and supervised learning algorithm. Instead of classifying all possible actions in video demonstration, the focus of this work is to extract motion trajectories from each video.
3 Motion Segmentation
In this section, the methods for extracting hand motion trajectories and segmentation are described.
3.1 Data Collection
This work aims to extract human motion from video sequences, which consists of both color and depth information of hand motion. Given recorded motion data, a pipeline consisting of the following three steps is proposed. Firstly, the objects which are more than one meter away from camera origin will be removed. Since the depth and color stream have different viewpoints, the alignment is necessary before further processing. In the second step, the depth frame is aligned to the color frame. The resulted frames have the same shape as the color image. Thirdly, the hands are detected by MediaPipe framework [10] from recorded color image sequences. The output of the hand detector are 21 3D hand-knuckle coordinates inside of the detected hand regions. Figure 1(a) shows an example. The representation of each landmark is composed of x-, y- and z-coordinate. x and y are normalized to [0.0, 1.0] by the image width and height respectively, z represents the landmark depth the wrist being the origin. The illustration of landmarks on the hand can be found on the website of MediaPipeFootnote 1. If the hand is not detected, the time stamp will be excluded from the output time sequences. Otherwise, key points in pixel coordinates are transformed to camera coordinate system in Fig. 3(a). An detailed illustration of hand landmarks in world coordinate system with wrist being the origin is outlined in Fig. 1(b). A flowchart of the proposed pipeline is summarized in Fig. 2.
3.2 Motion Representation
The goal of motion segmentation is to split the recorded time series into five basic motions: reach, grasp, move, position and release [11]. It builds up the moving cycle of pick-and-place in multi-step manipulation tasks. The trajectories can be represented by \(P(x,y,z) = [p(t_{0}), p(t_{1}), \dots , p(t_{i}), \dots , p(t_{n})]\), where \(t_{i}\) represents the temporal information. The segmentation task is to define the starting and ending time for each motion. Grasp and release are two basic motions with short duration. By recognizing voice commands from human, the time stamp of the voice input can be mapped to hand motion trajectories. The move- and position-motions are segmented by hand moving speed. It is based on the assumption that the speed of position-motion decreases monotonically. The results of segmentation are outlined in Table 1.
4 DMP for Skills Representation
Dynamice Movement Primitive (DMP) is a way to learn motor actions, which are formalized as stable nonlinear attractor systems [12]. There are many variations of DMPs. As summarized by Fabisch [13], they have in common that
-
they have an internal time variable (phase), which is defined by a so-called canonical system,
-
they can be adapted by tuning the weights of a forcing term and
-
a transformation system generates goal-directed accelerations.
The canonical system uses the phase variable z which replaces explicit timing in DMPs. The values are generated by the function:
where z starts from 1 and approaches 0, \(\tau \) the duration of the movement primitive, and \(\alpha \) is some constant that has to be set such that z approaches 0 sufficiently fast. The transformation system is a spring-damper system and generates a goal-directed motion that is controlled by a phase variable z and modified by a forcing term f.
The variables y, \(\dot{y}\), \(\ddot{y}\) are interpreted as desired position, velocity and acceleration for a control system, \(y_{0}\) is the start and g is the goal of the movement. The forcing term f can be chosen hypothetically:
with parameters w that control the shape of the trajectory. Influence of the forcing term decays as the phase variable approaches 0. \(\psi _{i}(z)\) = \(\exp (-\frac{d_{i}}{2}(z-c_{i})^{2})\) are radial basis functions with constant \(d_{i}\) (widths) and \(c_{i}\). The DMP formulation presented in [14] is considered in this work, such that the desired velocity can be incorporated.
5 Experimental Setup
The setup is illustrated in Fig. 3 (a). The Intel® RealSenseTM L515 3D camera is mounted on the robot to record hand movements. The working space of robot can be captured by camera, as shown in Fig. 3 (b). As a LiDAR camera, it projects an infrared laser at 860 nm wavelength as an active light source. 3D data is obtained evaluating the time required to the projected signal to bounce off the objects of the scene and come back to the camera [15]. The size of color images recorded by the L515 is \(1280\,\times \,720\) and the size of depth image is \(640\,\times \,480\). Intel RealSense Viewer is used to record the video sequences. The natural language text for segmentation is manually inserted into the recorded sequences.
6 Experimental Results
To validate the proposed approach, a task demonstration is recorded with the setup in Fig. 3, in which the human demonstrated a pick-and-place task. The methods proposed in Sects. 3 and 4 are applied on the recorded sequence. The results are discussed in the following.
6.1 Segmentation
Index finger tip trajectory in X is outlined in Fig. 4. It shows that some data are missing due to depth error in the data collection process. The segmentation result based on the voice command grasp and release can be also found in Fig. 4. In the next step, the sequence between grasp and release is segmented into move and position, as illustrated in Fig. 5. This is achieved by defining the temporal information of voice command input.
6.2 Learning DMPs
Before learning the DMPs, a linear interpolation is applied on both time and trajectory data. Additionally, time series are shifted such that every trajectory starts at time zero. Three DMPs are learned for representing trajectories of X, Y and Z. The implementation by Fabisch [13] is used to learn the DMPsFootnote 2. For the sake of simplicity, only X and Y are in Fig. 6 and Fig. 7. The learned model can be adapted to new goal position such as 0, 1 or 2.
To represent the move motion, it is essential that the DMPs can be adapted to different final velocities. The results for trajectories and velocities in Fig. 8 show that the learned DMPs can be adapted to different final velocities such as 0, 1 and 2, where \(x_{d}\) and \(y_{d}\) represent the goal velocity. Furthermore, it shows that the recorded hand trajectories are not smooth can not be applied on robot directly. The smoothness can be improved by the learned DMPs.
7 Conclusion
To reduce the complexity of robot programming, an integrated approach is introduced in this work for robot learning of skills from voice command and a single video demonstration. The extracted index finger trajectories from video are firstly segmented into five basic motions: reach, grasp, move, position and release. It is realized by voice input of grasp and release during video recording. Followed by segmenting move and position by hand moving velocities. DMPs are then learned to represent reach, move and position. They are adaptable to new goal positions and velocities. The experiment results show the feasibility of the proposed approach. As future works, the data missing problem caused by depth error should be addressed. Furthermore, the learned DMPs should be evaluated on real robot.
References
Pervez, A.: Task parameterized robot skill learning via programming by demonstrations. (2018)
Berg, J.K.: System zur aufgabenorientierten Programmierung für die Mensch-Roboter-Kooperation (2020)
Julian K., Lukas B., Martin G., Thorsten Schüppstuhl. A Methods-Time-Measurement based Approach to enable Action Recognition for Multi-Variant Assembly in Human-Robot Collaboration. Procedia CIRP, 106, pp. 233-238, 2022. 9th CIRP Conference on Assembly Technology and Systems (2022)
Ding, G., Liu, Y., Zang, X., Zhang, X., Liu, G., Zhao, J.: A task-learning strategy for robotic assembly tasks from human demonstrations. Sensors 20(19), 5505 (2020)
Finn, C., Yu, T., Zhang, T., Abbeel, P., Levine, S.: One-shot visual imitation learning via meta-learning. In Conference on Robot Learning, pp. 357–368. PMLR (2017)
Qiu, Z., Eiband, T., Li, S., Lee, D.: Hand pose-based task learning from visual observations with semantic skill extraction. In 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 596–603. IEEE (2020)
Li. S., Lee, D.: Point-to-pose voting based hand pose estimation using residual permutation equivariant layer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11927–11936 (2019)
Kyrarini, M., Haseeb, M.A., Risti, D.: ć-Durrant, and Axel Gräser. Robot learning of industrial assembly task via human demonstrations. Auton. Robot. 43(1), 239–257 (2019)
Shao, L., Migimatsu, T., Zhang, Q., Yang, K., Bohg, J.: Concept2robot: learning manipulation concepts from instructions and human demonstrations. Int. J. Robot. Res. 40(12–14), 1419–1434 (2021)
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M. Zhang, F., Chang, C.-L., Yong, M.G., Lee, J., et al.: Mediapipe: a framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019)
Keyvani, A., Lämkull, D., Bolmsjö, G., Örtengren, R.: Using methods-time measurement to connect digital humans and motion databases. In HCI (2013)
Schaal, S.: Dynamic movement primitives-a framework for motor control in humans and humanoid robotics. In Adaptive Motion of Animals and Machines, pp. 261–280. Springer (2006)
Alexander F. Learning and generalizing behaviors for robots from human demonstration. PhD thesis, University of Bremen (2020)
Mülling, K., Kober, J., Kroemer, O., Peters, J.: Learning to select and generalize striking movements in robot table tennis. Int. J. Robot. Res. 32(3), 263–279 (2013)
Servi, M., Mussi, E., Profili, A., Furferi, R., Volpe, Y., Governi, L., Buonamici, F.: Metrological characterization and comparison of d415, d455, l515 real sense devices in the close range. Sensors 21(22), 7770 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2023 The Author(s)
About this paper
Cite this paper
Lu, S., Berger, J., Schilp, J. (2023). An Integrated Approach for Hand Motion Segmentation and Robot Skills Representation. In: Schüppstuhl, T., Tracht, K., Fleischer, J. (eds) Annals of Scientific Society for Assembly, Handling and Industrial Robotics 2022. MHI 2022. Springer, Cham. https://doi.org/10.1007/978-3-031-10071-0_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-10071-0_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10070-3
Online ISBN: 978-3-031-10071-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)