Keywords

1 Introduction

The demand for customized products has been increasing rapidly in the last decades. The manufacturing process should be adjusted upon individual request. Collaborative robots can work with human workers hand-in-hand for assembly tasks, which can improve the flexibility in task execution. However, the application of hybrid system is still in its infant stage. One obstacle is the complex robot programming process. Another one is the required expertise from the worker for each specific type of robot. Moreover, the tasks have to be re-programmed each time a new request is received from the factory. It is time consuming and causes higher production cost.

Learning from demonstration is a promising programming paradigm for non-experts. Kinesthetic teaching is widely explored in the last decades for data collection [1]. However, the process can be a tedious task for a human worker especially for multi-step tasks. Instead of guiding the robot directly by hand, visual observation gains more attention recently, thanks to the development in field of computer vision. Hand movement can be tracked and recorded by optical sensors. The trajectories from demonstration are then segmented to elementary action sequences such as pick-and-place objects, which are also known as skills. A task model is then defined as a sequence of skills [2]. The basic motions (reach, grasp, move, position and release) in methods-time measurement (MTM) [3] are considered as skills in this work, such that a learned task model can be optimized in a more flexible way during execution. For instance, the move motion can be optimized while the reach and grasp remain unchanged. The representation is also beneficial for integrating natural language as voice command such as grasp and release, since they can be considered as discrete event both for human speaking and for robot execution. The aim of this work is to develop a framework, in which the robot is able to learn a task from integrated natural language instruction and video demonstration. The main contributions of this work are:

  • Proposal of pipeline to extract hand motion from 3D video sequences.

  • Proposal of integrating voice commands with velocity-based motion segmentation.

  • Definition of skills according to methods-time measurement (MTM): extraction of discrete skills from voice command and extraction of continuous skills from visual observation.

2 Related Work

This section provides a summary of recent literature on robot learning from visual observation. Ding et al. developed a learning strategy for assembly tasks, in which the continuous human hand movement are tracked by a 3D camera [4]. Finn et al. presented a visual imitation learning method that enables a robot to learn new skills from raw pixel input [5]. It allows the robot to acquire task knowledge from a single demonstration. However, the training process is time consuming and the learned model is prone to environment changes. Qiu et al. presented a system, which observes human demonstrations by a camera [6]. A human worker demonstrates an object handling task wearing a hand glove. The hand-pose is estimated based on a deep learning model trained by 3D input data [7]. The human demonstration is segmented by Hidden Markov Models (HMM) into motion primitives, so-called skills. The skills are represented by Dynamic Movement Primitives (DMPs), which allows the generalization to new goal positions. But there are no rules for defining semantic of skills in the existing works. Pick up, place and locate are considered as skills by Qiu et al. [6], however Kyrarin et al. define them as start arm moving, object grasp, object release [8]. This causes difficulty when comparing the performance of different approaches. Shao et al. developed a framework which allows robot to learn manipulation concepts from human visual demonstrations and natural language instructions [9]. By manipulation concepts they mean for instance “put [something] behind/into/in front of [something”. The model’s inputs are natural language instruction and an RGB image of the initial scene. The outputs are the parameters of a motion trajectory to accomplish the task in the given environment. Task policies are trained by integrated reinforcement and supervised learning algorithm. Instead of classifying all possible actions in video demonstration, the focus of this work is to extract motion trajectories from each video.

3 Motion Segmentation

In this section, the methods for extracting hand motion trajectories and segmentation are described.

3.1 Data Collection

This work aims to extract human motion from video sequences, which consists of both color and depth information of hand motion. Given recorded motion data, a pipeline consisting of the following three steps is proposed. Firstly, the objects which are more than one meter away from camera origin will be removed. Since the depth and color stream have different viewpoints, the alignment is necessary before further processing. In the second step, the depth frame is aligned to the color frame. The resulted frames have the same shape as the color image. Thirdly, the hands are detected by MediaPipe framework [10] from recorded color image sequences. The output of the hand detector are 21 3D hand-knuckle coordinates inside of the detected hand regions. Figure 1(a) shows an example. The representation of each landmark is composed of x-, y- and z-coordinate. x and y are normalized to [0.0, 1.0] by the image width and height respectively, z represents the landmark depth the wrist being the origin. The illustration of landmarks on the hand can be found on the website of MediaPipeFootnote 1. If the hand is not detected, the time stamp will be excluded from the output time sequences. Otherwise, key points in pixel coordinates are transformed to camera coordinate system in Fig. 3(a). An detailed illustration of hand landmarks in world coordinate system with wrist being the origin is outlined in Fig. 1(b). A flowchart of the proposed pipeline is summarized in Fig. 2.

Fig. 1
figure 1

(a) Detected hand key points on color image; (b) 21 Hand joints position with wrist as origin

Fig. 2
figure 2

Flowchart of the process for generating hand motion trajectories

3.2 Motion Representation

The goal of motion segmentation is to split the recorded time series into five basic motions: reach, grasp, move, position and release [11]. It builds up the moving cycle of pick-and-place in multi-step manipulation tasks. The trajectories can be represented by \(P(x,y,z) = [p(t_{0}), p(t_{1}), \dots , p(t_{i}), \dots , p(t_{n})]\), where \(t_{i}\) represents the temporal information. The segmentation task is to define the starting and ending time for each motion. Grasp and release are two basic motions with short duration. By recognizing voice commands from human, the time stamp of the voice input can be mapped to hand motion trajectories. The move- and position-motions are segmented by hand moving speed. It is based on the assumption that the speed of position-motion decreases monotonically. The results of segmentation are outlined in Table 1.

4 DMP for Skills Representation

Dynamice Movement Primitive (DMP) is a way to learn motor actions, which are formalized as stable nonlinear attractor systems [12]. There are many variations of DMPs. As summarized by Fabisch [13], they have in common that

  • they have an internal time variable (phase), which is defined by a so-called canonical system,

  • they can be adapted by tuning the weights of a forcing term and

  • a transformation system generates goal-directed accelerations.

The canonical system uses the phase variable z which replaces explicit timing in DMPs. The values are generated by the function:

$$\begin{aligned} \tau \dot{z} = -\alpha z \end{aligned}$$
(1)

where z starts from 1 and approaches 0, \(\tau \) the duration of the movement primitive, and \(\alpha \) is some constant that has to be set such that z approaches 0 sufficiently fast. The transformation system is a spring-damper system and generates a goal-directed motion that is controlled by a phase variable z and modified by a forcing term f.

$$\begin{aligned} \begin{aligned} \tau \dot{v} = K(g-y) -Dv -K(g-y_{0})z + Kf(z) \\ \tau \dot{y} = v \\ \end{aligned} \end{aligned}$$
(2)

The variables y, \(\dot{y}\), \(\ddot{y}\) are interpreted as desired position, velocity and acceleration for a control system, \(y_{0}\) is the start and g is the goal of the movement. The forcing term f can be chosen hypothetically:

$$\begin{aligned} f(z) = \frac{\sum _{i=1}^{N}\psi _{i}(z)w_{i}}{\sum _{i=1}^{N}\psi _{i}(z)} \end{aligned}$$
(3)

with parameters w that control the shape of the trajectory. Influence of the forcing term decays as the phase variable approaches 0. \(\psi _{i}(z)\) = \(\exp (-\frac{d_{i}}{2}(z-c_{i})^{2})\) are radial basis functions with constant \(d_{i}\) (widths) and \(c_{i}\). The DMP formulation presented in [14] is considered in this work, such that the desired velocity can be incorporated.

Table 1 Representation of skills

5 Experimental Setup

The setup is illustrated in Fig. 3 (a). The Intel® RealSenseTM L515 3D camera is mounted on the robot to record hand movements. The working space of robot can be captured by camera, as shown in Fig. 3 (b). As a LiDAR camera, it projects an infrared laser at 860 nm wavelength as an active light source. 3D data is obtained evaluating the time required to the projected signal to bounce off the objects of the scene and come back to the camera [15]. The size of color images recorded by the L515 is \(1280\,\times \,720\) and the size of depth image is \(640\,\times \,480\). Intel RealSense Viewer is used to record the video sequences. The natural language text for segmentation is manually inserted into the recorded sequences.

Fig. 3
figure 3

(a) Demonstration setup; (b) Color image; (c) Depth image

Fig. 4
figure 4

Hand motion segmentation based on voice command

6 Experimental Results

To validate the proposed approach, a task demonstration is recorded with the setup in Fig. 3, in which the human demonstrated a pick-and-place task. The methods proposed in Sects. 3 and 4 are applied on the recorded sequence. The results are discussed in the following.

6.1 Segmentation

Index finger tip trajectory in X is outlined in Fig. 4. It shows that some data are missing due to depth error in the data collection process. The segmentation result based on the voice command grasp and release can be also found in Fig. 4. In the next step, the sequence between grasp and release is segmented into move and position, as illustrated in Fig. 5. This is achieved by defining the temporal information of voice command input.

Fig. 5
figure 5

Segmentation into move and position

Fig. 6
figure 6

Learned DMP for representing reach

6.2 Learning DMPs

Before learning the DMPs, a linear interpolation is applied on both time and trajectory data. Additionally, time series are shifted such that every trajectory starts at time zero. Three DMPs are learned for representing trajectories of X, Y and Z. The implementation by Fabisch [13] is used to learn the DMPsFootnote 2. For the sake of simplicity, only X and Y are in Fig. 6 and Fig. 7. The learned model can be adapted to new goal position such as 0, 1 or 2.

Fig. 7
figure 7

Learned DMP for representing position

Fig. 8
figure 8

Learned DMP for representing move

To represent the move motion, it is essential that the DMPs can be adapted to different final velocities. The results for trajectories and velocities in Fig. 8 show that the learned DMPs can be adapted to different final velocities such as 0, 1 and 2, where \(x_{d}\) and \(y_{d}\) represent the goal velocity. Furthermore, it shows that the recorded hand trajectories are not smooth can not be applied on robot directly. The smoothness can be improved by the learned DMPs.

7 Conclusion

To reduce the complexity of robot programming, an integrated approach is introduced in this work for robot learning of skills from voice command and a single video demonstration. The extracted index finger trajectories from video are firstly segmented into five basic motions: reach, grasp, move, position and release. It is realized by voice input of grasp and release during video recording. Followed by segmenting move and position by hand moving velocities. DMPs are then learned to represent reach, move and position. They are adaptable to new goal positions and velocities. The experiment results show the feasibility of the proposed approach. As future works, the data missing problem caused by depth error should be addressed. Furthermore, the learned DMPs should be evaluated on real robot.