Research on Transfer Learning of Vision-based Gesture Recognition

Gesture recognition has been widely used for human-robot interaction. At present, a problem in gesture recognition is that the researchers did not use the learned knowledge in existing domains to discover and recognize gestures in new domains. For each new domain, it is required to collect and annotate a large amount of data, and the training of the algorithm does not benefit from prior knowledge, leading to redundant calculation workload and excessive time investment. To address this problem, the paper proposes a method that could transfer gesture data in different domains. We use a red-green-blue (RGB) Camera to collect images of the gestures, and use Leap Motion to collect the coordinates of 21 joint points of the human hand. Then, we extract a set of novel feature descriptors from two different distributions of data for the study of transfer learning. This paper compares the effects of three classification algorithms, i.e., support vector machine (SVM), broad learning system (BLS) and deep learning (DL). We also compare learning performances with and without using the joint distribution adaptation (JDA) algorithm. The experimental results show that the proposed method could effectively solve the transfer problem between RGB Camera and Leap Motion. In addition, we found that when using DL to classify the data, excessive training on the source domain may reduce the accuracy of recognition in the target domain.


Introduction
Recently, human-robot interaction has been developed rapidly.Gesture could be an important way for human-robot interaction since it is able to give accurate and intuitive instruction to the robots, and it has been widely studied for decades [1] .Gesture recognition could enable effective and efficient interactions between human workers and robots.There are many kinds of devices for vision-based gesture recognition.For example, the camera is the main sensor used in the field of gesture recognition.Previously, most of the researchers used red-greenblue (RGB) images for gesture recognition [2] .With the development of technology, some new devices have sprung up, such as leap motion, Kinect, etc. Leap motion is an interactive hardware device based on infrared radiation (IR) sensors, and it could precisely capture and extract the positions and angles of finger joints.Specifically, Leap Motion is designed to detect and track human hand gestures, so the error of tracking is about 200 μm about the 3D coordinate of fingertips [3] .
However, the data from different domains may be distributed differently.Therefore, classifiers trained from one domain are likely to have a poor performance in the other domains.And for each domain, it is too expensive to collect a mass of examples manually and build a separate classifier.Therefore, how to make better use of the trained model in the source domain and reduce the learning cost in the target domain has become an urgent problem to be solved.
In recent years, transfer learning has arisen wide interest in researchers.Transfer learning refers to the application of existing knowledge to other related domains.Researchers have studied transfer learning in different methods, e.g., board learning system (BLS) [4,5] , neural network (NN) [6] , Bayesian model [7] and some other methods.Although transfer learning has received a lot of attention in [8], there are very few cases in the application of gesture recognition.The goal of this paper is to propose a method in the field of gesture recognition, which enables a model trained in the source domain to be used in the target domain directly.Therefore, the time for collecting data is reduced and the time for annotating data could be minimized or eliminated [9] .At present, transfer learning has been effectively used in text classification [10,11] , sentiment classification [12−14] , image classification [15−21] and other fields.It could be divided into feature representation transfer learning, instance transfer learning, parameter transfer learning and relationship knowledge transfer learning [8] .Feature representation transfer learning refers to transfer through feature transformation to decrease the difference between the source domain and the target domain [22−24] ; or to convert the data of the source and target domains into a unified feature space [25−27] , and then use the classification algorithm for identification.Feature representation transfer learning is one of the most popular research methods in the field of transfer learning.The paper uses this method to convert the original data of the RGB Camera and Leap Motion into a unified feature space, and then use the classification algorithm for recognition.
In the process of gesture recognition, it is generally necessary to assume: 1) the same feature space, it means that the training and test data need to use the same set of sensors; 2) the same overall distribution, i.e., experi-menters′ preferences or habits on training and test data are similar; 3) the same label space, i.e., the same label set in the training and test data [25] .Using conventional unsupervised data mining methods for gesture recognition, the long data collection cycle becomes a practical problem.If a supervised method is used, it will put a great burden on users, and they must annotate enough data to train the algorithm.It is a time-consuming task to label the original sensor data manually and requires experts to spend a lot of time annotating the sensor data.In addition, learning the model of each device independently and neglecting the learned knowledge in other domains will result in redundant calculation workload, excessive time cost, and loss of useful knowledge.Consequently, it is very profitable to develop models in a new field by using the learned information.Using transferable knowledge could decrease the collection of data, reduce the need for data annotation, and increase the learning speed [9] .There is very little work to transfer knowledge between two or more sensor models.Kurz et al. [28] and Roggen et al. [29] used teacher/learner models to handle the transfer problem of action recognition.Hu and Yang [24] introduced a transfer learning method to effectively transfer knowledge between models, but the greater knowledge transfer between different domains remains to be explored.Marin et al. [30] proposed how to jointly exploit the Camera and Leap Motion for accurate gesture recognition.However, it still needs to collect a large number of data from various devices and does not use the model learned from a certain device.The focus of this paper is to effectively solve the transfer problem between the RGB Camera and Leap Motion, thereby improving the learning efficiency of cross-device transfer.
This paper presents a method to apply the learned model in one device to another.The RGB Camera and Leap Motion were used to collect gesture data from several human users to verify the presented method.The main contributions are as follows: 1) A transfer learning framework of gesture recognition across different devices is proposed.Here, these devices have different data distributions, but all of them have the same output labels.
2) In the transfer of gesture recognition by the RGB Camera and Leap Motion, we extract several different features, and from the experimental results, the average accuracy of the new coordinate features is the highest.
3) When using the back propagation neural network (BP NN) algorithm for classification, we found that in some cases, the epoch of training has some unusual effects on the transfer results.Too many training times may lead to model overfitting in the source domain, and reduce the generalization ability in the target domain.
Fig. 1 shows a general overview of our approach.The structure of this paper is organized as follows.In Section 2, the preliminaries of transfer learning are reviewed.In Section 3, the data collection and feature extraction are described.Then, we introduce the experiment in Section 4. We further discuss the problems found in the experiment in Section 5, and Section 6 concludes our work.

Preliminaries
2.1 Joint distribution adaptation (JDA) [31] P (xs) P (xt) P (ys|xs) The difference between the source domain and the target domain is approximated by the distance between and , and the distance between and , as shown in (1).The JDA algorithm realizes transfer by reducing the distance of marginal distribution and conditional distribution in different domains.In this paper, we use the JDA algorithm to reduce the distance between the RGB Camera and Leap Motion.Just to be clear, the related notations and descriptions are shown in Table 1.

Feature transformation
A A A Dimensionality reduction could be used to transfer the data.For clarity, principal component analysis (PCA) is used to reconstruct the data.The goal of PCA is to find a transformation matrix to maximize the embedded data variance, which is shown in (2). (2)

X X XH H HX X X T A A
can deal with this optimization problem effectively.

Marginal distribution adaptation
Although PCA could extract k-dimensional features from the data, the distribution of different domains is still very large.It needs to reduce the difference of marginal distributions firstly, in other words, the distance between and should be as close as possible.The maximum mean discrepancy (MMD) [32] is used to compute the distance between the source domain and the target domain.
where is the MMD matrix and is computed as follows:

Conditional distribution adaptation
Then, it needs to reduce the difference of the conditional distribution, i.e., the distance between and should be reduced.A modified MMD is used to measure the distance between the and . 1 where is computed as follows:

Optimization problem
In JDA, the distance of the marginal distributions and conditional distributions is minimized at the same time, which makes the transfer learning more robust.Thus, by

A A A
Adaptation matrix

Z Z Z
Embedding matrix

H H H
Lefting matrix combining the above two distances, a total optimization goal could be obtained: Since the variance of the data is maintained before and after the transformation, another constraint is obtained as Therefore, by combining the above constraints, the optimization goal is transformed into Using the Rayleigh quotient, (9) could be translated as follows: According to the Lagrange method, the formula turns out to be as follows: A A A Thus, we could use the eigs function in Matlab to solve the transformation matrix easily.

Feature extraction and selection
We use the RGB Camera and Leap Motion to collect 10 static gestures of multiple experimenters (Figs. 2 and  3), and about 800 sets of data.In order to find the most suitable features for transfer between the two devices, we extract various features from the original data obtained for experimental comparison.We introduce each feature in the following sections.

Feature 1: The coordinates
Thanks to the existing hand key point detection technology, it is easy to extract the coordinates of 21 joint points of the hand from the gesture images taken by the RGB Camera 1 , as shown in Fig. 4. We use to represent the hand joint point coordinates obtained by the RGB Camera.The upper left corner of the image is the origin of the coordinate system, and the positive direction of the x-axis and y-axis are shown in Fig. 4. Leap Motion could directly collect the three-dimensional coordinate positions of the 21 joint points of the hand.Fig. 5 is a coordinate system with the center of the Leap Motion device as the origin of the coordinates.In the paper, represents the hand joint point coordinates obtained by Leap Motion, and the depth information is not used in this work.
The joint point coordinates extracted from the RGB Camera and Leap Motion corresponding to the position of the hand are shown in Fig. 6.It could be seen that the coordinates obtained by the two devices correspond to the same joint points.However, the coordinate systems of the two devices are different, so their distributions are different.

Feature 2: The length
Using the coordinate of the joint points obtained by the two devices, the length information could be easily calculated.It could be found that the position of the fingertip is the most variable point, so we use (12) to calculate the following distances: 1) the distance between the root of the fingers and the fingertips, 2) the distance between the root of the palm (point 0 in Fig. 6) and each fingertip, 3) the distance between each fingertip.

Feature 3: The angle
Using the obtained joint point coordinates, we could easily calculate the angle information.We use ( 13)−( 15) to calculate the following feature: 1) take the points 2, 5, 9, 13, 17 as the vertices, and the angle formed by the point 0 and any one of the points 4, 8, 12, 16, and 20, an example is shown in Fig. 6; 2) an angle formed by any three points in the fingertips.

Feature 4: The new coordinates
In order to weaken the influence of different coordinate systems on the joint point coordinates, coordinate origin could be unified as the root of the middle finger (point 9 in Fig. 6).Take the direction from point 0 to point 9 as the positive direction of the y-axis, the direction perpendicular to the y-axis and to the right is the positive direction of the x-axis.
The positive y-axis direction vector is expressed as The positive x-axis direction vector is expressed as ) . ( The point representation in the new coordinate system is 8) end for xinew, yinew 9) return 4 Experiment

Experimental setup
This section mainly introduces the relevant parameter settings of the algorithms used in the experiment.The relevant parameter settings of support vector machine (SVM) and BLS are shown in Table 2.
We use a BP neural network in this paper, and the number of nodes in the input layer is determined by the dimension N of feature described in Section 3. The number of nodes in the first and second layers is set as 0.5N − 2N, and the number of nodes in the output layer is set as 10.Fig. 7 shows the transfer process between different devices.

Experimental results
In Section 3, the different gesture features of the RGB Camera and Leap Motion are obtained.This section will use the algorithm introduced in Section 2 to transfer the data of these two domains.

Experiment 1: Transfer of raw data collected by two devices
The results of Experiment 1 are shown in Fig. 8.It could be found from the experimental results that if the images taken by the RGB Camera and the coordinates of joint points obtained by the Leap Motion are transferred to each other, the experimental results are poor.

Experiment 2: Mutual transfer of coordinate features
The results of Experiment 2 are shown in Fig. 9. From the experimental results of Experiment 2, the following conclusions could be drawn: 1) By comparing with Experiment 1, we could see that after extracting the coordinate features, the accuracy of the transfer result between the two devices has been greatly improved.
2) The JDA algorithm could reduce the distance between two domains, and improve the accuracy of the experimental results in most cases.

Experiment 5: Transfer between two devices after coordinate conversion
The results of Experiment 5 are shown in Fig. 12. From the comparison of the experimental results of Experiments 3−5, it could be found that most of the experimental results are improved little or even not at all by using the JDA algorithm.Through the analysis, we could see that because of the length feature, the angle feature and the new coordinate feature are less affected by the original coordinate system.Different original coordinate systems do not have much influence on them.For this paper, the main function of the JDA algorithm is to reduce the impact of different coordinate systems on the data, thus reducing the difference between domains.According to [31], we know that the JDA algorithm needs complex calculations to the transformation matrix, which is a time-consuming process.In this paper, we could directly use the extracted length, angle and coordinate features to transfer learning, which not only guarantees the accuracy, but also greatly reduces the training time.
By comparing the 5 experiments, it could be found that the average accuracy of experiment 5 is the highest.In other words, the best results are obtained by the new coordinates feature.In addition, in five experiments, the accuracy of the Leap Motion transfer to the RGB Camera is generally higher than that of the RGB Camera transfer to the Leap Motion.After analysis, we think that this is because the coordinates originally extracted by Leap Motion are in three-dimensional space, while those extracted from the RGB Camera images are two-dimensional coordinates.Therefore, the features of Leap Motion have more information, the accuracy of transfer is higher when Leap Motion data is used as the source domain.

Discussions
Some interesting phenomena are found when using the neural network algorithm to transfer and classify data.From the experimental results shown in 13(d)− 13(f), we could draw the following conclusions: 1) In a small interval to the left of the intersection (the rightmost intersection), the accuracy of the target domain is higher than the source domain, and the highest point of the accuracy of the target domain is in this interval.This means that in this interval, the model trained on the source domain is more suitable for the target domain.We speculate that this is because the Leap Motion data is originally in three-dimensional space, while the Camera data is in two-dimensional space.In other words, the Leap Motion has a more abundant feature space than the Camera, so that the Camera data could perform better.Therefore, in this interval, the accuracy of the target domain is higher than the source domain.
2) In the right region of the intersection (the rightmost intersection), the accuracy of the target domain decreases with the improvement of source domain accuracy.This may mean that the model is more suitable for the source domain due to the increase of training times, which reduces the generalization ability in the target domain.Therefore, it could be concluded that in some cases, the training times in the source domain affect the accuracy of the target domain.
For transfer learning, we have not yet found the discussion of these two points.Compared with the discussion in [33] about "1) Which layers in the source domain could be transferred to the target domain?2) How much layers of knowledge in the source domain are transferred to the target domain?"We propose "When is the best time to transfer during the training of the source domain."A detailed introduction is given based on the experiment.

Conclusions
In this paper, an effective transfer learning method for gesture recognition between the RGB Camera and Leap Motion has been put forward.The different distribution of data collected by the Leap Motion and the RGB Camera raises challenging problems, for which effective solutions have been presented.We extracted various features from the obtained original data, such as the coordinates, the length and the angle features, and compared the learning performances with and without using the JDA algorithm.The experimental results show the performance of different features when using different algorithms.Through the comparison of several groups of experimental results, we found that the average accuracy of the new coordinate features is the highest.In the future work, we will focus on the following points: 1) At present, only two-dimensional features are used in the transfer learning of gesture recognition, which has certain limitations on the direction of the palm.If the palm is not parallel to the device, it will have an impact on the classification results.We will use Kinect to extract more reliable features from 3D space.
2) We only discuss the experiment result of coordinates, length, and angle features, more features could be calculated for transfer.
3) In the future, it could also be extended to the field of transfer learning of the action recognition among different devices.

Fig. 1
Fig. 1 Pipeline of the proposed approach The hand joint point coordinates obtained by the RGB Camera, [0, 1, •••, 20] (x li , y li ) l i ∈ The hand joint point coordinates obtained by the Leap Motion, [0, 1, •••, 20] length The length feature α The angel feature C SVM penalty coefficient

Figs. 13
(a)−13(c) show the experimental results with the Camera as the source domain and the Leap Motion as the

Fig. 9
Fig. 9 Comparison of results in Experiment 2

Fig. 13
Fig. 13 Comparison of experimental results using neural network classification

Table 1
Notations and descriptions

Table 2
SVM and BLS related parameter settings