Data integration by two-sensors in a LEAP-based Virtual Glove for human-system interaction

Virtual Glove (VG) is a low-cost computer vision system that utilizes two orthogonal LEAP motion sensors to provide detailed 4D hand tracking in real–time. VG can find many applications in the field of human-system interaction, such as remote control of machines or tele-rehabilitation. An innovative and efficient data-integration strategy, based on the velocity calculation, for selecting data from one of the LEAPs at each time, is proposed for VG. The position of each joint of the hand model, when obscured to a LEAP, is guessed and tends to flicker. Since VG uses two LEAP sensors, two spatial representations are available each moment for each joint: the method consists of the selection of the one with the lower velocity at each time instant. Choosing the smoother trajectory leads to VG stabilization and precision optimization, reduces occlusions (parts of the hand or handling objects obscuring other hand parts) and/or, when both sensors are seeing the same joint, reduces the number of outliers produced by hardware instabilities. The strategy is experimentally evaluated, in terms of reduction of outliers with respect to a previously used data selection strategy on VG, and results are reported and discussed. In the future, an objective test set has to be imagined, designed, and realized, also with the help of an external precise positioning equipment, to allow also quantitative and objective evaluation of the gain in precision and, maybe, of the intrinsic limitations of the proposed strategy. Moreover, advanced Artificial Intelligence-based (AI-based) real-time data integration strategies, specific for VG, will be designed and tested on the resulting dataset.


Introduction
In recent years, computer vision is becoming increasingly important in addressing a wide range of application areas, including human action recognition [4,16], aerial image processing [5,31], person re-identification [33,37], and human-system interaction [14,32]. Concerning the latter, its goal is to improve the communication between users and computers, virtual reality environments, electromechanical devices, and robots. With the use of highly sophisticated sensors, a lot of critical applications on remotely operating systems, e.g., driving robots, rovers, or performing medical procedures [10,19,25,38,39], are becoming possible. Tele-operated systems are expensive, neither replicable nor quickly replaceable. The results of long-planned, critical, costly, and challenging operations depend on their proper use, that requires precise recording and reproduction of the operator's hand and finger movements.
Both non-vision and vision-based gesture recognition are usually employed to finely track the hand and all its joints. The non-vision approaches utilize wearable devices, such as wired gloves, for the detection of finger movements [7,20,34], while vision-based approaches use the interpretation of video-collecting devices, usually sensors operating also in the infrared (IR) range, placed at a certain distance from the subject [2,9,10,27,28,39,41]. The key advantage of vision-based systems is that no physical contact is required and the movements are free and natural, being the hand unforced to wear anything, and it could naturally be used to grip specialized tools in order to carry on a procedure (for example, surgical devices). However, in order to be used to control remotely operating systems, the movements must be identified with good spatial precision (few millimetres) and in real-time (at least 30 frames per second, fps, are necessary). In the last few years, the use of immersive Virtual Reality (VR) interfaces driven by natural hand movements for remote control, is growing up thanks to the development of innovative optical 3D visionbased systems for gesture recognition [12] and the range of applications that benefited from them is increased, as is occurring in rehabilitation [3,30]. One of the most recent optical 3D sensors, based on stereo vision, is the LEAP motion controller (LEAP 1 ). LEAP is a high-resolution 3D hand-sensing device which allows the freehand natural interaction, crucial for the implementation of real-time, realistic VR systems [6,30]. It uses 3 IR light sources and two detectors o obtain 3D visual information saved and reproduced almost simultaneously (more than 60fps) from the server. It has been successfully integrated with VR environments in rehabilitation and neuroergonomics [8,26,30], and also used as a tool for touchless interfaces, such as 3D writing recognition systems [18]. One of its advantages is that it is appropriate for different hand sizes (adults and children), as well as for different hand shapes (healthy people and patients with residual infirmities). However, if objects have to be handled, e.g., a joystick or the controller of a remotely operated vehicle, they can produce occlusions. Even parts of the hand itself often cross over with the view of the sensor (self-occlusions). Thus, LEAP, such as most vision-based systems, can fail to correctly reproduce the hand trajectory because the spatial position of some joints of the hand, invisible to the sensor, are guessed, thus resulting in inaccurate and unstable representations. That could be negligible when just raw gestures need to be reproduced, but crucial when finer movements are used in tele-operated applications, such as tele-surgery or operations in dangerous environments. Recently, several works have been published with the aim of improving hand tracking accuracy by combining LEAP data with those of other devices or data from multiple LEAPs [17,[21][22][23]40]. In particular, in [21] a LEAP is supported by an RGB webcam to improve the quality of the recognition of symbols in the 3D American sign language datasets. Aim of the proposed system is to reduce the ambiguities, due to occlusions, in gesture recognition: the RGB webcam is used as an auxiliary system, being it unable to furnish specific spatial information. The same gesture recognition problem for identification of American sign language and Handicraft-Gesture is solved accurately with just one LEAP [40]. In [22,23] a LEAP is supported by a Depth camera. The system has very good accuracy (regarding gesture recognition) but low frame rate (15fps) making it not suitable for applications that requires higher frequency (30fps or greater) to track natural hand movements. Moreover, due to the occlusions between fingers, the method can perform well only when the hand is in ideal orientations/positions. Kiselev et al. [17] use three LEAPs for gesture recognition. The Authors show that by increasing the number of sensors, the accuracy also increases due the fact that the number of occlusions decreases. Moreover, the use of multiple sensors of the same type greatly improves the performance of the data integration strategy due to the easiness in comparing similar models. However, since just one LEAP at time can be driven by a single operating system, the used client/server architecture described in the paper suggests that at least three different computers have been used (cheap and critical in terms of synchronization). In addition, as two of the three LEAPs are coplanar, they mostly contribute to increase the active region but have low influence in reducing occlusions. Finally, the performance of the system, in terms of frame rate, has not been discussed. Shen et al. [35], solve the problem of occlusions in gesture recognition by proposing the use of three LEAPs placed with their long axes on the medium points of the sides of an equilateral triangle. Though the paper deeply discuss on the system assembly, calibration, data-fusion and results in terms of position/orientation accuracy, no mention is dedicated to the resulting efficiency of the system in terms of fps.
Virtual Glove (VG) is a system based on the synchronized use of two orthogonal LEAPs ( Fig. 1) for reducing the probability of occlusions [30]. Better results regarding occlusions reduction could have been obtained with three LEAPs on a equilateral/equiangular configuration, as in [35], but we would have had serious problems with the real-time maintenance (at least 30fps) on a low-cost computer. Though the paradigm of VG [29] is applicable to any number of sensors placed in any angular configuration, the choice of two orthogonal sensors represents a good compromise between optimization/positioning of the region of interest (ROI), precision and efficiency. In fact, through project-related considerations and qualitative measurements regarding position/dimensions of the ROI and precision with respect to the angle, it can be argued that: 1) An acute angle between the sensors planes [15], though useful to approach the ROI to sensors and to maximize precision of the sensors individually, would reduce the space between sensors that implies a reduction of the hand movements inside the system and, consequently, a reduction of the ROI. Moreover, IR reciprocal interferences between sensors would increase, thus resulting in a reduction of stability, reliability and, hence, of the final precision of the tracking; 2) An obtuse angle between the sensors planes, though increasing the space between sensors, would move away the ROI from the sensors surface, thus reducing the precision of the system. In the original embodiment of VG, data coming only from one of the LEAPs were used at each time instant by mutual exclusion: the one having the most favourable orientation with respect to the hand palm was chosen. Though simple and efficient, this solution did not solve many cases of occlusions and, to increase efficiency, a lot of useful information coming from the orthogonal sensor got wasted.
To integrate data from both LEAP sensors and to solve the problem of data wasting, we have also considered the possibility offered by Machine Learning [24], ML, and Deep Learning [36], DL, but, though very effective, they could be either too slow or too computationally expensive to be used in a low cost machine (VG system is imagined for accurate and, in the same time, low-cost human-system interaction [20]). Besides, we would face the difficulty of getting sufficiently populated labelled datasets to be used for training and composed by the spatial positions of the hand joints (collected by a position indicator and considered the ground truth) and the corresponding spatial positions measured by both LEAP sensors while moving the hand inside the VG. This last task, necessary for using ML and DL strategies, is a long process that, to be carried on, requires the usage of an advanced, mini-invasive (LEAP sensors have to view the hand and its joints) and precise position indicator, such as one of those produced by VICON 2 , to be installed on the hand.
Aim of the present paper is to design and test a completely different data integration approach for VG, a good trade-off between simplicity, efficacy and efficiency without the requirement of any training datasets. The rest of the manuscript is structured as follows: Section 2 reviews VG assembly (both hardware and data collection strategy). Section 3 details the proposed data-integration method. Section 4 presents experimental measurements, results, and discussion. Finally, Section 5 concludes the manuscript and delineates future work and developments.

Design, calibration, and sensors management
The VG hardware consists of a rigid support, equipped with lodges for the orthogonal LEAP sensors Fig. 1a. The sensors are fixed inside the lodges through plastic screws to avoid vibrations and movements. The center of each LEAP is positioned at 18.5 cm from the internal corner of the support: these measurements were optimized for maximizing the signal into a 21 cm side, while also reducing VG's dimensions.
Both sensors were calibrated to a common right-handed Cartesian coordinate system, the center of which lies on the LEAPs plane, at the intersection of their vertical axes. Calibration was performed by accurately measuring, with a high precision positioning system 3 (spatial precision 0.01mm) the position of a tip of a stick on a set of m points inside the region of interest of the VG. On the same points, spatial measurements were collected by both LEAPs (one sensor at a time) and by calculating the transformation matrix [11]. Given the cloud of points measured by the two LEAPs each with its proper reference system, it finds the rotation matrix R and the translation vector s that minimizes the error: The center of mass for both sets, , are calculated and used to center the sets on the origin: This allows to compute the cross-covariance matrix: The resulting transformation, in homogeneous coordinates, is: Regarding the operation of two sensors at the same time, the software development kit of the LEAP (SDK) 4 does not allow the use of two devices on the same operating system and an architecture based on the use of virtual machines is necessary. In our architecture, two virtual machines (slaves) are installed on the physical machine (master) and each of them manages one of the two sensors. Data provided by the SDK through the websocket are captured by a javascript router and returned to a server hosted on the master machine. In the same way, the server sends data from both devices to one or more clients running on the master. The server receives data from the routers and elaborates them by performing the coordinate transformation and by constructing, and representing on a virtual environment, the numerical hand model. The hand model structure could variate depending on the specific programming language SDK. In fact, VG uses Javascript API and the computations are performed by employing the bone class: Given a Hand instance, it has access to the Arm (bone class) and to the Finger classes. Each Finger accesses to its bones (i.e., metacarpal, proximal, intermediate, and distal) and joints (i.e., attributes carpPosition, mcp-Position, pipPosition, dipPosition, btipPosition). Fig. 1 shows the hand inside the VG (a) and the corresponding numerical model (b).

Original data collection strategy
The original hand tracking strategy is based on a switching approach, i.e., at any given time instant t, only one sensor, and the same for all joints, is used to track the hand. In fact, both sensors are switched on but only one LEAP at a time is active and furnishes data (Fig. 2, top row). To determine which LEAP is active (the "favorite" sensor), the palm's normal vector p, (a vector orthogonal to the palm of the hand), is used to find the angle between the X-axis of the horizontal LEAP reference system and the projection of p on the X-Y plane. If the angle is between 225 • and 315 • (the palm is facing downwards) or between 45 • and 135 • (the palm is directed upwards) the horizontal LEAP is active, while data from the vertical sensor are ignored. Out of these ranges, the role of the sensors is inverted: the vertical LEAP becomes active and data from the horizontal LEAP are ignored. Though this approach is very efficient (just hand orientation is necessary to choose which model to use) and capable to fix occlusions caused by the hand's palm, it performs poorly when the hand is not perfectly oriented toward one of the sensors and/or when the hand is bending and some fingers obscure the others (this could occur in any orientation of the hand). In fact, with mutual exclusion, just data coming from one sensor are used each time instant, and, by discarding those of the other, a lot of potentially useful information is lost. These effects are accentuated when occlusions increase because of the handling of an object. In what follows we describe the new strategy we propose to exploit data from both LEAPs at the same time, thus improving the VG's capability of reducing occlusions.

The proposed data integration strategy
We aim at using data coming from both sensors. In particular, the role of "favourite view" to the sensor which has the palm of the hand oriented toward it, as in the mutual exclusion Fig. 2 Running example. The switching modality (top row) and the proposed integration strategy (bottom row). In the top row, the switching modality selects data from just one of the LEAPs. With the proposed modality, second row, the joints are selected according to their stability and data coming from both LEAPs are fused in the same model. Notice that, blue, red, and green are referred to horizontal LEAP, vertical LEAP, and the proposed data integration strategy, respectively strategy, is maintained just at the beginning of acquisition but, after that, data from both sensors are checked for each joint and just those from one of the sensors, each time, are selected in terms of stability and used to track that joint, i.e., at any time t, different joints of the model could be associated to different sensors (Fig. 2, bottom row). The reason of this choice is two-fold: the hand is a dynamic structure and, during time, a joint could be alternately obscured or visible; when a joint is lost by one sensor, its guessed position could be very far from the correct value, correctly represented by the other sensor (in this case, merging data from both sensors would reduce position errors). In fact, when LEAP is tracking the hand, it correctly represents the joints that it sees and guesses those that it does not see due to occlusions.
When LEAP loses a joint, first it becomes temporally unstable (forming high-frequency flickers and shakes) and then it stabilizes the guessed position and maintains it still until the joint becomes visible again: at that point, the position is updated to the right one (also this update occurs abruptly). This produces jumps and spikes on the trajectory that could also consist of errors of centimetres (see the Section 4 below). A LEAP hand model contains data for all joints and for each time, even if some of them are invisible to the sensor. In this last case, the positions of invisible joints are guessed on the basis of the hand shape and previous temporal view (a proprietary LEAP strategy). The strategy we propose is to check, for each joint, the data flow coming from both sensors and to choose those coming from the more stable of the two. Joint's stability is inversely proportional to its velocity: when the model is unstable, spikes and jumps are produced in the trajectory and velocity is high. Since we have data of the same joint from both sensors, the velocity is computable and finite for both LEAPs and the corresponding values can always be compared: for each time instant, data are selected from the LEAP showing lower velocity, in module. The data flow, from both sensors, is shown in Fig. 3. Each sensor is collecting data with varying frame rate, also different between LEAPs, and data from closest times are compared. Two velocity values are calculated: one is the velocity along the same sensor, that we call Internal (intra-sensor) velocity, and the other is the velocity "produced" by skipping from one sensor to the other, Fig. 3 Example of data production from both LEAPs (blue and red dots), along time (vertical axis) and space (horizontal axis). The fps can change with time: always the slower is used to produce data. Empty dots, from the faster sensor, are discarded. The parameters (space and time changes) for the calculation of Internal and External velocities are also shown. The reported example is just representative (spatial discrepancies between sensors have been enlarged for graphical purposes) that we call External (inter-sensors) velocity. External velocity is usually present because of the spatial differences between the two sensors (see Fig. 3). We first define these velocities and then we describe the stabilization strategy. For each joint (i = 1, 2, ..., 24) of the sensor L j (j = 1, 2), we calculate the Internal velocity: and the External velocity (it does not depend on one specific sensor): By indicating with C L,i the LEAP currently used for the joint i, the resulting integration algorithm is the following: 1. At the starting time, take the hand model from the favourite LEAP for all the joints and update C L (here the i is missing because the LEAP is the same for all the joints); 2.
Step to the following time t (that of the slower LEAP); 3. For each joint i:  Table 1, take the data from the appropriate LEAP and update C L,i accordingly; 5. Go to step 2.

Verify conditions in
The conditions in Table 1, a truth table, allow to define from which LEAP we have to select data for the joint i at the time t. As it can be noticed, we first define the lowest value for the Internal velocity and, if a change of sensor is necessary with respect to the current C L,i , we also check whether the External velocity is lower than the current, Internal, one. If this condition is met (data across sensors are more stable than those into the current one), a sensor data skip is allowed, otherwise data are collected from the original sensor. As it can be noticed, time also affects the LEAP choice because the number of fps changes with time for both sensors, and the two fps could even be always different. However, we always use the lower fps. In Fig. 4 an example is illustrated. The derivative calculation is obviously discrete.
The resulting hand model is a mixture of joints tracked from both sensors, resulting in a smoother train of points. The compared values refer to the same joint. No velocity threshold Table 1 Truth table indicating, inside the cells, the sensor that has to be chosen if the logical conditions (row and column) are met at the same time (logical AND) is needed: no constraint needs to be set regarding the maximum velocity of the hand. In fact, having two sensors to register the same joint, it can be supposed that, if the joint is moving, the smoother track is the more precise between the two. To further improve precision, it could be useful to verify how to merge data when both sensors are operating correctly. In that case, however, additional calculations are necessary to verify the correctness of the data but that could preclude real-time.

Data collection
To demonstrate the effectiveness of the proposed strategy in comparison with the usage of a single sensor, measurements were collected while a subject moved the left hand inside the VG (the usage of the right hand would have been the same). We performed the experiment with the hand free, without grabbing any object, in order to highlight that: 1) the instability effects on the reproduced trajectories are due to the loss of the signal (occlusions) and not to disturbances due to the grabbed object; 2) also in a free-hand mode the number of occlusions during the tracking process is high. The experiment started with the hand still oriented toward the horizontal LEAP, followed by wrist rotations alternated to a sequence of hand open-close operations. The number of wrist rotations was 5, corresponding to 6 hand positions with respect to the LEAPs (these are important to establish the changes of orientation of the hand with respect to both sensors, as clarified below). The duration of the sequence was 25 seconds for a total of 961 4D positions (x, y, z, and t all referred to the world system of VG). The hand model reconstructed in real-time by the proposed strategy was shown on a computer screen and saved into a database (DB). Apart from the reconstructed model, also original models obtained by each LEAP were stored into the same DB.
The whole experiment was also recorded using an external video camera and time was monitored by a stopwatch. Conditions of the room were maintained normal in order to avoid favorable conditions: no particular attention was paid to maintain external interferences low (controlled light, temperature, electromagnetic disturbances, and so on) and to maintain the background free of objects.

Results and discussion
Data obtained with the proposed method and those from each single LEAP have been recovered from the DB to be shown into the same plots. To this aim, the trajectories of just the 5 fingertips, organized by axes, are presented in Fig. 4 where three lines are reported: data from the horizontal LEAP (blue line), data from the vertical LEAP (red line) and data obtained with the proposed integration strategy (green line). As it can be observed, the green curve follows alternatively values from the blue or from the red curves by remaining on the smoother one. In fact, the green curves are smoother than blue and red ones and, in that way, also spikes and jumps, obviously representing tracking losses or outliers, are reduced. A summary of the outliers' reduction by using the proposed solution with respect to the switching approach is reported in Table 2. Though most of the outliers are removed, some of them remain mostly where both sensors are unstable at the same time.
Obviously, the data collection strategy originally used in VG would maintain all the outliers occurring for the actually active sensor, since the only information used to get data from one of the LEAPs was the orientation of the hand. Fig. 4 also indicates, with vertical dashed lines, the instants where the switching between sensors occurred in the original procedure, because the limit angles were overgone. Further, additional discontinuities could be produced by the transition from one LEAP to the other, as it can be observed at both sides of the vertical lines in the plots. Another important aspect to be noticed is that the selection of the blue or the red trajectory by the green one depends on the specific joint and not on the orientation of the hand palm (in that way, at a given time instant, a joint uses data by looking at its own smoother trajectory which could be different from that used by another joint). Particular attention should be paid at the difference between spatial representations of the two LEAPs for the same joint that, in some traits, can be very high and continues to be high for a long time. As said before, this depends on the behavior of the sensor: when a LEAP has to guess the position of a joint, it chooses the best estimated position and maintains it until it sees the joint again. During this time, the positions indicated by the two sensors could greatly differ and this justifies the usage of data coming from just one of them instead of merging data from both. Such an effect also becomes evident in Fig. 4, especially for the time interval between t=1 sec and t=3 sec, where pinkie finger is shadowed by the rest of the hand with respect to the vertical sensor: its guessed positions by the vertical LEAP are quite different from those collected (correctly) by the horizontal one. When data are collected by the proposed integration strategy, the correct position is selected. Figure 4 also shows relevant snapshots of the experiment on which some fingertip occlusions are highlighted: thanks to the proposed identification strategy, the trajectories are smoothed, as it can be observed on the plots, and the final reconstructed hand model is correctly reproduced on the computer screen in real-time. As noticeable, the system can follow and sufficiently record hand movements without any special preparation needed. The presented results have been acquired under normal conditions, which indicates that the VG system is capable to perform well also outside a laboratory and, after further development, makes it a good candidate for future applications in external environments. Based on these qualitative results, the model shape resembled the real hand accurately and, most importantly, the model followed the hand movements in real-time when operated on a PC with Intel I7, 32Gb Ram, NVIDIA GE force GTX 1080. The results confirmed that the model was represented on the screen at 47fps (demonstrated by registering the timestamps of the presentations on the screen) which is about 1.5 times the frequency required to consider human-systems interaction useful for real-time (about 30fps). This high frequency image acquisition is what allows us to combine and synchronize the two LEAP systems, even when they do not work on the same fps. Table 2 shows that outliers in all five finger trajectories have been reduced to more than their half. In particular, the average reduction is 58% for the x, 51.6% for the y, and 55.2% for the z direction. To conclude the analysis, we want to remark that, both from Fig. 4 and from Table 2, certain fingers show a lower number of outliers than others. Thumb has the less outliers in both approaches (its averages on the three axes are 19.3 and 10.3 for the old and the new method respectively) while pinkie comes second using the old method (25.7 outliers on average) and ring finger has the second less spikes with the new method (12 on average). This is obviously due to their more favorable position (external). To provide an overview of outlier average values for both fingers and axes, Table 3 is also presented. Moreover, two other considerations can be made: 1) the x coordinate is more stable than other coordinates (there is a not evident explanation to that behavior and we have to explore it); 2) there is a different behavior between the two LEAPs of the VG (the horizontal one was more stable than the vertical). This point is probably due to internal hardware differences between the two sensors and it is not an environment effect. In fact, by rotating the VG of 90•, the two LEAPs maintain the same behaviour. Average values of the plot outliers with the old and the new data integration method, together with their % reduction percentage, separated by finger and grouped by axis (first four columns) and separated by axis and grouped by finger (the remaining four columns)

Conclusion
Occlusions are one of the biggest and most studied issues in hand movement tracking. The LEAP system, with its low cost and simple setup, offers the opportunity of significantly reducing the problem, by using more than one detector to multiply the visual information.
We presented a new strategy for selecting data coming from both sensors forming the VG, a system composed by two vertically placed LEAP detectors to provide 4D hand tracking in real-time. The proposed strategy has made it possible to reduce occlusions, to avoid outliers and false position indications (errors) with respect to using data from just one sensor, and to increase the stability of VG. These are the necessary conditions by which VG, being a touchless system which leaves the hand free to perform natural movements, could be effectively used for reproducing hand and finger movements with good spatial and temporal resolution and, hence, to drive systems remotely with high accuracy. However, the proposed results are only capable to demonstrate qualitative improvements with respect to the original mutual exclusion strategy. Future work will be dedicated to organize measurements from which it would be possible to obtain also quantitative evaluations and to study possible countermeasures to the residual instabilities produced when some views are obscured with respect to both sensors at the same time (the usage of a third detector, as in [35], could help but the real-time conditions have to be checked). Moreover, advanced data integration strategies based on AI will be designed and tested to improve VG precision and stability, while maintaining a sufficiently low computational load for a low-cost machine to maintain real-time. In particular, we aim to reach this goal by using AI-based approaches, such as those in [1,13,40], applied to the temporal trajectories described by each joint of the hand.