Exploring Machine Learning Object Classification for Interactive Proximity Surfaces
- 1.4k Downloads
Capacitive proximity sensors are a variety of the sensing technology that drives most finger-controlled touch screens today. However, they work over a larger distance. As they are not disturbed by non-conductive materials, they can be used to track hands above arbitrary surfaces, creating flexible interactive surfaces. Since the resolution is lower compared to many other sensing technologies, it is necessary to use sophisticated data processing methods for object recognition and tracking. In this work we explore machine learning methods for the detection and tracking of hands above an interactive surface created with capacitive proximity sensors. We discuss suitable methods and present our implementation based on Random Decision Forests. The system has been evaluated on a prototype interactive surface - the CapTap. Using a Kinect-based hand tracking system, we collect training data and compare the results of the learning algorithm to actual data.
KeywordsCapacitive proximity sensing Interactive surfaces Machine learning
Capacitive sensing drives the most common interaction device of the recent years - the finger-controlled touch screen. A generated electric field is disturbed by the presence of grounded objects, such as fingers. This disturbance can be measured and is used to detect the location of an object on the surface . Capacitive proximity sensors are a variety of this technology that is able to detect the presence of a human body over a distance. They can be used to create interaction devices that are hidden below non-conductive surfaces [2, 3]. However, for these sensors, object recognition is a major challenge, due to a comparatively low resolution and ambiguity of the detected objects .
In the last years researchers have investigated a number of different processing methods, including algorithms that try to distinguish multiple hands. Machine learning methods have become more popular in object recognition for 3D interaction devices. The Microsoft Kinect is a popular example that extensively uses machine learning for posture classification . Le Goc et al. used the Random Decision Forest (RDF) method to improve the object recognition of a commercially available, small area capacitive proximity interaction device . They used a stereo camera system to track the position of a fingertip with high precision as ground truth and trained a RDF, resulting in a finger tracking precision that exceeded the manufacturer’s methods. In this paper we want to evaluate, how machine learning methods, such as RDF can be used to realize object recognition and tracking on large-area interaction systems. The CapTap is an interactive table that uses capacitive proximity sensors to provide a 3D interaction paradigm . It is comprised of 24 sensors hidden under a wooden or plastic cover. It starts detecting objects at a distance of approximately 40 cm. The sensors use the OpenCapSense rapid prototyping toolkit for capacitive sensing .
Using a second system that supports precise hand tracking, we can acquire ground truth data. The idea is to perform a classification of volume elements that the center of the hand resides in. For this it is necessary to collect training samples from within each volume element. The hand tracker uses a background subtraction method from a depth image and looks for the innermost point in a recognized hand shape. In order to determine the 3D position the extrinsic camera parameters have to be determined in a calibration routine. This idea was implemented in a prototype that uses the Kinect v2 as a system to acquire the hand position. Additionally, we propose several modifications to improve the localization accuracy.
Propose a method to track multiple hands over a capacitive proximity interaction device using RDF classification
Create a system to accurately measure the relative position of hands from a surface using a fast hand detection algorithm for the Microsoft Kinect v2
Evaluation of the system using 1500 training samples and proposing several methods to improve localization based on RDF classification results.
2 Related Works
Our proposed method is designed for large area interactive tables that support interaction above the surface. One of those systems is the MirageTable by Benko et al. . This curved interactive surface combines hand tracking in three dimensions using the Microsoft Kinect and Augmented Reality with a stereo projector attached above the table. The users have to wear shutter glasses for the stereo projection. An early system that uses capacitive sensors for touch recognition is DiamondTouch by Dietz et al. . It uses a multilayer setup of electrodes and is thus able to track several touch points from different users. The system we used in this work is CapTap, an interactive table that combines capacitive proximity sensors for object detection above the surface and an acoustic sensor for detecting and distinguishing different touches (Fig. 1) . It creates capacitive images and applies various image processing methods to detect the location and orientation of hands and arms. The acoustic sensor uses analysis in the frequency domain and a trained classifier to recognize different touch events, such as taps and knocks .
The system requires training data as ground truth. Therefore, we need methods that track the position of a hand in 3D. The Microsoft Kinect is an affordable sensor that provides reasonably precise collection of depth data in a sufficiently large area. It is primarily intended for pose tracking . Some methods have been proposed for the first version that track the position of multiple hands . A novel method for the Kinect v2 enables the tracking of fingers with high precision . However, for our proposed system the focus was on fast tracking of the palm position above a surface. Therefore, a custom method was implemented that will be outlined in the following section.
3 Hand Tracking Using Random Decision Forests
A stand was built that holds a video system at an elevation above the surface of the interaction device.
The Kinect v2 is attached to the stand and observes the scene with both depth and RGB camera.
A computer vision algorithm detects the center of the palms of the hands in the interaction area in three dimensions.
A software suite collects the data of the capacitive sensors and the detected hand positions as training and test data.
3.1 Hand Tracking Algorithm for Kinect v2
We are using an edge detector on the depth image to find the boundaries of CapTap in the scene. As the depth images are fairly noisy, an average of 150 frames is used. The four corner points of the boundary allow us to create a transformation matrix that corrects for position in x, y, and z. This transformation matrix is applied to points of the Kinect coordinate systems that are translated to points in the CapTap coordinate system.
Segmentation of hands using background subtraction, as shown in Fig. 3 on the right. Creates a binary black and white image.
Use a combination of the morphological operations erode and dilate to reduce noise in the remaining image.
Apply a region growing algorithm to find pixels that belong to the interior or exterior of a hand.
Find the interior point of a hand that is furthest from any edge, as shown in Fig. 3 on the left. Transform the coordinates to the CapTap coordinate system.
This approach is fast enough for execution in real-time (>30 frames per second). We applied some optimizations that reduce the number of candidate pixels for the palm center, such as discarding candidates very close to the edges or candidates that are part of a finger.
3.2 Acquiring Training Data
As we are limited to collecting 30 samples per second the whole process would take approximately 110 h if the hands move perfectly. The complexity increases if we want to recognize multiple hands.
This is not feasible in the scope of this work, thus several simplifications have been applied. The resolution of the system was restricted to voxels with 20 mm edge length, which cuts down the training time significantly. Additionally, the classification of multiple hands assumes a minimal distance between those, which reduces the number of training samples required.
4 Random Decision Forests Location Classification
Assigning the center of the dominant voxel as the location of the palm of the hand, subsequently called naïve-center.
Linear interpolation between the centers of the vxoels with the highest probability, subsequently called linear-interpolation.
Weighted interpolation between the centers of the most probable voxels according to their probability, subsequently called weighted-interpolation.
Naïve-center is the basic approach that does not require any additional knowledge about the geometric distribution of the voxels in the interaction area. Disadvantages include a highly-quantized localization, whose resolution is limited to the voxel edge length. If there are multiple voxels that have a similar probability, it is likely that the true position is somewhere between the detected voxels.
Linear-interpolation solves the problem of similar probabilities and can successfully localize hands that are between voxels. However, the simple linear interpolation can lead to overrepresentation of unlikely voxels. If the most likely voxel has a probability of 0.9 and a voxel that is far away has 0.05, the linear-interpolation would lead to a position somewhere between the voxels that is likely wrong. A first improvement is discarding voxels that have a low probability.
Weighted-interpolation tries to overcome this disadvantage by attaching a weight to the voxels according to their probability. This leads to a location that has a tendency towards the center of the most probable voxel. A remaining disadvantage of this approach is the lack of correlation between classification probability and hand position. If the hand is at the border between two voxels there is no guarantee that the classification assigns both the same probability. However, in practical applications we found that this often leads to the best results.
5 Evaluation and Discussion
On the following lines we are discussing some findings occurring during the system design and implementation stages, as well as the results of a study performed on a fully trained system.
Evaluation results of several algorithm varieties and RDF settings using 827 training and test samples.
Ratio correct voxels
Average distance error
N = 100, r = 0.75
N = 50, r = 0.66
N = 100, r = 0.75
N = 50, r = 0.66
The ratio of correctly classified voxels was comparatively low for both varieties, ranging from 28 % to 48 %. The average distance error was better, whereas the classified hand position was between 2.3 cm and 4.4 cm from the ground truth. In this case the distance is measured from the center of the voxel as collected by the Kinect.
The RDF with a lower number of trees generally performed better than the RDF with a higher number of trees. During the evaluation we could observe that there are is the expected strong correlation of RDF classification probability and voxels close to the training data, even if they often do not fit exactly. Even with a very coarse voxel resolution this creates the opportunity to get a good average distance error of the hand palm from the true position. This finding can be used to optimize the training process in the future, by using suitable interpolation methods.
6 Conclusion and Future Work
In this work we explored the use of RDF classification for the tracking of hands in three dimensions above a large-area interaction devices using capacitive proximity sensors. There are several challenges in using this methods. The number of training samples increases exponentially with the intended resolution, as the interaction area is a three-dimensional space. In addition sensor noise and calibration of the initial sensor values become a challenge in unconstrained conditions.
We have proposed two interpolation methods for improving the localization of objects, based on classification results. An efficient algorithm was developed to calculate the center of the hand palm based on simple computer vision operations on the depth image of a Kinect v2. We evaluated the system and achieve a good localization error, even if the classifier did not find the correct voxel in the majority of the cases.
We would like to use the symmetry of the sensor setup to create more efficient training methods that are performed over a smaller area, but whose results can be used for other areas of the interaction device. We will collect more training data to get measurements at smaller resolutions, using this method. Based on that we can evaluate if resolutions comparable to vision-based systems are achievable.
We would like to thank all volunteers that participated in our studies and provided valuable feedback for future iterations. This work was supported by the European Commission under the 7th Framework Programme (Grant Agreement No. 611421).
- 1.Barrett, G., Omote, R.: Projected capacitive touch technology. Inf. Disp. 28, 16–21 (2010)Google Scholar
- 2.Braun, A., Hamisu, P.: Designing a multi-purpose capacitive proximity sensing input device. Proc. PETRA (2011). Article No. 15Google Scholar
- 3.Zimmerman, T.G., Smith, J.R., Paradiso, J.A., Allport, D., Gershenfeld, N.: Applying electric field sensing to human-computer interfaces. In: Proceedings of the CHI, pp. 280–287 (1995)Google Scholar
- 4.Braun, A., Wichert, R., Kuijper, A., Fellner, D.W.: Capacitive proximity sensing in smart environments. J. Ambient Intell. Smart Environ. 7, 1–28 (2015)Google Scholar
- 6.Le Goc, M., Taylor, S., Izadi, S., Keskin, C., et al.: A low-cost transparent electric field sensor for 3d interaction on mobile devices. In: Proceedings of the CHI, pp. 3167–3170 (2014)Google Scholar
- 7.Braun, A., Zander-Walz, S., Krepp, S., Rus, S., Wichert, R., Kuijper, A.: CapTap - combining capacitive gesture recognition and knock detection. Working Paper (2016)Google Scholar
- 8.Grosse-Puppendahl, T., Berghoefer, Y., Braun, A., Wimmer, R., Kuijper, A.: OpenCapSense: a rapid prototyping toolkit for pervasive interaction using capacitive sensing. In: 2013 IEEE International Conference on Pervasive Computing and Communications, PerCom 2013, pp. 152–159 (2013)Google Scholar
- 9.Grosse-Puppendahl, T., Braun, A., Kamieth, F., Kuijper, A.: Swiss-cheese extended: an object recognition method for ubiquitous interfaces based on capacitive proximity sensing. In: Proceedings of the CHI, pp. 1401–1410 (2013)Google Scholar
- 10.Microchip Technology Inc.: GestIC ® Design Guide: Electrodes and System Design MGC3130 (2013)Google Scholar
- 11.Benko, H., Jota, R., Wilson, A.: Miragetable: freehand interaction on a projected augmented reality tabletop. In: Proceedings of the CHI, pp. 199–208 (2012)Google Scholar
- 12.Dietz, P., Leigh, D.: DiamondTouch: a multi-user touch technology. In: Proceedings of the UIST, pp. 219–226 (2001)Google Scholar
- 13.Harrison, C., Schwarz, J., Hudson, S.E.: TapSense: enhancing finger interaction on touch surfaces. In: Proceedings of the UIST, pp. 627–636 (2011)Google Scholar
- 14.Ren, Z., Meng, J., Yuan, J.: Depth camera based hand gesture recognition and its applications in Human-Computer-Interaction. In: 2011 8th International Conference on Information, Communications and Signal Processing, pp. 1–5 (2011)Google Scholar
- 15.Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., Izadi, S.: Accurate, robust, and flexible real-time hand tracking. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 3633–3642. ACM, New York (2015)Google Scholar
- 17.Cerezo, F.T.: 3D hand and finger recognition using Kinect. Project report, University of Granada (2011)Google Scholar