Keywords

1 Introduction

Unlike traditional invasive technologies, such as displays, foils, or infrared (IR) grids that are limited in detecting interactions based on an assembly, in tabletop interactions, an existing environment integrates into the field of interaction. This integration allows a high degree of freedom to incorporate elements into the environment and provides new interaction possibilities that can fundamentally change traditional user interfaces. Projections liberate us from the thought of an image format, as the physical environment offers the natural boundaries of interaction possibilities, and enables the fusion of the physical world with possibilities of interaction and projection, without invasive assemblies in the environment.

This review is on methods in multi-touch tracking with an image projection on a flat surface (e.g., table or wall) using a projection device (e.g., beamer) and sensors (e.g., depth sensor). One of the research topics of the Institute for Interactive Technologies at the University of Applied Sciences and Arts Northwestern lies within the scope of interactive blended projections. They have introduced “Live Paper” [5], an interactive tabletop projection-based environment based on the findings of Xiao et al. [18], to enhance user interaction scenarios. The field of tabletop projections continuously develops in terms of interaction and interface design with potential in multiple areas of application. So far, we have been limited by the lack of possibilities of existing sensors and projection technologies. We are now at a point where there is potential for further development, as new sensors and methods have been introduced.

This review intends to find out the current state of touch tracking for tabletop projections to use in future projects and tries to identify state of the art and the potential of such environments as a new generation of interfaces for current and future research in the institute. It will examine which methods are relevant to the next development steps and identify trends for future research. The main research question, “What is the current state and where lies the potential for current research?" should cover emerging trends and identify further research areas for this application. The following criteria must be fulfilled by a published paper to be included in this review:

  • published between 2015–2019

  • projection-based on a non-moving surface (e.g., wall and table)

  • no assembly of surface needed

  • multiple fingertip detection independent of the sensor angle (i.e., a fingertip behind another fingertip should be recognized if in the detection area)

It is not intended to cover the history of projection-based touch tracking. Where required, papers published before 2015 or not meeting the selection criteria can be mentioned. It emerged from a systematic overview of different methods in tabletop projections. As the selected paper do not distribute sufficient statistical data and are difficult to compare, this review excludes statistical comparisons and approaches a narrative review.

2 Overview of Interactive Tabletop Projection Methods

Table 1 shows an overview of milestones in touch tracking for tabletop projections meeting the criteria. For a more detailed summary of the history of projected interfaces, see Xiao et al. [16].

Table 1. Timeline of selected papers meeting the selection criteria in touch tracking for tabletop projections. Above the dashed line, papers with the most references in the citation graph (see Figure 1) are listed.
Fig. 1.
figure 1

Citation graph of selected papers published after 2015

Table 2. Strength and drawbacks of tabletop projection systems

Comparability. Unfortunately, not all methods and measurements can be compared to each other, as they use different sensors, platforms, or not all results are published. For example, Chai et al. [2] state that they could not compare their results to Xiao et al. [18], because of the limitations of the environment and sensors. Figure 1 shows the citation graph of the chosen papers revealing that Lee et al. [10] did not refer to any of the other selected papers, which makes it difficult to compare. To give an overview, Table 2 shows the primary subjective strengths and drawbacks of selected papers with their characteristics. Fujinawa et al. [6], Gao et al. [7], Gregor et al. [8], and Wu et al. [15] published to little technical information to be included.

2.1 Review of Selected Methodical Approaches

As there are different approaches on a conceptual level (e.g., depth segmentation or hand pose estimation) but also using separate, or combining, sensors, we identify tasks in touch recognition, which also structures the following sections. Each section contains the approaches of selected papers that made notable public contributions to that topic. Touch needs to be detected (see Sect. 2.2). Such a detected touch can be associated using an approximate region (see Sect. 2.3) and reduced into a touchpoint with an accuracy, measuring the distance from the intended target point to the estimated point (see Sect. 2.4). The detected touchpoint can be associated with the left or right hand (see Sect. 2.5), or further classified with a fingertip such as a thumb, index finger, middle finger, ring finger, and little finger (see Sect. 2.6).

2.2 Touch Detection

The challenge of touch detection is sensing when a finger has physically contacted a surface [18, p. 1]. Cadena et al. [1] compare the average of the assumed touchpoint and neighboring depth values (due to noise) to a threshold above the learned background. Both Cadena et al. [1] and Xiao et al. [18] apply the hysteresis process to avoid rapid changes in the touch state. Xiao et al. [18] distinguish hovering and contact by including the neighborhood around the estimated touchpoint before comparing it to the threshold. In Lee et al. [10], the touch region is determined by comparing the average of depth values of several measurements to the initially calibrated background to reduce the noise of the touch panel and comparing them to a threshold [10, p. 4]. Lee et al. [10] apply 3x3 block patterns on the depth image values on non-background neighboring pixels, which are below a threshold, of a touch region to detect the touchpoint. Zhang et al. [19] propose a method to determine if a fingertip contacts the surface based on a finger-modeling approach (see Sect. 2.3), comparing the depth values to the part around the fingertip.

2.3 Touch Region Detection

Touch region detection describes how the region of interest (ROI), such as hands and fingers, are extracted omitting the background. Xiao et al. [16, 18] divide touch tracking systems into background and finger modeling approaches. A background modeling approach detects contact with the surface by comparing current depth values with a background map [18, pp. 2–3]. In contrast, finger modeling does not require background data because they segment fingers based on their characteristics [18, pp. 2–3].

Wilson [14] suggested a static background model, comparing the surface depth value using a minimal and maximal threshold. Also, Choi et al. [4], Son et al. [13], and Lee et al. [10] capture a static background depth image once. Such methods have been extended by Cadena et al. [1] using the arithmetic mean for each depth pixel, to calibrate the background model initially. Chai et al. [2] use the 30 first frames to build the background model and calculate a per-pixel depth-difference background histogram. Lee et al. [10] enable us to choose the method to obtain the depth values of the background between average, which is faster but could be incorrect if not enough depth values are available, and mode, which is slower because a sort is needed. Xiao et al. [18] introduced a statistical model of the background inspired by “WorldKit” [17], to enable dynamic updating, allowing changes on the projection surface. Son et al. [13] detect the ROI using thresholds and extracts the three largest connected components. Zhang et al. [19] remove static objects from the background to extract the hand blobs on the foreground image.

2.4 Touchpoint Estimation

Xiao et al. [18] and also Cadena et al. [1] are extending the depth sensor, due to its noise, with IR information to detect edge contours better. Cadena et al. [1] extracts the farthest point concerning the center of the hand, but additionally use a k-curvature algorithm to support closed fists or joined fingers and corrects the touchpoint towards the edges extracted from the IR image. Xiao et al. [18] extract the farthest point concerning the center of each finger. If this fails, forward projection using arm and hand position is used. Son et al. [13] compute the skeleton of the connected components using morphological operations, calculate the centroid, and 20 corresponding extremal points using Dijkstra’s algorithm to test with an introduced complementary fingertip model and a cost function. Lee et al. [10] determine the touchpoint using a bounding box and 3 \(\times \) 3 block patterns. The touch path is corrected using a predicted position, the measured position, and a weight parameter to solve the disadvantage of a slow response in case of a sudden change of touchpoint movement. Fujinawa et al. [6] introduce interaction modes for different hand poses. A convolutional neural network (CNN) detects positions on the hand (e.g., fingertips, finger joints, center of palm) and obtains the position by comparing the depth value of each fingertip to the background depth, and selects an interaction mode (such as index posture) to reduce false fingertip detection. Matsubara et al. [11] propose a detection method to detect touch by the shape of the finger’s shadow using two IR lights.

2.5 Arm, Hand, and Fingertip Detection

Cadena et al. [1] extracts the arm-hand contours from ROI and uses the K-means algorithm to create two clusters, to classify the area with more points inside as hand region. Xiao et al. [18] use a flood fill segmentation to form a hierarchy containing arm, hand, finger, and fingertip and reject finger-like objects. Xiao et al. [18] use only the IR edge to fill fingertips, as the fingers merge with noise, but have built a fallback to depth-only touch tracking if the IR image is unusable (e.g., holes in the edge image). Fujinawa et al. [6] distinguish hands from objects and detect positions of hand fingertips, finger joints, and the center of palm using their CNN. Zhang et al. [19] extracts the hand region using the IR and depth image and applies a modified convex hull algorithm to detect candidate fingertips on the hand contour.

2.6 Fingertip Classification

Knowing which finger triggered the touch can bring extended functionality to a user (e.g., using the thumb as an eraser). As Choi et al. [4] stated, “determining the identity of a fingertip is more difficult than detecting fingertips” [4, p. 1488]. Choi et al. [4] noted that “index fingers and thumbs are the most frequently used fingers for human-computer interaction” and classify fingertips into the (left and right) index and thumb fingertips using cascaded random forests and a score function [4, p. 1487]. Chai et al. [2] developed a deep CNN-based hand pose estimation to identify fingertips.

3 Conclusion

Concerning the main research question, “What is the current state and where lies the potential for current research?” touch tracking on tabletop projections using different sensors and methods have been reviewed. The analysis of the selected papers suggests that there is an increasing trend to use machine learning to estimate the hand posture, which looks very promising. They are closely related to mid-air recognition, an emerging area in current research, which will also provide results for touch tracking in interactive table projections. Nevertheless, implementations using background modeling are more refined in terms of fingertip recognition and accuracy. Different sensors (e.g., depth sensor, IR sensor) are used or even combined to make surfaces touchable. Since it is a challenge to select a specific method for an applied solution, a testing framework would ensure the comparability of a method in the domain of interactive tabletop projections. It would be interesting to use low-power computing devices to enable further implementations in the emerging area of the Internet of Things (IoT). New technological possibilities, such as solid-state LIDAR or radar technology, could also provide new opportunities for the application of interactive table projections. The classification of fingers and hands, but also touch gestures, can improve touch recognition and offer new functionality for touch input (e.g., use the little finger as an eraser). The combination of the mature results of sensor processing with new insights from anatomy and machine learning or even new sensors (e.g., radar) would create more accurate and functional input systems.