# Estimate Hand Poses Efficiently from Single Depth Images

## Abstract

This paper aims to tackle the practically very challenging problem of *efficient* and *accurate* hand pose estimation from single depth images. A dedicated two-step regression forest pipeline is proposed: given an input hand depth image, step one involves mainly estimation of 3D location and in-plane rotation of the hand using a pixel-wise regression forest. This is utilized in step two which delivers final hand estimation by a similar regression forest model based on the entire hand image patch. Moreover, our estimation is guided by internally executing a 3D hand kinematic chain model. For an unseen test image, the kinematic model parameters are estimated by a proposed dynamically weighted scheme. As a combined effect of these proposed building blocks, our approach is able to deliver more precise estimation of hand poses. In practice, our approach works at 15.6 frame-per-second (FPS) on an average laptop when implemented in CPU, which is further sped-up to 67.2 FPS when running on GPU. In addition, we introduce and make publicly available a data-glove annotated depth image dataset covering various hand shapes and gestures, which enables us conducting quantitative analyses on real-world hand images. The effectiveness of our approach is verified empirically on both synthetic and the annotated real-world datasets for hand pose estimation, as well as related applications including part-based labeling and gesture classification. In addition to empirical studies, the consistency property of our approach is also theoretically analyzed.

### Keywords

Hand pose estimation Depth images GPU acceleration Regression forests Consistency analysis Annotated hand image dataset## 1 Introduction

Vision-based hand interpretation plays important roles in diverse applications including humanoid animation (Sueda et al. 2008; Wang and Popović 2009), robotic control (Gustus et al. 2012), and human-computer interaction (Hackenberg et al. 2011; Melax et al. 2013), among others. In its core lies the nevertheless challenging problem of 3D hand pose estimation (Erol et al. 2007; Gorce et al. 2011), owing mostly to the complex and dexterous nature of hand articulations (Gustus et al. 2012). Facilitated by the emerging commodity-level depth cameras (Kinect 2011; Softkinetic 2012), recent efforts such as Keskin et al. (2012) ,Ye et al. (2013), Xu and Cheng (2013), Tang et al. (2014) have led to noticeable progress in the field. The problem is however still far from being satisfactorily solved: For example, not much quantitative analysis has been conducted on annotated real-world 3D datasets, partly due to the practical difficulty of setting up such testbeds. This however imposes significant restrictions on the evaluation of existing efforts, which are often either visually judged based on a number of real depth images, or quantitatively verified on synthetic images only as the ground-truths are naturally known. As each work utilizes its own set of images, their results are not entirely comparable. These inevitably raise the concerns of progress evaluation and reproducibility.

^{1}we tackle the problem of efficient hand pose estimation from single depth images. The main contributions of our work are fourfold:

For an unseen test image, a dynamically weighted scheme is proposed to regress our hand model parameters based on a two-step pipeline, which is empirically shown to lead to significantly reduced errors comparing to the state-of-the-arts. As presented in Fig. 1 as well the supplementary video, our system estimates hand poses from single images and with 3D orientations. This also enables our system to work with a mobile depth camera.

We provide an extensive, data-glove annotated benchmark of depth images for general hand pose estimation. The benchmark dataset, together with the ground-truths and the evaluation metric implementation, have been made publicly available. This is the first benchmark of such kind to our knowledge, and we wish it can provide an option for researchers in the field to compare performance on the same ground.

To maintain efficiency and to offset the CPU footprint, the most time-consuming components of our approach have also been identified and accelerated by GPU implementations, which gains us a further fivefold overall speed-up. These enable us to deliver a practical hand pose estimation system that works efficiently, at about 15.6 frame-per-second (FPS) on an average laptop, and 67.2 FPS when having access to a mid-range GPU.

The reliance on synthetic training examples naturally brings up the consistency question when infinitely many examples are potentially available for training. Our paper makes first such attempt to propose a regression forest-based hand pose system that is theoretically motivated. To this end we are able to provide consistency analysis on a simple variant of our learning system. Although the complete analysis is still open, we believe this is a necessary and important step toward full comprehension of the random forests theory that has been working so well on a number of applications in practice.

### 1.1 Related Work

An earlier version of our work appears in Xu and Cheng (2013) that deals with the problem of depth image-based hand pose estimation. There are a number of differences of our work here comparing to that of Xu and Cheng (2013): first, a simple two-step pipeline is utilized in our approach, in contrast to a more complicated approach in Xu and Cheng (2013) containing three steps. Second, in this work we attempt to consider random forest models that can be analyzed theoretically, while the random forest models in Xu and Cheng (2013) are not able to be studied theoretically. Third, there are also many other differences: the kinematic model parameters are estimated by a dynamically weighted scheme that leads to a significant error reduction in empirical evaluations. The information gains and split criteria, the usage of whole hand image patch rather than individual pixels, as well as the DOT features to be detailed later are also quite different. Meanwhile, various related regression forest models have been investigated recently: in Fanelli et al. (2011), the head pose has 6 degree-of-freedom (DoF), which is divided into 2 parts: 3D translation and 3D orientation. In each leaf node, the distribution is approximated by a 3D Gaussian. In Gall and Lempitsky (2013), the Hough forest model is instead utilized to represent the underlining distributions as voting with a set of 3D vectors. In Shotton et al. (2013), a *fixed* weight is assigned to each of the 3D voting vectors during training, and the experimental results suggest that the weight plays a crucial role in body pose estimation. Our scheme of *dynamical* weights can be regarded as a further extension of this idea to allow adaptive weight estimation at test-run that is dedicated to the current test example. A binary latent tree model is used in Tang et al. (2014) to guide the searching process of 3D locations of hand joints. For the related problem of video-based 3D hand tracking, a user-specific modeling method is proposed by Taylor et al. (2014), while (Oikonomidis et al. 2014) adopts an evolutionary optimization method to capture hand and object interactions.

Leapmotion (2013) is a commercial system designed for close-range (within about 50 cm in depth) hand pose estimation. As a closed system based proprietary hardware, its inner working mechanism remains undisclosed. Our observation is that it is not well tolerant to self-occlusions of finger tips. In contrast, our system works beyond half a meter in depth, and works well when some of the finger-tips are occluded as it does not rely on detecting finger tips.

Additionally, instead of directly estimating 3D locations of finger joints from the depth image as e.g. that of Girshick et al. (2011), our model predicts the parameters of a predefined hand kinematic chain, which is further utilized to build the 3D hand. This is mainly due to the fact that compared to 3D location of joints, kinematic chain is a global representation and is more tolerant to self-occlusion, a scenario often encountered in our hand pose estimation context. Second, for human pose estimation, once the body location is known (*i.e., *the torso is fixed), the limbs and the head can be roughly considered as independent sub-chains: *e.g. *a change from left hand will not affect the other parts significantly. In contrast, motions of the five fingers and the palm are tightly correlated.

For the related problem of optical motion capture, depth cameras have been utilized either on their own (Ballan et al. 2012), or together with existing marker-based system (Zhao et al. 2012; Sridhar et al. 2013) for markerless optical motion capture. While being able to produce more precise results, they typically rely on more than one cameras, and operate in an off-line fashion. In term of annotated datasets of hand images, existing efforts are typically annotated manually with either part-based labels (e.g. Tang et al. 2014) or finger tips (Sridhar et al. 2013). However these annotations do not explicitly offer 3D information of the skeletal joints. Tzionas and Gall (2013) instead engages a human annotator to annotate 2D locations of joints, and aggregates them to infer the 3D hand joint locations. This dataset unfortunately does not provide depth image input, also one possible concern is its annotation might not be fully objective.

In what follows, we start by giving an overall account of the regression forest models that are the core modules in our proposed learning system.

## 2 Our Theoretically Motivated Regression Forest Models

As will become clear in later sections, the training of our regression forests relies on large quantity of synthetic examples. It is thus of central interest to provide *consistency* analysis to characterize their asymptotic behaviors, which concerns the convergence of the estimate to an optimal estimate as the sample size goes to infinity. Most existing papers (Breiman 2004; Biau et al. 2008; Biau 2012; Denil et al. 2014) on the consistency of regression forests focus on stylized and simplified algorithms. The unpublished manuscript of Breiman (2004) suggests a simplified version of random forests and provides a heuristic analysis of its consistency. This model is further analyzed in Biau (2012) where, besides consistency, the author also shows that the rate of convergence depends only on the number of strong features. An important work on the consistency of random forests for classification is Biau et al. (2008) which provides consistency theorems for various versions of random forests and other randomized ensemble classifiers. Despite these efforts, there is still a noticeable gap between theory and practice of regression forests learning systems. This is particularly true for the pose estimation systems that have make tremendous progresses during the past few years in looking at human full-body, head, and hand, where random forests have been very successful. On the other hand, little theoretical analysis has been provided for the learning systems underpinning these empirical successes.

Different from most existing practical random forest models, the random forest model considered in our approach is theoretically motivated, which is inspired by existing theoretical works (Breiman 2004; Biau et al. 2008; Biau 2012) and in particular (Denil et al. 2014). The theoretical analysis of the resulting random forest model closely follows that of Denil et al. (2014). Meanwhile our proposed random forest model is sufficiently sophisticated to be practically capable of addressing real world problems. Our models and its variants are specifically applied to the real problem of hand pose estimation with competitive empirical performance. Meanwhile, it is worth noting that the proposed models are generic and can work with problems beyond hand pose estimation.

In what follows, we introduce our generic regression forest model in term of training data partition, split criteria during tree constructions, prediction, as well as its variants. Its asymptotic *consistency* analysis is also offered.

### 2.1 Training Data: A Partition into Structure Data and Estimation Data

Formally, let \(\overline{X}\) denote a \([0,1]^d\)-valued random variable and \(\overline{Y}\) denote a \({\mathbb {R}}^{q}\)-valued vector of random variables, where \(d\) is the dimension of normalized feature space, and \(q\) is the dimension of the label space. Denote \(\overline{\mathbf {x}}\) (or \(\overline{\mathbf {y}}\)) a realization of the random variable \(\overline{X}\) (or \(\overline{Y}\)). A training example can be defined as an (instance, label) pair, \(\left( \overline{\mathbf {x}}, \overline{\mathbf {y}} \right) \). Therefore, the set of \(n\) training examples is represented as \(D_n= \big \{ (\overline{\mathbf {x}}_{i}, \overline{\mathbf {y}}_{i}) \big \}_{i=1}^n\). Inspired by Denil et al. (2014), during tree constructions, we partition \(D_n\) randomly into two parts: *structure data*\(U_n\) and *estimation data*\(E_n\) by randomly selecting \(\lfloor \frac{n}{2}\rfloor \) examples as structure data and the rest as estimation data. Examples in structure data are used to determine the tests used in split nodes (i.e. internal nodes of the trees), and examples in estimation data are retained in each leaf node of the tree for making prediction at test phase. This way, once the partition of the training sample is provided, the randomness in the construction of the tree remains independent of the estimation data, which is necessary to ensure consistency in the follow-up theoretical analysis.

### 2.2 The Split Criteria

The performance of the regression forest models is known to be crucially determined by decision tree constructions and particularly the split criteria, which are the focuses here.

In the regression forests, each decision tree is independently constructed. The tree construction process can be equivalently interpreted as a successive partition of the feature space, \([0,1]^d\), with axis-aligned split functions. That is, starting from the root node of a decision tree which encloses the entire feature space, each tree node corresponds to a specific rectangular hypercube with monotone decreasing volumes as we visit node deeper into the tree. Finally, the union of the hypercubes associated with the set of leaf nodes forms a complete partition of the feature space.

Similar to existing regression forests in literature including (Fanelli et al. 2011; Shotton et al. 2011; Denil et al. 2014), at a split node, we randomly select a relatively small set of \(s\) distinct features \(\varPhi := \{\phi _i\}_{i=1}^s\) from the \(d\)-dimensional space as candidate features (i.e. entries of the feature vector). \(s\) is obtained via \(s \sim 1+ \mathrm {Binomial}(d-1,p)\), where \(\mathrm {Binomial}\) denotes the binomial distribution, and \(p>0\) a predefined probability. Denote \(t \in {\mathbb {R}}\) a threshold. At every candidate feature dimension, we first randomly select \(M\) structure data in this node, where \(M\) is the smaller value between the number of structure data in this node and a user-specified integer \(m_0\) (\(m_0\) is independent of the training size), then project them onto the candidate feature dimension and uniformly select a set of candidate thresholds \({\mathcal {T}}\) over the projections of the \(M\) chosen examples. The best test \((\phi ^*, t^*)\) is chosen from these \(s\) features and accompanying thresholds by maximizing the information gain that is to be defined next. This procedure is then repeated until there are \(\lceil \log _2 L_n\rceil \) levels in the tree *or* if further splitting of a node would result in fewer than \(k_n\) estimation examples.

### 2.3 Prediction

*Baseline-M*. Before proceeding to its asymptotic analysis in Sect. 2.4, we would like to further discuss two more sophisticated variants that possess the same training stage as illustrated above and differ only at the prediction stage: one is the forest model using static weights, and is called

*Baseline-S*; The second variant is a regression forest model using dynamical weights at the leaf nodes, and is termed

*DHand*.

#### 2.3.1 Baseline-S versus DHand: Static Versus Dynamical Weights at the Leaf Nodes

Instead of making prediction with the empirical average as in *Baseline-M*, we consider to deliver final prediction by mode-seeking of the votes as in the typical Hough forests of Gall and Lempitsky (2013). More specifically, let \(l\) denote the current leaf node, and let \(i\in \{1,\dots , k_l\}\) indexes over the training examples of leaf node \(l\). These examples are subsequently included as vote vectors in the voting space. Now, consider a more general scenario where each of the training examples has its own weight. Let \(\mathbf {z}_{li}\) represent the parameter vector of a particular training example \(i\) of leaf node \(l\), together with \(w_{li} > 0\) as its corresponding weight. The set of weighted training examples at leaf node \(l\) can thus be defined as \({\mathcal {V}}_{l}=\big \{ (\mathbf {z}_{li}, w_{li}) \big \}_{i=1}^{k_l}\). Note this empirical vote set defines a point set or equivalently, an empirical distribution. In existing literature such as Gall and Lempitsky (2013), \(w_{li} =1\) for any training example \(i\) and any leaf node \(l\). In other words, the empirical distribution \({\mathcal {V}}_{l}\) is determined during tree constructions in training stage, and remains unchanged during prediction stage. This is referred to as the statically weighted scheme or *Baseline-S*. Rather, we consider a dynamically weighted scheme (i.e. *DHand*) where each of the weights, \(w_{li}\), can be decided at runtime. This is inspired by the observation that the typical distribution of \({\mathcal {V}}_{l}\) tends to be highly multi-modal. It is therefore crucial to assign each instance \(\mathbf {z}_{li}\) a weight \(w_{li}\) that properly reflects its influence on the test instance.

*DOT* Hinterstoisser et al. (2010)*:* As illustrated in Fig. 3, the DOT feature is used in our context to compute a similarity score \(C(I, I_T)\) between an input (\(I\)) and a reference (\(I_T\)) hand patches. DOT works by dividing the image patch into a series of blocks of size \(8\times 8\) pixels, where each block is encoded using the pixel gradient information as follows: denote as \(\eta \) the orientation of the gradient on a pixel, with its range \([0, 2\pi )\) quantized into \(n_{\eta }\) bins, \(\{0,1,\ldots ,n_{\eta }-1\}\). We empirically set \(n_{\eta }=8\), the span of each bin is thus \(45^{\circ }\). This way, \(\eta \) can be encoded as a vector \(\mathbf {o}\) of length \(n_{\eta }\), by assigning 1 to the bin it resides and 0 otherwise. We set \(\mathbf {o}\) to zero vector if there is no dominant orientation at the pixel. Now, consider each block of the input patch, its local dominant orientation \(\eta ^*\) is simply defined as the maximum gradient within this block, which gives the corresponding vector \(\mathbf {o}^*\). Meanwhile for each block in a template patch, to improve the robustness of DOT matching, we utilize a list of local dominant orientations \(\{\eta _{1}^*,\eta _{2}^*,\ldots ,\eta _{r}^*\}\), each corresponds to the template under a slight translation. Each entry of the list is mapped to the aforementioned orientation vector, and by applying bitwise OR operations successively to these vectors, they are merged into the vector \(\mathbf {o}_T^*\).

### 2.4 Theoretical Analysis

*Baseline-M*). Denote (\(\overline{X},\overline{Y}\)) a pair of random variables following certain joint distribution, and \(\mu \) the marginal distribution of \(\overline{X}\in [0,1]^d\). In regression analysis, one is interested in estimating the regression function \(r(\overline{\mathbf {x}}):={\mathbb {E}}\{\overline{Y}|\overline{X}=\overline{\mathbf {x}}\}\) for fixed \(\overline{\mathbf {x}}\) based on the training sample. A sequence of regression estimates \(r_n(\overline{\mathbf {x}})\) is called

*weakly consistent*for a certain distribution of \((\overline{X},\overline{Y})\) if

*Baseline-M*.

**Theorem 1**

Assume that \(\overline{X}\) is uniformly distributed on \([0,1]^d\) and \({\mathbb {E}} \Big \{ \big \Vert \overline{Y} \big \Vert ^2 \Big \}<\infty \), and suppose the regression function \(r(\overline{\mathbf {x}})\) is bounded. Then the regression forest estimates \(\big \{ r_n^{(Z)} \big \}\) of our *Baseline-M* model in (4) is consistent whenever \(\lceil \log _2L_n\rceil \rightarrow \infty , \frac{L_n}{n}\rightarrow 0\) and \(\frac{k_n}{n}\rightarrow 0\) as \(n\rightarrow \infty \).

Proof details of our consistency theorem is relegated to the appendix. Recall the optimal estimator is the regression function \(r(\overline{\mathbf {x}})\) which is usually unknown. The theorem guarantees that as the amount of data increases, the probability that the estimate \(r_n(\overline{\mathbf {x}})\) of our regression forests is within a small neighbourhood of the optimal estimator will approach arbitrarily close to one. In our context when infinitely many synthetic examples are potentially available for training, it suggests that our estimate, constructed by learning from a large amount of examples, is optimal with high probability.

- (1)
The maximum tree depth \(\lceil \log _{2}L_n\rceil \) is reached.

- (2)
The splitting of the node using the selected split point results in any child with fewer than \(k_n\) estimation points.

*Baseline-M*), while the analysis of

*Baseline-S*and

*DHand*remains open for future study. Meanwhile empirically

*DHand*is shown to outperform the rest two models by a large margin.

## 3 The Pipeline of Our Learning System

### 3.1 Preprocessing

Our approach relies on *synthetic* hand examples for training, where each training example contains a synthetic hand depth image and its corresponding 3D pose. The learned system is then applied to *real* depth images at test stage for pose estimation. Particularly, depth noises are commonly produced by existing commodity-level depth cameras, which renders noticeable differences from the synthetic images. For TOF cameras, this is overcome by applying median filter to clear away the outliers, which is followed by Gaussian filter to smooth out random noises. The amplitude image, also known as the confidence map, is used to filter out the so called “flying pixel” noise (Hansard et al. 2013). The pixel with low confidence value is treated as the background. For structured illumination cameras, the preprocessing strategy of Xu and Cheng (2013) can be applied. To obtain a hand image patch, a simple background removal technique similar to that of Shotton et al. (2011) is adopted, followed by image cropping to obtain a hand-centered bounding box. Moreover, to accommodate hand size variations, a simple calibration process is applied to properly scale a new hand size to match with that of the training ones, by acquiring an initial image with hand fully stretched and flat, and all fingers spread wide. Empirically these preprocessing ingredients are shown to work sufficiently well.

### 3.2 Our Two-Step Pipeline

After preprocessing, our approach consists of two major steps as in Fig. 2: step one involves mainly estimation of 3D location and in-plane rotation of the hand base (i.e. wrist) using a regression forest. This is utilized in step two which subsequently establishes its coordinate based on the estimated 3D location and in-plane rotation. In step two, a similar regression forest model estimates the rest parameters of our hand kinematic model by a dynamically weighted scheme, which produces the final pose estimation. Note that different from existing methods such as Keskin et al. (2012) where by introducing the conditional model, a lot of forest models (each catering one particular condition) have to be prepared and kept in memory, our pipeline design requires only one forest model after the translation and in-plane rotation of step one to establish the canonical coordinate.

In both steps of our pipeline, two almost identical regression forests are adopted. In what follows, separate descriptions are provided that underline their differences. This allows us to present three variants of our learning system with a slight abuse of notations: the *Baseline-M* system employs the basic *Baseline-M* regression model on both steps; Similarly, the *Baseline-S* system utilizes instead the *Baseline-S* models in both steps; Finally, the *DHand* system applies the *DHand* regression forest model only at step two, while the *Baseline-S* model is still engaged in step one. It is worth mentioning that for the *Baseline-M* system, our theoretical analysis applies to both regression forests models used in the two steps of our pipeline.

Before proceeding to our main steps, we would like to introduce the 3D hand poses, the related depth features and tests utilized in our approach, which are based on existing techniques as follows:

*Our Kinematic Chain Model:*As displayed in Fig. 4, the representation of our 3D hand poses follows that of Xu and Cheng (2013): 4 DoF are used for each of the five fingers, and 1 DoF is explicitly for palm bending, as well as 6 DoF reserved for the global hand location (\(x_1, x_2, x_3\)) and orientation (\(\alpha , \beta , \gamma \)), where \(\alpha \) stands for the in-plane rotation. This amounts to a 27-dimensional vector \(\varTheta := \big ( x_1, x_2, x_3, \alpha , \beta , \gamma , \ldots \big )\) as the hand kinematic chain model, used in our system to represent each of the 3D hand poses. Sometimes it is more convenient to denote as \(\varTheta = \big ( x_1, x_2, x_3, \alpha , \mathbf {z} \big )\), with \(\mathbf {z}\) being a 23-dimensional sub-vector.

*Depth Features and Binary Tests:*Let \(I\) denote the hand image patch obtained from raw depth image. Without loss of generality, one depth image is assumed to contain only one right hand. The depth features as mentioned in Shotton et al. (2011) are adapted to our context here. That is, at a given pixel location \(\mathbf {x}=(\hat{x}_1, \hat{x}_2)\) of a hand patch \(I\), denote its depth value as a mapping \(d_I (\mathbf {x})\), and construct a feature \(\phi (\mathbf {x})\) by considering two 2D offsets positions \(\mathrm {u}, \mathrm {v}\) from \(\mathbf {x}\):

#### 3.2.1 Step One: Estimation of Coordinate Origin and In-Plane Rotation

This step is to estimate the 3D location and in-plane rotation of the hand base, namely (\(x_1, x_2, x_3, \alpha \)), which forms the origin of our to-be-used coordinate in step two. The (instance, label) pair of an example in step one is specified as follows: the instance (aka feature vector) \(\overline{\mathbf {x}}\) is obtained from an image patch centered at current pixel location, \(\mathbf {x}=(\hat{x}_1, \hat{x}_2)\). Each element of \(\overline{\mathbf {x}}\) is realized by feeding particular \(u,v\) offset values in (7). Correspondingly the label of each example \(\overline{\mathbf {y}}\) is the first four elements of the full pose label vector \(\varTheta \), namely (\(x_1, x_2, x_3, \alpha \)). A regression forest is used to predict these parameters as follows: every pixel location in the hand image patch determines a training example, which is parsed by each of the \(T_1\) trees, resulting in a path from the root to certain leaf node that stores a collection of training examples. Empirically we observe that this 3D origin location and in-plane rotation are usually estimated fairly accurately.

*Split Criterion of the First Step*

For the regression forest of the first step, its input is an image patch centered at current pixel, from which it produces the 4-dimensional parameters (\(x_1, x_2, x_3, \alpha \)). The entropy term of (2) is naturally computed in this 4-dimensional space (i.e. \(q=4\)).

#### 3.2.2 Step Two: Pose Estimation

The depth pixel values of a hand image patch naturally form a 3D point cloud. With the output of step one, the point cloud is translated to (\(x_1, x_2, x_3\)) as coordinate origin, which is followed by a reverse-rotation to the canonical hand pose by the estimated in-plane rotation \(\alpha \). An almost identical regression forest is then constructed to deliver the hand pose estimation: with the location output of step one, (\(x_1, x_2, x_3\)), as the coordinate origin, each *entire* hand patch from training is parsed by each of the \(T_2\) trees, leading down the tree path to a certain leaf node. The regression forest model of this step then delivers a 23-dimensional parameter vector \(\mathbf {z}\), by aggregating the votes of the training example of the leaf nodes. The final 27-dimensional parameter estimation \(\varTheta \) is then obtained by direct composition of results from both steps. Meanwhile for step two, \(\overline{\mathbf {x}}\) stands for a feature vector of the entire hand image patch, while \(\overline{\mathbf {y}}:=\mathbf {z}\) represents the remaining 23 elements of \(\varTheta \).

*Split Criterion of the Second Step*

The second step focuses on estimating the remaining 23-dimensional parameters, which resides in a much larger space than what we have considered during the first step. As a result, by straightforwardly following the same procedure as in step one, we will inevitably work with a very sparsely distributed empirical point set in a relatively high dimensional space. As a result, it consumes considerably amount of time, while the results might be unstable. Instead we consider an alternative strategy.

## 4 GPU Acceleration

### 4.1 Motivation

Typically a leaf node is expected to contain similar poses. The vast set of feasible poses however implies a conflicting aim: on one hand, this can be achieved by making ready as many training examples as possible; On the other hand, practically we prefer a small memory print for our system, thus limiting the amount of data. A good compromise is obtained via imposing a set of small random perturbations including 2D translations, rotations, and hand size scaling for each of existing training instances, \(I_t\). This way, a leaf node usually has a better chance to work with an enriched set of similar poses. For this purpose, small transformation such as in-plane translations, rotations and scaling are additionally applied on the training image patches. We remap \(I_t\) using \(m_t\) transformation maps. Every transformation map is generated using a set of small random perturbations including 2D translations, rotations, and hand size scaling of the same hand gesture, and is of the same dimensions (i.e. \(w \times h\)) as of \(I_t\). After remapping the values of \(I_t\) using a transformation map, its DOT features are generated and compared with each of the features of the \(k_l\) instances of \(I_{li}\) to obtain a similarity score. These DOT-related executions turns out to be *the* computation bottleneck in our CPU implementation, which can be substantially accelerated using GPU by exploiting the massive parallelism inherent in these steps. It is worth noting that the purpose here is to duplicate our CPU system with GPU-native implementation, in order to obtain the same performance with much reduced time.

### 4.2 Using Texture Units for Remapping

To perform this remapping on GPU, we first launch one thread for every \(\mathbf {x}\) to read its \(\mathbf {x}_f\) from \(M_t\). Since all the threads in a thread block read from adjacent locations in GPU memory in a sequence, the memory reads are perfectly coalesced. To obtain a depth value at \(\mathbf {x}_f\), we use the four pixels whose locations are the closest to it in \(I_t\). The resulting depth value is computed by performing a bi-linear interpolation of the depth values at these four pixels (Andrews and Patterson 1976; Hamming 1998). Reading the four pixels around \(\mathbf {x}_f\) is inefficient since the image is stored in memory in row order and memory accesses by adjacent threads can span across multiple rows and columns of \(I_t\) and thus cannot be coalesced. This type of memory access is not a problem in CPU computation due to its deep hierarchy of caches with large cache memories at each level. However, the data caches in GPU architecture are tiny in comparison and are not very useful for this computation. The row order memory layout that is commonly used has poor locality of reference. Instead an isotropic memory layout is needed, with no preferred access direction. Instead this operation can be performed in GPU by utilizing its two-dimensional texture memory, which ensures that pixels that are local in image space are almost always local in the memory layout (Peachey 1990).

### 4.3 Computing DOT Features

Computing the DOT features for each of the \(m_t\) remapped images takes two steps: computing the gradient at every pixel and then the dominant orientation of every block in the image. One thread is launched on GPU for every pixel to compute its X and Y gradient values. We apply a \(3 \times 3\) Sobel filter to compute the gradient and the memory reads are coalesed across a warp of threads for efficiency. Using the gradient values, the magnitude and angle of the gradient vector is computed and stored in GPU memory. We use the fast intrinsic functions available on GPU to compute these quickly.

To pick the orientation of the pixel whose magnitude is largest in a block, the common strategy of launching one thread per pixel is not practical. The cost of synchronization between threads of a DOT block is not worthwhile since the dimensions of the block (\(8 \times 8\)) are quite small in practice. Instead, we launch one thread for every DOT block to compare the magnitude values across its pixels and note the orientation of the largest magnitude vector.

### 4.4 Computing DOT Feature Matching

The DOT feature comparison is essentially composed of two steps: bitwise comparison and accumulation of the comparison result. The bitwise comparisons can be conveniently performed by using one thread per orientation in the DOT feature. A straightforward method to accumulate the comparison result is to use parallel segmented reduction. However, this can be wasteful because the size of DOT feature is typically small and the number of training examples is typically large. To accumulate efficiently, we use the special atomic addition operations that have been recently implemented in GPU hardware.

## 5 Our Annotated Real-World Dataset and Performance Evaluation

### 5.1 The Annotated Real-World Dataset

^{2}We wish this can provide an option for researchers in the field to compare performance on the same ground. As presented in Fig. 6, in our dataset the depth images are acquired through a TOF camera (SoftKinetic DS325 Softkinetic 2012). The corresponding annotated depth images is obtained from the state-of-the-art data-glove, (ShapeHand 2009). Our data capture setup is depicted in Fig. 7, where a data-gloved hand is performing in a desktop setting with the depth camera positioned overhead. Each depth image contains only one hand, and without loss of generality we consider only the right hand, and fix the camera to hand distance to around 50 cm. Our dataset contains images collected from 30 volunteers varying in age (18–60 years), gender (15 male and 15 female), race and hand shape. 29 images are obtained for each volunteer during the capture sessions, where 8 of these images are from the Chinese Number Counting system (from 1 to 10, excluding 3 and 7), and the remaining ones are from the American Sign Language (ASL) alphabet (from A to Z, excluding J, R, T, W and Z), as illustrated in Fig. 9. Together, these amount to 870 annotated examples, with each example consisting of a hand depth image and its label (the data-glove annotation).

In addition to our kinematic chain model of Fig. 4, an alternative characterization (Girshick et al. 2011) of a 3D hand pose consists of a sequence of joint locations \(\mathbf {v} = \big \{ v_i \in {\mathbb {R}}^3 : i=1, \ldots , m \big \}\), where \(m\) refers to the number of joints, and \(v\) specifies the 3D location of a joint. In term of performance evaluation, this characterization by joint locations (as illustrated in Fig. 8) is usually easily interpreted when comes to comparing pose estimation results. As this hand pose characterization is obtained from the ShapeHand data-glove, there exists some slight differences in joints when comparing with the kinematic model: first, all five finger tips are additionally considered in Fig. 8; Second, there are three thumb joints in Fig. 4 while only two of them are retained in Fig. 8, as ShapeHand does not measure the thumb base joint directly. Nevertheless there exists a unique mapping between the two characterizations.

### 5.2 Performance Evaluation Metric and Its Computation

Our performance evaluation metric is based on the *joint error*, which is defined as the averaged Euclidean distance in 3D space over all the joints. Note the joints in this context refer to the 20 joints defined in Fig. 8, which are exactly all the joints of our skeletal model as of Fig. 4 plus the tips of the five fingers and the hand base, except for one thumb joint that is not included here due to the compatibility issue with ShapeHand data-glove used during empirical experiments. Formally, denote \(\mathbf {v}_g\) and \(\mathbf {v}_e\) as the ground truth and estimated joint locations. The *joint error* of the hand pose estimate \(\mathbf {v}_e\) is defined as \(e = \frac{1}{m}\sum _{i=1}^m \Vert v_{gi} - v_{ei} \Vert \), where \(\Vert \cdot \Vert \) is the Euclidean norm in 3D space. Moreover, as we are dealing with a number of test hand images, let \(j=1,\ldots , n_t\) run over the test images, the corresponding joint errors are \(\{e_1, \ldots , e_{n_t}\}\), then the *mean joint error* is defined as \(\frac{1}{n_t}\sum _{j} e_j\), and the *median joint error* is simply the median of the set of errors.

When working with annotated real depth images, there are a number of practical issues to be addressed. Below we present the major ones: To avoid the interference of the tapes fixed at the back of the ShapeHand data-glove, our depth images focus around the frontal views. Empirically, we have evaluated the reliability of the data-glove annotations. This is achieved via a number of simple but informative tests where we have observed that the ShapeHand device produces reasonable and consistent measurements (i.e. within mm accuracy) on all the finger joints except for the thumb, where significant errors are observed. We believe that the source of this error lies in the design of the instrument. As a result, even though we have included the thumb-related joints in our dataset, they are presently ignored during performance evaluation. In other words, the three thumb-related joints are not considered while evaluating the hand-pose estimation algorithms. As displayed in Fig. 8, this gives \(m=20-3=17\) in practice. The data-glove also gives inaccurate measurements when the palm arches (bends) deeply. Therefore we have to withdraw from consideration several gestures including 3, 7, R, T and W. Note on synthetic data all finger joints are considered as discussed previously.

The last which is nevertheless the most significant issue is an alignment problem as follows: due to the physical principle of ShapeHand data acquisition, its coordinate frame is originated at the hand base, which is different from the coordinate used by the estimated hand pose from depth camera. They are related by a linear coordinate transformation. In other words, the estimated joint positions need to be transformed from the camera coordinate to the ShapeHand coordinate frame. More specifically, denote the 3D location of a joint by \(v_i^S\) and \(v_i^C\), the 3D vectors corresponding to the ShapeHand (\(S\)) and the camera (\(C\)) coordinate frames, respectively, where \(i\) indexes over the \(m\) joints. The transformation matrix \(T^S_C\) is the 3D transformation matrix from \((C)\) to \((S)\), which can be uniquely obtained following the least-square 3D alignment method of Umeyama (1991).

## 6 Experiments

Throughout experiments we set \(T_1=7\) and \(T_2=12\). The depth of the trees is 20. Altogether \(460\hbox {K}\)*synthetic* training examples are used, as illustrated in Fig. 11. These training examples cover generic gestures from American sign language and Chinese number counting, together with their out-of-plane pitch and roll rotations, as well as in-plane rotational perturbations. The minimum number of estimation examples stored at leaf nodes is set to \(k_n=30\). \(m_0\) is set to a large constant of \(1e7\), that practically allows the consideration of all training examples when choosing a threshold \(t\) at a split node. The evaluation of depth features requires the access to local image window centered at current pixel during the first step, and of the whole hand patches during the second step, which are of size \((w,h) = (50, 50)\) and \((w,h) = (120, 160)\) respectively. The size of feature space \(d\) is fixed to 3000, and the related probability \(p=0.2\). Distance is defined as the Euclidean distance between hand and camera. By default we will focus on our *DHand* model during experiments.

In what follows, empirical simulations are carried out on the synthetic dataset to investigate myriad aspects of our system under controlled setting. This is followed by extensive experiments with real-world data. In addition to hand pose estimation, our system is also shown to work with related tasks such as part-based labeling and gesture classification.

### 6.1 Experiments on Synthetic Data

To conduct quantitatively analysis, we first work with an in-house dataset of \(1.6\,\hbox {K}\) synthesized hand depth images that covers a range of distances (from 350 to 700 mm). Similar to real data, the resolution of the depth camera is set to \(320\times 240\). When the distance from the hand to the camera is \(\mathrm {dist}=350\,mm\), the bounding box of the hand in image plane is typically of size \(70 \times 100\); when \(\mathrm {dist}=500\) mm, the size is reduced to \(49\times 70\); When \(\mathrm {dist}=700\) mm, the size further decreases to \(35\times 50\). White noise is added to the synthetic depth images with standard deviation 15 mm.

#### 6.1.1 Estimation Error of Step One

We further investigate the effects of perturbing the estimates of step one toward the final estimates of our system on the same synthetic dataset. A systematic study is presented in the three-dimensional bar plot of Fig. 13, where the hand position and in-plane rotation errors of step one form the two-dimensional input, which produces as output the mean joint error: assume the inputs from step one are perfect (i.e. with zero errors in both dimensions), final error of our system is around 15 mm. As both input errors increase, the final mean joint error will go up to over 40 mm. So it is fair to say that our system is reasonably robust against perturbation of the results from step one. Interestingly, our pipeline seems particularly insensitive to the in-plane rotation error of step one, which changes only 5 mm when the in-plane rotation error varies between 0 to 30 degrees. Finally, as shown in Fig. 13 where the errors of our first step (the green bar) is relatively small, our final estimation error is around 22 mm (mean joint error).

#### 6.1.2 Number of Trees

#### 6.1.3 Split Criteria

Two different split criteria are used for tree training in the second forest. When all 23 parameters are used to compute the entropy, the mean and median joint errors are 21.7 and 19.1 mm respectively. The hand rotation around Y-axis plays an important role in training the forest. In each node considering only the distribution of Y-axis rotation and the balance of the split (8), the mean and median joint errors are 21.5 and 19.2 mm respectively. The performance of considering only Y-axis rotation is as good as that of considering all 23 parameters.

#### 6.1.4 Performance Evaluation of the Proposed Versus Existing Methods

Figure 15 provides a performance evaluation (in term of mean/median joint error) among several competing methods. They include the proposed two baseline methods (*Baseline-M* and *Baseline-S*), the proposed main method (*DHand*), as well as a comparison method (Xu and Cheng 2013) denoted as *ICCV’13*. Overall our method *DHand* deliver the best results across all distances, which is followed by *Baseline-M* and *Baseline-S*. This matches well with our expectation. Meanwhile *ICCV’13* achieves the worst performance. In addition, our proposed methods are shown to be rather insensitive to distance changes (anywhere between 350 and 700 mm), while *ICCV’13* performs the best around 450 mm, then performance declines when working with larger distance.

*ICCV’13*can be attributed to its relatively flat error distributions, which suggests some joints deviate seriously from their true locations. This is in sharp contrast to the error distribution of

*DHand*shown at the top-left corner, where majority of the errors reside at the small error zone. Finally,

*Baseline-M*and

*Baseline-S*lie somewhere in-between, with their peaks lie on the small error side.

*Comparison over Different Matching Methods:* There are a few state-of-the-art object template matching methods that are commonly used for related tasks, including DOT (Hinterstoisser et al. 2010), HOG (Dalal and Triggs 2005), and NCC (Lewis 1995). Figure 17 presents a performance comparison of our approach when adopting these matching methods. It is clear that DOT consistently performs the best, which is followed by HOG, while NCC always delivers the worst results. In term of memory usage, DOT consumes 100 MB, HOG takes 4 GB, while and NCC needs 2 GB. Clearly DOT is the most cost-effective option. Note that in addition to the 100 MB DOT consumption, the 288 MB memory footprint of our system also includes other overheads such as third-party libraries.

### 6.2 Experiments on Real Data

Experiments of this section focus on our in-house real-world depth images that are introduced in Sect. 5. By default, the distance from the hand to the camera is fixed to 500 mm. Throughout experiments three sets of depth images are used as presented in Fig. 1: (1) bare hand imaged by top-mounted camera; (2) bare hand imaged by front-mounted camera; (3) hand with data-glove imaged by top-mounted camera.

*DHand*on this test set. Empirically only a small fraction of the errors occurs with very large errors (e.g. 40 mm and above), while most resides in the relatively small error (e.g. 10–25 mm) area. This suggests that the pose estimation of

*DHand*usually does not incur large errors such as mistaking a palm by a hand dorsum (i.e. the back of hand) or vice versa. As a summary of this empirical error distribution, its median/mean joint errors are 21.1/22.7 mm, which are comparable with what we have on the synthetic dataset where the median/mean joint errors are 19.2/21.5 mm. We further look into its spread over different subjects and gestures, which are displayed in Fig. 19: in the top plot, we can see the errors over subjects are quite similar. The differences over subjects may due to the hand sizes, as smaller hand tends to incur smaller errors. In the bottom plot, it is clear that simple gestures such as “5” receive relatively small errors, while some gestures such as “6”, “10”, and “F” tend to have larger errors, as many finger joints are not directly observable.

*DHand*on real-world depth images of various gestures, orientations, hand sizes, w/ versus w/o gloves, as well as different camera set-ups. Throughout our system is shown to consistently deliver visually plausible results. Some failure cases are also shown in Fig. 21, which will be analyzed later.

#### 6.2.1 Comparisons with State-of-the-Art Methods

Experiments are also conducted to qualitatively evaluate *DHand* and the state-of-the-art methods (Tang et al. 2014; Oikonomidis and Argyros 2011; Leapmotion 2013) on pose estimation and tracking tasks, as manifested in Figs. 22 and 23. Note (Tang et al. 2014) is re-implemented by ourselves while original implementations of the rest two methods are employed.

Recently the latent regression forest (LRF) method has been developed in Tang et al. (2014) to estimate finger joints from single depth images. As presented in Fig. 22, for the eight distinct hand images from left to right, LRF gives relatively reasonable results on the first four and makes noticeable mistakes on the rest four scenarios, while our method consistently offers visually plausible estimates. Note in this experiment all hand images are acquired in frontal facing view only, as LRF has been observed to deteriorate significantly when the hands rotate around the Y-axis, as is also revealed in Fig. 5 in our paper, an issue we have considered as the leading challenges for hand pose estimation.

We further compare *DHand* with two state-of-the-art hand tracking methods, which are the well-known tracker of Oikonomidis and Argyros (2011), and a commercial software, (Leapmotion 2013), where the stable version 1.2.2 is used. Unfortunately each of the trackers operates on a different hardware: Oikonomidis and Argyros (2011) works with Kinect (2011) to take as input a streaming pairs of color and depth images, while Leap Motion runs on proprietary camera hardware (Leapmotion 2013). Also its results in Fig. 23 are screencopy images from its visualizer as being a closed system. To accommodate the differences, we engage the cameras at the same time during data acquisition, where the Kinect and the Softkinetic cameras are closely placed to ensure their inputs are from similar side views, and the hands are hovered on top of Leap motion with about 17 cm distance. we also allow both trackers with sufficient lead time to facilitate both functioning well before exposing to each of the hand scenes as displayed at the first row of Fig. 23. Taking each of these ten images as input, *DHand* consistently delivers plausible results, while the performance of both tracking methods are rather mixed: Oikonomidis and Argyros (2011) seems not entirely fits well with our hand size, and in particular, we have observed that its performance degrades when the palm arches as e.g. in the seventh case. Leap Motion produces reasonable results for the first five cases and performs less well on the rest five.

#### 6.2.2 Part-Based Labeling

Our proposed approach can also be used to label hand parts, where the objective is to assign each pixel to one of the list of prescribed hand parts. Here we adopt the color-coded part labels of Fig. 10. Moreover, a simple scheme is adopted to convert our hand pose estimation to part-based labels: From input depth image, the hand area is first segmented from background. Our predicted hand pose is then applied to a synthetic 3D hand and projected onto the input image. This is followed by assigning each overlapping pixel a proper color label. For pixels not covered by the synthetic hand model, we allocate each of them with a label from the nearest overlapped regions.

Figure 24 presents exemplar labeling results on real-world depth images where the data-glove is put on. To illustrate the variety of images we present horizontally a series of unique gestures and vertically instances of the same gesture but from different subjects. Visually the results are quite satisfactory, where the color labels are mostly correct and consistent across different gestures and subjects. It is also observed that our annotation results are remarkably insensitive to background changes including the wires of the data-glove.

#### 6.2.3 Gesture Classification

Instead of emphasizing on connecting and comparing with existing ASL-focused research efforts, the aim here is to showcase the capacity of applying our pose estimation system to address related task of gesture recognition. Therefore, we take the liberty of considering a combined set of gestures, which are exactly the 29 gestures discussed previously in the dataset and evaluation section—instead of pose estimation, here we consider the problem of assigning each test hand image to its corresponding gesture category. Notice that some gestures are very similar to each other: They include e.g. {“t”, “s”, “m”, “n”, “a”, “e”, “10”}, {“p”, “v”, “h”, “r”, “2”}, {“b”, “4”}, and {“x”, “9”}, as illustrated also in Fig. 9. Overall the average accuracy is 0.53.

The overall low score is mainly due to the similarity of several gesture types under consideration. For example, \(X\) in ASL is very similar to number counting gesture \(9\), and is also very close to number \(1\). This explains why for letter \(X\), it is correctly predicted with only \(0.27\), and with \(0.27\) it is wrongly classified as number \(9\) and with \(0.13\) as number \(1\), as displayed in the confusion matrix at Fig. 25.

#### 6.2.4 Execution Time

Efficient CPU enables our system to run near realtime at 15.6 FPS, while our GPU implementation further boosts the speed to 67.2 FPS.

#### 6.2.5 Limitations

Some failure cases of our hand pose estimation are presented in Fig. 21. A closed hand is usually difficult to deal with since no finger joints or tips are visible: As demonstrated in the first column, it might be confused with some similar gestures. Even when few fingers are in sight, different hand gestures might still be confused as they look very similar when projecting to the image plane from certain viewing angles, as presented in the 2rd to 4th columns. The last two columns display scenarios of overlapped fingers which might also be wrongly estimated.

As our approach is based on single depth images, the results may appear jittered when working with video streams. We also remark that our method is fundamentally different from tracking-based methods, where gradient based or stochastic optimization method would be used to exploit temporal information available. As a result, the accuracy of our method might slightly lag behind a tracking enabled approach with good initializations.

## 7 Conclusion and Outlook

This paper presents an efficient and effective two-step pipeline for hand pose estimation. GPU-acceleration of the computational bottleneck component is also presented that significantly speeds up the runtime execution. A data-glove annotated hand depth image dataset is also described as an option for performance comparison of different approaches. Extensive empirical evaluations demonstrate the competitive performance of our approach. This is in addition to theoretical consistency analysis of its slightly simplified version. For future work, we consider to integrate into our consideration the temporal information, to eliminate the jittering effects of our system when working with live streams.

## Footnotes

- 1.
A project webpage can be found at http://web.bii.a-star.edu.sg/~xuchi/handengine.htm, which contains supplementary information of this paper such as the demo video.

- 2.
Our annotated dataset of depth images and the online performance evaluation system for 3D hand pose estimation are publicly available at http://hpes.bii.a-star.edu.sg/.

- 3.
\(diam(A)=\sup _{\overline{\mathbf {x}}_1,\overline{\mathbf {x}}_2\in A}\Vert \overline{\mathbf {x}}_1-\overline{\mathbf {x}}_2\Vert \) for a set \(A\subset {\mathbb {R}}^d\).

## Notes

### Acknowledgments

This research was partially supported by A*STAR JCO and IAF grants. We would like to thank Vaghul Aditya Balaji for helping with the data collection and website design processes during his intern attachment at BII.

### References

- Andrews, H. C., & Patterson, C. L. (1976). Digital interpolation of discrete images.
*IEEE Transactions on Computers*,*C–25*(2), 196–202.CrossRefGoogle Scholar - Ballan, L., Taneja, A., Gall, J., Gool, L., & Pollefeys, M. (2012). Motion capture of hands in action using discriminative salient points. In
*ECCV*.Google Scholar - Biau, G., Devroye, L., & Lugosi, G. (2008). Consistency of random forests and other averaging classifiers.
*Journal on Machine Learning Research*,*9*, 2015–2033.MathSciNetMATHGoogle Scholar - Biau, G. (2012). Analysis of a random forests model.
*Journal on Machine Learning Research*,*13*, 1063–1095.MathSciNetMATHGoogle Scholar - Breiman, L. (2004).
*Consistency for a simple random forests*. Tech. rep. UC Berkeley.Google Scholar - Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*24*, 603–619.CrossRefGoogle Scholar - Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In
*CVPR*(Vol. 1, pp. 886–893).Google Scholar - de La Gorce, M., Fleet, D., & Paragios, N. (2011). Model-based 3d hand pose estimation from monocular video.
*IEEE Transaction on Pattern Analysis and Machine*,*33*(9), 1793–1805.CrossRefGoogle Scholar - Denil, M., Matheson, D., & de Freitas, N. (2014). Narrowing the gap: Random forests in theory and practice. In
*ICML*.Google Scholar - Erol, A., Bebis, G., Nicolescu, M., Boyle, R., & Twombly, X. (2007). Vision-based hand pose estimation: A review.
*Computer Vision Image Understanding*,*108*(1–2), 52–73.CrossRefGoogle Scholar - Fanelli, G., Gall, J., & Gool, L. V. (2011). Real time head pose estimation with random regression forests. In
*CVPR*.Google Scholar - Gall, J., & Lempitsky, V. (2013). Class-specific hough forests for object detection. In
*Decision forests for computer vision and medical image analysis*(pp. 143–157). Berlin: Springer.Google Scholar - Girshick, R., Shotton, J., Kohli, P., Criminisi, A., & Fitzgibbon, A. (2011). Efficient regression of general-activity human poses from depth images. In
*ICCV*.Google Scholar - Gustus, A., Stillfried, G., Visser, J., Jorntell, H., & van der Smagt, P. (2012). Human hand modelling: Kinematics, dynamics, applications.
*Biological Cybernetics*,*106*(11–12), 741–755.MathSciNetCrossRefMATHGoogle Scholar - Gyröfi, L., Kohler, M., Krzyzak, A., & Walk, H. (2002).
*A Distribution-Free Theory of Nonparametric Regression*. Berlin: Springer.CrossRefGoogle Scholar - Hackenberg, G., McCall, R., & Broll, W. (2011). Lightweight palm and finger tracking for real-time 3d gesture control. In
*IEEE virtual reality conference*(pp. 19–26).Google Scholar - Hamming, R. W. (1997). Digital filters (3rd ed.). Dover Publications.Google Scholar
- Hansard, M., Lee, S., Choi, O., & Horaud, R. (2013).
*Time-of-flight cameras: Principles, methods and applications*. Berlin: Springer.CrossRefGoogle Scholar - Hinterstoisser, S., Lepetit, V., Ilic, S., Fua, P., & Navab, N. (2010). Dominant orientation templates for real-time detection of textureless objects. In
*CVPR*.Google Scholar - Keskin, C., Kirac, F., Kara, Y., & Akarun, L. (2012). Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In
*ECCV*.Google Scholar - Kinect. (2011). http://www.xbox.com/en-US/kinect/.
- Leapmotion. (2013). http://www.leapmotion.com.
- Lewis, J. (1995). Fast normalized cross-correlation. In
*Vision interface*(Vol. 10, pp. 120–123).Google Scholar - Melax, S., Keselman, L., & Orsten, S. (2013). Dynamics based 3d skeletal hand tracking. In
*Graphics interface*.Google Scholar - Oikonomidis, N., & Argyros, A. (2011). Efficient model-based 3d tracking of hand articulations using kinect. In
*BMVC*.Google Scholar - Oikonomidis, I., Lourakis, M., & Argyros, A. (2014). Evolutionary quasi-random search for hand articulations tracking. In
*CVPR*.Google Scholar - Peachey, D. (1990).
*Texture on demand*. Tech. rep.Google Scholar - ShapeHand. (2009). http://www.shapehand.com/shapehand.html.
- Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In
*CVPR*.Google Scholar - Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., et al. (2013). Real-time human pose recognition in parts from single depth images.
*Communication of the ACM*,*56*(1), 116–124.CrossRefGoogle Scholar - Softkinetic. (2012). http://www.softkinetic.com.
- Sridhar, S., Oulasvirta, A., & Theobalt, C. (2013). Interactive markerless articulated hand motion tracking using rgb and depth data. In
*ICCV*.Google Scholar - Sueda, S., Kaufman, A., & Pai, D. (2008). Musculotendon simulation for hand animation. In
*SIGGRAPH*(pp. 83:1–83:8).Google Scholar - Tang, D., Tejani, A., Chang, H., & Kim, T. (2014) Latent regression forest: Structured estimation of 3d articulated hand posture. In
*CVPR*.Google Scholar - Taylor, J., Stebbing, R., Ramakrishna, V., Keskin, C., Shotton, J., Izadi, S., Fitzgibbon, A., & Hertzmann, A. (2014). User-specific hand modeling from monocular depth sequences. In
*CVPR*.Google Scholar - Tzionas, D., & Gall, J. (2013). A comparison of directional distances for hand pose estimation. In
*German conference on pattern recognition*.Google Scholar - Umeyama, S. (1991). Least-squares estimation of transformation parameters between two point patterns.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*13*, 376380.CrossRefGoogle Scholar - Wang, R., & Popović, J. (2009). Real-time hand-tracking with a color glove. In
*SIGGRAPH*(pp. 63:1–63:8).Google Scholar - Xu, C., & Cheng, L. (2013). Efficient hand pose estimation from a single depth image. In
*ICCV*.Google Scholar - Ye, M., Zhang, Q., Wang, L., Zhu, J., Yang, R., & Gall, J. (2013).
*Time-of-flight and depth imaging. Sensors, algorithms, and applications, chap. A survey on human motion analysis from depth data*(pp. 149–187). Berlin: Springer.Google Scholar - Zhao, W., Chai, J., & Xu, Y. (2012). Combining marker-based mocap and rgb-d camera for acquiring high-fidelity hand motion data. In
*Eurographics symposium on computer animation*.Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.