Estimate Hand Poses Efficiently from Single Depth Images
 5k Downloads
 6 Citations
Abstract
This paper aims to tackle the practically very challenging problem of efficient and accurate hand pose estimation from single depth images. A dedicated twostep regression forest pipeline is proposed: given an input hand depth image, step one involves mainly estimation of 3D location and inplane rotation of the hand using a pixelwise regression forest. This is utilized in step two which delivers final hand estimation by a similar regression forest model based on the entire hand image patch. Moreover, our estimation is guided by internally executing a 3D hand kinematic chain model. For an unseen test image, the kinematic model parameters are estimated by a proposed dynamically weighted scheme. As a combined effect of these proposed building blocks, our approach is able to deliver more precise estimation of hand poses. In practice, our approach works at 15.6 framepersecond (FPS) on an average laptop when implemented in CPU, which is further spedup to 67.2 FPS when running on GPU. In addition, we introduce and make publicly available a dataglove annotated depth image dataset covering various hand shapes and gestures, which enables us conducting quantitative analyses on realworld hand images. The effectiveness of our approach is verified empirically on both synthetic and the annotated realworld datasets for hand pose estimation, as well as related applications including partbased labeling and gesture classification. In addition to empirical studies, the consistency property of our approach is also theoretically analyzed.
Keywords
Hand pose estimation Depth images GPU acceleration Regression forests Consistency analysis Annotated hand image dataset1 Introduction
Visionbased hand interpretation plays important roles in diverse applications including humanoid animation (Sueda et al. 2008; Wang and Popović 2009), robotic control (Gustus et al. 2012), and humancomputer interaction (Hackenberg et al. 2011; Melax et al. 2013), among others. In its core lies the nevertheless challenging problem of 3D hand pose estimation (Erol et al. 2007; Gorce et al. 2011), owing mostly to the complex and dexterous nature of hand articulations (Gustus et al. 2012). Facilitated by the emerging commoditylevel depth cameras (Kinect 2011; Softkinetic 2012), recent efforts such as Keskin et al. (2012) ,Ye et al. (2013), Xu and Cheng (2013), Tang et al. (2014) have led to noticeable progress in the field. The problem is however still far from being satisfactorily solved: For example, not much quantitative analysis has been conducted on annotated realworld 3D datasets, partly due to the practical difficulty of setting up such testbeds. This however imposes significant restrictions on the evaluation of existing efforts, which are often either visually judged based on a number of real depth images, or quantitatively verified on synthetic images only as the groundtruths are naturally known. As each work utilizes its own set of images, their results are not entirely comparable. These inevitably raise the concerns of progress evaluation and reproducibility.

For an unseen test image, a dynamically weighted scheme is proposed to regress our hand model parameters based on a twostep pipeline, which is empirically shown to lead to significantly reduced errors comparing to the stateofthearts. As presented in Fig. 1 as well the supplementary video, our system estimates hand poses from single images and with 3D orientations. This also enables our system to work with a mobile depth camera.

We provide an extensive, dataglove annotated benchmark of depth images for general hand pose estimation. The benchmark dataset, together with the groundtruths and the evaluation metric implementation, have been made publicly available. This is the first benchmark of such kind to our knowledge, and we wish it can provide an option for researchers in the field to compare performance on the same ground.

To maintain efficiency and to offset the CPU footprint, the most timeconsuming components of our approach have also been identified and accelerated by GPU implementations, which gains us a further fivefold overall speedup. These enable us to deliver a practical hand pose estimation system that works efficiently, at about 15.6 framepersecond (FPS) on an average laptop, and 67.2 FPS when having access to a midrange GPU.

The reliance on synthetic training examples naturally brings up the consistency question when infinitely many examples are potentially available for training. Our paper makes first such attempt to propose a regression forestbased hand pose system that is theoretically motivated. To this end we are able to provide consistency analysis on a simple variant of our learning system. Although the complete analysis is still open, we believe this is a necessary and important step toward full comprehension of the random forests theory that has been working so well on a number of applications in practice.
1.1 Related Work
An earlier version of our work appears in Xu and Cheng (2013) that deals with the problem of depth imagebased hand pose estimation. There are a number of differences of our work here comparing to that of Xu and Cheng (2013): first, a simple twostep pipeline is utilized in our approach, in contrast to a more complicated approach in Xu and Cheng (2013) containing three steps. Second, in this work we attempt to consider random forest models that can be analyzed theoretically, while the random forest models in Xu and Cheng (2013) are not able to be studied theoretically. Third, there are also many other differences: the kinematic model parameters are estimated by a dynamically weighted scheme that leads to a significant error reduction in empirical evaluations. The information gains and split criteria, the usage of whole hand image patch rather than individual pixels, as well as the DOT features to be detailed later are also quite different. Meanwhile, various related regression forest models have been investigated recently: in Fanelli et al. (2011), the head pose has 6 degreeoffreedom (DoF), which is divided into 2 parts: 3D translation and 3D orientation. In each leaf node, the distribution is approximated by a 3D Gaussian. In Gall and Lempitsky (2013), the Hough forest model is instead utilized to represent the underlining distributions as voting with a set of 3D vectors. In Shotton et al. (2013), a fixed weight is assigned to each of the 3D voting vectors during training, and the experimental results suggest that the weight plays a crucial role in body pose estimation. Our scheme of dynamical weights can be regarded as a further extension of this idea to allow adaptive weight estimation at testrun that is dedicated to the current test example. A binary latent tree model is used in Tang et al. (2014) to guide the searching process of 3D locations of hand joints. For the related problem of videobased 3D hand tracking, a userspecific modeling method is proposed by Taylor et al. (2014), while (Oikonomidis et al. 2014) adopts an evolutionary optimization method to capture hand and object interactions.
Leapmotion (2013) is a commercial system designed for closerange (within about 50 cm in depth) hand pose estimation. As a closed system based proprietary hardware, its inner working mechanism remains undisclosed. Our observation is that it is not well tolerant to selfocclusions of finger tips. In contrast, our system works beyond half a meter in depth, and works well when some of the fingertips are occluded as it does not rely on detecting finger tips.
Additionally, instead of directly estimating 3D locations of finger joints from the depth image as e.g. that of Girshick et al. (2011), our model predicts the parameters of a predefined hand kinematic chain, which is further utilized to build the 3D hand. This is mainly due to the fact that compared to 3D location of joints, kinematic chain is a global representation and is more tolerant to selfocclusion, a scenario often encountered in our hand pose estimation context. Second, for human pose estimation, once the body location is known (i.e., the torso is fixed), the limbs and the head can be roughly considered as independent subchains: e.g. a change from left hand will not affect the other parts significantly. In contrast, motions of the five fingers and the palm are tightly correlated.
For the related problem of optical motion capture, depth cameras have been utilized either on their own (Ballan et al. 2012), or together with existing markerbased system (Zhao et al. 2012; Sridhar et al. 2013) for markerless optical motion capture. While being able to produce more precise results, they typically rely on more than one cameras, and operate in an offline fashion. In term of annotated datasets of hand images, existing efforts are typically annotated manually with either partbased labels (e.g. Tang et al. 2014) or finger tips (Sridhar et al. 2013). However these annotations do not explicitly offer 3D information of the skeletal joints. Tzionas and Gall (2013) instead engages a human annotator to annotate 2D locations of joints, and aggregates them to infer the 3D hand joint locations. This dataset unfortunately does not provide depth image input, also one possible concern is its annotation might not be fully objective.
In what follows, we start by giving an overall account of the regression forest models that are the core modules in our proposed learning system.
2 Our Theoretically Motivated Regression Forest Models
As will become clear in later sections, the training of our regression forests relies on large quantity of synthetic examples. It is thus of central interest to provide consistency analysis to characterize their asymptotic behaviors, which concerns the convergence of the estimate to an optimal estimate as the sample size goes to infinity. Most existing papers (Breiman 2004; Biau et al. 2008; Biau 2012; Denil et al. 2014) on the consistency of regression forests focus on stylized and simplified algorithms. The unpublished manuscript of Breiman (2004) suggests a simplified version of random forests and provides a heuristic analysis of its consistency. This model is further analyzed in Biau (2012) where, besides consistency, the author also shows that the rate of convergence depends only on the number of strong features. An important work on the consistency of random forests for classification is Biau et al. (2008) which provides consistency theorems for various versions of random forests and other randomized ensemble classifiers. Despite these efforts, there is still a noticeable gap between theory and practice of regression forests learning systems. This is particularly true for the pose estimation systems that have make tremendous progresses during the past few years in looking at human fullbody, head, and hand, where random forests have been very successful. On the other hand, little theoretical analysis has been provided for the learning systems underpinning these empirical successes.
Different from most existing practical random forest models, the random forest model considered in our approach is theoretically motivated, which is inspired by existing theoretical works (Breiman 2004; Biau et al. 2008; Biau 2012) and in particular (Denil et al. 2014). The theoretical analysis of the resulting random forest model closely follows that of Denil et al. (2014). Meanwhile our proposed random forest model is sufficiently sophisticated to be practically capable of addressing real world problems. Our models and its variants are specifically applied to the real problem of hand pose estimation with competitive empirical performance. Meanwhile, it is worth noting that the proposed models are generic and can work with problems beyond hand pose estimation.
In what follows, we introduce our generic regression forest model in term of training data partition, split criteria during tree constructions, prediction, as well as its variants. Its asymptotic consistency analysis is also offered.
2.1 Training Data: A Partition into Structure Data and Estimation Data
Formally, let \(\overline{X}\) denote a \([0,1]^d\)valued random variable and \(\overline{Y}\) denote a \({\mathbb {R}}^{q}\)valued vector of random variables, where \(d\) is the dimension of normalized feature space, and \(q\) is the dimension of the label space. Denote \(\overline{\mathbf {x}}\) (or \(\overline{\mathbf {y}}\)) a realization of the random variable \(\overline{X}\) (or \(\overline{Y}\)). A training example can be defined as an (instance, label) pair, \(\left( \overline{\mathbf {x}}, \overline{\mathbf {y}} \right) \). Therefore, the set of \(n\) training examples is represented as \(D_n= \big \{ (\overline{\mathbf {x}}_{i}, \overline{\mathbf {y}}_{i}) \big \}_{i=1}^n\). Inspired by Denil et al. (2014), during tree constructions, we partition \(D_n\) randomly into two parts: structure data \(U_n\) and estimation data \(E_n\) by randomly selecting \(\lfloor \frac{n}{2}\rfloor \) examples as structure data and the rest as estimation data. Examples in structure data are used to determine the tests used in split nodes (i.e. internal nodes of the trees), and examples in estimation data are retained in each leaf node of the tree for making prediction at test phase. This way, once the partition of the training sample is provided, the randomness in the construction of the tree remains independent of the estimation data, which is necessary to ensure consistency in the followup theoretical analysis.
2.2 The Split Criteria
The performance of the regression forest models is known to be crucially determined by decision tree constructions and particularly the split criteria, which are the focuses here.
In the regression forests, each decision tree is independently constructed. The tree construction process can be equivalently interpreted as a successive partition of the feature space, \([0,1]^d\), with axisaligned split functions. That is, starting from the root node of a decision tree which encloses the entire feature space, each tree node corresponds to a specific rectangular hypercube with monotone decreasing volumes as we visit node deeper into the tree. Finally, the union of the hypercubes associated with the set of leaf nodes forms a complete partition of the feature space.
Similar to existing regression forests in literature including (Fanelli et al. 2011; Shotton et al. 2011; Denil et al. 2014), at a split node, we randomly select a relatively small set of \(s\) distinct features \(\varPhi := \{\phi _i\}_{i=1}^s\) from the \(d\)dimensional space as candidate features (i.e. entries of the feature vector). \(s\) is obtained via \(s \sim 1+ \mathrm {Binomial}(d1,p)\), where \(\mathrm {Binomial}\) denotes the binomial distribution, and \(p>0\) a predefined probability. Denote \(t \in {\mathbb {R}}\) a threshold. At every candidate feature dimension, we first randomly select \(M\) structure data in this node, where \(M\) is the smaller value between the number of structure data in this node and a userspecified integer \(m_0\) (\(m_0\) is independent of the training size), then project them onto the candidate feature dimension and uniformly select a set of candidate thresholds \({\mathcal {T}}\) over the projections of the \(M\) chosen examples. The best test \((\phi ^*, t^*)\) is chosen from these \(s\) features and accompanying thresholds by maximizing the information gain that is to be defined next. This procedure is then repeated until there are \(\lceil \log _2 L_n\rceil \) levels in the tree or if further splitting of a node would result in fewer than \(k_n\) estimation examples.
2.3 Prediction
2.3.1 BaselineS versus DHand: Static Versus Dynamical Weights at the Leaf Nodes
Instead of making prediction with the empirical average as in BaselineM, we consider to deliver final prediction by modeseeking of the votes as in the typical Hough forests of Gall and Lempitsky (2013). More specifically, let \(l\) denote the current leaf node, and let \(i\in \{1,\dots , k_l\}\) indexes over the training examples of leaf node \(l\). These examples are subsequently included as vote vectors in the voting space. Now, consider a more general scenario where each of the training examples has its own weight. Let \(\mathbf {z}_{li}\) represent the parameter vector of a particular training example \(i\) of leaf node \(l\), together with \(w_{li} > 0\) as its corresponding weight. The set of weighted training examples at leaf node \(l\) can thus be defined as \({\mathcal {V}}_{l}=\big \{ (\mathbf {z}_{li}, w_{li}) \big \}_{i=1}^{k_l}\). Note this empirical vote set defines a point set or equivalently, an empirical distribution. In existing literature such as Gall and Lempitsky (2013), \(w_{li} =1\) for any training example \(i\) and any leaf node \(l\). In other words, the empirical distribution \({\mathcal {V}}_{l}\) is determined during tree constructions in training stage, and remains unchanged during prediction stage. This is referred to as the statically weighted scheme or BaselineS. Rather, we consider a dynamically weighted scheme (i.e. DHand) where each of the weights, \(w_{li}\), can be decided at runtime. This is inspired by the observation that the typical distribution of \({\mathcal {V}}_{l}\) tends to be highly multimodal. It is therefore crucial to assign each instance \(\mathbf {z}_{li}\) a weight \(w_{li}\) that properly reflects its influence on the test instance.
DOT Hinterstoisser et al. (2010): As illustrated in Fig. 3, the DOT feature is used in our context to compute a similarity score \(C(I, I_T)\) between an input (\(I\)) and a reference (\(I_T\)) hand patches. DOT works by dividing the image patch into a series of blocks of size \(8\times 8\) pixels, where each block is encoded using the pixel gradient information as follows: denote as \(\eta \) the orientation of the gradient on a pixel, with its range \([0, 2\pi )\) quantized into \(n_{\eta }\) bins, \(\{0,1,\ldots ,n_{\eta }1\}\). We empirically set \(n_{\eta }=8\), the span of each bin is thus \(45^{\circ }\). This way, \(\eta \) can be encoded as a vector \(\mathbf {o}\) of length \(n_{\eta }\), by assigning 1 to the bin it resides and 0 otherwise. We set \(\mathbf {o}\) to zero vector if there is no dominant orientation at the pixel. Now, consider each block of the input patch, its local dominant orientation \(\eta ^*\) is simply defined as the maximum gradient within this block, which gives the corresponding vector \(\mathbf {o}^*\). Meanwhile for each block in a template patch, to improve the robustness of DOT matching, we utilize a list of local dominant orientations \(\{\eta _{1}^*,\eta _{2}^*,\ldots ,\eta _{r}^*\}\), each corresponds to the template under a slight translation. Each entry of the list is mapped to the aforementioned orientation vector, and by applying bitwise OR operations successively to these vectors, they are merged into the vector \(\mathbf {o}_T^*\).
2.4 Theoretical Analysis
Theorem 1
Assume that \(\overline{X}\) is uniformly distributed on \([0,1]^d\) and \({\mathbb {E}} \Big \{ \big \Vert \overline{Y} \big \Vert ^2 \Big \}<\infty \), and suppose the regression function \(r(\overline{\mathbf {x}})\) is bounded. Then the regression forest estimates \(\big \{ r_n^{(Z)} \big \}\) of our BaselineM model in (4) is consistent whenever \(\lceil \log _2L_n\rceil \rightarrow \infty , \frac{L_n}{n}\rightarrow 0\) and \(\frac{k_n}{n}\rightarrow 0\) as \(n\rightarrow \infty \).
Proof details of our consistency theorem is relegated to the appendix. Recall the optimal estimator is the regression function \(r(\overline{\mathbf {x}})\) which is usually unknown. The theorem guarantees that as the amount of data increases, the probability that the estimate \(r_n(\overline{\mathbf {x}})\) of our regression forests is within a small neighbourhood of the optimal estimator will approach arbitrarily close to one. In our context when infinitely many synthetic examples are potentially available for training, it suggests that our estimate, constructed by learning from a large amount of examples, is optimal with high probability.
 (1)
The maximum tree depth \(\lceil \log _{2}L_n\rceil \) is reached.
 (2)
The splitting of the node using the selected split point results in any child with fewer than \(k_n\) estimation points.
3 The Pipeline of Our Learning System
3.1 Preprocessing
Our approach relies on synthetic hand examples for training, where each training example contains a synthetic hand depth image and its corresponding 3D pose. The learned system is then applied to real depth images at test stage for pose estimation. Particularly, depth noises are commonly produced by existing commoditylevel depth cameras, which renders noticeable differences from the synthetic images. For TOF cameras, this is overcome by applying median filter to clear away the outliers, which is followed by Gaussian filter to smooth out random noises. The amplitude image, also known as the confidence map, is used to filter out the so called “flying pixel” noise (Hansard et al. 2013). The pixel with low confidence value is treated as the background. For structured illumination cameras, the preprocessing strategy of Xu and Cheng (2013) can be applied. To obtain a hand image patch, a simple background removal technique similar to that of Shotton et al. (2011) is adopted, followed by image cropping to obtain a handcentered bounding box. Moreover, to accommodate hand size variations, a simple calibration process is applied to properly scale a new hand size to match with that of the training ones, by acquiring an initial image with hand fully stretched and flat, and all fingers spread wide. Empirically these preprocessing ingredients are shown to work sufficiently well.
3.2 Our TwoStep Pipeline
After preprocessing, our approach consists of two major steps as in Fig. 2: step one involves mainly estimation of 3D location and inplane rotation of the hand base (i.e. wrist) using a regression forest. This is utilized in step two which subsequently establishes its coordinate based on the estimated 3D location and inplane rotation. In step two, a similar regression forest model estimates the rest parameters of our hand kinematic model by a dynamically weighted scheme, which produces the final pose estimation. Note that different from existing methods such as Keskin et al. (2012) where by introducing the conditional model, a lot of forest models (each catering one particular condition) have to be prepared and kept in memory, our pipeline design requires only one forest model after the translation and inplane rotation of step one to establish the canonical coordinate.
In both steps of our pipeline, two almost identical regression forests are adopted. In what follows, separate descriptions are provided that underline their differences. This allows us to present three variants of our learning system with a slight abuse of notations: the BaselineM system employs the basic BaselineM regression model on both steps; Similarly, the BaselineS system utilizes instead the BaselineS models in both steps; Finally, the DHand system applies the DHand regression forest model only at step two, while the BaselineS model is still engaged in step one. It is worth mentioning that for the BaselineM system, our theoretical analysis applies to both regression forests models used in the two steps of our pipeline.
Before proceeding to our main steps, we would like to introduce the 3D hand poses, the related depth features and tests utilized in our approach, which are based on existing techniques as follows:
3.2.1 Step One: Estimation of Coordinate Origin and InPlane Rotation
This step is to estimate the 3D location and inplane rotation of the hand base, namely (\(x_1, x_2, x_3, \alpha \)), which forms the origin of our tobeused coordinate in step two. The (instance, label) pair of an example in step one is specified as follows: the instance (aka feature vector) \(\overline{\mathbf {x}}\) is obtained from an image patch centered at current pixel location, \(\mathbf {x}=(\hat{x}_1, \hat{x}_2)\). Each element of \(\overline{\mathbf {x}}\) is realized by feeding particular \(u,v\) offset values in (7). Correspondingly the label of each example \(\overline{\mathbf {y}}\) is the first four elements of the full pose label vector \(\varTheta \), namely (\(x_1, x_2, x_3, \alpha \)). A regression forest is used to predict these parameters as follows: every pixel location in the hand image patch determines a training example, which is parsed by each of the \(T_1\) trees, resulting in a path from the root to certain leaf node that stores a collection of training examples. Empirically we observe that this 3D origin location and inplane rotation are usually estimated fairly accurately.
Split Criterion of the First Step
For the regression forest of the first step, its input is an image patch centered at current pixel, from which it produces the 4dimensional parameters (\(x_1, x_2, x_3, \alpha \)). The entropy term of (2) is naturally computed in this 4dimensional space (i.e. \(q=4\)).
3.2.2 Step Two: Pose Estimation
The depth pixel values of a hand image patch naturally form a 3D point cloud. With the output of step one, the point cloud is translated to (\(x_1, x_2, x_3\)) as coordinate origin, which is followed by a reverserotation to the canonical hand pose by the estimated inplane rotation \(\alpha \). An almost identical regression forest is then constructed to deliver the hand pose estimation: with the location output of step one, (\(x_1, x_2, x_3\)), as the coordinate origin, each entire hand patch from training is parsed by each of the \(T_2\) trees, leading down the tree path to a certain leaf node. The regression forest model of this step then delivers a 23dimensional parameter vector \(\mathbf {z}\), by aggregating the votes of the training example of the leaf nodes. The final 27dimensional parameter estimation \(\varTheta \) is then obtained by direct composition of results from both steps. Meanwhile for step two, \(\overline{\mathbf {x}}\) stands for a feature vector of the entire hand image patch, while \(\overline{\mathbf {y}}:=\mathbf {z}\) represents the remaining 23 elements of \(\varTheta \).
Split Criterion of the Second Step
The second step focuses on estimating the remaining 23dimensional parameters, which resides in a much larger space than what we have considered during the first step. As a result, by straightforwardly following the same procedure as in step one, we will inevitably work with a very sparsely distributed empirical point set in a relatively high dimensional space. As a result, it consumes considerably amount of time, while the results might be unstable. Instead we consider an alternative strategy.
4 GPU Acceleration
4.1 Motivation
Typically a leaf node is expected to contain similar poses. The vast set of feasible poses however implies a conflicting aim: on one hand, this can be achieved by making ready as many training examples as possible; On the other hand, practically we prefer a small memory print for our system, thus limiting the amount of data. A good compromise is obtained via imposing a set of small random perturbations including 2D translations, rotations, and hand size scaling for each of existing training instances, \(I_t\). This way, a leaf node usually has a better chance to work with an enriched set of similar poses. For this purpose, small transformation such as inplane translations, rotations and scaling are additionally applied on the training image patches. We remap \(I_t\) using \(m_t\) transformation maps. Every transformation map is generated using a set of small random perturbations including 2D translations, rotations, and hand size scaling of the same hand gesture, and is of the same dimensions (i.e. \(w \times h\)) as of \(I_t\). After remapping the values of \(I_t\) using a transformation map, its DOT features are generated and compared with each of the features of the \(k_l\) instances of \(I_{li}\) to obtain a similarity score. These DOTrelated executions turns out to be the computation bottleneck in our CPU implementation, which can be substantially accelerated using GPU by exploiting the massive parallelism inherent in these steps. It is worth noting that the purpose here is to duplicate our CPU system with GPUnative implementation, in order to obtain the same performance with much reduced time.
4.2 Using Texture Units for Remapping
To perform this remapping on GPU, we first launch one thread for every \(\mathbf {x}\) to read its \(\mathbf {x}_f\) from \(M_t\). Since all the threads in a thread block read from adjacent locations in GPU memory in a sequence, the memory reads are perfectly coalesced. To obtain a depth value at \(\mathbf {x}_f\), we use the four pixels whose locations are the closest to it in \(I_t\). The resulting depth value is computed by performing a bilinear interpolation of the depth values at these four pixels (Andrews and Patterson 1976; Hamming 1998). Reading the four pixels around \(\mathbf {x}_f\) is inefficient since the image is stored in memory in row order and memory accesses by adjacent threads can span across multiple rows and columns of \(I_t\) and thus cannot be coalesced. This type of memory access is not a problem in CPU computation due to its deep hierarchy of caches with large cache memories at each level. However, the data caches in GPU architecture are tiny in comparison and are not very useful for this computation. The row order memory layout that is commonly used has poor locality of reference. Instead an isotropic memory layout is needed, with no preferred access direction. Instead this operation can be performed in GPU by utilizing its twodimensional texture memory, which ensures that pixels that are local in image space are almost always local in the memory layout (Peachey 1990).
4.3 Computing DOT Features
Computing the DOT features for each of the \(m_t\) remapped images takes two steps: computing the gradient at every pixel and then the dominant orientation of every block in the image. One thread is launched on GPU for every pixel to compute its X and Y gradient values. We apply a \(3 \times 3\) Sobel filter to compute the gradient and the memory reads are coalesed across a warp of threads for efficiency. Using the gradient values, the magnitude and angle of the gradient vector is computed and stored in GPU memory. We use the fast intrinsic functions available on GPU to compute these quickly.
To pick the orientation of the pixel whose magnitude is largest in a block, the common strategy of launching one thread per pixel is not practical. The cost of synchronization between threads of a DOT block is not worthwhile since the dimensions of the block (\(8 \times 8\)) are quite small in practice. Instead, we launch one thread for every DOT block to compare the magnitude values across its pixels and note the orientation of the largest magnitude vector.
4.4 Computing DOT Feature Matching
The DOT feature comparison is essentially composed of two steps: bitwise comparison and accumulation of the comparison result. The bitwise comparisons can be conveniently performed by using one thread per orientation in the DOT feature. A straightforward method to accumulate the comparison result is to use parallel segmented reduction. However, this can be wasteful because the size of DOT feature is typically small and the number of training examples is typically large. To accumulate efficiently, we use the special atomic addition operations that have been recently implemented in GPU hardware.
5 Our Annotated RealWorld Dataset and Performance Evaluation
5.1 The Annotated RealWorld Dataset
In addition to our kinematic chain model of Fig. 4, an alternative characterization (Girshick et al. 2011) of a 3D hand pose consists of a sequence of joint locations \(\mathbf {v} = \big \{ v_i \in {\mathbb {R}}^3 : i=1, \ldots , m \big \}\), where \(m\) refers to the number of joints, and \(v\) specifies the 3D location of a joint. In term of performance evaluation, this characterization by joint locations (as illustrated in Fig. 8) is usually easily interpreted when comes to comparing pose estimation results. As this hand pose characterization is obtained from the ShapeHand dataglove, there exists some slight differences in joints when comparing with the kinematic model: first, all five finger tips are additionally considered in Fig. 8; Second, there are three thumb joints in Fig. 4 while only two of them are retained in Fig. 8, as ShapeHand does not measure the thumb base joint directly. Nevertheless there exists a unique mapping between the two characterizations.
5.2 Performance Evaluation Metric and Its Computation
Our performance evaluation metric is based on the joint error, which is defined as the averaged Euclidean distance in 3D space over all the joints. Note the joints in this context refer to the 20 joints defined in Fig. 8, which are exactly all the joints of our skeletal model as of Fig. 4 plus the tips of the five fingers and the hand base, except for one thumb joint that is not included here due to the compatibility issue with ShapeHand dataglove used during empirical experiments. Formally, denote \(\mathbf {v}_g\) and \(\mathbf {v}_e\) as the ground truth and estimated joint locations. The joint error of the hand pose estimate \(\mathbf {v}_e\) is defined as \(e = \frac{1}{m}\sum _{i=1}^m \Vert v_{gi}  v_{ei} \Vert \), where \(\Vert \cdot \Vert \) is the Euclidean norm in 3D space. Moreover, as we are dealing with a number of test hand images, let \(j=1,\ldots , n_t\) run over the test images, the corresponding joint errors are \(\{e_1, \ldots , e_{n_t}\}\), then the mean joint error is defined as \(\frac{1}{n_t}\sum _{j} e_j\), and the median joint error is simply the median of the set of errors.
When working with annotated real depth images, there are a number of practical issues to be addressed. Below we present the major ones: To avoid the interference of the tapes fixed at the back of the ShapeHand dataglove, our depth images focus around the frontal views. Empirically, we have evaluated the reliability of the dataglove annotations. This is achieved via a number of simple but informative tests where we have observed that the ShapeHand device produces reasonable and consistent measurements (i.e. within mm accuracy) on all the finger joints except for the thumb, where significant errors are observed. We believe that the source of this error lies in the design of the instrument. As a result, even though we have included the thumbrelated joints in our dataset, they are presently ignored during performance evaluation. In other words, the three thumbrelated joints are not considered while evaluating the handpose estimation algorithms. As displayed in Fig. 8, this gives \(m=203=17\) in practice. The dataglove also gives inaccurate measurements when the palm arches (bends) deeply. Therefore we have to withdraw from consideration several gestures including 3, 7, R, T and W. Note on synthetic data all finger joints are considered as discussed previously.
The last which is nevertheless the most significant issue is an alignment problem as follows: due to the physical principle of ShapeHand data acquisition, its coordinate frame is originated at the hand base, which is different from the coordinate used by the estimated hand pose from depth camera. They are related by a linear coordinate transformation. In other words, the estimated joint positions need to be transformed from the camera coordinate to the ShapeHand coordinate frame. More specifically, denote the 3D location of a joint by \(v_i^S\) and \(v_i^C\), the 3D vectors corresponding to the ShapeHand (\(S\)) and the camera (\(C\)) coordinate frames, respectively, where \(i\) indexes over the \(m\) joints. The transformation matrix \(T^S_C\) is the 3D transformation matrix from \((C)\) to \((S)\), which can be uniquely obtained following the leastsquare 3D alignment method of Umeyama (1991).
6 Experiments
Throughout experiments we set \(T_1=7\) and \(T_2=12\). The depth of the trees is 20. Altogether \(460\hbox {K}\) synthetic training examples are used, as illustrated in Fig. 11. These training examples cover generic gestures from American sign language and Chinese number counting, together with their outofplane pitch and roll rotations, as well as inplane rotational perturbations. The minimum number of estimation examples stored at leaf nodes is set to \(k_n=30\). \(m_0\) is set to a large constant of \(1e7\), that practically allows the consideration of all training examples when choosing a threshold \(t\) at a split node. The evaluation of depth features requires the access to local image window centered at current pixel during the first step, and of the whole hand patches during the second step, which are of size \((w,h) = (50, 50)\) and \((w,h) = (120, 160)\) respectively. The size of feature space \(d\) is fixed to 3000, and the related probability \(p=0.2\). Distance is defined as the Euclidean distance between hand and camera. By default we will focus on our DHand model during experiments.
In what follows, empirical simulations are carried out on the synthetic dataset to investigate myriad aspects of our system under controlled setting. This is followed by extensive experiments with realworld data. In addition to hand pose estimation, our system is also shown to work with related tasks such as partbased labeling and gesture classification.
6.1 Experiments on Synthetic Data
To conduct quantitatively analysis, we first work with an inhouse dataset of \(1.6\,\hbox {K}\) synthesized hand depth images that covers a range of distances (from 350 to 700 mm). Similar to real data, the resolution of the depth camera is set to \(320\times 240\). When the distance from the hand to the camera is \(\mathrm {dist}=350\,mm\), the bounding box of the hand in image plane is typically of size \(70 \times 100\); when \(\mathrm {dist}=500\) mm, the size is reduced to \(49\times 70\); When \(\mathrm {dist}=700\) mm, the size further decreases to \(35\times 50\). White noise is added to the synthetic depth images with standard deviation 15 mm.
6.1.1 Estimation Error of Step One
We further investigate the effects of perturbing the estimates of step one toward the final estimates of our system on the same synthetic dataset. A systematic study is presented in the threedimensional bar plot of Fig. 13, where the hand position and inplane rotation errors of step one form the twodimensional input, which produces as output the mean joint error: assume the inputs from step one are perfect (i.e. with zero errors in both dimensions), final error of our system is around 15 mm. As both input errors increase, the final mean joint error will go up to over 40 mm. So it is fair to say that our system is reasonably robust against perturbation of the results from step one. Interestingly, our pipeline seems particularly insensitive to the inplane rotation error of step one, which changes only 5 mm when the inplane rotation error varies between 0 to 30 degrees. Finally, as shown in Fig. 13 where the errors of our first step (the green bar) is relatively small, our final estimation error is around 22 mm (mean joint error).
6.1.2 Number of Trees
6.1.3 Split Criteria
Two different split criteria are used for tree training in the second forest. When all 23 parameters are used to compute the entropy, the mean and median joint errors are 21.7 and 19.1 mm respectively. The hand rotation around Yaxis plays an important role in training the forest. In each node considering only the distribution of Yaxis rotation and the balance of the split (8), the mean and median joint errors are 21.5 and 19.2 mm respectively. The performance of considering only Yaxis rotation is as good as that of considering all 23 parameters.
6.1.4 Performance Evaluation of the Proposed Versus Existing Methods
Figure 15 provides a performance evaluation (in term of mean/median joint error) among several competing methods. They include the proposed two baseline methods (BaselineM and BaselineS), the proposed main method (DHand), as well as a comparison method (Xu and Cheng 2013) denoted as ICCV’13. Overall our method DHand deliver the best results across all distances, which is followed by BaselineM and BaselineS. This matches well with our expectation. Meanwhile ICCV’13 achieves the worst performance. In addition, our proposed methods are shown to be rather insensitive to distance changes (anywhere between 350 and 700 mm), while ICCV’13 performs the best around 450 mm, then performance declines when working with larger distance.
Comparison over Different Matching Methods: There are a few stateoftheart object template matching methods that are commonly used for related tasks, including DOT (Hinterstoisser et al. 2010), HOG (Dalal and Triggs 2005), and NCC (Lewis 1995). Figure 17 presents a performance comparison of our approach when adopting these matching methods. It is clear that DOT consistently performs the best, which is followed by HOG, while NCC always delivers the worst results. In term of memory usage, DOT consumes 100 MB, HOG takes 4 GB, while and NCC needs 2 GB. Clearly DOT is the most costeffective option. Note that in addition to the 100 MB DOT consumption, the 288 MB memory footprint of our system also includes other overheads such as thirdparty libraries.
6.2 Experiments on Real Data
Experiments of this section focus on our inhouse realworld depth images that are introduced in Sect. 5. By default, the distance from the hand to the camera is fixed to 500 mm. Throughout experiments three sets of depth images are used as presented in Fig. 1: (1) bare hand imaged by topmounted camera; (2) bare hand imaged by frontmounted camera; (3) hand with dataglove imaged by topmounted camera.
6.2.1 Comparisons with StateoftheArt Methods
Experiments are also conducted to qualitatively evaluate DHand and the stateoftheart methods (Tang et al. 2014; Oikonomidis and Argyros 2011; Leapmotion 2013) on pose estimation and tracking tasks, as manifested in Figs. 22 and 23. Note (Tang et al. 2014) is reimplemented by ourselves while original implementations of the rest two methods are employed.
Recently the latent regression forest (LRF) method has been developed in Tang et al. (2014) to estimate finger joints from single depth images. As presented in Fig. 22, for the eight distinct hand images from left to right, LRF gives relatively reasonable results on the first four and makes noticeable mistakes on the rest four scenarios, while our method consistently offers visually plausible estimates. Note in this experiment all hand images are acquired in frontal facing view only, as LRF has been observed to deteriorate significantly when the hands rotate around the Yaxis, as is also revealed in Fig. 5 in our paper, an issue we have considered as the leading challenges for hand pose estimation.
We further compare DHand with two stateoftheart hand tracking methods, which are the wellknown tracker of Oikonomidis and Argyros (2011), and a commercial software, (Leapmotion 2013), where the stable version 1.2.2 is used. Unfortunately each of the trackers operates on a different hardware: Oikonomidis and Argyros (2011) works with Kinect (2011) to take as input a streaming pairs of color and depth images, while Leap Motion runs on proprietary camera hardware (Leapmotion 2013). Also its results in Fig. 23 are screencopy images from its visualizer as being a closed system. To accommodate the differences, we engage the cameras at the same time during data acquisition, where the Kinect and the Softkinetic cameras are closely placed to ensure their inputs are from similar side views, and the hands are hovered on top of Leap motion with about 17 cm distance. we also allow both trackers with sufficient lead time to facilitate both functioning well before exposing to each of the hand scenes as displayed at the first row of Fig. 23. Taking each of these ten images as input, DHand consistently delivers plausible results, while the performance of both tracking methods are rather mixed: Oikonomidis and Argyros (2011) seems not entirely fits well with our hand size, and in particular, we have observed that its performance degrades when the palm arches as e.g. in the seventh case. Leap Motion produces reasonable results for the first five cases and performs less well on the rest five.
6.2.2 PartBased Labeling
Our proposed approach can also be used to label hand parts, where the objective is to assign each pixel to one of the list of prescribed hand parts. Here we adopt the colorcoded part labels of Fig. 10. Moreover, a simple scheme is adopted to convert our hand pose estimation to partbased labels: From input depth image, the hand area is first segmented from background. Our predicted hand pose is then applied to a synthetic 3D hand and projected onto the input image. This is followed by assigning each overlapping pixel a proper color label. For pixels not covered by the synthetic hand model, we allocate each of them with a label from the nearest overlapped regions.
Figure 24 presents exemplar labeling results on realworld depth images where the dataglove is put on. To illustrate the variety of images we present horizontally a series of unique gestures and vertically instances of the same gesture but from different subjects. Visually the results are quite satisfactory, where the color labels are mostly correct and consistent across different gestures and subjects. It is also observed that our annotation results are remarkably insensitive to background changes including the wires of the dataglove.
6.2.3 Gesture Classification
Instead of emphasizing on connecting and comparing with existing ASLfocused research efforts, the aim here is to showcase the capacity of applying our pose estimation system to address related task of gesture recognition. Therefore, we take the liberty of considering a combined set of gestures, which are exactly the 29 gestures discussed previously in the dataset and evaluation section—instead of pose estimation, here we consider the problem of assigning each test hand image to its corresponding gesture category. Notice that some gestures are very similar to each other: They include e.g. {“t”, “s”, “m”, “n”, “a”, “e”, “10”}, {“p”, “v”, “h”, “r”, “2”}, {“b”, “4”}, and {“x”, “9”}, as illustrated also in Fig. 9. Overall the average accuracy is 0.53.
The overall low score is mainly due to the similarity of several gesture types under consideration. For example, \(X\) in ASL is very similar to number counting gesture \(9\), and is also very close to number \(1\). This explains why for letter \(X\), it is correctly predicted with only \(0.27\), and with \(0.27\) it is wrongly classified as number \(9\) and with \(0.13\) as number \(1\), as displayed in the confusion matrix at Fig. 25.
6.2.4 Execution Time
Efficient CPU enables our system to run near realtime at 15.6 FPS, while our GPU implementation further boosts the speed to 67.2 FPS.
6.2.5 Limitations
Some failure cases of our hand pose estimation are presented in Fig. 21. A closed hand is usually difficult to deal with since no finger joints or tips are visible: As demonstrated in the first column, it might be confused with some similar gestures. Even when few fingers are in sight, different hand gestures might still be confused as they look very similar when projecting to the image plane from certain viewing angles, as presented in the 2rd to 4th columns. The last two columns display scenarios of overlapped fingers which might also be wrongly estimated.
As our approach is based on single depth images, the results may appear jittered when working with video streams. We also remark that our method is fundamentally different from trackingbased methods, where gradient based or stochastic optimization method would be used to exploit temporal information available. As a result, the accuracy of our method might slightly lag behind a tracking enabled approach with good initializations.
7 Conclusion and Outlook
This paper presents an efficient and effective twostep pipeline for hand pose estimation. GPUacceleration of the computational bottleneck component is also presented that significantly speeds up the runtime execution. A dataglove annotated hand depth image dataset is also described as an option for performance comparison of different approaches. Extensive empirical evaluations demonstrate the competitive performance of our approach. This is in addition to theoretical consistency analysis of its slightly simplified version. For future work, we consider to integrate into our consideration the temporal information, to eliminate the jittering effects of our system when working with live streams.
Footnotes
 1.
A project webpage can be found at http://web.bii.astar.edu.sg/~xuchi/handengine.htm, which contains supplementary information of this paper such as the demo video.
 2.
Our annotated dataset of depth images and the online performance evaluation system for 3D hand pose estimation are publicly available at http://hpes.bii.astar.edu.sg/.
 3.
\(diam(A)=\sup _{\overline{\mathbf {x}}_1,\overline{\mathbf {x}}_2\in A}\Vert \overline{\mathbf {x}}_1\overline{\mathbf {x}}_2\Vert \) for a set \(A\subset {\mathbb {R}}^d\).
Notes
Acknowledgments
This research was partially supported by A*STAR JCO and IAF grants. We would like to thank Vaghul Aditya Balaji for helping with the data collection and website design processes during his intern attachment at BII.
References
 Andrews, H. C., & Patterson, C. L. (1976). Digital interpolation of discrete images. IEEE Transactions on Computers, C–25(2), 196–202.CrossRefGoogle Scholar
 Ballan, L., Taneja, A., Gall, J., Gool, L., & Pollefeys, M. (2012). Motion capture of hands in action using discriminative salient points. In ECCV.Google Scholar
 Biau, G., Devroye, L., & Lugosi, G. (2008). Consistency of random forests and other averaging classifiers. Journal on Machine Learning Research, 9, 2015–2033.MathSciNetzbMATHGoogle Scholar
 Biau, G. (2012). Analysis of a random forests model. Journal on Machine Learning Research, 13, 1063–1095.MathSciNetzbMATHGoogle Scholar
 Breiman, L. (2004). Consistency for a simple random forests. Tech. rep. UC Berkeley.Google Scholar
 Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.CrossRefzbMATHGoogle Scholar
 Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 603–619.CrossRefGoogle Scholar
 Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In CVPR (Vol. 1, pp. 886–893).Google Scholar
 de La Gorce, M., Fleet, D., & Paragios, N. (2011). Modelbased 3d hand pose estimation from monocular video. IEEE Transaction on Pattern Analysis and Machine, 33(9), 1793–1805.CrossRefGoogle Scholar
 Denil, M., Matheson, D., & de Freitas, N. (2014). Narrowing the gap: Random forests in theory and practice. In ICML.Google Scholar
 Erol, A., Bebis, G., Nicolescu, M., Boyle, R., & Twombly, X. (2007). Visionbased hand pose estimation: A review. Computer Vision Image Understanding, 108(1–2), 52–73.CrossRefGoogle Scholar
 Fanelli, G., Gall, J., & Gool, L. V. (2011). Real time head pose estimation with random regression forests. In CVPR.Google Scholar
 Gall, J., & Lempitsky, V. (2013). Classspecific hough forests for object detection. In Decision forests for computer vision and medical image analysis (pp. 143–157). Berlin: Springer.Google Scholar
 Girshick, R., Shotton, J., Kohli, P., Criminisi, A., & Fitzgibbon, A. (2011). Efficient regression of generalactivity human poses from depth images. In ICCV.Google Scholar
 Gustus, A., Stillfried, G., Visser, J., Jorntell, H., & van der Smagt, P. (2012). Human hand modelling: Kinematics, dynamics, applications. Biological Cybernetics, 106(11–12), 741–755.MathSciNetCrossRefzbMATHGoogle Scholar
 Gyröfi, L., Kohler, M., Krzyzak, A., & Walk, H. (2002). A DistributionFree Theory of Nonparametric Regression. Berlin: Springer.CrossRefGoogle Scholar
 Hackenberg, G., McCall, R., & Broll, W. (2011). Lightweight palm and finger tracking for realtime 3d gesture control. In IEEE virtual reality conference (pp. 19–26).Google Scholar
 Hamming, R. W. (1997). Digital filters (3rd ed.). Dover Publications.Google Scholar
 Hansard, M., Lee, S., Choi, O., & Horaud, R. (2013). Timeofflight cameras: Principles, methods and applications. Berlin: Springer.CrossRefGoogle Scholar
 Hinterstoisser, S., Lepetit, V., Ilic, S., Fua, P., & Navab, N. (2010). Dominant orientation templates for realtime detection of textureless objects. In CVPR.Google Scholar
 Keskin, C., Kirac, F., Kara, Y., & Akarun, L. (2012). Hand pose estimation and hand shape classification using multilayered randomized decision forests. In ECCV.Google Scholar
 Kinect. (2011). http://www.xbox.com/enUS/kinect/.
 Leapmotion. (2013). http://www.leapmotion.com.
 Lewis, J. (1995). Fast normalized crosscorrelation. In Vision interface (Vol. 10, pp. 120–123).Google Scholar
 Melax, S., Keselman, L., & Orsten, S. (2013). Dynamics based 3d skeletal hand tracking. In Graphics interface.Google Scholar
 Oikonomidis, N., & Argyros, A. (2011). Efficient modelbased 3d tracking of hand articulations using kinect. In BMVC.Google Scholar
 Oikonomidis, I., Lourakis, M., & Argyros, A. (2014). Evolutionary quasirandom search for hand articulations tracking. In CVPR.Google Scholar
 Peachey, D. (1990). Texture on demand. Tech. rep.Google Scholar
 ShapeHand. (2009). http://www.shapehand.com/shapehand.html.
 Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Realtime human pose recognition in parts from single depth images. In CVPR.Google Scholar
 Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., et al. (2013). Realtime human pose recognition in parts from single depth images. Communication of the ACM, 56(1), 116–124.CrossRefGoogle Scholar
 Softkinetic. (2012). http://www.softkinetic.com.
 Sridhar, S., Oulasvirta, A., & Theobalt, C. (2013). Interactive markerless articulated hand motion tracking using rgb and depth data. In ICCV.Google Scholar
 Sueda, S., Kaufman, A., & Pai, D. (2008). Musculotendon simulation for hand animation. In SIGGRAPH (pp. 83:1–83:8).Google Scholar
 Tang, D., Tejani, A., Chang, H., & Kim, T. (2014) Latent regression forest: Structured estimation of 3d articulated hand posture. In CVPR.Google Scholar
 Taylor, J., Stebbing, R., Ramakrishna, V., Keskin, C., Shotton, J., Izadi, S., Fitzgibbon, A., & Hertzmann, A. (2014). Userspecific hand modeling from monocular depth sequences. In CVPR.Google Scholar
 Tzionas, D., & Gall, J. (2013). A comparison of directional distances for hand pose estimation. In German conference on pattern recognition.Google Scholar
 Umeyama, S. (1991). Leastsquares estimation of transformation parameters between two point patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13, 376380.CrossRefGoogle Scholar
 Wang, R., & Popović, J. (2009). Realtime handtracking with a color glove. In SIGGRAPH (pp. 63:1–63:8).Google Scholar
 Xu, C., & Cheng, L. (2013). Efficient hand pose estimation from a single depth image. In ICCV.Google Scholar
 Ye, M., Zhang, Q., Wang, L., Zhu, J., Yang, R., & Gall, J. (2013). Timeofflight and depth imaging. Sensors, algorithms, and applications, chap. A survey on human motion analysis from depth data (pp. 149–187). Berlin: Springer.Google Scholar
 Zhao, W., Chai, J., & Xu, Y. (2012). Combining markerbased mocap and rgbd camera for acquiring highfidelity hand motion data. In Eurographics symposium on computer animation.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.