RealTime Joint Tracking of a Hand Manipulating an Object from RGBD Input
 47 Citations
 12k Downloads
Abstract
Realtime simultaneous tracking of hands manipulating and interacting with external objects has many potential applications in augmented reality, tangible computing, and wearable computing. However, due to difficult occlusions, fast motions, and uniform hand appearance, jointly tracking hand and object pose is more challenging than tracking either of the two separately. Many previous approaches resort to complex multicamera setups to remedy the occlusion problem and often employ expensive segmentation and optimization steps which makes realtime tracking impossible. In this paper, we propose a realtime solution that uses a single commodity RGBD camera. The core of our approach is a 3D articulated Gaussian mixture alignment strategy tailored to handobject tracking that allows fast pose optimization. The alignment energy uses novel regularizers to address occlusions and handobject contacts. For added robustness, we guide the optimization with discriminative part classification of the hand and segmentation of the object. We conducted extensive experiments on several existing datasets and introduce a new annotated handobject dataset. Quantitative and qualitative results show the key advantages of our method: speed, accuracy, and robustness.
Keywords
Random Forest Augmented Reality Gaussian Mixture Model Leap Motion Part Label1 Introduction

A 3D articulated Gaussian mixture alignment approach for jointly tracking hand and object accurately.

Novel contact point and occlusion objective terms that were motivated by the physics of grasps, and can handle difficult handobject interactions.

A multilayered classification architecture to segment hand and object, and classify hand parts in RGBD sequences.

An extensive evaluation on public datasets as well as a new, fully annotated dataset consisting of diverse handobject interactions.
2 Related Work
Single Hand Tracking. Single hand tracking has received a lot of attention in recent years with discriminative and generative methods being the two main classes of methods. Discriminative methods for monocular RGB tracking index into a large database of poses or learn a mapping from image to pose space [3, 42]. However, accuracy and temporal stability of these methods are limited. Monocular generative methods optimize pose of more sophisticated 3D or 2.5D hand models by optimizing an alignment energy [6, 9, 15]. Occlusions and appearance ambiguities are less problematic with multicamera setups [5]. [41] use a physicsbased approach to optimize the pose of a hand using silhouette and color constraints at slow noninteractive frame rates. [28] use multiple RGB cameras and a single depth camera to track single hand poses in near realtime by combining generative tracking and finger tip detection. More lightweight setups with a single depth camera are preferred for many interactive applications. Among single camera methods, examples of discriminative methods are based on decision forests for hand part labeling [11], on a latent regression forest in combination with a coarsetofine search [33], fast hierarchical pose regression [31], or Hough voting [43]. Realtime performance is feasible, but temporal instability remains an issue. [19] generatively track a hand by optimizing a depth and appearancebased alignment metric with particle swarm optimization (PSO). A realtime generative tracking method with a physicsbased solver was proposed in [16]. The stabilization of realtime articulated ICP based on a learned subspace prior on hand poses was used in [32]. Templatebased nonrigid deformation tracking of arbitrary objects in realtime from RGBD was shown in [45], very simple unoccluded hand poses can be tracked. Combining generative and discriminative tracking enables recovery from some tracking failures [25, 28, 39]. [27] showed realtime single hand tracking from depth using generative pose optimization under detection constraints. Similarly, reinitialization of generative estimates via finger tip detection [23], multilayer discriminative reinitialization [25], or joints detected with convolutional networks is feasible [36]. [34] employ hierarchical sampling from partial pose distributions and a final hypothesis selection based on a generative energy. None of the above methods is able to track interacting hands and objects simultaneously and in nontrivial poses in realtime.
Tracking Hands in Interaction. Tracking two interacting hands, or a hand and a manipulated object, is a much harder problem. The straightforward combination of methods for object tracking, e.g. [4, 35], and hand tracking does not lead to satisfactory solutions, as only a combined formulation can methodically exploit mutual constraints between object and hand. [40] track two wellseparated hands from stereo by efficient pose retrieval and IK refinement. In [18] two hands in interaction are tracked at 4 Hz with an RGBD camera by optimizing a generative depth and image alignment measure. Tracking of interacting hands from multiview video at slow noninteractive runtimes was shown in [5]. They use generative pose optimization supported by salient point detection. The method in [32] can track very simple two hand interactions with little occlusion. Commercial solutions, e.g. Leap Motion [1] and NimbleVR [2], fail if two hands interact closely or interact with an object. In [17], a markerless method based on a generative pose optimization of a combined handobject model is proposed. They explicitly model collisions, but need multiple RGB cameras. In [8] the most likely pose is found through belief propagation using partbased trackers. This method is robust under occlusions, but does not explicitly track the object. A temporally coherent nearest neighbor search tracks the hand manipulating an object in [24], but not the object, in real time. Results are prone to temporal jitter. [13] perform frametoframe tracking of hand and objects from RGBD using physicsbased optimization. This approach has a slow noninteractive runtime. An ensemble of Collaborative Trackers (ECT) for RGBD based multiobject and multiple hand tracking is used in [14]. Their accuracy is high, but runtime is far from realtime. [21] infer contact forces from a tracked hand interacting with an object at slow noninteractive runtimes. [20, 38] propose methods for inhand RGBD object scanning. Both methods use known generative methods to track finger contact points to support ICPlike shape scanning. Recently, [37] introduced a method for tracking handonly, handhand, and handobject (we include a comparison with this method). None of the above methods can track the hand and the manipulated object in realtime in nontrivial poses from a single depth camera view, which is what our approach achieves.
3 Discriminative Hand Part Classification
As a preprocessing step, we classify depth pixels as hand or object, and further into hand parts. The obtained labeling is later used to guide the generative pose optimization. Our part classification strategy is based on a twolayer random forest that takes occlusions into account. Classification is based on a three step pipeline (see Fig. 3). Input is the color \(\mathcal {C}_t\) and depth \(\mathcal {D}_t\) frames captured by the RGBD sensor. We first perform handobject segmentation based on color cues to remove the object from the depth map. Afterwards, we select a suitable twolayer random forest to obtain the classification. The final output per pixel is a part probability histogram that encodes the class likelihoods. Note, object pixel histograms are set to an object class probability of 1. The forests are trained based on a set of training images that consists of real hand motions retargeted to a virtual hand model to generate synthetic data from multiple viewpoints. A virtual object is automatically inserted in the scene to simulate occlusions. To this end, we randomly sample uniform object positions between the thumb and one other finger and prune implausible poses based on intersection tests.
Viewpoint Selection. We trained twolayer forests for hand part classification from different viewpoints. Four cases are distinguished: observing the hand from the front, back, thumb and little finger sides. We select the forest that best matches the hand orientation computed in the last frame. The selected twolayer forest is then used for hand part classification.
ColorBased Object Segmentation. As a first step, we segment out the object from the captured depth map \(\mathcal {D}_t\). Similar to many previous handobject tracking approaches [19], we use the color image \(\mathcal {C}_t\) in combination with an HSV color segmentation strategy. As we show in the results, we are able to support objects with different colors. Object pixels are removed to obtain a new depth map \(\mathcal {\hat{D}}_t\), which we then feed to the next processing stage.
TwoLayer Hand Part Classification. We use a twolayer random forest for hand part classification. The first layer classifies hand and arm pixels while the second layer uses the hand pixels and further classifies them into one of several distinct hand parts. Both layers are perpixel classification forests [26]. The handarm classification forest is trained on \(N = 100k\) images with diverse handobject poses. For each of the four viewpoints a random forest is trained on \(N=38k\) images. The random forests are based on three trees, each trained on a random distinct subset. In each image, 2000 example foreground pixels are chosen. Split decisions at nodes are based on 100 random feature offsets and 40 thresholds. Candidate features are a uniform mix of unary and binary depth difference features [26]. Nodes are split as long as the information gain is sufficient and the maximum tree depth of 19 (21 for handarm forest) has not been reached. On the first layer, we use 3 part labels: 1 for hand, 1 for arm and 1 to represent the background. On the second layer, classification is based on 7 part labels: 6 for the hand parts, and 1 for the background. We use one label for each finger and one for the palm, see Fig. 3c. We use a crossvalidation procedure to find the best hyperparameters. On the disjoint test set, the handarm forest has a classification accuracy of 65.2 %. The forests for the four camera views had accuracies of 59.8 % (front), 64.7 % (back), 60.9 % (little), and 53.5 % (thumb).
4 Gaussian Mixture Model Representation
Joint handobject tracking requires a representation that allows for accurate tracking, is robust to outliers, and enables fast pose optimization. Gaussian mixture alignment, initially proposed for rigid pointset alignment (e.g. [10]), satisfies all these requirements. It features the advantages of ICPlike methods, without requiring a costly, errorprone correspondence search. We extend this approach to 3D articulated Gaussian mixture alignment tailored to handobject tracking. Compared to our 3D formulation, 2.5D [27] approaches are discontinuous. This causes instabilities, since the spatial proximity between model and data is not fully considered. We quantitatively show this for handonly tracking (Sect. 8).
5 Unified Density Representation
We parameterize the articulated motion of the human hand using a kinematic skeleton with \(\mathcal {X}_h = 26\) degrees of freedom (DOF). Nonrigid hand motion is expressed based on 20 joint angles in twist representation. The remaining 6 DOFs specify the global rigid transform of the hand with respect to the root joint. The manipulated object is assumed to be rigid and its motion is parameterized using \(\mathcal {X}_o = 6\) DOFs. In the following, we deal with the hand and object in a unified way. To this end, we refer to the vector of all unknowns as \(\mathcal {X}\). For pose optimization, both the input depth as well as the scene (hand and object) are expressed as 3D Gaussian Mixture Models (GMMs). This allows for fast and analytical pose optimization. We first define the following generic probability density distribution \(\mathcal {M}(\mathbf {x}) = \sum _{i=1}^{K}{w_i \mathcal {G}_i(\mathbf {x}\varvec{\mu }_i,\sigma _i)}\) at each point \(\mathbf {x} \in \mathbb {R}^3\) in space. This mixture contains K unnormalized, isotropic Gaussian functions \(\mathcal {G}_i\) with mean \(\varvec{\mu }_i \in \mathbb {R}^3\) and variance \(\sigma _i^2 \in \mathbb {R}\). In the case of the model distribution, the positions of the Gaussians are parameterized by the unknowns \(\mathcal {X}\). For the hand, this means each Gaussian is being rigidly rigged to one bone of the hand. The probability density is defined and nonvanishing over the whole domain \(\mathbb {R}^3\).
Hand and Object Model. The threedimensional shape of the hand and object is represented in a similar fashion as probability density distributions \(\mathcal {M}_{h}\) and \(\mathcal {M}_{o}\), respectively. We manually attach \(N_h= 30\) Gaussian functions to the kinematic chain of the hand to model its volumetric extent. Standard deviations are set such that they roughly correspond to the distance to the actual surface. The object is represented by automatically fitting a predefined number \(N_o\) of Gaussians to its spatial extent, such that the one standard deviation spheres model the objects volumetric extent. \(N_o\) is a user defined parameter which can be used to control the tradeoff between tracking accuracy and runtime performance. We found that \(N_o \in [12, 64]\) provides a good tradeoff between speed and accuracy for the objects used in our experiments. We refer to the combined handobject distribution as \(\mathcal {M}_{s}\), with \(N_s = N_h + N_o\) Gaussians. Each Gaussian is assigned to a class label \(l_i\) based on its semantic location in the scene. Note, the input GMM is only a model of the visible surface of the hand/object. Therefore, we incorporate a visibility factor \(f_i \in [0, 1]\) (0 completely occluded, 1 completely visible) per Gaussian. This factor is approximated by rendering an occlusion map with each Gaussian as a circle (radius equal to its standard deviation). The GMM is restricted to the visible surface by setting \(w_i=f_i\) in the mixture. These operations are performed based on the solution of the previous frame \(\mathcal {X}_{old}\).
Input Depth Data. We first perform bottomup hierarchical quadtree clustering of adjacent pixels with similar depth to convert the input to the density based representation. We cluster at most \((2^{(41)})^2=64\) pixels, which corresponds to a maximum tree depth of 4. Clustering is performed as long as the depth variance in the corresponding subdomain is smaller than \(\epsilon _{cluster}=30\) mm. Each leaf node is represented as a Gaussian function \(\mathcal {G}_i\) with \(\varvec{\mu }_i\) corresponding to the 3D center of gravity of the quad and \(\sigma _i^2=(\frac{a}{2})^2\), where a is the backprojected side length of the quad. Note, the mean \(\varvec{\mu }_i \in \mathbb {R}^3\) is obtained by backprojecting the 2D center of gravity of the quad based on the computed average depth and displacing by a in camera viewing direction to obtain a representation that matches the model of the scene. In addition, each \(\mathcal {G}_i\) stores the probability \(p_i\) and index \(l_i\) of the best associated semantic label. We obtain the best label and its probability by summing over all corresponding perpixel histograms obtained in the classification stage. Based on this data, we define the input depth distribution \(\mathcal {M}_{d_h}(\mathbf {x})\) for the hand and \(\mathcal {M}_{d_o}(\mathbf {x})\) for the object. The combined input distribution \(\mathcal {M}_{d}(\mathbf {x})\) has \(N_d = N_{d_o} + N_{d_h}\) Gaussians. We set uniform weights \(w_i=1\) based on the assumption of equal contribution. \(N_d\) is much smaller than the number of pixels leading to realtime handobject tracking.
6 Multiple Proposal Optimization
7 HandObject Tracking Objectives
8 Experiments and Results
We evaluate and compare our method on more than 15 sequences spanning 3 public datasets, which have been recorded with 3 different RBGD cameras (see Fig. 7). Additional live sequences (see Fig. 8 and supplementary materials) show that our method handles fast object and finger motion, difficult occlusions and fares well even if two hands are present in the scene. Our method supports commodity RGBD sensors like the Creative Senz3D, Intel RealSense F200, and Primesense Carmine. We rescale depth and color to resolutions of \(320\times 240\) and \(640\times 480\) respectively, and capture at 30 Hz. Furthermore, we introduce a new handobject tracking benchmark dataset with ground truth fingertip and object annotations.
Comparison to the StateoftheArt. We quantitatively and qualitatively evaluate on two publicly available handobject datasets [37, 38] (see Fig. 8 and also supplementary material). Only one dataset (IJCV [37]) contains ground truth joint annotations. We test on 5 rigid object sequences from IJCV. We track the right hand only, but our method works even when multiple hands are present. Ground truth annotations are provided for 2D joint positions, but not object pose. Our method achieves a fingertip pixel error of 8.6 px, which is comparable (difference of only 2 px) to that reported for the slower method of [37]. This small difference is well within the uncertainty of manual annotation and sensor noise. Note, our approach runs over 60 times faster, while producing visual results that are on par (see Fig. 8). We also track the dataset of [38] (see also Fig. 8). While they solve a different problem (offline inhand scanning), it shows that our realtime method copes well with different shaped objects (e.g. bowling pin, bottle, etc.) under occlusion.
Average error (mm) for hand and object tracking in our dataset
Rigid  Rotate  Occlusion  Grasp1  Grasp2  Pinch  Overall (mm)  

Fingertips  14.2  16.3  17.5  18.1  17.5  10.3  15.6 
Object  13.5  26.8  11.9  15.3  15.7  13.9  16.2 
Combined (E)  14.1  18.0  16.4  17.6  17.2  10.9  15.7 
Secondly, we show that the average error on our handobject dataset is worse without viewpoint selection, semantic alignment, occlusion handling, and contact points term. Figure 6 shows a consistency plot with different components of the energy disabled. Using only the data term often results in large errors. The errors are even larger without viewpoint selection. The semantic alignment, occlusion handling, and contact points help improve robustness of tracking results and recovery from failures. Figure 5 shows that [27] clearly fails when fingers are occluded. Our handobject specific terms are more robust to these difficult occlusion cases while achieving realtime performance.
Limitations. Although we demonstrated robustness against reasonable occlusions, situations where a high fraction of the hand is occluded for a long period are still challenging. This is mostly due to degraded classification performance under such occlusions. Misalignments can appear if the underlying assumption of the occlusion heuristic is violated, i. e. occluded parts do not move rigidly. Fortunately, our discriminative classification strategy enables the pose optimization to recover once previously occluded regions become visible again as shown in Fig. 9. Further research has to focus on better priors for occluded regions, for example grasp and interaction priors learned from data. Also improvements to hand part classification using different learning approaches or the regression of dense correspondences are interesting topics for future work. Another source of error are very fast motions. While the current implementation achieves 30 Hz, higher frame rate sensors in combination with a faster pose optimization will lead to higher robustness due to improved temporal coherence. We show diverse object shapes being tracked. However, increasing object complexity (shape and color) affects runtime performance. We would like to further explore how multiple complex objects and hands can be tracked.
9 Conclusion
We have presented the first realtime approach for simultaneous handobject tracking based on a single commodity depth sensor. Our approach combines the strengths of discriminative classification and generative pose optimization. Classification is based on a multilayer forest architecture with viewpoint selection. We use 3D articulated Gaussian mixture alignment tailored for handobject tracking along with novel analytic occlusion and contact handling constraints that enable successful tracking of challenging handobject interactions based on multiple proposals. Our qualitative and quantitative results demonstrate that our approach is both accurate and robust. Additionally, we have captured a new benchmark dataset (with hand and object annotations) and make it publicly available. We believe that future research will significantly benefit from this.
Footnotes
Notes
Acknowledgments
This research was funded by the ERC Starting Grant projects CapReal (335545) and COMPUTED (637991), and the Academy of Finland. We would like to thank Christian Richardt.
Supplementary material
References
 1.Leap Motion. https://www.leapmotion.com/
 2.NimbleVR. http://nimblevr.com/
 3.Athitsos, V., Sclaroff, S.: Estimating 3D hand pose from a cluttered image. In: Proceedings of IEEE CVPR, pp. 432–442 (2003)Google Scholar
 4.Badami, I., Stckler, J., Behnke, S.: Depthenhanced hough forests for objectclass detection and continuous pose estimation. In: Workshop on Semantic Perception, Mapping and Exploration (SPME) (2013)Google Scholar
 5.Ballan, L., Taneja, A., Gall, J., Gool, L., Pollefeys, M.: Motion capture of hands in action using discriminative salient points. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 640–653. Springer, Heidelberg (2012). doi: 10.1007/9783642337833_46 Google Scholar
 6.Bray, M., KollerMeier, E., Van Gool, L.: Smart particle filtering for 3D hand tracking. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition, pp. 675–680 (2004)Google Scholar
 7.Campbell, D., Petersson, L.: Gogma: globallyoptimal Gaussian mixture alignment (2016). arXiv preprint arXiv:1603.00150
 8.Hamer, H., Schindler, K., KollerMeier, E., Van Gool, L.: Tracking a hand manipulating an object. In: Proceedings of IEEE ICCV, pp. 1475–1482 (2009)Google Scholar
 9.Heap, T., Hogg, D.: Towards 3D hand tracking using a deformable model. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition, pp. 140–145, October 1996Google Scholar
 10.Jian, B., Vemuri, B.C.: Robust point set registration using Gaussian mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1633–1645 (2011)CrossRefGoogle Scholar
 11.Keskin, C., Kira, F., Kara, Y.E., Akarun, L.: Real time hand pose estimation using depth sensors. In: ICCV Workshops, pp. 1228–1234. IEEE (2011). http://dblp.unitrier.de/db/conf/iccvw/iccvw2011.html#KeskinKKA11
 12.Kurmankhojayev, D., Hasler, N., Theobalt, C.: Monocular pose capture with a depth camera using a sumsofGaussians body model. In: Weickert, J., Hein, M., Schiele, B. (eds.) GCPR 2013. LNCS, vol. 8142, pp. 415–424. Springer, Heidelberg (2013)CrossRefGoogle Scholar
 13.Kyriazis, N., Argyros, A.: Physically plausible 3D scene tracking: the single actor hypothesis. In: Proceedings of IEEE CVPR, pp. 9–16 (2013)Google Scholar
 14.Kyriazis, N., Argyros, A.: Scalable 3D tracking of multiple interacting objects. In: Proceedings of IEEE CVPR, pp. 3430–3437, June 2014Google Scholar
 15.de La Gorce, M., Fleet, D., Paragios, N.: Modelbased 3D hand pose estimation from monocular video. IEEE TPAMI 33(9), 1793–1805 (2011)CrossRefGoogle Scholar
 16.Melax, S., Keselman, L., Orsten, S.: Dynamics based 3D skeletal hand tracking. In: Proceedings of GI, pp. 63–70 (2013)Google Scholar
 17.Oikonomidis, I., Kyriazis, N., Argyros, A.: Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints. In: Proceedings of IEEE ICCV, pp. 2088–2095 (2011)Google Scholar
 18.Oikonomidis, I., Kyriazis, N., Argyros, A.: Tracking the articulated motion of two strongly interacting hands. In: Proceedings of IEEE CVPR, pp. 1862–1869 (2012)Google Scholar
 19.Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Efficient modelbased 3D tracking of hand articulations using kinect. In: Proceedings of BMVC, pp. 1–11 (2011)Google Scholar
 20.Panteleris, P., Kyriazis, N., Argyros, A.A.: 3D tracking of human hands in interaction with unknown objects. In: Proceedings of BMVC (2015). https://dx.doi.org/10.5244/C.29.123
 21.Pham, T.H., Kheddar, A., Qammaz, A., Argyros, A.A.: Towards force sensing from vision: observing handobject interactions to infer manipulation forces. In: Proceedings of IEEE CVPR (2015)Google Scholar
 22.Plankers, R., Fua, P.: Articulated soft objects for multiview shape and motion capture. IEEE TPAMI 25(9), 1182–1187 (2003). http://dx.doi.org/10.1109/TPAMI.2003.1227995 CrossRefGoogle Scholar
 23.Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and robust hand tracking from depth. In: Proceedings of IEEE CVPR (2014)Google Scholar
 24.Romero, J., Kjellstrom, H., Kragic, D.: Hands in action: realtime 3D reconstruction of hands in interaction with objects. In: Proceedings of ICRA, pp. 458–463 (2010)Google Scholar
 25.Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., Izadi, S.: Accurate, robust, and flexible realtime hand tracking. In: Proceedings of ACM CHI (2015)Google Scholar
 26.Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Realtime human pose recognition in parts from single depth images. In: Proceedings of IEEE CVPR, pp. 1297–1304 (2011). http://dx.doi.org/10.1109/CVPR.2011.5995316
 27.Sridhar, S., Mueller, F., Oulasvirta, A., Theobalt, C.: Fast and robust hand tracking using detectionguided optimization. In: Proceedings IEEE CVPR (2015). http://handtracker.mpiinf.mpg.de/projects/FastHandTracker/
 28.Sridhar, S., Oulasvirta, A., Theobalt, C.: Interactive markerless articulated hand motion tracking using RGB and depth data. In: Proceedings of IEEE ICCV (2013)Google Scholar
 29.Stenger, B., Mendonça, P.R., Cipolla, R.: Modelbased 3D tracking of an articulated hand. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 2, pp. II310. IEEE (2001)Google Scholar
 30.Stoll, C., Hasler, N., Gall, J., Seidel, H., Theobalt, C.: Fast articulated motion tracking using a sums of Gaussians body model. In: Proceedings of IEEE ICCV, pp. 951–958 (2011)Google Scholar
 31.Sun, X., Wei, Y., Liang, S., Tang, X., Sun, J.: Cascaded hand pose regression. In: Proceedings of IEEE CVPR (2015)Google Scholar
 32.Tagliasacchi, A., Schröder, M., Tkach, A., Bouaziz, S., Botsch, M., Pauly, M.: Robust articulatedICP for realtime hand tracking. In: Computer Graphics Forum (Proceedings of SGP), vol. 34, no. 5 (2015)Google Scholar
 33.Tang, D., Chang, H.J., Tejani, A., Kim, T.: Latent regression forest: structured estimation of 3D articulated hand posture. In: Proceedings of IEEE CVPR, pp. 3786–3793 (2014). http://dx.doi.org/10.1109/CVPR.2014.490
 34.Tang, D., Taylor, J., Kim, T.K.: Opening the black box: hierarchical sampling optimization for estimating human hand pose. In: Proceedings of IEEE ICCV (2015)Google Scholar
 35.Tejani, A., Tang, D., Kouskouridas, R., Kim, T.K.: Latentclass hough forests for 3D object detection and pose estimation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 462–477. Springer, Heidelberg (2014). doi: 10.1007/9783319105994_30 Google Scholar
 36.Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Realtime continuous pose recovery of human hands using convolutional networks. ACM TOG 33(5), 169:1–169:10 (2014)CrossRefGoogle Scholar
 37.Tzionas, D., Ballan, L., Srikantha, A., Aponte, P., Pollefeys, M., Gall, J.: Capturing hands in action using discriminative salient points and physics simulation. IJCV 118, 172–193 (2016)MathSciNetCrossRefGoogle Scholar
 38.Tzionas, D., Gall, J.: 3D object reconstruction from handobject interactions. In: Proceedings of IEEE ICCV (2015)Google Scholar
 39.Tzionas, D., Srikantha, A., Aponte, P., Gall, J.: Capturing hand motion with an RGBD sensor, fusing a generative model with salient points. In: Jiang, X., Hornegger, J., Koch, R. (eds.) GCPR 2014. LNCS, vol. 8753, pp. 277–289. Springer, Heidelberg (2014). doi: 10.1007/9783319117522_22 Google Scholar
 40.Wang, R., Paris, S., Popović, J.: 6D hands: markerless handtracking for computer aided design. In: Proceedings of ACM UIST, pp. 549–558 (2011)Google Scholar
 41.Wang, Y., Min, J., Zhang, J., Liu, Y., Xu, F., Dai, Q., Chai, J.: Videobased hand manipulation capture through composite motion control. ACM TOG 32(4), 43:1–43:14 (2013)CrossRefzbMATHGoogle Scholar
 42.Wu, Y., Huang, T.: Viewindependent recognition of hand postures. In: Proceedings of IEEE CVPR, pp. 88–94 (2000)Google Scholar
 43.Xu, C., Cheng, L.: Efficient hand pose estimation from a single depth image. In: Proceedings of IEEE ICCV (2013)Google Scholar
 44.Ye, M., Yang, R.: Realtime simultaneous pose and shape estimation for articulated objects using a single depth camera. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2353–2360, June 2014Google Scholar
 45.Zollhöfer, M., Nießner, M., Izadi, S., Rehmann, C., Zach, C., Fisher, M., Wu, C., Fitzgibbon, A., Loop, C., Theobalt, C., Stamminger, M.: Realtime nonrigid reconstruction using an RGBD camera. ACM TOG 33(4), 156 (2014)CrossRefGoogle Scholar