Real-Time Joint Tracking of a Hand Manipulating an Object from RGB-D Input

  • Srinath Sridhar
  • Franziska Mueller
  • Michael Zollhöfer
  • Dan Casas
  • Antti Oulasvirta
  • Christian TheobaltEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9906)


Real-time simultaneous tracking of hands manipulating and interacting with external objects has many potential applications in augmented reality, tangible computing, and wearable computing. However, due to difficult occlusions, fast motions, and uniform hand appearance, jointly tracking hand and object pose is more challenging than tracking either of the two separately. Many previous approaches resort to complex multi-camera setups to remedy the occlusion problem and often employ expensive segmentation and optimization steps which makes real-time tracking impossible. In this paper, we propose a real-time solution that uses a single commodity RGB-D camera. The core of our approach is a 3D articulated Gaussian mixture alignment strategy tailored to hand-object tracking that allows fast pose optimization. The alignment energy uses novel regularizers to address occlusions and hand-object contacts. For added robustness, we guide the optimization with discriminative part classification of the hand and segmentation of the object. We conducted extensive experiments on several existing datasets and introduce a new annotated hand-object dataset. Quantitative and qualitative results show the key advantages of our method: speed, accuracy, and robustness.


Random Forest Augmented Reality Gaussian Mixture Model Leap Motion Part Label 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This research was funded by the ERC Starting Grant projects CapReal (335545) and COMPUTED (637991), and the Academy of Finland. We would like to thank Christian Richardt.

Supplementary material

Supplementary material 1 (mp4 23187 KB)

419974_1_En_19_MOESM2_ESM.pdf (2.1 mb)
Supplementary material 2 (pdf 2163 KB)


  1. 1.
  2. 2.
  3. 3.
    Athitsos, V., Sclaroff, S.: Estimating 3D hand pose from a cluttered image. In: Proceedings of IEEE CVPR, pp. 432–442 (2003)Google Scholar
  4. 4.
    Badami, I., Stckler, J., Behnke, S.: Depth-enhanced hough forests for object-class detection and continuous pose estimation. In: Workshop on Semantic Perception, Mapping and Exploration (SPME) (2013)Google Scholar
  5. 5.
    Ballan, L., Taneja, A., Gall, J., Gool, L., Pollefeys, M.: Motion capture of hands in action using discriminative salient points. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 640–653. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33783-3_46 Google Scholar
  6. 6.
    Bray, M., Koller-Meier, E., Van Gool, L.: Smart particle filtering for 3D hand tracking. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition, pp. 675–680 (2004)Google Scholar
  7. 7.
    Campbell, D., Petersson, L.: Gogma: globally-optimal Gaussian mixture alignment (2016). arXiv preprint arXiv:1603.00150
  8. 8.
    Hamer, H., Schindler, K., Koller-Meier, E., Van Gool, L.: Tracking a hand manipulating an object. In: Proceedings of IEEE ICCV, pp. 1475–1482 (2009)Google Scholar
  9. 9.
    Heap, T., Hogg, D.: Towards 3D hand tracking using a deformable model. In: Proceedings of the International Conference on Automatic Face and Gesture Recognition, pp. 140–145, October 1996Google Scholar
  10. 10.
    Jian, B., Vemuri, B.C.: Robust point set registration using Gaussian mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1633–1645 (2011)CrossRefGoogle Scholar
  11. 11.
    Keskin, C., Kira, F., Kara, Y.E., Akarun, L.: Real time hand pose estimation using depth sensors. In: ICCV Workshops, pp. 1228–1234. IEEE (2011).
  12. 12.
    Kurmankhojayev, D., Hasler, N., Theobalt, C.: Monocular pose capture with a depth camera using a sums-of-Gaussians body model. In: Weickert, J., Hein, M., Schiele, B. (eds.) GCPR 2013. LNCS, vol. 8142, pp. 415–424. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  13. 13.
    Kyriazis, N., Argyros, A.: Physically plausible 3D scene tracking: the single actor hypothesis. In: Proceedings of IEEE CVPR, pp. 9–16 (2013)Google Scholar
  14. 14.
    Kyriazis, N., Argyros, A.: Scalable 3D tracking of multiple interacting objects. In: Proceedings of IEEE CVPR, pp. 3430–3437, June 2014Google Scholar
  15. 15.
    de La Gorce, M., Fleet, D., Paragios, N.: Model-based 3D hand pose estimation from monocular video. IEEE TPAMI 33(9), 1793–1805 (2011)CrossRefGoogle Scholar
  16. 16.
    Melax, S., Keselman, L., Orsten, S.: Dynamics based 3D skeletal hand tracking. In: Proceedings of GI, pp. 63–70 (2013)Google Scholar
  17. 17.
    Oikonomidis, I., Kyriazis, N., Argyros, A.: Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints. In: Proceedings of IEEE ICCV, pp. 2088–2095 (2011)Google Scholar
  18. 18.
    Oikonomidis, I., Kyriazis, N., Argyros, A.: Tracking the articulated motion of two strongly interacting hands. In: Proceedings of IEEE CVPR, pp. 1862–1869 (2012)Google Scholar
  19. 19.
    Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Efficient model-based 3D tracking of hand articulations using kinect. In: Proceedings of BMVC, pp. 1–11 (2011)Google Scholar
  20. 20.
    Panteleris, P., Kyriazis, N., Argyros, A.A.: 3D tracking of human hands in interaction with unknown objects. In: Proceedings of BMVC (2015).
  21. 21.
    Pham, T.H., Kheddar, A., Qammaz, A., Argyros, A.A.: Towards force sensing from vision: observing hand-object interactions to infer manipulation forces. In: Proceedings of IEEE CVPR (2015)Google Scholar
  22. 22.
    Plankers, R., Fua, P.: Articulated soft objects for multiview shape and motion capture. IEEE TPAMI 25(9), 1182–1187 (2003). CrossRefGoogle Scholar
  23. 23.
    Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and robust hand tracking from depth. In: Proceedings of IEEE CVPR (2014)Google Scholar
  24. 24.
    Romero, J., Kjellstrom, H., Kragic, D.: Hands in action: real-time 3D reconstruction of hands in interaction with objects. In: Proceedings of ICRA, pp. 458–463 (2010)Google Scholar
  25. 25.
    Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., Izadi, S.: Accurate, robust, and flexible real-time hand tracking. In: Proceedings of ACM CHI (2015)Google Scholar
  26. 26.
    Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: Proceedings of IEEE CVPR, pp. 1297–1304 (2011).
  27. 27.
    Sridhar, S., Mueller, F., Oulasvirta, A., Theobalt, C.: Fast and robust hand tracking using detection-guided optimization. In: Proceedings IEEE CVPR (2015).
  28. 28.
    Sridhar, S., Oulasvirta, A., Theobalt, C.: Interactive markerless articulated hand motion tracking using RGB and depth data. In: Proceedings of IEEE ICCV (2013)Google Scholar
  29. 29.
    Stenger, B., Mendonça, P.R., Cipolla, R.: Model-based 3D tracking of an articulated hand. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, vol. 2, pp. II-310. IEEE (2001)Google Scholar
  30. 30.
    Stoll, C., Hasler, N., Gall, J., Seidel, H., Theobalt, C.: Fast articulated motion tracking using a sums of Gaussians body model. In: Proceedings of IEEE ICCV, pp. 951–958 (2011)Google Scholar
  31. 31.
    Sun, X., Wei, Y., Liang, S., Tang, X., Sun, J.: Cascaded hand pose regression. In: Proceedings of IEEE CVPR (2015)Google Scholar
  32. 32.
    Tagliasacchi, A., Schröder, M., Tkach, A., Bouaziz, S., Botsch, M., Pauly, M.: Robust articulated-ICP for real-time hand tracking. In: Computer Graphics Forum (Proceedings of SGP), vol. 34, no. 5 (2015)Google Scholar
  33. 33.
    Tang, D., Chang, H.J., Tejani, A., Kim, T.: Latent regression forest: structured estimation of 3D articulated hand posture. In: Proceedings of IEEE CVPR, pp. 3786–3793 (2014).
  34. 34.
    Tang, D., Taylor, J., Kim, T.K.: Opening the black box: hierarchical sampling optimization for estimating human hand pose. In: Proceedings of IEEE ICCV (2015)Google Scholar
  35. 35.
    Tejani, A., Tang, D., Kouskouridas, R., Kim, T.-K.: Latent-class hough forests for 3D object detection and pose estimation. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 462–477. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10599-4_30 Google Scholar
  36. 36.
    Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. ACM TOG 33(5), 169:1–169:10 (2014)CrossRefGoogle Scholar
  37. 37.
    Tzionas, D., Ballan, L., Srikantha, A., Aponte, P., Pollefeys, M., Gall, J.: Capturing hands in action using discriminative salient points and physics simulation. IJCV 118, 172–193 (2016)MathSciNetCrossRefGoogle Scholar
  38. 38.
    Tzionas, D., Gall, J.: 3D object reconstruction from hand-object interactions. In: Proceedings of IEEE ICCV (2015)Google Scholar
  39. 39.
    Tzionas, D., Srikantha, A., Aponte, P., Gall, J.: Capturing hand motion with an RGB-D sensor, fusing a generative model with salient points. In: Jiang, X., Hornegger, J., Koch, R. (eds.) GCPR 2014. LNCS, vol. 8753, pp. 277–289. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-11752-2_22 Google Scholar
  40. 40.
    Wang, R., Paris, S., Popović, J.: 6D hands: markerless hand-tracking for computer aided design. In: Proceedings of ACM UIST, pp. 549–558 (2011)Google Scholar
  41. 41.
    Wang, Y., Min, J., Zhang, J., Liu, Y., Xu, F., Dai, Q., Chai, J.: Video-based hand manipulation capture through composite motion control. ACM TOG 32(4), 43:1–43:14 (2013)CrossRefzbMATHGoogle Scholar
  42. 42.
    Wu, Y., Huang, T.: View-independent recognition of hand postures. In: Proceedings of IEEE CVPR, pp. 88–94 (2000)Google Scholar
  43. 43.
    Xu, C., Cheng, L.: Efficient hand pose estimation from a single depth image. In: Proceedings of IEEE ICCV (2013)Google Scholar
  44. 44.
    Ye, M., Yang, R.: Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2353–2360, June 2014Google Scholar
  45. 45.
    Zollhöfer, M., Nießner, M., Izadi, S., Rehmann, C., Zach, C., Fisher, M., Wu, C., Fitzgibbon, A., Loop, C., Theobalt, C., Stamminger, M.: Real-time non-rigid reconstruction using an RGB-D camera. ACM TOG 33(4), 156 (2014)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Srinath Sridhar
    • 1
  • Franziska Mueller
    • 1
  • Michael Zollhöfer
    • 1
  • Dan Casas
    • 1
  • Antti Oulasvirta
    • 2
  • Christian Theobalt
    • 1
    Email author
  1. 1.Max Planck Institute for InformaticsSaarbrückenGermany
  2. 2.Aalto UniversityEspooFinland

Personalised recommendations