International Journal of Computer Vision

, Volume 118, Issue 2, pp 172–193 | Cite as

Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation

  • Dimitrios TzionasEmail author
  • Luca Ballan
  • Abhilash Srikantha
  • Pablo Aponte
  • Marc Pollefeys
  • Juergen Gall


Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated objects. Our framework combines a generative model with discriminatively trained salient points to achieve a low tracking error and with collision detection and physics simulation to achieve physically plausible estimates even in case of occlusions and missing visual data. Since all components are unified in a single objective function which is almost everywhere differentiable, it can be optimized with standard optimization techniques. Our approach works for monocular RGB-D sequences as well as setups with multiple synchronized RGB cameras. For a qualitative and quantitative evaluation, we captured 29 sequences with a large variety of interactions and up to 150 degrees of freedom.


Hand motion capture Hand–object interaction Fingertip detection Physics simulation 



Financial support was provided by the DFG Emmy Noether program (GA 1927/1-1).


  1. Aggarwal, A., Klawe, M. M., Moran, S., Shor, P., & Wilber, R. (1987). Geometric applications of a matrix-searching algorithm. Algorithmica, 2(1–4), 195–208.MathSciNetCrossRefzbMATHGoogle Scholar
  2. Albrecht, I., Haber, J., & Seidel, H. P. (2003). Construction and animation of anatomically based human hand models. In: SCA (pp. 98–109).Google Scholar
  3. Athitsos, V., & Sclaroff, S. (2003). Estimating 3d hand pose from a cluttered image. In CVPR (pp 432–439).Google Scholar
  4. Ballan, L., & Cortelazzo, G. M. (2008). Marker-less motion capture of skinned models in a four camera set-up using optical flow and silhouettes. In 3DPVT.Google Scholar
  5. Ballan, L., Taneja, A., Gall, J., Van Gool, L., & Pollefeys, M. (2012) Motion capture of hands in action using discriminative salient points. In ECCV (pp. 640–653).Google Scholar
  6. Baran, I., & Popović, J. (2007). Automatic rigging and animation of 3d characters. TOG, 26(3).Google Scholar
  7. Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. PAMI, 24(4), 509–522.CrossRefGoogle Scholar
  8. Bray, M., Koller-Meier, E., & Van Gool, L. (2007). Smart particle filtering for high-dimensional tracking. CVIU, 106(1), 116–129.Google Scholar
  9. Bregler, C., Malik, J., & Pullen, K. (2004). Twist based acquisition and tracking of animal and human kinematics. IJCV, 56(3), 179–194.CrossRefGoogle Scholar
  10. Brox, T., Rosenhahn, B., Gall, J., & Cremers, D. (2010). Combined region- and motion-based 3d tracking of rigid and articulated objects. PAMI, 32(3), 402–415.CrossRefGoogle Scholar
  11. Canny, J. (1986). A computational approach to edge detection. PAMI, 8(6), 679–698.CrossRefGoogle Scholar
  12. Chen, Y., & Medioni, G. (1991). Object modeling by registration of multiple range images. In ICRA.Google Scholar
  13. Coumans, E. (2013) Bullet real-time physics simulation.
  14. de Campos, T., & Murray, D. (2006). Regression-based hand pose estimation from multiple cameras. In CVPR.Google Scholar
  15. de La Gorce, M., Fleet, D. J., & Paragios, N. (2011). Model-based 3d hand pose estimation from monocular video. PAMI, 33(9), 1793–1805.CrossRefGoogle Scholar
  16. Delamarre, Q., & Faugeras, O. D. (2001). 3d articulated models and multiview tracking with physical forces. CVIU, 81(3), 328–357.Google Scholar
  17. Ekvall, S., & Kragic, D. (2005). Grasp recognition for programming by demonstration. In ICRA (pp. 748–753).Google Scholar
  18. Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., & Twombly, X. (2007). Vision-based hand pose estimation: A review. CVIU, 108(1–2), 52–73.Google Scholar
  19. Everingham, M., Van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. IJCV, 88(2), 303–338.CrossRefGoogle Scholar
  20. Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Distance transforms of sampled functions. Cornell Computing and Information Science: Tech. rep.Google Scholar
  21. Gall, J., Fossati, A., & Van Gool, L. (2011a). Functional categorization of objects using real-time markerless motion capture. In CVPR (pp. 1969–1976).Google Scholar
  22. Gall, J., Yao, A., Razavi, N., Van Gool, L., & Lempitsky, V. (2011b). Hough forests for object detection, tracking, and action recognition. PAMI, 33(11), 2188–2202.CrossRefGoogle Scholar
  23. Gärtner, B., & Schönherr, S. (2000). An efficient, exact, and generic quadratic programming solver for geometric optimization. In SCG ’00 (pp 110–118).Google Scholar
  24. Hamer, H., Gall, J., Weise, T., & Van Gool, L. (2010). An object-dependent hand pose prior from sparse training data. In CVPR (pp. 671–678).Google Scholar
  25. Hamer, H., Schindler, K., Koller-Meier, E., & Van Gool, L. (2009). Tracking a hand manipulating an object. In ICCV (pp. 1475–1482).Google Scholar
  26. Heap, T., & Hogg, D. (1996). Towards 3d hand tracking using a deformable model. In: FG (pp. 140–145).Google Scholar
  27. Holzer, S., Rusu, R., Dixon, M., Gedikli, S., & Navab, N. (2012). Adaptive neighborhood selection for real-time surface normal estimation from organized point cloud data using integral images. In: IROS (pp 2684–2689).Google Scholar
  28. Jones, M. J., & Rehg, J. M. (2002). Statistical color models with application to skin detection. IJCV, 46(1), 81–96.CrossRefzbMATHGoogle Scholar
  29. Keskin, C., Kra, F., Kara, Y., & Akarun, L. (2012). Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In ECCV.Google Scholar
  30. Kim, D., Hilliges, O., Izadi, S., Butler, A.D., Chen, J., Oikonomidis, I., & Olivier, P. (2012). Digits: Freehand 3d interactions anywhere using a wrist-worn gloveless sensor. In UIST (pp. 167–176).Google Scholar
  31. Kyriazis, N., & Argyros, A. (2013). Physically plausible 3d scene tracking: The single actor hypothesis. In CVPR (pp. 9–16).Google Scholar
  32. Kyriazis, N., & Argyros, A. (2014) Scalable 3d tracking of multiple interacting objects. In CVPR.Google Scholar
  33. Lewis, J. P., Cordner, M., & Fong, N. (2000). Pose space deformation: A unified approach to shape interpolation and skeleton-driven deformation. In SIGGRAPH.Google Scholar
  34. Lu, S., Metaxas, D., Samaras, D., & Oliensis, J. (2003). Using multiple cues for hand tracking and model refinement. In CVPR (pp. 443–450).Google Scholar
  35. MacCormick, J., & Isard, M. (2000) Partitioned sampling, articulated objects, and interface-quality hand tracking. In ECCV (pp. 3–19).Google Scholar
  36. Murray, R. M., Sastry, S. S., & Zexiang, L. (1994). A mathematical introduction to robotic manipulation.Google Scholar
  37. Oikonomidis, I., Kyriazis, N., & Argyros, A. (2011a). Efficient model-based 3d tracking of hand articulations using kinect. In BMVC (pp 101.1–101.11).Google Scholar
  38. Oikonomidis, I., Kyriazis, N., & Argyros, A. (2011b). Full dof tracking of a hand interacting with an object by modeling occlusions and physical constraints. In ICCV.Google Scholar
  39. Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2012). Tracking the articulated motion of two strongly interacting hands. In CVPR (pp 1862–1869).Google Scholar
  40. Oikonomidis, I., Lourakis, M. I., & Argyros, A. A. (2014). Evolutionary quasi-random search for hand articulations tracking. In CVPR.Google Scholar
  41. Paris, S., & Durand, F. (2009). A fast approximation of the bilateral filter using a signal processing approach. IJCV, 81(1), 24–52.CrossRefGoogle Scholar
  42. Pons-Moll, G., & Rosenhahn, B. (2011). Model-based Pose estimation (pp. 139–170).Google Scholar
  43. Qian, C., Sun, X., Wei, Y., Tang, X., & Sun, J. (2014). Realtime and robust hand tracking from depth. In CVPR.Google Scholar
  44. Rehg, J. M., & Kanade, T. (1994). Visual tracking of high dof articulated structures: An application to human hand tracking. In ECCV (pp. 35–46).Google Scholar
  45. Rehg, J., & Kanade, T. (1995). Model-based tracking of self-occluding articulated objects. In ICCV (pp. 612–617).Google Scholar
  46. Romero, J., Kjellström, H., & Kragic, D. (2009). Monocular real-time 3d articulated hand pose estimation. In HUMANOIDS (pp. 87–92).Google Scholar
  47. Romero, J., Kjellström, H., & Kragic, D. (2010). Hands in action: Real-time 3d reconstruction of hands in interaction with objects. In ICRA (pp. 458–463).Google Scholar
  48. Rosales, R., Athitsos, V., Sigal, L., & Sclaroff, S. (2001). 3d hand pose reconstruction using specialized mappings. In ICCV (pp. 378–387).Google Scholar
  49. Rosenhahn, B., Brox, T., & Weickert, J. (2007). Three-dimensional shape knowledge for joint image segmentation and pose tracking. IJCV, 73(3), 243–262.CrossRefGoogle Scholar
  50. Rusinkiewicz, S., & Levoy, M. (2001). Efficient variants of the icp algorithm. In 3DIM (pp 145–152).Google Scholar
  51. Rusinkiewicz, S., Hall-Holt, O., & Levoy, M. (2002). Real-time 3d model acquisition. TOG, 21(3), 438–446.CrossRefGoogle Scholar
  52. Schmidt, T., Newcombe, R., & Fox, D. (2014). Dart: Dense articulated real-time tracking. In Proceedings of robotics: Science and systems, Berkeley, USA.Google Scholar
  53. Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., & Izadi, S. (2015). Accurate, robust, and flexible real-time hand tracking. In CHI.Google Scholar
  54. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR (pp. 1297–1304).Google Scholar
  55. Sridhar, S., Mueller, F., Oulasvirta, A., & Theobalt, C. (2015). Fast and robust hand tracking using detection-guided optimization. In: CVPR.Google Scholar
  56. Sridhar, S., Oulasvirta, A., & Theobalt, C. (2013). Interactive markerless articulated hand motion tracking using rgb and depth data. In ICCV (pp. 2456–2463).Google Scholar
  57. Sridhar, S., Rhodin, H., Seidel, H.P., Oulasvirta, A., & Theobalt, C. (2014). Real-time hand tracking using a sum of anisotropic gaussians model. In 3DV.Google Scholar
  58. Stenger, B., Mendonca, P., & Cipolla, R. (2001). Model-based 3D tracking of an articulated hand. In CVPR.Google Scholar
  59. Stolfi, J. (1991). Oriented projective geometry: A framework for geometric computation. Boston: Academic Press.zbMATHGoogle Scholar
  60. Sudderth, E., Mandel, M., Freeman, W., & Willsky, A. (2004) Visual hand tracking using nonparametric belief propagation. In Workshop on generative model based vision (pp. 189–189).Google Scholar
  61. Tang, D., Chang, H. J., Tejani, A., & Kim, T. K. (2014). Latent regression forest: Structured estimation of 3d articulated hand posture. In CVPR.Google Scholar
  62. Tang, D., Yu, T. H., & Kim, T. K. (2013). Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In ICCV (pp. 3224–3231).Google Scholar
  63. Taylor, J., Stebbing, R., Ramakrishna, V., Keskin, C., Shotton, J., Izadi, S., Hertzmann, A., & Fitzgibbon. A. (2014). User-specific hand modeling from monocular depth sequences. In CVPR.Google Scholar
  64. Teschnerm, M., Kimmerle, S., Heidelberger, B., Zachmann, G., Raghupathi, L., Fuhrmann, A., Cani, M. P., Faure, F., Magnetat-Thalmann, N., & Strasser, W. (2004). Collision detection for deformable objects. In Eurographics.Google Scholar
  65. Thayananthan, A., Stenger, B., Torr, P. H. S., & Cipolla, R. (2003). Shape context and chamfer matching in cluttered scenes. In CVPR (pp. 127–133).Google Scholar
  66. Tompson, J., Stein, M., Lecun, Y., & Perlin, K. (2014). Real-time continuous pose recovery of human hands using convolutional networks. In TOG 33.Google Scholar
  67. Tzionas, D., & Gall, J. (2013). A comparison of directional distances for hand pose estimation. In GCPR.Google Scholar
  68. Tzionas, D., Srikantha, A., Aponte, P., & Gall, J. (2014). Capturing hand motion with an rgb-d sensor, fusing a generative model with salient points. In GCPR.Google Scholar
  69. Vaezi, M., & Nekouie, M. A. (2011). 3d human hand posture reconstruction using a single 2d image. IJHCI, 1(4), 83–94.Google Scholar
  70. Wang, R. Y., & Popović, J. (2009). Real-time hand-tracking with a color glove. TOG, 28(3), 63:1–63:8.Google Scholar
  71. Wu, Y., Lin, J., & Huang, T. (2001). Capturing natural hand articulation. In ICCV (pp. 426–432).Google Scholar
  72. Ye, M., Zhang, Q., Wang, L., Zhu, J., Yang, R., & Gall, J. (2013). A survey on human motion analysis from depth data. In Time-of-flight and depth imaging. sensors, algorithms, and applications (pp. 149–187).Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Dimitrios Tzionas
    • 1
    • 2
    Email author
  • Luca Ballan
    • 3
  • Abhilash Srikantha
    • 1
    • 2
  • Pablo Aponte
    • 1
  • Marc Pollefeys
    • 3
  • Juergen Gall
    • 1
  1. 1.Institute of Computer Science IIIUniversity of BonnBonnGermany
  2. 2.Perceiving Systems DepartmentMax Planck institute for Intelligent SystemsTübingenGermany
  3. 3.Institute for Visual Computing, ETH ZurichUniversitätstraße 6ZurichSwitzerland

Personalised recommendations