Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation


Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated objects. Our framework combines a generative model with discriminatively trained salient points to achieve a low tracking error and with collision detection and physics simulation to achieve physically plausible estimates even in case of occlusions and missing visual data. Since all components are unified in a single objective function which is almost everywhere differentiable, it can be optimized with standard optimization techniques. Our approach works for monocular RGB-D sequences as well as setups with multiple synchronized RGB cameras. For a qualitative and quantitative evaluation, we captured 29 sequences with a large variety of interactions and up to 150 degrees of freedom.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21


  1. 1.

    All annotated sequences are available at

  2. 2.

    All annotated sequences are available at

  3. 3.


  1. Aggarwal, A., Klawe, M. M., Moran, S., Shor, P., & Wilber, R. (1987). Geometric applications of a matrix-searching algorithm. Algorithmica, 2(1–4), 195–208.

    MathSciNet  Article  MATH  Google Scholar 

  2. Albrecht, I., Haber, J., & Seidel, H. P. (2003). Construction and animation of anatomically based human hand models. In: SCA (pp. 98–109).

  3. Athitsos, V., & Sclaroff, S. (2003). Estimating 3d hand pose from a cluttered image. In CVPR (pp 432–439).

  4. Ballan, L., & Cortelazzo, G. M. (2008). Marker-less motion capture of skinned models in a four camera set-up using optical flow and silhouettes. In 3DPVT.

  5. Ballan, L., Taneja, A., Gall, J., Van Gool, L., & Pollefeys, M. (2012) Motion capture of hands in action using discriminative salient points. In ECCV (pp. 640–653).

  6. Baran, I., & Popović, J. (2007). Automatic rigging and animation of 3d characters. TOG, 26(3).

  7. Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. PAMI, 24(4), 509–522.

    Article  Google Scholar 

  8. Bray, M., Koller-Meier, E., & Van Gool, L. (2007). Smart particle filtering for high-dimensional tracking. CVIU, 106(1), 116–129.

    Google Scholar 

  9. Bregler, C., Malik, J., & Pullen, K. (2004). Twist based acquisition and tracking of animal and human kinematics. IJCV, 56(3), 179–194.

    Article  Google Scholar 

  10. Brox, T., Rosenhahn, B., Gall, J., & Cremers, D. (2010). Combined region- and motion-based 3d tracking of rigid and articulated objects. PAMI, 32(3), 402–415.

    Article  Google Scholar 

  11. Canny, J. (1986). A computational approach to edge detection. PAMI, 8(6), 679–698.

    Article  Google Scholar 

  12. Chen, Y., & Medioni, G. (1991). Object modeling by registration of multiple range images. In ICRA.

  13. Coumans, E. (2013) Bullet real-time physics simulation.

  14. de Campos, T., & Murray, D. (2006). Regression-based hand pose estimation from multiple cameras. In CVPR.

  15. de La Gorce, M., Fleet, D. J., & Paragios, N. (2011). Model-based 3d hand pose estimation from monocular video. PAMI, 33(9), 1793–1805.

    Article  Google Scholar 

  16. Delamarre, Q., & Faugeras, O. D. (2001). 3d articulated models and multiview tracking with physical forces. CVIU, 81(3), 328–357.

  17. Ekvall, S., & Kragic, D. (2005). Grasp recognition for programming by demonstration. In ICRA (pp. 748–753).

  18. Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., & Twombly, X. (2007). Vision-based hand pose estimation: A review. CVIU, 108(1–2), 52–73.

    Google Scholar 

  19. Everingham, M., Van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. IJCV, 88(2), 303–338.

    Article  Google Scholar 

  20. Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Distance transforms of sampled functions. Cornell Computing and Information Science: Tech. rep.

  21. Gall, J., Fossati, A., & Van Gool, L. (2011a). Functional categorization of objects using real-time markerless motion capture. In CVPR (pp. 1969–1976).

  22. Gall, J., Yao, A., Razavi, N., Van Gool, L., & Lempitsky, V. (2011b). Hough forests for object detection, tracking, and action recognition. PAMI, 33(11), 2188–2202.

    Article  Google Scholar 

  23. Gärtner, B., & Schönherr, S. (2000). An efficient, exact, and generic quadratic programming solver for geometric optimization. In SCG ’00 (pp 110–118).

  24. Hamer, H., Gall, J., Weise, T., & Van Gool, L. (2010). An object-dependent hand pose prior from sparse training data. In CVPR (pp. 671–678).

  25. Hamer, H., Schindler, K., Koller-Meier, E., & Van Gool, L. (2009). Tracking a hand manipulating an object. In ICCV (pp. 1475–1482).

  26. Heap, T., & Hogg, D. (1996). Towards 3d hand tracking using a deformable model. In: FG (pp. 140–145).

  27. Holzer, S., Rusu, R., Dixon, M., Gedikli, S., & Navab, N. (2012). Adaptive neighborhood selection for real-time surface normal estimation from organized point cloud data using integral images. In: IROS (pp 2684–2689).

  28. Jones, M. J., & Rehg, J. M. (2002). Statistical color models with application to skin detection. IJCV, 46(1), 81–96.

    Article  MATH  Google Scholar 

  29. Keskin, C., Kra, F., Kara, Y., & Akarun, L. (2012). Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In ECCV.

  30. Kim, D., Hilliges, O., Izadi, S., Butler, A.D., Chen, J., Oikonomidis, I., & Olivier, P. (2012). Digits: Freehand 3d interactions anywhere using a wrist-worn gloveless sensor. In UIST (pp. 167–176).

  31. Kyriazis, N., & Argyros, A. (2013). Physically plausible 3d scene tracking: The single actor hypothesis. In CVPR (pp. 9–16).

  32. Kyriazis, N., & Argyros, A. (2014) Scalable 3d tracking of multiple interacting objects. In CVPR.

  33. Lewis, J. P., Cordner, M., & Fong, N. (2000). Pose space deformation: A unified approach to shape interpolation and skeleton-driven deformation. In SIGGRAPH.

  34. Lu, S., Metaxas, D., Samaras, D., & Oliensis, J. (2003). Using multiple cues for hand tracking and model refinement. In CVPR (pp. 443–450).

  35. MacCormick, J., & Isard, M. (2000) Partitioned sampling, articulated objects, and interface-quality hand tracking. In ECCV (pp. 3–19).

  36. Murray, R. M., Sastry, S. S., & Zexiang, L. (1994). A mathematical introduction to robotic manipulation.

  37. Oikonomidis, I., Kyriazis, N., & Argyros, A. (2011a). Efficient model-based 3d tracking of hand articulations using kinect. In BMVC (pp 101.1–101.11).

  38. Oikonomidis, I., Kyriazis, N., & Argyros, A. (2011b). Full dof tracking of a hand interacting with an object by modeling occlusions and physical constraints. In ICCV.

  39. Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2012). Tracking the articulated motion of two strongly interacting hands. In CVPR (pp 1862–1869).

  40. Oikonomidis, I., Lourakis, M. I., & Argyros, A. A. (2014). Evolutionary quasi-random search for hand articulations tracking. In CVPR.

  41. Paris, S., & Durand, F. (2009). A fast approximation of the bilateral filter using a signal processing approach. IJCV, 81(1), 24–52.

    Article  Google Scholar 

  42. Pons-Moll, G., & Rosenhahn, B. (2011). Model-based Pose estimation (pp. 139–170).

  43. Qian, C., Sun, X., Wei, Y., Tang, X., & Sun, J. (2014). Realtime and robust hand tracking from depth. In CVPR.

  44. Rehg, J. M., & Kanade, T. (1994). Visual tracking of high dof articulated structures: An application to human hand tracking. In ECCV (pp. 35–46).

  45. Rehg, J., & Kanade, T. (1995). Model-based tracking of self-occluding articulated objects. In ICCV (pp. 612–617).

  46. Romero, J., Kjellström, H., & Kragic, D. (2009). Monocular real-time 3d articulated hand pose estimation. In HUMANOIDS (pp. 87–92).

  47. Romero, J., Kjellström, H., & Kragic, D. (2010). Hands in action: Real-time 3d reconstruction of hands in interaction with objects. In ICRA (pp. 458–463).

  48. Rosales, R., Athitsos, V., Sigal, L., & Sclaroff, S. (2001). 3d hand pose reconstruction using specialized mappings. In ICCV (pp. 378–387).

  49. Rosenhahn, B., Brox, T., & Weickert, J. (2007). Three-dimensional shape knowledge for joint image segmentation and pose tracking. IJCV, 73(3), 243–262.

    Article  Google Scholar 

  50. Rusinkiewicz, S., & Levoy, M. (2001). Efficient variants of the icp algorithm. In 3DIM (pp 145–152).

  51. Rusinkiewicz, S., Hall-Holt, O., & Levoy, M. (2002). Real-time 3d model acquisition. TOG, 21(3), 438–446.

    Article  Google Scholar 

  52. Schmidt, T., Newcombe, R., & Fox, D. (2014). Dart: Dense articulated real-time tracking. In Proceedings of robotics: Science and systems, Berkeley, USA.

  53. Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., & Izadi, S. (2015). Accurate, robust, and flexible real-time hand tracking. In CHI.

  54. Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR (pp. 1297–1304).

  55. Sridhar, S., Mueller, F., Oulasvirta, A., & Theobalt, C. (2015). Fast and robust hand tracking using detection-guided optimization. In: CVPR.

  56. Sridhar, S., Oulasvirta, A., & Theobalt, C. (2013). Interactive markerless articulated hand motion tracking using rgb and depth data. In ICCV (pp. 2456–2463).

  57. Sridhar, S., Rhodin, H., Seidel, H.P., Oulasvirta, A., & Theobalt, C. (2014). Real-time hand tracking using a sum of anisotropic gaussians model. In 3DV.

  58. Stenger, B., Mendonca, P., & Cipolla, R. (2001). Model-based 3D tracking of an articulated hand. In CVPR.

  59. Stolfi, J. (1991). Oriented projective geometry: A framework for geometric computation. Boston: Academic Press.

    Google Scholar 

  60. Sudderth, E., Mandel, M., Freeman, W., & Willsky, A. (2004) Visual hand tracking using nonparametric belief propagation. In Workshop on generative model based vision (pp. 189–189).

  61. Tang, D., Chang, H. J., Tejani, A., & Kim, T. K. (2014). Latent regression forest: Structured estimation of 3d articulated hand posture. In CVPR.

  62. Tang, D., Yu, T. H., & Kim, T. K. (2013). Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In ICCV (pp. 3224–3231).

  63. Taylor, J., Stebbing, R., Ramakrishna, V., Keskin, C., Shotton, J., Izadi, S., Hertzmann, A., & Fitzgibbon. A. (2014). User-specific hand modeling from monocular depth sequences. In CVPR.

  64. Teschnerm, M., Kimmerle, S., Heidelberger, B., Zachmann, G., Raghupathi, L., Fuhrmann, A., Cani, M. P., Faure, F., Magnetat-Thalmann, N., & Strasser, W. (2004). Collision detection for deformable objects. In Eurographics.

  65. Thayananthan, A., Stenger, B., Torr, P. H. S., & Cipolla, R. (2003). Shape context and chamfer matching in cluttered scenes. In CVPR (pp. 127–133).

  66. Tompson, J., Stein, M., Lecun, Y., & Perlin, K. (2014). Real-time continuous pose recovery of human hands using convolutional networks. In TOG 33.

  67. Tzionas, D., & Gall, J. (2013). A comparison of directional distances for hand pose estimation. In GCPR.

  68. Tzionas, D., Srikantha, A., Aponte, P., & Gall, J. (2014). Capturing hand motion with an rgb-d sensor, fusing a generative model with salient points. In GCPR.

  69. Vaezi, M., & Nekouie, M. A. (2011). 3d human hand posture reconstruction using a single 2d image. IJHCI, 1(4), 83–94.

    Google Scholar 

  70. Wang, R. Y., & Popović, J. (2009). Real-time hand-tracking with a color glove. TOG, 28(3), 63:1–63:8.

    Google Scholar 

  71. Wu, Y., Lin, J., & Huang, T. (2001). Capturing natural hand articulation. In ICCV (pp. 426–432).

  72. Ye, M., Zhang, Q., Wang, L., Zhu, J., Yang, R., & Gall, J. (2013). A survey on human motion analysis from depth data. In Time-of-flight and depth imaging. sensors, algorithms, and applications (pp. 149–187).

Download references


Financial support was provided by the DFG Emmy Noether program (GA 1927/1-1).

Author information



Corresponding author

Correspondence to Dimitrios Tzionas.

Additional information

Communicated by Junsong Yuan, Wanqing Li, Zhengyou Zhang, David Fleet, Jamie Shotton.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tzionas, D., Ballan, L., Srikantha, A. et al. Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation. Int J Comput Vis 118, 172–193 (2016).

Download citation


  • Hand motion capture
  • Hand–object interaction
  • Fingertip detection
  • Physics simulation