Advertisement

International Journal of Computer Vision

, Volume 126, Issue 11, pp 1180–1198 | Cite as

Depth-Based Hand Pose Estimation: Methods, Data, and Challenges

  • James Steven SupančičIII
  • Grégory Rogez
  • Yi Yang
  • Jamie Shotton
  • Deva Ramanan
Article

Abstract

Hand pose estimation has matured rapidly in recent years. The introduction of commodity depth sensors and a multitude of practical applications have spurred new advances. We provide an extensive analysis of the state-of-the-art, focusing on hand pose estimation from a single depth frame. To do so, we have implemented a considerable number of systems, and have released software and evaluation code. We summarize important conclusions here: (1) Coarse pose estimation appears viable for scenes with isolated hands. However, high precision pose estimation [required for immersive virtual reality and cluttered scenes (where hands may be interacting with nearby objects and surfaces) remain a challenge. To spur further progress we introduce a challenging new dataset with diverse, cluttered scenes. (2) Many methods evaluate themselves with disparate criteria, making comparisons difficult. We define a consistent evaluation criteria, rigorously motivated by human experiments. (3) We introduce a simple nearest-neighbor baseline that outperforms most existing systems. This implies that most systems do not generalize beyond their training sets. This also reinforces the under-appreciated point that training data is as important as the model itself. We conclude with directions for future progress.

Keywords

Hand pose RGB-D sensor Datasets Benchmarking 

Notes

Acknowledgements

National Science Foundation Grant 0954083, Office of Naval Research-MURI Grant N00014-10-1-0933, and the Intel Science and Technology Center-Visual Computing supported JS&DR. The European Commission FP7 Marie Curie IOF grant “Egovision4Health” (PIOF-GA-2012-328288) supported GR.

References

  1. Ballan, L., Taneja, A., Gall, J., Gool, L. J. V., & Pollefeys, M. (2012). Motion capture of hands in action using discriminative salient points. In ECCV (6).Google Scholar
  2. Bray, M., Koller-Meier, E., Müller, P., Van Gool, L., & Schraudolph, N. N. (2004). 3D hand tracking by rapid stochastic gradient descent using a skinning model. In 1st European conference on visual media production (CVMP).Google Scholar
  3. Bullock, I. M., Member, S., Zheng, J. Z., Rosa, S. D. L., Guertler, C., & Dollar, A. M. (2013). IEEE transactions on grasp frequency and usage in daily household and machine shop tasks, Haptics.Google Scholar
  4. Camplani, M., & Salgado, L. (2012). Efficient spatio-temporal hole filling strategy for kinect depth maps. In Proceedings of SPIE.Google Scholar
  5. Castellini, C., Tommasi, T., Noceti, N., Odone, F., & Caputo, B. (2011). Using object affordances to improve object recognition. In IEEE transactions on autonomous mental development.Google Scholar
  6. Choi, C., Sinha, A., Hee Choi, J., Jang, S., & Ramani, K. (2015). A collaborative filtering approach to real-time hand pose estimation. In Proceedings of the IEEE international conference on computer vision (pp. 2336–2344).Google Scholar
  7. Cooper, H. (2012). Sign language recognition using sub-units. The Journal of Machine Learning Research, 13, 2205.zbMATHGoogle Scholar
  8. Delamarre, Q., & Faugeras, O. (2001). 3D articulated models and multiview tracking with physical forces. Computer Vision and Image Understanding., 81, 328.CrossRefzbMATHGoogle Scholar
  9. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In Computer vision and pattern recognition (CVPR). IEEE.Google Scholar
  10. Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. In IEEE transactions on pattern analysis and machine intelligence.Google Scholar
  11. Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., & Twombly, X. (2007). Vision-based hand pose estimation: A review. Computer Vision and Image Understanding., 108, 52.CrossRefGoogle Scholar
  12. Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88, 303.CrossRefGoogle Scholar
  13. Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2013). Learning hierarchical features for scene labeling. In IEEE transactions on pattern analysis and machine intelligence.Google Scholar
  14. Fathi, A., Ren, X., & Rehg, J. M. (2011). Learning to recognize objects in egocentric activities. In 2011 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3281–3288). IEEE.Google Scholar
  15. Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106, 59.CrossRefGoogle Scholar
  16. Feix, T., Romero, J., Ek, C. H., Schmiedmayer, H., & Kragic, D. (2013). A metric for comparing the anthropomorphic motion capability of artificial hands. In IEEE transactions on robotics.Google Scholar
  17. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. In IEEE transactions on pattern analysis and machine intelligence.Google Scholar
  18. Girard, M., & Maciejewski, A. A. (1985). Computational modeling for the computer animation of legged figures. ACM SIGGRAPH Computer Graphics, 19, 263.CrossRefGoogle Scholar
  19. Gupta, S., Girshick, R., Arbeláez, P., & Malik, J. (2014). Learning rich features from RGB-D images for object detection and segmentation. In European conference on computer vision (ECCV). Springer.Google Scholar
  20. Intel. (2013). Perceptual computing SDK.Google Scholar
  21. Janoch, A., Karayev, S., Jia, Y., Barron, J. T., Fritz, M., Saenko, K., et al. (2013). A category-level 3d object dataset: Putting the kinect to work. In Consumer depth cameras for computer vision. Springer, LondonGoogle Scholar
  22. Keskin, C., Kıraç, F., Kara, Y. E., & Akarun, L. (2012). Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In European conference on computer vision (ECCV).Google Scholar
  23. Khamis, S., Taylor, J., Shotton, J., Keskin, C., Izadi, S., & Fitzgibbon, A. (2015). Learning an efficient model of hand shape variation from depth images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2540–2548).Google Scholar
  24. Li, C., & Kitani, K. M. (2013). Pixel-level hand detection in ego-centric videos. In Computer vision and pattern recognition (CVPR).Google Scholar
  25. Li, P., Ling, H., Li, X., & Liao, C. (2015). 3d hand pose estimation using randomized decision forest with segmentation index points. In Proceedings of the IEEE international conference on computer vision (pp. 819–827).Google Scholar
  26. Martin, D. R., Fowlkes, C. C., & Malik, J. (2004). Learning to detect natural image boundaries using local brightness, color, and texture cues. in IEEE transactions on pattern analysis and machine intelligence.Google Scholar
  27. Melax, S., Keselman, L., & Orsten, S. (2013). Dynamics based 3D skeletal hand tracking. In Proceedings of the ACM SIGGRAPH symposium on interactive 3D graphics and games-I3D ’13.Google Scholar
  28. Mo, Z., & Neumann, U. (2006). Real-time hand pose recognition using low-resolution depth images. In 2006 IEEE computer society conference on computer vision and pattern recognition (vol. 2, pp. 1499–1505). IEEE.Google Scholar
  29. Moore, A. W., Connolly, A. J., Genovese, C., Gray, A., Grone, L., & Kanidoris, N, I. I., et al. (2001). Fast algorithms and efficient statistics: N-point correlation functions. In Mining the Sky. Springer.Google Scholar
  30. Muja, M., & Lowe, D. G. (2014). Scalable nearest neighbor algorithms for high dimensional data. In IEEE transactions on pattern analysis and machine intelligence.Google Scholar
  31. Oberweger, M., Riegler, G., Wohlhart, P., & Lepetit, V. (2016). Efficiently creating 3d training data for fine hand pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4957–4965).Google Scholar
  32. Oberweger, M., Wohlhart, P., & Lepetit, V. (2015a). Hands deep in deep learning for hand pose estimation. In Computer vision winter workshop (CVWW).Google Scholar
  33. Oberweger, M., Wohlhart, P., & Lepetit, V. (2015b). Training a feedback loop for hand pose estimation. In Proceedings of the IEEE international conference on computer vision (pp. 3316–3324).Google Scholar
  34. Ohn-Bar, E., & Trivedi, M. M. (2014a). Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. In IEEE transactions on intelligent transportation systems.Google Scholar
  35. Ohn-Bar, E., & Trivedi, M. M. (2014b). Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations. IEEE Transactions on Intelligent Transportation Systems, 15(6), 2368–2377.CrossRefGoogle Scholar
  36. Oikonomidis, I., Kyriazis, N., & Argyros, A. (2011). Efficient model-based 3D tracking of hand articulations using kinect. In British machine vision conference (BMVC).Google Scholar
  37. Pang, Y., & Ling, H. (2013). Finding the best from the second bests-inhibiting subjective bias in evaluation of visual tracking algorithms. In International conference on computer vision (ICCV).Google Scholar
  38. Pieropan, A., Salvi, G., Pauwels, K., & Kjellstrom, H. (2014). Audio-visual classification and detection of human manipulation actions. In International conference on intelligent robots and systems (IROS).Google Scholar
  39. Premaratne, P., Nguyen, Q., & Premaratne, M. (2010). Human computer interaction using hand gestures. Berlin: Springer.CrossRefzbMATHGoogle Scholar
  40. PrimeSense. (2013). Nite2 middleware, Version 2.2.Google Scholar
  41. Qian, C., Sun, X., Wei, Y., Tang, X., & Sun, J. (2014). Realtime and robust hand tracking from depth. In Computer vision and pattern recognition (CVPR).Google Scholar
  42. Ren, Z., Yuan, J., & Zhang, Z. (2011). Robust hand gesture recognition based on finger-earth mover’s distance with a commodity depth camera. In Proceedings of the 19th ACM international conference on Multimedia. ACM.Google Scholar
  43. Rogez, G., Khademi, M., Supancic, III, J., Montiel, J. M. M., & Ramanan, D. (2014). 3D hand pose detection in egocentric RGB-D images. CDC4CV workshop, European conference on computer vision (ECCV).Google Scholar
  44. Rogez, G., Supancic, III, J., & Ramanan, D. (2015a). First-person pose recognition using egocentric workspaces. In Computer vision and pattern recognition (CVPR).Google Scholar
  45. Rogez, G., Supancic, J. S., & Ramanan, D. (2015b). Understanding everyday hands in action from RGB-D images. In Proceedings of the IEEE international conference on computer vision (pp. 3889–3897).Google Scholar
  46. Romero, J., Kjellstr, H., & Kragic, D. (2009). Monocular real-time 3D articulated hand pose estimation. In International conference on humanoid robots.Google Scholar
  47. Russakovsky, O., Deng, J., Huang, Z., Berg, A. C., & Fei-Fei, L. (2013). Detecting avocados to zucchinis: What have we done, and where are we going? In International conference on computer vision (ICCV). IEEE.Google Scholar
  48. Šarić, M. (2011). Libhand: A library for hand articulation, Version 0.9.Google Scholar
  49. Scharstein, D. (2002). A taxonomy and evaluation of dense two-frame stereo. International Journal of Computer Vision, 47, 7.CrossRefzbMATHGoogle Scholar
  50. Shakhnarovich, G., Viola, P., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In International conference on computer vision (ICCV). IEEE.Google Scholar
  51. Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., & Izadi, S. (2015). Accurate, robust, and flexible real-time hand tracking. In ACM conference on computer–human interaction.Google Scholar
  52. Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., et al. (2013). Real-time human pose recognition in parts from single depth images. Communications of the ACM., 56, 116.CrossRefGoogle Scholar
  53. Song, S., & Xiao, J. (2014). Sliding shapes for 3D object detection in depth images. In European conference on computer vision (ECCV).Google Scholar
  54. Sridhar, S., Mueller, F., Oulasvirta, A., & Theobalt, C. (2015). Fast and robust hand tracking using detection-guided optimization. In Computer vision and pattern recognition (CVPR).Google Scholar
  55. Sridhar, S., Oulasvirta, A., & Theobalt, C. (2013). Interactive markerless articulated hand motion tracking using RGB and depth data. In International conference on computer vision (ICCV).Google Scholar
  56. Stenger, B., Thayananthan, A., Torr, P. H. S., & Cipolla, R. (2006). Model-based hand tracking using a hierarchical Bayesian filter. In IEEE transactions on pattern analysis and machine intelligence.Google Scholar
  57. Stokoe, W. C. (2005). Sign language structure: An outline of the visual communication systems of the American deaf. Journal of Deaf Studies and Deaf Education, 10, 3.CrossRefGoogle Scholar
  58. Sun, X., Wei, Y., Liang, S., Tang, X., & Sun, J. (2015). Cascaded hand pose regression. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 824–832).Google Scholar
  59. Tang, D., Chang, H. J., Tejani, A., & Kim, T.-K. (2014). Latent regression forest: Structured estimation of 3D articulated hand posture. In Computer vision and pattern recognition (CVPR).Google Scholar
  60. Tang, D., Taylor, J., Kohli, P., Keskin, C., Kim, T.-K., & Shotton, J. (2015). Opening the black box: Hierarchical sampling optimization for estimating human hand pose. In Proceedings of the IEEE international conference on computer vision (pp. 3325–3333).Google Scholar
  61. Tang, D., Yu, T.H. & Kim, T.-K. (2013). Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In International conference on computer vision (ICCV).Google Scholar
  62. Taylor, J., Stebbing, R., Ramakrishna, V., Keskin, C., Shotton, J., & Izadi, S., et al. (2014). User-specific hand modeling from monocular depth sequences. In Computer vision and pattern recognition (CVPR). IEEE.Google Scholar
  63. Taylor, J., Bordeaux, L., Cashman, T., Corish, B., Keskin, C., Sharp, T., et al. (2016). Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Transactions on Graphics (TOG), 35(4), 143.CrossRefGoogle Scholar
  64. Tompson, J., Stein, M., Lecun, Y., & Perlin, K. (2014). Real-time continuous pose recovery of human hands using convolutional networks. In ACM Transactions on Graphics.Google Scholar
  65. Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In Computer vision and pattern recognition (CVPR). IEEE.Google Scholar
  66. Tzionas, D., Srikantha, A., Aponte, P., & Gall, J. (2014). Capturing hand motion with an RGB-D sensor, fusing a generative model with salient points. In German Conference on Pattern Recognition (GCPR). Lecture notes in computer science. Springer.Google Scholar
  67. Vezhnevets, V., Sazonov, V., & Andreeva, A. (2003). A survey on pixel-based skin color detection techniques. In Proceedings of the Graphicon, Moscow, Russia.Google Scholar
  68. Wan, C., Yao, A., & Van Gool, L. (2016). Hand pose estimation from local surface normals. In European conference on computer vision (pp. 554–569). Springer.Google Scholar
  69. Wetzler, A., Slossberg, R., & Kimmel, R. (2015). Rule of thumb: Deep derotation for improved fingertip detection. In British machine vision conference (BMVC). BMVA Press.Google Scholar
  70. Xu, C., & Cheng, L. (2013). Efficient hand pose estimation from a single depth image. InInternational conference on computer vision (ICCV).Google Scholar
  71. Yang, Y., & Ramanan, D. (2013). Articulated pose estimation with flexible mixtures-of-parts. In IEEE transactions on pattern analysis and machine intelligence.Google Scholar
  72. Ye, Q., Yuan, S., & Kim, T.-K. (2016). Spatial attention deep net with partial PSO for hierarchical hybrid hand pose estimation. In European conference on computer vision (pp. 346–361). Springer.Google Scholar
  73. Zhu, X., Vondrick, C., Ramanan, D., & Fowlkes, C. (2012). Do we need more training data or better models for object detection? British Machine Vision Conference (BMVC), 3, 5.Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.University of CaliforniaIrvineUSA
  2. 2.Univ. Grenoble Alpes, Inria, CNRSGrenobleFrance
  3. 3.Institute of Engineering Univ.Grenoble AlpesFrance
  4. 4.Baidu Institute of Deep LearningSunnyvaleUSA
  5. 5.Microsoft ResearchCambridgeUK
  6. 6.Carnegie Mellon UniversityPittsburghUSA

Personalised recommendations