Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation

  • Qi Ye
  • Shanxin YuanEmail author
  • Tae-Kyun Kim
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9912)


Discriminative methods often generate hand poses kinematically implausible, then generative methods are used to correct (or verify) these results in a hybrid method. Estimating 3D hand pose in a hierarchy, where the high-dimensional output space is decomposed into smaller ones, has been shown effective. Existing hierarchical methods mainly focus on the decomposition of the output space while the input space remains almost the same along the hierarchy. In this paper, a hybrid hand pose estimation method is proposed by applying the kinematic hierarchy strategy to the input space (as well as the output space) of the discriminative method by a spatial attention mechanism and to the optimization of the generative method by hierarchical Particle Swarm Optimization (PSO). The spatial attention mechanism integrates cascaded and hierarchical regression into a CNN framework by transforming both the input (and feature space) and the output space, which greatly reduces the viewpoint and articulation variations. Between the levels in the hierarchy, the hierarchical PSO forces the kinematic constraints to the results of the CNNs. The experimental results show that our method significantly outperforms four state-of-the-art methods and three baselines on three public benchmarks.


Hierarchical hand pose estimation Particle Swarm Optimization Convolutional neural network Iterative refinement Spatial attention Hybrid method Kinematic constraints 

Supplementary material

Supplementary material 1 (mp4 18182 KB)

419983_1_En_21_MOESM2_ESM.pdf (921 kb)
Supplementary material 2 (pdf 920 KB)


  1. 1.
    Oikonomidis, I., Kyriazis, N., Argyros, A.A.: Efficient model-based 3D tracking of hand articulations using kinect. In: BMVC (2011)Google Scholar
  2. 2.
    Qian, C., Sun, X., Wei, Y., Tang, X., Sun, J.: Realtime and robust hand tracking from depth. In: CVPR (2014)Google Scholar
  3. 3.
    Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Leichter, D., Wei, A.V.Y., Krupka, D., Fitzgibbon, A., Izadi, S.: Accurate, robust, and flexible real-time hand tracking. In: CHI (2015)Google Scholar
  4. 4.
    Oberweger, M., Wohlhart, P., Lepetit, V.: Training a feedback loop for hand pose estimation. In: ICCV (2015)Google Scholar
  5. 5.
    Neverova, N., Wolf, C., Taylor, G.W., Nebout, F.: Hand segmentation with structured convolutional learning. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9005, pp. 687–702. Springer, Heidelberg (2015)Google Scholar
  6. 6.
    Tang, D., Yu, T.H., Kim, T.K.: Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In: ICCV (2013)Google Scholar
  7. 7.
    Keskin, C., Kıraç, F., Kara, Y.E., Akarun, L.: Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VI. LNCS, vol. 7577, pp. 852–863. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  8. 8.
    Ionescu, C., Carreira, J., Sminchisescu, C.: Iterated second-order label sensitive pooling for 3D human pose estimation. In: CVPR (2014)Google Scholar
  9. 9.
    Liang, H., Yuan, J., Thalmann, D.: Parsing the hand in depth images. TMM 16(5), 1241–1253 (2014)Google Scholar
  10. 10.
    Rogez, G., Supancic III., J.S., Khademi, M., Montiel, J.M.M., Ramanan, D.: 3D hand pose detection in egocentric RGB-D images. In: ECCV Workshop (2014)Google Scholar
  11. 11.
    Stenger, B., Thayananthan, A., Torr, P.H., Cipolla, R.: Model-based hand tracking using a hierarchical bayesian filter. TPAMI 28(9), 1372–1384 (2006)CrossRefzbMATHGoogle Scholar
  12. 12.
    Ballan, L., Taneja, A., Gall, J., Van Gool, L., Pollefeys, M.: Motion capture of hands in action using discriminative salient points. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VI. LNCS, vol. 7577, pp. 640–653. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  13. 13.
    Intel: Perceptual computing SDK (2013)Google Scholar
  14. 14.
    Supancic III., J.S., Rogez, G., Yang, Y., Shotton, J., Ramanan, D.: Depth-based hand pose estimation: methods, data, and challenges. arXiv preprint arXiv:1504.06378 (2015)
  15. 15.
    Taylor, J., Stebbing, R., Ramakrishna, V., Keskin, C., Shotton, J., Izadi, S., Hertzmann, A., Fitzgibbon, A.: User-specific hand modeling from monocular depth sequences. In: CVPR (2014)Google Scholar
  16. 16.
    Krejov, P., Gilbert, A., Bowden, R.: Combining discriminative and model based approaches for hand pose estimation. In: FG (2015)Google Scholar
  17. 17.
    Sun, X., Wei, Y., Liang, S., Tang, X., Sun, J.: Cascaded hand pose regression. In: CVPR (2015)Google Scholar
  18. 18.
    Oberweger, M., Wohlhart, P., Lepetit, V.: Hands deep in deep learning for hand pose estimation. arXiv preprint arXiv:1502.06807 (2015)
  19. 19.
    Tang, D., Taylor, J., Kohli, P., Keskin, C., Kim, T.K., Shotton, J.: Opening the black box: hierarchical sampling optimization for estimating human hand pose. In: ICCV (2015)Google Scholar
  20. 20.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  21. 21.
    Gregor, K., Danihelka, I., Graves, A., Wierstra, D.: Draw: a recurrent neural network for image generation. arXiv preprint arXiv:1502.04623 (2015)
  22. 22.
    Sermanet, P., Frome, A., Real, E.: Attention for fine-grained categorization. arXiv preprint arXiv:1412.7054 (2014)
  23. 23.
    Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: NIPS (2015)Google Scholar
  24. 24.
    Zhao, X., Kim, T.K., Luo, W.: Unified face analysis by iterative multi-output random forests. In: CVPR (2014)Google Scholar
  25. 25.
    Zhu, S., Li, C., Change Loy, C., Tang, X.: Face alignment by coarse-to-fine shape searching. In: CVPR (2015)Google Scholar
  26. 26.
    Dollár, P., Welinder, P., Perona, P.: Cascaded pose regression. In: CVPR (2010)Google Scholar
  27. 27.
    Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: CVPR (2014)Google Scholar
  28. 28.
    Xiong, X., Torre, F.: Supervised descent method and its applications to face alignment. In: CVPR (2013)Google Scholar
  29. 29.
    Sridhar, S., Mueller, F., Oulasvirta, A., Theobalt, C.: Fast and robust hand tracking using detection-guided optimization. In: CVPR (2014)Google Scholar
  30. 30.
    Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. TOG 33(5), 169 (2014)CrossRefGoogle Scholar
  31. 31.
    Kennedy, J., Eberhart, R.: Particle swarm optimization. In: International Conference on Neural Networks (1995)Google Scholar
  32. 32.
    Shi, Y., Eberhart, R.: A modified particle swarm optimizer. In: Proceedings of IEEE International Conference on Evolutionary Computation (1998)Google Scholar
  33. 33.
    Kabsch, W.: A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography (1976)Google Scholar
  34. 34.
    Theano Development Team: Theano: A Python framework for fast computation of mathematical expressions. arXiv.1605.02688, May 2016

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Department of Electrical and Electronic EngineeringImperial College LondonLondonUK

Personalised recommendations