Towards Viewpoint Invariant 3D Human Pose Estimation

  • Albert Haque
  • Boya Peng
  • Zelun Luo
  • Alexandre Alahi
  • Serena Yeung
  • Li Fei-Fei
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9905)


We propose a viewpoint invariant model for 3D human pose estimation from a single depth image. To achieve this, our discriminative model embeds local regions into a learned viewpoint invariant feature space. Formulated as a multi-task learning problem, our model is able to selectively predict partial poses in the presence of noise and occlusion. Our approach leverages a convolutional and recurrent network architecture with a top-down error feedback mechanism to self-correct previous pose estimates in an end-to-end manner. We evaluate our model on a previously published depth dataset and a newly collected human pose dataset containing 100 K annotated depth images from extreme viewpoints. Experiments show that our model achieves competitive performance on frontal views while achieving state-of-the-art performance on alternate viewpoints.

Supplementary material

419956_1_En_10_MOESM1_ESM.pdf (5 mb)
Supplementary material 1 (pdf 5122 KB)


  1. 1.
    Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., et al.: Tensorflow: large-scale machine learning on heterogeneous systems (2015). Software available from
  2. 2.
    Alahi, A., Bierlaire, M., Kunt, M.: Object detection and matching with mobile cameras collaborating with fixed cameras (2008)Google Scholar
  3. 3.
    Alahi, A., Bierlaire, M., Vandergheynst, P.: Robust real-time pedestrians detection in urban environments with low-resolution cameras (2014)Google Scholar
  4. 4.
    Alahi, A., Boursier, Y., Jacques, L., Vandergheynst, P.: A sparsity constrained inverse problem to locate people in a network of cameras. In: Digital Signal Processing. IEEE (2009)Google Scholar
  5. 5.
    Alahi, A., Ramanathan, V., Fei-Fei, L.: Socially-aware large-scale crowd forecasting. In: CVPR (2014)Google Scholar
  6. 6.
    Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)Google Scholar
  7. 7.
    Azizpour, H., Laptev, I.: Object detection using strongly-supervised deformable part models. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 836–849. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33718-5_60 Google Scholar
  8. 8.
    Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. In: PAMI (2013)Google Scholar
  9. 9.
    Bonde, U., Badrinarayanan, V., Cipolla, R.: Robust instance recognition in presence of occlusion and clutter. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8690, pp. 520–535. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10605-2_34 Google Scholar
  10. 10.
    Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: CVPR (2016)Google Scholar
  11. 11.
    Comaniciu, D., Meer, P.: Mean shift: a robust approach toward feature space analysis. In: PAMI (2002)Google Scholar
  12. 12.
    Dantone, M., Gall, J., Leistner, C., Gool, L.: Human pose estimation using body parts dependent joint regressors. In: CVPR (2013)Google Scholar
  13. 13.
    Ding, M., Fan, G.: Articulated gaussian kernel correlation for human pose estimation. In: CVPR Workshops (2015)Google Scholar
  14. 14.
    Eichner, M., Ferrari, V.: Appearance sharing for collective human pose estimation. In: Lee, K.M., Matsushita, Y., Rehg, J.M., Hu, Z. (eds.) ACCV 2012. LNCS, vol. 7724, pp. 138–151. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37331-2_11 Google Scholar
  15. 15.
    Eichner, M., Ferrari, V., Zurich, S.: Better appearance models for pictorial structures. In: BMVC (2009)Google Scholar
  16. 16.
    Eichner, M., Marin-Jimenez, M., Zisserman, A., Ferrari, V.: 2d articulated human pose estimation and retrieval in (almost) unconstrained still images. IJCV 99(2), 190–2014 (2012)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Fan, X., Zheng, K., Lin, Y., Wang, S.: Combining local appearance and holistic view: dual-source deep neural networks for human pose estimation. In: CVPR (2015)Google Scholar
  18. 18.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. In: PAMI (2010)Google Scholar
  19. 19.
    Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. IJCV 61(1), 55–79 (2005). SpringerCrossRefGoogle Scholar
  20. 20.
    Ganapathi, V., Plagemann, C., Koller, D., Thrun, S.: Real time motion capture using a single time-of-flight camera. In: CVPR (2010)Google Scholar
  21. 21.
    Ganapathi, V., Plagemann, C., Koller, D., Thrun, S.: Real-time human pose tracking from range data. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 738–751. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33783-3_53 Google Scholar
  22. 22.
    Gao, T., Packer, B., Koller, D.: A segmentation-aware object detection model with occlusion handling. In: CVPR (2011)Google Scholar
  23. 23.
    Ghiasi, G., Yang, Y., Ramanan, D., Fowlkes, C.: Parsing occluded people. In: CVPR (2014)Google Scholar
  24. 24.
    Girshick, R., Shotton, J., Kohli, P., Criminisi, A., Fitzgibbon, A.: Efficient regression of general-activity human poses from depth images. In: ICCV (2011)Google Scholar
  25. 25.
    Grauman, K., Shakhnarovich, G., Darrell, T.: Inferring 3d structure with a statistical image-based shape model. In: ICCV (2003)Google Scholar
  26. 26.
    Grest, D., Woetzel, J., Koch, R.: Nonlinear body pose estimation from depth images. In: Kropatsch, W.G., Sablatnig, R., Hanbury, A. (eds.) DAGM 2005. LNCS, vol. 3663, pp. 285–292. Springer, Heidelberg (2005). doi:10.1007/11550518_36 CrossRefGoogle Scholar
  27. 27.
    Haehnel, D., Thrun, S., Burgard, W.: An extension of the ICP algorithm for modeling nonrigid objects with mobile robots. In: IJCAI (2003)Google Scholar
  28. 28.
    Haque, A., Alahi, A., Fei-Fei, L.: Recurrent attention models for depth-based person identification. In: CVPR (2016)Google Scholar
  29. 29.
    He, L., Wang, G., Liao, Q., Xue, J.H.: Depth-images-based pose estimation using regression forests and graphical models. Neurocomputing 164, 210–219 (2015). ElsevierCrossRefGoogle Scholar
  30. 30.
    Hesse, N., Stachowiak, G., Breuer, T., Arens, M.: Estimating body pose of infants in depth images using random ferns. In: CVPR Workshops (2015)Google Scholar
  31. 31.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997). MIT PressCrossRefGoogle Scholar
  32. 32.
    Hsiao, E., Hebert, M.: Occlusion reasoning for object detectionunder arbitrary viewpoint. In: PAMI (2014)Google Scholar
  33. 33.
    Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: NIPS (2015)Google Scholar
  34. 34.
    Jain, A., Tompson, J., Andriluka, M., Taylor, G.W., Bregler, C.: Learning human pose estimation features with convolutional networks. In: ICLR (2013)Google Scholar
  35. 35.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2014)Google Scholar
  36. 36.
    Knoop, S., Vacek, S., Dillmann, R.: Sensor fusion for 3d human body tracking with an articulated 3d body model. In: ICRA (2006)Google Scholar
  37. 37.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  38. 38.
    Larochelle, H., Hinton, G.E.: Learning to combine foveal glimpses with a third-order boltzmann machine. In: NIPS (2010)Google Scholar
  39. 39.
    LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time series. In: Arbib, M.A. (ed.) The Handbook of Brain Theory and Neural Networks. MIT Press, Cambridge (1995)Google Scholar
  40. 40.
    LeCun, Y., Boser, B., Denker, J., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.: Handwritten digit recognition with a back-propagation network. In: NIPS (1990)Google Scholar
  41. 41.
    Li, S., Liu, Z.Q., Chan, A.: Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. IJCV 113, 19–36 (2015)MathSciNetCrossRefGoogle Scholar
  42. 42.
    Li, S., Zhang, W., Chan, A.B.: Maximum-margin structured learning with deep networks for 3d human pose estimation. In: ICCV (2015)Google Scholar
  43. 43.
    Liebelt, J., Schmid, C., Schertler, K.: Viewpoint-independent object class detection using 3d feature maps. In: CVPR (2008)Google Scholar
  44. 44.
    Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV (1999)Google Scholar
  45. 45.
    Maturana, D., Scherer, S.: 3d convolutional neural networks for landing zone detection from lidar. In: ICRA (2015)Google Scholar
  46. 46.
    Maturana, D., Scherer, S.: Voxnet: a 3d convolutional neural network for real-time object recognition. In: Intelligent Robots and Systems (2015)Google Scholar
  47. 47.
    Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: NIPS (2014)Google Scholar
  48. 48.
    Mori, G., Malik, J.: Estimating human body configurations using shape context matching. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 666–680. Springer, Heidelberg (2002). doi:10.1007/3-540-47977-5_44 CrossRefGoogle Scholar
  49. 49.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML (2010)Google Scholar
  50. 50.
    Ozuysal, M., Lepetit, V., Fua, P.: Pose estimation for category specific multiview object localization. In: CVPR (2009)Google Scholar
  51. 51.
    Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorial structures. In: CVPR (2013)Google Scholar
  52. 52.
    Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Strong appearance and expressive spatial models for human pose estimation. In: ICCV (2013)Google Scholar
  53. 53.
    Rafi, U., Gall, J., Leibe, B.: A semantic occlusion model for human pose estimation from a single depth image. In: CVPR Workshops (2015)Google Scholar
  54. 54.
    Sapp, B., Taskar, B.: Modec: multimodal decomposable models for human pose estimation. In: CVPR (2013)Google Scholar
  55. 55.
    Savarese, S., Fei-Fei, L.: 3d generic object categorization, localization and pose estimation. In: ICCV (2007)Google Scholar
  56. 56.
    Shotton, J., Sharp, T., Kipman, A., Fitzgibbon, A., Finocchio, M., Blake, A., Cook, M., Moore, R.: Real-time human pose recognition in parts from single depth images. In: CVPR (2011)Google Scholar
  57. 57.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)Google Scholar
  58. 58.
    Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: CVPR (2013)Google Scholar
  59. 59.
    Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS (2014)Google Scholar
  60. 60.
    Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: CVPR (2014)Google Scholar
  61. 61.
    Wang, T., He, X., Barnes, N.: Learning structured hough voting for joint object detection and occlusion reasoning. In: CVPR (2013)Google Scholar
  62. 62.
    Wu, C., Clipp, B., Li, X., Frahm, J.M., Pollefeys, M.: 3d model matching with viewpoint-invariant patches. In: CVPR (2008)Google Scholar
  63. 63.
    Xia, L., Chen, C.C., Aggarwal, J.: View invariant human action recognition using histograms of 3d joints. In: CVPR Workshops (2012)Google Scholar
  64. 64.
    Xu, Y., Ji, H., Fermüller, C.: Viewpoint invariant texture description using fractal analysis. IJCV 83, 85–100 (2009)CrossRefGoogle Scholar
  65. 65.
    Ye, M., Wang, X., Yang, R., Ren, L., Pollefeys, M.: Accurate 3d pose estimation from a single depth image. In: ICCV (2011)Google Scholar
  66. 66.
    Ye, M., Yang, R.: Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In: CVPR (2014)Google Scholar
  67. 67.
    Yub Jung, H., Lee, S., Seok Heo, Y., Dong Yun, I.: Random tree walk toward instantaneous 3d human pose estimation. In: CVPR (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Albert Haque
    • 1
  • Boya Peng
    • 1
  • Zelun Luo
    • 1
  • Alexandre Alahi
    • 1
  • Serena Yeung
    • 1
  • Li Fei-Fei
    • 1
  1. 1.Stanford UniversityStanfordUSA

Personalised recommendations