Learning Markerless Human Pose Estimation from Multiple Viewpoint Video

  • Matthew TrumbleEmail author
  • Andrew Gilbert
  • Adrian Hilton
  • John Collomosse
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9915)


We present a novel human performance capture technique capable of robustly estimating the pose (articulated joint positions) of a performer observed passively via multiple view-point video (MVV). An affine invariant pose descriptor is learned using a convolutional neural network (CNN) trained over volumetric data extracted from a MVV dataset of diverse human pose and appearance. A manifold embedding is learned via Gaussian Processes for the CNN descriptor and articulated pose spaces enabling regression and so estimation of human pose from MVV input. The learned descriptor and manifold are shown to generalise over a wide range of human poses, providing an efficient performance capture solution that requires no fiducials or other markers to be worn. The system is evaluated against ground truth joint configuration data from a commercial marker-based pose estimation system.


Deep learning Pose estimation Multiple viewpoint video 



The work was supported by the REFRAME project, InnovateUK grant agreement 101854. The Ballet dataset is courtesy of the EU FP7 RE@CT project.


  1. 1.
    Zhao, T., Nevatia, R.: Bayesian human segmentation in crowded situations. In: Proceedings of the Computer Vision and Pattern Recognition, vol. 2, pp. 459–466 (2003)Google Scholar
  2. 2.
    Aggarwal, A., Biswas, S., Singh, S., Sural, S., Majumdar, A.K.: Object tracking using background subtraction and motion estimation in MPEG videos. In: Narayanan, P.J., Nayar, S.K., Shum, H.-Y. (eds.) ACCV 2006. LNCS, vol. 3852, pp. 121–130. Springer, Heidelberg (2006). doi: 10.1007/11612704_13 CrossRefGoogle Scholar
  3. 3.
    Viola, P., Jones, M.: Robust real-time object detection. Int. J. Comput. Vis. 2(57), 137–154 (2004)CrossRefGoogle Scholar
  4. 4.
    Eichner, M., Ferrari, V.: Better appearance models for pictorial structures. In: Proceedings of the British Machine Vision Conference (BMVC) (2009)Google Scholar
  5. 5.
    Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Proceedings of the NIPS (2012)Google Scholar
  6. 6.
    Toshev, A., Szegedy, C.: Deep pose: human pose estimation via deep neural networks. In: Proceedings of the CVPR (2014)Google Scholar
  7. 7.
    Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., Xiao, J.: 3D shapenets: a deep representation for volumetric shapes. In: Proceedings of the CVPR (2015)Google Scholar
  8. 8.
    Wohlhart, P., Lepetit, V.: Learning descriptors for object recognition and 3D pose estimation. In: Proceedings of the CVPR (2015)Google Scholar
  9. 9.
    Park, D., Ramanan, D.: Articulated pose estimation with tiny synthetic videos. In: Proceedings of the CHA-LEARN Workshop on Looking at People (2015)Google Scholar
  10. 10.
    Wei, L., Huang, Q., Ceylan, D., Vouga, E., Li, H.: Dense human body correspondences using convolutional networks. CoRR abs/1511.05904 (2015)Google Scholar
  11. 11.
    Huang, P., Tejera, M., Collomosse, J., Hilton, A.: Hybrid skeletal-surface motion graphs for character animation from 4d performance capture. ACM Trans. Graph. (ToG) 34(2), Article No. 17 (2015)Google Scholar
  12. 12.
    Elhayek, A., de Aguiar, E., Jain, A., Tompson, J., Pishchulin, L., Andriluka, M., Bregler, C., Schiele, B., Theobalt, C.: Efficient ConvNet-based marker-less motion capture in general scenes with a low number of cameras. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3810–3818. IEEE (2015)Google Scholar
  13. 13.
    Grauman, K., Shakhnarovich, G., Darrell, T.: A Bayesian approach to image-based visual hull reconstruction. In: Proceedings of the CVPR (2003)Google Scholar
  14. 14.
    Huang, P., Hilton, A., Starck, J.: Shape similarity for 3D video sequences of people. Int. J. Comput. Vis. 89, 362–381 (2010)CrossRefGoogle Scholar
  15. 15.
    Makadia, A., Daniilidis, K.: Spherical correlation of visual representations for 3D model retrieval. Int. J. Comput. Vis. 89, 193–210 (2009)CrossRefGoogle Scholar
  16. 16.
    Rasmussen, C.E., Williams, C.: Gaussian Processes for Machine Learning. MIT Press, Cambridge (2006)zbMATHGoogle Scholar
  17. 17.
    Lawrence, N.: Probabilistic non-linear principal component analysis with Gaussian process latent variable models. J. Mach. Learn. Res. 6, 1783–1817 (2005)MathSciNetzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Matthew Trumble
    • 1
    Email author
  • Andrew Gilbert
    • 1
  • Adrian Hilton
    • 1
  • John Collomosse
    • 1
  1. 1.Centre for Vision Speech and Signal Processing (CVSSP)Univeristy of SurreyGuildfordUK

Personalised recommendations