Human Pose Estimation in Space and Time Using 3D CNN

  • Agne Grinciunaite
  • Amogh Gudi
  • Emrah Tasli
  • Marten den Uyl
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9915)


This paper explores the capabilities of convolutional neural networks to deal with a task that is easily manageable for humans: perceiving 3D pose of a human body from varying angles. However, in our approach, we are restricted to using a monocular vision system. For this purpose, we apply a convolutional neural network approach on RGB videos and extend it to three dimensional convolutions. This is done via encoding the time dimension in videos as the 3\(^\mathrm{rd}\) dimension in convolutional space, and directly regressing to human body joint positions in 3D coordinate space. This research shows the ability of such a network to achieve state-of-the-art performance on the selected Human3.6M dataset, thus demonstrating the possibility of successfully representing temporal data with an additional dimension in the convolutional operation.


Recurrent Neural Network Joint Position Convolutional Neural Network Joint Location Convolutional Layer 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Jones, M.R.: Time, our lost dimension: toward a new theory of perception, attention, and memory. Psychol. Rev. 83, 323–355 (1976)CrossRefGoogle Scholar
  2. 2.
    Freyd, J.J.: Dynamic mental representations. Psychol. Rev. 94(4), 427 (1987)CrossRefGoogle Scholar
  3. 3.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4489–4497. IEEE (2015)Google Scholar
  4. 4.
    Grinciunaite, A.: Development of a deep learning model for 3D human pose estimation in monocular videos. Master’s thesis, Vilniaus Gedimino Technikos Universitetas (2016)Google Scholar
  5. 5.
    Wang, C., Wang, Y., Lin, Z., Yuille, A., Gao, W.: Robust estimation of 3D human poses from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2361–2368 (2014)Google Scholar
  6. 6.
    Oberweger, M., Wohlhart, P., Lepetit, V.: Hands deep in deep learning for hand pose estimation. arXiv preprint arXiv:1502.06807 (2015)
  7. 7.
    Du, Y., Huang, Y., Peng, J.: Full-body human pose estimation from monocular video sequence via multi-dimensional boosting regression. In: Jawahar, C.V., Shan, S. (eds.) ACCV 2014. LNCS, vol. 9010, pp. 531–544. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-16634-6_39 Google Scholar
  8. 8.
    Toshev, A., Szegedy, C.: DeepPose: human pose estimation via deep neural networks. CoRR, abs/1312.4659 (2013)Google Scholar
  9. 9.
    Fan, X., Zheng, K., Lin, Y., Wang, S.: Combining local appearance, holistic view: dual-source deep neural networks for human pose estimation. CoRR, abs/1504.07159 (2015)Google Scholar
  10. 10.
    Zhou, F., De la Torre, F.: Spatio-temporal matching for human pose estimation in video (2016)Google Scholar
  11. 11.
    Zhou, X., Zhu, M., Leonardos, S., Derpanis, K., Daniilidis, K.: Sparseness meets deepness: 3D human pose estimation from monocular video. arXiv preprint 2015. arXiv:1511.09439
  12. 12.
    Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. CoRR, abs/1411.4280 (2014)Google Scholar
  13. 13.
    Pfister, T., Simonyan, K., Charles, J., Zisserman, A.: Deep convolutional neural networks for efficient pose estimation in gesture videos. In: Cremers, D., Reid, I., Saito, H., Yang, M.-H. (eds.) ACCV 2014. LNCS, vol. 9003, pp. 538–552. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-16865-4_35 Google Scholar
  14. 14.
    Ji, S., Wei, X., Yang, M., Kai, Y.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)CrossRefGoogle Scholar
  15. 15.
    Tekin, B., Sun, X., Wang, X., Lepetit, V., Fua, P.: Predicting people’s 3D poses from short sequences. arXiv preprint arXiv:1504.08200 (2015)
  16. 16.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36, 1325–1329 (2014)CrossRefGoogle Scholar
  17. 17.
    He, K., Zhang, X., Ren, S., Sun, J.: Delving Deep into Rectifiers: Surpassing Human-Level performance on ImageNet classification. CoRR, abs/1502.01852 (2015)Google Scholar
  18. 18.
    Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Netw. 12(1), 145–151 (1999)CrossRefGoogle Scholar
  19. 19.
    Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Agne Grinciunaite
    • 1
  • Amogh Gudi
    • 2
  • Emrah Tasli
    • 3
  • Marten den Uyl
    • 2
  1. 1.Vilniaus Gedimino Technikos Univ.VilniusLithuania
  2. 2.VicarVisionAmsterdamNetherlands
  3. 3.Booking.comAmsterdamNetherlands

Personalised recommendations