Human action recognition in videos with articulated pose information by deep networks

  • M. FarrajotaEmail author
  • João M. F. Rodrigues
  • J. M. H. du Buf
Original Article


Action recognition is of great importance in understanding human motion from video. It is an important topic in computer vision due to its many applications such as video surveillance, human–machine interaction and video retrieval. One key problem is to automatically recognize low-level actions and high-level activities of interest. This paper proposes a way to cope with low-level actions by combining information of human body joints to aid action recognition. This is achieved by using high-level features computed by a convolutional neural network which was pre-trained on Imagenet, with articulated body joints as low-level features. These features are then used to feed a Long Short-Term Memory network to learn the temporal dependencies of an action. For pose prediction, we focus on articulated relations between body joints. We employ a series of residual auto-encoders to produce multiple predictions which are then combined to provide a likelihood map of body joints. In the network topology, features are processed across all scales which capture the various spatial relationships associated with the body. Repeated bottom-up and top-down processing with intermediate supervision of each auto-encoder network is applied. We demonstrate state-of-the-art results on the popular FLIC, LSP and UCF Sports datasets.


Human action Human pose ConvNet Neural networks Auto-encoders LSTM 



This work was supported by the FCT Project LARSyS (UID/EEA/ 50009/2013) and FCT Ph.D. Grant to author MF (SFRH/BD/79812/2011).


  1. 1.
    Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR. IEEE, pp 3686–3693Google Scholar
  2. 2.
    Belagiannis V, Zisserman A (2016) Recurrent human pose estimation. arXiv preprint arXiv:1605.02914
  3. 3.
    Bulat A, Tzimiropoulos G (2016) Human pose estimation via convolutional part heatmap regression. In: ECCV. Springer, pp 717–732Google Scholar
  4. 4.
    Caba Heilbron F, Escorcia V, Ghanem B, Carlos Niebles J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: CVPR, pp 961–970Google Scholar
  5. 5.
    Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. arXiv preprint arXiv:1705.07750
  6. 6.
    Chen X, Yuille AL (2014) Articulated pose estimation by a graphical model with image dependent pairwise relations. In: NIPS, pp 1736–1744Google Scholar
  7. 7.
    Collobert R, Kavukcuoglu K, Farabet C (2011) Torch7: a matlab-like environment for machine learning. Tech. repGoogle Scholar
  8. 8.
    Dantone M, Gall J, Leistner C, Van Gool L (2013) Human pose estimation using body parts dependent joint regressors. In: CVPR, pp 3041–3048Google Scholar
  9. 9.
    Derpanis KG, Sizintsev M, Cannons K, Wildes RP (2010) Efficient action spotting based on a spacetime oriented structure representation. In: CVPR. IEEE, pp 1990–1997Google Scholar
  10. 10.
    Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp 2625–2634Google Scholar
  11. 11.
    Ess A, Leibe B, Schindler K, Van Gool L (2008) A mobile vision system for robust multi-person tracking. In: CVPR, pp 1–8Google Scholar
  12. 12.
    Fang H, Xie S, Lu C (2016) Rmpe: regional multi-person pose estimation. arXiv preprint arXiv:1612.00137
  13. 13.
    Farrajota M, Rodrigues JM, du Buf J (2017) Human pose estimation by a series of residual auto-encoders. In: IBPRIA. Springer, pp 131–139Google Scholar
  14. 14.
    Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: CVPR, pp 1933–1941Google Scholar
  15. 15.
    Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2010) Object detection with discriminatively trained part-based models. In: PAMI, vol. 32. IEEE, pp 1627–1645Google Scholar
  16. 16.
    Fernando B, Gavves E, Oramas JM, Ghodrati A, Tuytelaars T (2015) Modeling video evolution for action recognition. In: CVPR, pp 5378–5387Google Scholar
  17. 17.
    Gaidon A, Harchaoui Z, Schmid C (2013) Temporal localization of actions with actoms. In: PAMI, vol. 35. IEEE, pp 2782–2795Google Scholar
  18. 18.
    Gan C, Wang N, Yang Y, Yeung DY, Hauptmann AG (2015) Devnet: a deep event network for multimedia event detection and evidence recounting. In: CVPR, pp 2568–2577Google Scholar
  19. 19.
    He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385
  20. 20.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  21. 21.
    Insafutdinov E, Andriluka M, Pishchulin L, Tang S, Andres B, Schiele B (2016) Articulated multi-person tracking in the wild. arXiv preprint arXiv:1612.01465
  22. 22.
    Insafutdinov E, Pishchulin L, Andres B, Andriluka M, Schiele B (2016) Deepercut: a deeper, stronger, and faster multi-person pose estimation model. arXiv preprint arXiv:1605.03170
  23. 23.
    Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167
  24. 24.
    Jain M, Van Gemert J, Jégou H, Bouthemy P, Snoek C (2014) Action localization with tubelets from motion. In: CVPRGoogle Scholar
  25. 25.
    Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: ICCV, pp 3192–3199Google Scholar
  26. 26.
    Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE, pp 221–231Google Scholar
  27. 27.
    Johnson S, Everingham M (2010) Clustered pose and nonlinear appearance models for human pose estimation. In: BMVC, vol 2, p 5Google Scholar
  28. 28.
    Johnson S, Everingham M (2011) Learning effective human pose estimation from inaccurate annotation. In: CVPR, pp 1465–1472Google Scholar
  29. 29.
    Jones S, Shao L, Zhang J, Liu Y (2012) Relevance feedback for real-world human action retrieval. Pattern Recogn Lett 33(4):446–452CrossRefGoogle Scholar
  30. 30.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: CVPR, pp 1725–1732Google Scholar
  31. 31.
    Kingma D, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
  32. 32.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105Google Scholar
  33. 33.
    Lan T, Wang Y, Mori G (2011) Discriminative figure-centric models for joint action localization and recognition. In: ICCV. IEEE, pp 2003–2010Google Scholar
  34. 34.
    LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRefGoogle Scholar
  35. 35.
    Li X, Chuah M (2018) Rehar: robust and efficient human activity recognition. arXiv preprint arXiv:1802.09745
  36. 36.
    Lifshitz I, Fetaya E, Ullman S (2016) Human pose estimation using deep consensus voting. arXiv preprint arXiv:1603.08212
  37. 37.
    Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: CVPR, pp 3431–3440Google Scholar
  38. 38.
    Newell A, Deng J (2016) Associative embedding: end-to-end learning for joint detection and grouping. arXiv preprint arXiv:1611.05424
  39. 39.
    Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. arXiv preprint arXiv:1603.06937
  40. 40.
    Niebles JC, Chen CW, Fei-Fei L (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: ECCV, pp 392–405. SpringerGoogle Scholar
  41. 41.
    Packer B, Saenko K, Koller D (2012) A combined pose, object, and feature model for action understanding. In: CVPR, pp 1378–1385Google Scholar
  42. 42.
    Pirsiavash H, Ramanan D (2014) Parsing videos of actions with segmental grammars. In: CVPR, pp 612–619Google Scholar
  43. 43.
    Pishchulin L, Andriluka M, Gehler P, Schiele B (2013) Poselet conditioned pictorial structures. In: CVPR, pp 588–595Google Scholar
  44. 44.
    Pishchulin L, Andriluka M, Gehler P, Schiele B (2013) Strong appearance and expressive spatial models for human pose estimation. In: ICCV, pp 3487–3494Google Scholar
  45. 45.
    Pishchulin L, Insafutdinov E, Tang S, Andres B, Andriluka M, Gehler P, Schiele B (2015) Deepcut: joint subset partition and labeling for multi person pose estimation. arXiv preprint arXiv:1511.06645
  46. 46.
    Ramakrishna V, Munoz D, Hebert M, Bagnell JA, Sheikh Y (2014) Pose machines: articulated pose estimation via inference machines. In: ECCV. Springer, pp 33–47Google Scholar
  47. 47.
    Raptis M, Sigal L (2013) Poselet key-framing: a model for human activity recognition. In: CVPR, pp 2650–2657Google Scholar
  48. 48.
    Rodriguez MD, Ahmed J, Shah M (2008) Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: CVPR. IEEE, pp 1–8Google Scholar
  49. 49.
    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2014) ImageNet large scale visual recognition challenge, p 37. arXiv preprint arXiv:1409.0575
  50. 50.
    Sadanand S, Corso JJ (2012) Action bank: a high-level representation of activity in video. In: CVPR. IEEE, pp 1234–1241Google Scholar
  51. 51.
    Sapp B, Taskar B (2013) Modec: multimodal decomposable models for human pose estimation. In: CVPR, vol 13, p 3Google Scholar
  52. 52.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576Google Scholar
  53. 53.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
  54. 54.
    Singh VK, Nevatia R (2011) Action recognition in cluttered dynamic scenes using pose-specific part models. In: ICCV. IEEE, pp 113–120Google Scholar
  55. 55.
    Souly N, Shah M (2016) Visual saliency detection using group lasso regularization in videos of natural scenes. IJCV 117:93–110MathSciNetCrossRefGoogle Scholar
  56. 56.
    Sun L, Jia K, Yeung DY, Shi BE (2015) Human action recognition using factorized spatio-temporal convolutional networks. In: ICCV, pp 4597–4605Google Scholar
  57. 57.
    Sun M, Savarese S (2011) Articulated part-based model for joint object detection and pose estimation. In: ICCV. IEEE, pp 723–730Google Scholar
  58. 58.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: CVPR, pp 1–9Google Scholar
  59. 59.
    Tompson JJ, Jain A, LeCun Y, Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS, pp 1799–1807Google Scholar
  60. 60.
    Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C (2015) Efficient object localization using convolutional networks. In: CVPR, pp 648–656Google Scholar
  61. 61.
    Toshev A, Szegedy C (2014) Deeppose: human pose estimation via deep neural networks. In: CVPR, pp 1653–1660Google Scholar
  62. 62.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: ICCV, pp 4489–4497Google Scholar
  63. 63.
    Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEEGoogle Scholar
  64. 64.
    Wang L, Qiao Y, Tang X (2014) Latent hierarchical model of temporal structure for complex activity classification. Trans Image Process 23(2):810–822MathSciNetCrossRefzbMATHGoogle Scholar
  65. 65.
    Wang L, Qiao Y, Tang X (2013) Motionlets: mid-level 3d parts for human motion recognition. In: CVPR, pp 2674–2681Google Scholar
  66. 66.
    Wang L, Qiao Y, Tang X (2014) Video action detection with relational dynamic-poselets. In: ECCV. Springer, pp 565–580Google Scholar
  67. 67.
    Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR, pp 4305–4314Google Scholar
  68. 68.
    Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV, pp 3551–3558Google Scholar
  69. 69.
    Wang C, Wang Y, Yuille AL (2013) An approach to pose-based action recognition. In: CVPR, pp 915–922Google Scholar
  70. 70.
    Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: ECCV. Springer, pp 20–36Google Scholar
  71. 71.
    Weinzaepfel P, Harchaoui Z, Schmid C (2015) Learning to track for spatio-temporal action localization. In: ICCV, pp 3164–3172Google Scholar
  72. 72.
    Wei S, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. arXiv preprint arXiv:1602.00134
  73. 73.
    Xiong Y, Zhu K, Lin D, Tang X (2015) Recognize complex events from static images by fusing deep channels. In: CVPR, pp 1600–1609Google Scholar
  74. 74.
    Xu B, Wang N, Chen T, Li M (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853
  75. 75.
    Yang Y, Saleemi I, Shah M (2013) Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions. In: PAMI, vol 35. IEEE, pp 1635–1648Google Scholar
  76. 76.
    Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: CVPR, pp 4694–4702Google Scholar
  77. 77.
    Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: CVPR, pp 2718–2726Google Scholar

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2018

Authors and Affiliations

  1. 1.Vision Laboratory, LARSySUniversity of the AlgarveFaroPortugal

Personalised recommendations