Part-Activated Deep Reinforcement Learning for Action Prediction

  • Lei Chen
  • Jiwen LuEmail author
  • Zhanjie Song
  • Jie Zhou
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11207)


In this paper, we propose a part-activated deep reinforcement learning (PA-DRL) method for action prediction. Most existing methods for action prediction utilize the evolution of whole frames to model actions, which cannot avoid the noise of the current action, especially in the early prediction. Moreover, the loss of structural information of human body diminishes the capacity of features to describe actions. To address this, we design the PA-DRL to exploit the structure of the human body by extracting skeleton proposals under a deep reinforcement learning framework. Specifically, we extract features from different parts of the human body individually and activate the action-related parts in features to enhance the representation. Our method not only exploits the structure information of the human body, but also considers the saliency part for expressing actions. We evaluate our method on three popular action prediction datasets: UT-Interaction, BIT-Interaction and UCF101. Our experimental results demonstrate that our method achieves the performance with state-of-the-arts.


Action prediction Deep reinforcement learning Skeleton Part model 



This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700802, in part by the National Natural Science Foundation of China under Grant 61672306, Grant U1713214, Grant 61572271, Grant 91746107, in part by the Shenzhen Fundamental Research Fund (Subject Arrangement) under Grant JCYJ20170412170602564 and in part by the Natural Science Foundation of Tianjin under Grant 16JCYBJC15900.


  1. 1.
    Caicedo, J.C., Lazebnik, S.: Active object localization with deep reinforcement learning. In: ICCV, pp. 2488–2496 (2015)Google Scholar
  2. 2.
    Cao, Y., et al.: Recognize human activities from partially observed videos. In: CVPR, pp. 2658–2665 (2013)Google Scholar
  3. 3.
    Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR, pp. 7291–7299 (2017)Google Scholar
  4. 4.
    Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the kinetics dataset. In: CVPR, pp. 6299–6308 (2017)Google Scholar
  5. 5.
    Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: CVPR, pp. 1110–1118 (2015)Google Scholar
  6. 6.
    Goodrich, B., Arel, I.: Reinforcement learning based visual attention with application to face detection. In: CVPRW, pp. 19–24 (2012)Google Scholar
  7. 7.
    Hinami, R., Mei, T., Satoh, S.: Joint detection and recounting of abnormal events by learning deep generic knowledge. In: ICCV, pp. 3619–3627 (2017)Google Scholar
  8. 8.
    Hu, J.-F., Zheng, W.-S., Ma, L., Wang, G., Lai, J.: Real-time RGB-D activity prediction by soft regression. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 280–296. Springer, Cham (2016). Scholar
  9. 9.
    Huang, C., Lucey, S., Ramanan, D.: Learning policies for adaptive tracking with deep feature cascades. In: ICCV, pp. 105–114 (2017)Google Scholar
  10. 10.
    Jie, Z., Liang, X., Feng, J., Jin, X., Lu, W., Yan, S.: Tree-structured reinforcement learning for sequential object localization. In: NIPS, pp. 127–135 (2016)Google Scholar
  11. 11.
    Karayev, S., Baumgartner, T., Fritz, M., Darrell, T.: Timely object recognition. In: NIPS, pp. 890–898 (2012)Google Scholar
  12. 12.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)Google Scholar
  13. 13.
    Kong, X., Xin, B., Wang, Y., Hua, G.: Collaborative deep reinforcement learning for joint object search. In: CVPR, pp. 1695–1704 (2017)Google Scholar
  14. 14.
    Kong, Y., Fu, Y.: Max-margin action prediction machine. TPAMI 38(9), 1844–1858 (2016)CrossRefGoogle Scholar
  15. 15.
    Kong, Y., Jia, Y., Fu, Y.: Learning human interaction by interactive phrases. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7572, pp. 300–313. Springer, Heidelberg (2012). Scholar
  16. 16.
    Kong, Y., Kit, D., Fu, Y.: A discriminative model with multiple temporal scales for action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 596–611. Springer, Cham (2014). Scholar
  17. 17.
    Kong, Y., Tao, Z., Fu, Y.: Deep sequential context networks for action prediction. In: CVPR, pp. 1473–1481 (2017)Google Scholar
  18. 18.
    Krull, A., Brachmann, E., Nowozin, S., Michel, F., Shotton, J., Rother, C.: Poseagent: budget-constrained 6D object pose estimation via reinforcement learning. In: CVPR, pp. 6702–6710 (2017)Google Scholar
  19. 19.
    Lai, S., Zheng, W.S., Hu, J.F., Zhang, J.: Global-local temporal saliency action prediction. TIP 27, 2272–2285 (2017)MathSciNetGoogle Scholar
  20. 20.
    Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 689–704. Springer, Cham (2014). Scholar
  21. 21.
    Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion GAN for future-flow embedded video prediction. In: ICCV (2017)Google Scholar
  22. 22.
    Liang, X., Lee, L., Xing, E.P.: Deep variation-structured reinforcement learning for visual relationship and attribute detection. In: CVPR, pp. 848–857 (2017)Google Scholar
  23. 23.
    Littman, M.L.: Reinforcement learning improves behaviour from evaluative feedback. Nature 521(7553), 445–451 (2015)CrossRefGoogle Scholar
  24. 24.
    Ma, S., Sigal, L., Sclaroff, S.: Learning activity progression in LSTMs for activity detection and early detection. In: CVPR, pp. 1942–1950 (2016)Google Scholar
  25. 25.
    Mahmud, T., Hasan, M., Roy-Chowdhury, A.K.: Joint prediction of activity labels and starting times in untrimmed videos. In: ICCV, pp. 5773–5782 (2017)Google Scholar
  26. 26.
    Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: CVPR, pp. 2891–2900 (2017)Google Scholar
  27. 27.
    Mathe, S., Pirinen, A., Sminchisescu, C.: Reinforcement learning for visual object detection. In: CVPR, pp. 2894–2902 (2016)Google Scholar
  28. 28.
    Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015)CrossRefGoogle Scholar
  29. 29.
    Qi, S., Huang, S., Wei, P., Zhu, S.C.: Predicting human activities using stochastic grammar. In: ICCV, pp. 1164–1172 (2017)Google Scholar
  30. 30.
    Ren, Z., Wang, X., Zhang, N., Lv, X., Li, L.J.: Deep reinforcement learning-based image captioning with embedding reward. In: CVPR, pp. 290–298 (2017)Google Scholar
  31. 31.
    Ryoo, M.S.: Human activity prediction: early recognition of ongoing activities from streaming videos. In: ICCV, pp. 1036–1043 (2011)Google Scholar
  32. 32.
    Ryoo, M.S., Aggarwal, J.: UT-interaction dataset, ICPR contest on semantic description of human activities (SDHA). In: ICPR, vol. 2, p. 4 (2010)Google Scholar
  33. 33.
    Ryoo, M.S., Matthies, L.: First-person activity recognition: what are they doing to me? In: CVPR, pp. 2730–2737 (2013)Google Scholar
  34. 34.
    Shi, Q., Cheng, L., Wang, L., Smola, A.: Human action segmentation and recognition using discriminative semi-markov models. IJCV 93(1), 22–32 (2011)CrossRefGoogle Scholar
  35. 35.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)Google Scholar
  36. 36.
    Singh, G., Saha, S., Sapienza, M., Torr, P.H.S., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: ICCV, pp. 3637–3646 (2017)Google Scholar
  37. 37.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  38. 38.
    Supancic III, J.S., Ramanan, D.: Tracking as online decision-making: learning a policy from streaming videos with reinforcement learning. In: ICCV, pp. 322–331 (2017)Google Scholar
  39. 39.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV, pp. 4489–4497 (2015)Google Scholar
  40. 40.
    Tudor Ionescu, R., Smeureanu, S., Alexe, B., Popescu, M.: Unmasking the abnormal events in video. In: ICCV, pp. 2895–2903 (2017)Google Scholar
  41. 41.
    Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: CVPR, pp. 1290–1297 (2012)Google Scholar
  42. 42.
    Wang, L., Xiong, Y., Lin, D., Van Gool, L.: Untrimmednets for weakly supervised action recognition and detection. In: CVPR, pp. 4325–4334 (2017)Google Scholar
  43. 43.
    Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: CVPR, pp. 4724–4732 (2016)Google Scholar
  44. 44.
    Xu, Z., Qing, L., Miao, J.: Activity auto-completion: predicting human activities from partial videos. In: ICCV, pp. 3191–3199 (2015)Google Scholar
  45. 45.
    Yoo, Y., Yun, S., Choi, J., Yun, K., Choi, J.Y.: Action-decision networks for visual tracking with deep reinforcement learning. In: CVPR, pp. 2711–2720 (2017)Google Scholar
  46. 46.
    Zaki, H.F.M., Shafait, F., Mian, A.: Modeling sub-event dynamics in first-person action recognition. In: CVPR, pp. 7253–7262 (2017)Google Scholar
  47. 47.
    Zhuang, B., Liu, L., Shen, C., Reid, I.: Towards context-aware interaction recognition for visual relationship detection. In: ICCV, pp. 589–598 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Tianjin UniversityTianjinChina
  2. 2.Tsinghua UniversityBeijingChina

Personalised recommendations