Action prediction via deep residual feature learning and weighted loss

  • Shuangshuang Guo
  • Laiyun QingEmail author
  • Jun Miao
  • Lijuan Duan


Action prediction based on partially observed videos is challenging as the information provided by partial videos is not discriminative enough for classification. In this paper, we propose a Deep Residual Feature Learning (DeepRFL) framework to explore more discriminative information from partial videos, achieving similar representations as those of complete videos. The whole framework performs as a teacher-student network, where the teacher network supports the complete video feature supervision to the student network to capture the salient differences between partial videos and their corresponding complete videos based on the residual feature learning. The teacher and student network are trained simultaneously, and the technique called partial feature detach is employed to prevent the teacher network from disturbing by the student network. We also design a novel weighted loss function to give less penalization to partial videos that have small observation ratios. Extensive evaluations on the challenging UCF101 and HMDB51 datasets demonstrate that the proposed method outperforms state-of-the-art results without knowing the observation ratios of testing videos. The code will be publicly available soon.


Action prediction Action recognition Deep residual feature learning Teacher-student network 



This research is partially sponsored by Natural Science Foundation of China (Nos. 61872333, 61472387 and 61650201) and Beijing Natural Science Foundation (Nos. 4152005 and 4162058).


  1. 1.
    Aliakbarian MS, Saleh FS, Salzmann M, Fernando B, Petersson L, Andersson L (2017) Encouraging LSTMs to Anticipate Actions Very Early. In: ICCV, pp 280–289Google Scholar
  2. 2.
    Bendersky M, Garcia-Pueyo L, Harmsen J, Josifovski V, Lepikhin D (2014) Up next: retrieval methods for large scale related video suggestion. In: ACM SIGKDD, pp 1769–1778Google Scholar
  3. 3.
    Cao Y et al (2013) Recognize Human Activities from Partially Observed Videos. In: CVPR. IEEE, pp 2658–2665Google Scholar
  4. 4.
    Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR, pp 4724–4733Google Scholar
  5. 5.
    Donahue J et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp 2625–2634Google Scholar
  6. 6.
    Guo S, Qing L, Miao J, Duan L (2018) Deep Residual Feature Learning for Action Prediction. In: BigMM, pp 1–6Google Scholar
  7. 7.
    He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In: CVPR. IEEE, pp 770–778Google Scholar
  8. 8.
    He D, Zhou Z, Gan C et al (2018) StNet: Local and Global Spatial-Temporal Modeling for Action Recognition. arXiv:1811.01549
  9. 9.
    Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28CrossRefGoogle Scholar
  10. 10.
    Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML, pp 448–456Google Scholar
  11. 11.
    Karpathy A et al (2014) Large-scale Video Classification with Convolutional Neural Networks. In: CVPR, pp 1725–1732Google Scholar
  12. 12.
    Kay W et al (2017) The kinetics human action video dataset. arXiv:1705.06950
  13. 13.
    Kim J, Grauman K (2009) Observe locally, infer globally: a space-time mrf for detecting abnormal activities with incre- mental updates. In: CVPR, pp 2921–2928Google Scholar
  14. 14.
    Kong Y, Tao Z, Fu Y (2017) Deep Sequential Context Networks for Action Prediction. In: CVPR. IEEE, pp 3662–3670Google Scholar
  15. 15.
    Kong Y, Gao S, Sun B, Fu Y (2018) Action Prediction from Videos via Memorizing Hard-to-Predict Samples. In: AAAIGoogle Scholar
  16. 16.
    Kong Y, Tao Z, Fu Y (2018) Adversarial action prediction networks, TPAMIGoogle Scholar
  17. 17.
    Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105Google Scholar
  18. 18.
    Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: ICCV. IEEE, pp 2556–2563Google Scholar
  19. 19.
    Lai S, Zheng WS, Hu JF, Zhang J (2018) Global-Local Temporal saliency action prediction. TIP 27(5):2272–2285MathSciNetzbMATHGoogle Scholar
  20. 20.
    Li Y et al (2016) Online human action detection using joint classification-regression recurrent neural networks. In: ECCV, pp 203–220Google Scholar
  21. 21.
    Liu Y, Nie L, Han L et al (2015) Action2Activity: Recognizing Complex Activities from Sensor Data. In: IJCAI, pp 1617–1623Google Scholar
  22. 22.
    Lu Y, Wei Y, Liu L et al (2017) Towards unsupervised physical activity recognition using smartphone accelerometers. Multimed Tools Appl 76(8):10701–10719CrossRefGoogle Scholar
  23. 23.
    Ma S, Sigal L, Sclaroff S (2016) Learning Activity Progression in LSTMs for Activity Detection and Early Detection. In: CVPR, pp 1942–1950Google Scholar
  24. 24.
    Ng JY-H, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: CVPRGoogle Scholar
  25. 25.
    Nie L, Wang X, Zhang J et al (2017) Enhancing Micro-video Understanding by Harnessing External Sounds. In: ACM on Multimedia Conference, pp 1192–1200Google Scholar
  26. 26.
    Paszke et al (2017) Automatic differentiation in PyTorch. In: NIPS WorkshopGoogle Scholar
  27. 27.
    Ryoo MS (2011) Human activity prediction: Early recognition of ongoing activities from streaming videos. In: ICCV. IEEE, pp 1036–1043Google Scholar
  28. 28.
    Shou Z et al (2018) Online detection of action start in untrimmed, streaming videos. In: ECCV, pp 551–568Google Scholar
  29. 29.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576Google Scholar
  30. 30.
    Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
  31. 31.
    Szegedy C et al (2015) Going deeper with convolutions. In: CVPR, pp 1–9Google Scholar
  32. 32.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2016) C3D: Generic Features for Video Analysis. In: ICCV, pp 4489–4497Google Scholar
  33. 33.
    Tran D, Ray J, Shou Z, Chang S, Paluri M (2017) ConvNet Architecture Search for Spatiotemporal Feature Learning, arXiv:1708.05038
  34. 34.
    Vondrick C, Pirsiavash H, Torralba A (2016) Anticipating visual representations from unlabeled video. In: CVPR. IEEE, pp 98–106Google Scholar
  35. 35.
    Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR, pp 4305–4314Google Scholar
  36. 36.
    Wang L et al (2016) Temporal segment networks: Towards good practices for deep action recognition. In: ECCV, pp 20–36Google Scholar
  37. 37.
    Wang L, Li W, Li W et al (2017) Appearance-and-relation networks for video classification. arXiv:1711.09125
  38. 38.
    Xu Z, Qing L, Miao J (2015) Activity Auto-Completion : Predicting Human Activities from Partial Videos. In: ICCV. IEEE, pp 3191–3199Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  • Shuangshuang Guo
    • 1
  • Laiyun Qing
    • 1
    Email author
  • Jun Miao
    • 2
  • Lijuan Duan
    • 3
  1. 1.School of Computer Science and TechnologyUniversity of Chinese Academy of SciencesBeijingChina
  2. 2.Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, School of Computer ScienceBeijing Information Science and Technology UniversityBeijingChina
  3. 3.Faculty of Information TechnologyBeijing University of TechnologyBeijingChina

Personalised recommendations