Action prediction via deep residual feature learning and weighted loss


Action prediction based on partially observed videos is challenging as the information provided by partial videos is not discriminative enough for classification. In this paper, we propose a Deep Residual Feature Learning (DeepRFL) framework to explore more discriminative information from partial videos, achieving similar representations as those of complete videos. The whole framework performs as a teacher-student network, where the teacher network supports the complete video feature supervision to the student network to capture the salient differences between partial videos and their corresponding complete videos based on the residual feature learning. The teacher and student network are trained simultaneously, and the technique called partial feature detach is employed to prevent the teacher network from disturbing by the student network. We also design a novel weighted loss function to give less penalization to partial videos that have small observation ratios. Extensive evaluations on the challenging UCF101 and HMDB51 datasets demonstrate that the proposed method outperforms state-of-the-art results without knowing the observation ratios of testing videos. The code will be publicly available soon.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. 1.

    Aliakbarian MS, Saleh FS, Salzmann M, Fernando B, Petersson L, Andersson L (2017) Encouraging LSTMs to Anticipate Actions Very Early. In: ICCV, pp 280–289

  2. 2.

    Bendersky M, Garcia-Pueyo L, Harmsen J, Josifovski V, Lepikhin D (2014) Up next: retrieval methods for large scale related video suggestion. In: ACM SIGKDD, pp 1769–1778

  3. 3.

    Cao Y et al (2013) Recognize Human Activities from Partially Observed Videos. In: CVPR. IEEE, pp 2658–2665

  4. 4.

    Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: CVPR, pp 4724–4733

  5. 5.

    Donahue J et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp 2625–2634

  6. 6.

    Guo S, Qing L, Miao J, Duan L (2018) Deep Residual Feature Learning for Action Prediction. In: BigMM, pp 1–6

  7. 7.

    He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In: CVPR. IEEE, pp 770–778

  8. 8.

    He D, Zhou Z, Gan C et al (2018) StNet: Local and Global Spatial-Temporal Modeling for Action Recognition. arXiv:1811.01549

  9. 9.

    Hearst MA, Dumais ST, Osuna E, Platt J, Scholkopf B (1998) Support vector machines. IEEE Intell Syst Appl 13(4):18–28

    Article  Google Scholar 

  10. 10.

    Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML, pp 448–456

  11. 11.

    Karpathy A et al (2014) Large-scale Video Classification with Convolutional Neural Networks. In: CVPR, pp 1725–1732

  12. 12.

    Kay W et al (2017) The kinetics human action video dataset. arXiv:1705.06950

  13. 13.

    Kim J, Grauman K (2009) Observe locally, infer globally: a space-time mrf for detecting abnormal activities with incre- mental updates. In: CVPR, pp 2921–2928

  14. 14.

    Kong Y, Tao Z, Fu Y (2017) Deep Sequential Context Networks for Action Prediction. In: CVPR. IEEE, pp 3662–3670

  15. 15.

    Kong Y, Gao S, Sun B, Fu Y (2018) Action Prediction from Videos via Memorizing Hard-to-Predict Samples. In: AAAI

  16. 16.

    Kong Y, Tao Z, Fu Y (2018) Adversarial action prediction networks, TPAMI

  17. 17.

    Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1097–1105

  18. 18.

    Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) Hmdb: a large video database for human motion recognition. In: ICCV. IEEE, pp 2556–2563

  19. 19.

    Lai S, Zheng WS, Hu JF, Zhang J (2018) Global-Local Temporal saliency action prediction. TIP 27(5):2272–2285

    MathSciNet  MATH  Google Scholar 

  20. 20.

    Li Y et al (2016) Online human action detection using joint classification-regression recurrent neural networks. In: ECCV, pp 203–220

  21. 21.

    Liu Y, Nie L, Han L et al (2015) Action2Activity: Recognizing Complex Activities from Sensor Data. In: IJCAI, pp 1617–1623

  22. 22.

    Lu Y, Wei Y, Liu L et al (2017) Towards unsupervised physical activity recognition using smartphone accelerometers. Multimed Tools Appl 76(8):10701–10719

    Article  Google Scholar 

  23. 23.

    Ma S, Sigal L, Sclaroff S (2016) Learning Activity Progression in LSTMs for Activity Detection and Early Detection. In: CVPR, pp 1942–1950

  24. 24.

    Ng JY-H, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: CVPR

  25. 25.

    Nie L, Wang X, Zhang J et al (2017) Enhancing Micro-video Understanding by Harnessing External Sounds. In: ACM on Multimedia Conference, pp 1192–1200

  26. 26.

    Paszke et al (2017) Automatic differentiation in PyTorch. In: NIPS Workshop

  27. 27.

    Ryoo MS (2011) Human activity prediction: Early recognition of ongoing activities from streaming videos. In: ICCV. IEEE, pp 1036–1043

  28. 28.

    Shou Z et al (2018) Online detection of action start in untrimmed, streaming videos. In: ECCV, pp 551–568

  29. 29.

    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576

  30. 30.

    Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  31. 31.

    Szegedy C et al (2015) Going deeper with convolutions. In: CVPR, pp 1–9

  32. 32.

    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2016) C3D: Generic Features for Video Analysis. In: ICCV, pp 4489–4497

  33. 33.

    Tran D, Ray J, Shou Z, Chang S, Paluri M (2017) ConvNet Architecture Search for Spatiotemporal Feature Learning, arXiv:1708.05038

  34. 34.

    Vondrick C, Pirsiavash H, Torralba A (2016) Anticipating visual representations from unlabeled video. In: CVPR. IEEE, pp 98–106

  35. 35.

    Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: CVPR, pp 4305–4314

  36. 36.

    Wang L et al (2016) Temporal segment networks: Towards good practices for deep action recognition. In: ECCV, pp 20–36

  37. 37.

    Wang L, Li W, Li W et al (2017) Appearance-and-relation networks for video classification. arXiv:1711.09125

  38. 38.

    Xu Z, Qing L, Miao J (2015) Activity Auto-Completion : Predicting Human Activities from Partial Videos. In: ICCV. IEEE, pp 3191–3199

Download references


This research is partially sponsored by Natural Science Foundation of China (Nos. 61872333, 61472387 and 61650201) and Beijing Natural Science Foundation (Nos. 4152005 and 4162058).

Author information



Corresponding author

Correspondence to Laiyun Qing.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Guo, S., Qing, L., Miao, J. et al. Action prediction via deep residual feature learning and weighted loss. Multimed Tools Appl 79, 4713–4727 (2020).

Download citation


  • Action prediction
  • Action recognition
  • Deep residual feature learning
  • Teacher-student network