Skip to main content
Log in

Learning motion and content-dependent features with convolutions for action recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

A variety of recognizing architectures based on deep convolutional neural networks have been devised for labeling videos containing human motion with action labels. However, so far, most works cannot properly deal with the temporal dynamics encoded in multiple contiguous frames, which distinguishes action recognition from other recognition tasks. This paper develops a temporal extension of convolutional neural networks to exploit motion-dependent features for recognizing human action in video. Our approach differs from other recent attempts in that it uses multiplicative interactions between convolutional outputs to describe motion information across contiguous frames. Interestingly, the representation of image content arises when we are at work on extracting motion pattern, which makes our model effectively incorporate both of them to analysis video. Additional theoretical analysis proves that motion and content-dependent features arise simultaneously from the developed architecture, whereas previous works mostly deal with the two separately. Our architecture is trained and evaluated on the standard video actions benchmarks of KTH and UCF101, where it matches the state of the art and has distinct advantages over previous attempts to use deep convolutional architectures for action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Adelson EH, Bergen JR (1985) Spatiotemporal energy models for the perception of motion. JOSA A 2(2):284–299

    Article  Google Scholar 

  2. Aggarwal J., Ryoo MS (2011) Human activity analysis: A review. ACM Comput Surveys (CSUR) 43(3):16

    Article  Google Scholar 

  3. Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence. IEEE Transactions on 35(8):1798–1828

    Google Scholar 

  4. Bouagar S, Larabi S (2014) Efficient descriptor for full and partial shape matching. Multimedia Tools and Applications pp. 1–23

  5. Dollár P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, 2005. 2nd Joint IEEE International Workshop on, pp. 65–72. IEEE

  6. Guo J, Kim J (2011) Adaptive motion vector smoothing for improving side information in distributed video coding. J Inf Process Syst 7(1):103–110

    Article  Google Scholar 

  7. van Hateren JH, Ruderman DL (1998) Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex. Proceedings of the Royal Society of London. Series B: Biol Sci 265 (1412):2315–2320

    Google Scholar 

  8. Heider F, Simmel M (1944) An experimental study of apparent behavior. The American Journal of Psychology

  9. Horn RA, Johnson CR (2012) Matrix analysis. Cambridge university press

  10. Hyvärinen A, Hurri J, Hoyer PO (2009) Natural Image Statistics: A Probabilistic Approach to Early Computational Vision., vol. 39. Springer

  11. Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. Pattern Analysis and Machine Intelligence. IEEE Trans 35(1):221–231

    Google Scholar 

  12. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  13. Kim H, Lee SH, Sohn MK, Kim DJ (2014) Illumination invariant head pose estimation using random forests classifier and binary pattern run length matrix. Human-centric Comput Inf Sci 4(1):1–12

    Article  Google Scholar 

  14. Konda KR, Memisevic R, Michalski V (2013) The role of spatio-temporal synchrony in the encoding of motion. arXiv:CoRR1306.3162

  15. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems

  16. Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2-3):107–123

    Article  Google Scholar 

  17. Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE

  18. Le QV, Zou WY, Yeung SY, Ng AY (2011) Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE

  19. LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  20. Liu S, Fu W, He L, Zhou J, Ma M (2014) Distribution of primary additional errors in fractal encoding method. Multimedia Tools and Applications pp. 1–16. 10.1007/s11042-014-2408-1

  21. Marszałek M, Laptev I, Schmid C (2009) Actions in context. In: IEEE Conference on Computer Vision & Pattern Recognition

  22. Memisevic R (2011) Gradient-based learning of higher-order image features. In: Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE

  23. Memisevic R (2013) Learning to relate images. Pattern Analysis and Machine Intelligence. IEEE Trans 35(8):1829–1846

    Google Scholar 

  24. Mobahi H, Collobert R, Weston J (2009) Deep learning from temporal coherence in video. In: Proceedings of the 26th Annual International Conference on Machine Learning, ACM

  25. Ng CK, Ee GK, Noordin N, Fam JG (2013) Finger triggered virtual musical instruments. J Converg 4(1):39–46

    Google Scholar 

  26. Olshausen BA (2003) Learning sparse, overcomplete representations of time-varying natural images. In: Image Processing, 2003. ICIP 2003. Proceedings. 2003 International Conference on, vol. 1, IEEE

  27. Sanin A, Sanderson C, Harandi MT, Lovell BC (2013) Spatio-temporal covariance descriptors for action and gesture recognition. In: Applications of Computer Vision (WACV), 2013 IEEE Workshop on, IEEE

  28. Schindler K, Van Gool L (2008) Action snippets: How many frames does human action recognition require? In: Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE

  29. Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local svm approach. In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, vol. 3, pp. 32–36. IEEE

  30. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems

  31. Soomro K, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  32. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. arXiv:1409.4842

  33. Taylor GW, Fergus R, LeCun Y, Bregler C (2010)

  34. Turaga P, Chellappa R, Subrahmanian VS, Udrea O (2008) Machine recognition of human activities: A survey. Circuits and Systems for Video Technology. IEEE Trans 18(11):1473–1488

    Google Scholar 

  35. Vishwakarma S, Agrawal A (2013) A survey on activity recognition and behavior understanding in video surveillance. The Visual Comput 29(10):983–1009

    Article  Google Scholar 

  36. Wang H, Klaser A, Schmid C, Liu C.L. (2011) Action recognition by dense trajectories. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE

  37. Wang H, Ullah MM, Klaser A, Laptev I, Schmid C et al (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference

  38. Wang Y, Mori G (2009) Human action recognition by semilatent topic models. Pattern Analysis and Machine Intelligence. IEEE Trans 31(10):1762–1774

    Google Scholar 

  39. Willems G, Tuytelaars T, Van Gool L (2008) An efficient dense and scale-invariant spatio-temporal interest point detector. In: Computer Vision–ECCV 2008, Springer

  40. Wiskott L, Sejnowski T (2002) Slow feature analysis: Unsupervised learning of invariances. Neural Comput 14(4):715–770

    Article  MATH  Google Scholar 

  41. Zhang Z, Tao D (2012) Slow feature analysis for human action recognition. Pattern Analysis and Machine Intelligence. IEEE Trans 34(3):436–450

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cong Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, C., Xu, W., Wu, Q. et al. Learning motion and content-dependent features with convolutions for action recognition. Multimed Tools Appl 75, 13023–13039 (2016). https://doi.org/10.1007/s11042-015-2550-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-2550-4

Keywords

Navigation