Human Action Recognition in Videos of Realistic Scenes Based on Multi-scale CNN Feature

Zhou, Yongsheng; Pu, Nan; Qian, Li; Wu, Song; Xiao, Guoqiang

doi:10.1007/978-3-319-77383-4_31

Yongsheng Zhou¹⁹,
Nan Pu¹⁹,
Li Qian¹⁹,
Song Wu¹⁹ &
…
Guoqiang Xiao¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10736))

Included in the following conference series:

Pacific Rim Conference on Multimedia

2373 Accesses

Abstract

In this paper, we develop a novel method to design a robust feature representation based on deep convolutional features and Latent Dirichlet Allocation (LDA) model for human action recognition. Compared to traditional CNN features which explore the outputs from the fully connected layers in CNN, we show that a low dimension feature representation generated on the deep convolutional layers is more discriminative. In addition, based on the convolutional feature maps, we use a multi-scale pooling strategy to better handle the objects with different scales and deformations. Moreover, we adopt LDA to explore the semantic relationship in video sequences and generate a topic histogram to represent a video, since LDA puts more emphasis on the content coherence than mere spatial contiguity. Extensive experimental results on two challenging datasets show that the proposed approach outperforms or is competitive with state-of-the-art methods for the application of human action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Liu, P., Wang, J., She, M., et al.: Human action recognition based on 3D SIFT and LDA model. In: Robotic Intelligence in Informationally Structured Space, pp. 12–17. IEEE (2011)
Google Scholar
Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)
Article Google Scholar
Wang, H., et al.: Evaluation of local spatio-temporal features for action recognition. In: British Machine Vision Conference (BMVC 2009), London, 7–10 September 2009. Proceedings DBLP (2009)
Google Scholar
Le, Q.V., Zou, W.Y., Yeung, S.Y., et al.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: Computer Vision and Pattern Recognition, pp. 3361–3368. IEEE Xplore (2011)
Google Scholar
Rodriguez, M.D, Ahmed, J., Shah, M.: Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp. 1–8. IEEE (2008)
Google Scholar
Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos “in the wild”. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1996–2003. DBLP (2009)
Google Scholar
Hasan, M., Roy-Chowdhury, A.K.: Incremental activity modeling and recognition in streaming videos. In: Computer Vision and Pattern Recognition, pp. 796–803. IEEE (2014)
Google Scholar
Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Computer Vision and Pattern Recognition, pp. 2046–2053. IEEE (2010)
Google Scholar
Laptev, L.: Space-time interest points. In: International Conference on Computer Vision, vol. 1, pp. 432–439. IEEE Xplore (2003)
Google Scholar
Dollar, P., Rabaud, V., Cottrell, G., et al.: Behavior recognition via sparse spatio-temporal features. In: Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72. IEEE (2005)
Google Scholar
Guo, Y., Lao, S., Liu, Y., Bai, L., Liu, S., Lew, M.S.: Convolutional neural networks features: principal pyramidal convolution. In: Ho, Y.-S., Sang, J., Ro, Y.M., Kim, J., Wu, F. (eds.) PCM 2015. LNCS, vol. 9314, pp. 245–253. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24075-6_24
Chapter Google Scholar
Yosinski, J., Clune, J., Bengio, Y., et al.: How transferable are features in deep neural networks? Eprint Arxiv arXiv:1411.1792, vol. 27, pp. 3320–3328 (2014)
Hofmann, T.: Probabilistic latent semantic indexing. In: International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM (1999)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Chong, W., Blei, D., Li, F.F.: Simultaneous image classification and annotation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), pp. 1903–1910. IEEE (2009)
Google Scholar
Cao, L., Li, F.F.: Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes. In: IEEE International Conference on Computer Vision, pp. 1–8. DBLP (2007)
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 1–27 (2011)
Article Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Computer Science (2014)
Google Scholar
He, K., et al.: Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1904–1916 (2015)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems Curran Associates Inc., pp. 1097–1105 (2012)
Google Scholar
Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions, pp. 1–9 (2014)
Google Scholar
He, K., et al.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition, pp. 770–778. IEEE (2015)
Google Scholar
Souly, N., Shah, M.: Visual saliency detection using group lasso regularization in videos of natural scenes. Int. J. Comput. Vis. 117(1), 93–110 (2016)
Article MathSciNet Google Scholar
Wang, H., Klaser, A., Schmid, C., et al.: Action recognition by dense trajectories. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3169–3176. IEEE Computer Society (2011)
Google Scholar
Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: combining multiple features for human action recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6311, pp. 494–507. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15549-9_36
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

The School of Computer and Information Science in Southwest University, Chongqing, China
Yongsheng Zhou, Nan Pu, Li Qian, Song Wu & Guoqiang Xiao

Authors

Yongsheng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Nan Pu
View author publications
You can also search for this author in PubMed Google Scholar
Li Qian
View author publications
You can also search for this author in PubMed Google Scholar
Song Wu
View author publications
You can also search for this author in PubMed Google Scholar
Guoqiang Xiao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Guoqiang Xiao .

Editor information

Editors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Bing Zeng
University of Chinese Academy of Sciences, Beijing, China
Qingming Huang
University of Ottawa, Ottawa, Ontario, Canada
Abdulmotaleb El Saddik
University of Electronic Science and Technology of China, Chengdu, China
Hongliang Li
Chinese Academy of Sciences, Beijing, China
Shuqiang Jiang
Harbin Institute of Technology, Harbin, China
Xiaopeng Fan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, Y., Pu, N., Qian, L., Wu, S., Xiao, G. (2018). Human Action Recognition in Videos of Realistic Scenes Based on Multi-scale CNN Feature. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds) Advances in Multimedia Information Processing – PCM 2017. PCM 2017. Lecture Notes in Computer Science(), vol 10736. Springer, Cham. https://doi.org/10.1007/978-3-319-77383-4_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-77383-4_31
Published: 10 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77382-7
Online ISBN: 978-3-319-77383-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics