Abstract
This paper deals with automatic human action recognition in videos. Rather than considering traditional hand-craft features such as HOG, HOF and MBH, we explore how to learn both static and motion features from CNNs trained on large-scale datasets such as ImagNet and UCF101. We propose a novel method named multi-resolution latent concept descriptor (mLCD) to encode two-stream CNNs. Entensive experiments are conducted to demonstrate the performance of the proposed model. By combining our mLCD features with the improved dense trajectory features, we can achieve comparable performance with state-of-the-art algorithms on both Hollywood2 and Olympic Sports datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arandjelovic, R., Zisserman, A.: All about VLAD. In: CVPR. pp, 1578–1585. IEEE (2013)
Jain, M., Jégou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR, pp. 2555–2562. IEEE (2013)
Jain, M., van Gemert, J.C., Snoek, C.G.: What do 15,000 object categories tell us about classifying and localizing actions? In: CVPR, pp. 46–55 (2015)
Jiang, Y.-G., Dai, Q., Xue, X., Liu, W., Ngo, C.-W.: Trajectory-based modeling of human actions with motion reference points. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 425–438. Springer, Heidelberg (2012)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR, pp. 1725–1732. IEEE (2014)
Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR, pp. 2929–2936. IEEE (2009)
Niebles, J.C., Chen, C.-W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part II. LNCS, vol. 6312, pp. 392–405. Springer, Heidelberg (2010)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. arXiv preprint (2014). arXiv:1405.4506
Sánchez, J., Perronnin, F., Mensink, T., Verbeek, J.: Image classification with the fisher vector: theory and practice. IJCV 105(3), 222–245 (2013)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint (2012). arXiv:1212.0402
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: ICCV, pp. 3551–3558. IEEE (2013)
Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative CNN video representation for event detection. In: CVPR, pp. 1798–1807 (2015)
Acknowledgements
The work was supported by the National Natural Science Foundation of China (61272251), the Key Basic Research Program of Shanghai Municipality, China (15JC1400103) and the National Basic Research Program of China (2015CB856004).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Xue, W., Zhao, H., Zhang, L. (2016). Encoding Multi-resolution Two-Stream CNNs for Action Recognition. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds) Neural Information Processing. ICONIP 2016. Lecture Notes in Computer Science(), vol 9949. Springer, Cham. https://doi.org/10.1007/978-3-319-46675-0_62
Download citation
DOI: https://doi.org/10.1007/978-3-319-46675-0_62
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46674-3
Online ISBN: 978-3-319-46675-0
eBook Packages: Computer ScienceComputer Science (R0)