Skip to main content

Compressed Video Action Recognition Using Motion Vector Representation

  • Conference paper
  • First Online:
Pattern Recognition. ICPR International Workshops and Challenges (ICPR 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12661))

Included in the following conference series:

Abstract

Action recognition is an important task for video understanding. Due to expensive time consumption, the conventional approaches employing the optical flow are difficult to be used for real-time purpose. Recently, the Motion Vector (MV), which can be directly extracted from the compressed video, has been introduced for action recognition. In this paper, we propose a novel approach by utilizing motion vector representation for action recognition. On the one hand, we use the motion vector information to select key information sequences for recognition. On the other hand, we further use the motion vector to formulate the representation of the selected sequences. We evaluate the proposed approach on UCF101 and HMDB51 datasets. The experimental results demonstrate that the proposed approach is able to achieve competitive recognition performance, and is able to maintain a 461.5 fps end-to-end processing rate at the same time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3034–3042 (2016)

    Google Scholar 

  2. Bross, B., Han, W.J., Ohm, J.R., Sullivan, G.J., Wang, Y.K., Wiegand, T.: High efficiency video coding (hevc) text specification draft 10 (for fdis & final call). Joint Collaborative Team on Video Coding (JCT-VC) of ITU-T SG16 WP3 and ISO/IEC JTC1/SC29/WG11, JCTVCL1003. v34 (2013)

    Google Scholar 

  3. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)

    Google Scholar 

  4. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  5. Diba, A., Sharma, V., Van Gool, L.: Deep temporal linear encoding networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2329–2338 (2017)

    Google Scholar 

  6. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015)

    Google Scholar 

  7. Fernando, B., Gavves, E., Oramas, J.M., Ghodrati, A., Tuytelaars, T.: Modeling video evolution for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5378–5387 (2015)

    Google Scholar 

  8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  9. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: 2011 International Conference on Computer Vision, pp. 2556–2563. IEEE (2011)

    Google Scholar 

  10. Le Gall, D.: Mpeg: a video compression standard for multimedia applications. Commun. ACM 34(4), 46–58 (1991)

    Article  Google Scholar 

  11. Li, R., Zeng, B., Liou, M.L.: A new three-step search algorithm for block motion estimation. IEEE Trans. Circ. Syst. Video Technol 4(4), 438–442 (1994)

    Article  Google Scholar 

  12. Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50

    Chapter  Google Scholar 

  13. Lu, T., Ai, S., Jiang, Y., Xiong, Y., Min, F.: Deep optical flow feature fusion based on 3D convolutional networks for video action recognition. In: 2018 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pp. 1077–1080. IEEE (2018)

    Google Scholar 

  14. Shi, Y., Tian, Y., Wang, Y., Huang, T.: Sequential deep trajectory descriptor for action recognition with three-stream CNN. IEEE Trans. Multimedia 19(7), 1510–1520 (2017)

    Article  Google Scholar 

  15. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

    Google Scholar 

  16. Song, X., Lan, C., Zeng, W., Xing, J., Sun, X., Yang, J.: Temporal-spatial mapping for action recognition. IEEE Trans. Circ. Syst. Video Technol. 30, 748–759 (2019)

    Google Scholar 

  17. Soomro, K., Zamir, A., Shah, M.: Ucf101-action recognition data set (2017)

    Google Scholar 

  18. Sullivan, G.J., Baker, R.L.: Efficient quadtree coding of images and video. In: Proceedings ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing (2002)

    Google Scholar 

  19. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  20. Tran, D., Ray, J., Shou, Z., Chang, S.F., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038 (2017)

  21. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2017)

    Article  Google Scholar 

  22. Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2

    Chapter  Google Scholar 

  23. Wang, X., Gao, L., Wang, P., Sun, X., Liu, X.: Two-stream 3-D convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimedia 20(3), 634–644 (2017)

    Article  Google Scholar 

  24. Wiegand, T., Sullivan, G.J., Bjontegaard, G., Luthra, A.: Overview of the h. 264/avc video coding standard. IEEE Trans. Circ. Syst. Video Technol. 13(7), 560–576 (2003)

    Google Scholar 

  25. Wu, C.Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krähenbühl, P.: Compressed video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6026–6035 (2018)

    Google Scholar 

  26. Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with deeply transferred motion vector CNNs. IEEE Trans. Image Process. 27(5), 2326–2339 (2018)

    Article  MathSciNet  Google Scholar 

  27. Zhu, Y., Lan, Z., Newsam, S., Hauptmann, A.: Hidden two-stream convolutional networks for action recognition. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11363, pp. 363–378. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20893-6_23

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Guanwen Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhou, C., Chen, X., Sun, P., Zhang, G., Zhou, W. (2021). Compressed Video Action Recognition Using Motion Vector Representation. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12661. Springer, Cham. https://doi.org/10.1007/978-3-030-68763-2_53

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-68763-2_53

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-68762-5

  • Online ISBN: 978-3-030-68763-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics