Skip to main content
Log in

Exploiting local spatio-temporal characteristics for effective video understanding

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The explosive growth in online video streaming presents challenges for video understanding with high accuracy and low computation complexity. Recent methods have realized global video representation without considering the local spatial structures of the videos over time. In this paper, we propose a method called partial channel fusion (PCF), which exploits local spatio-temporal characteristics for video understanding. We also present an agnostic and effective module for PCF which can provide both high efficiency and high performance in a variety of networks. Rather than independently modeling the spatial structure and motion structure of videos, the PCF module enables information exchange among multiple frames by partially fusing channels over the temporal dimension. By inserting the PCF module into different layers of a 2D convolutional network (2D-convNets), the local and global spatio-temporal characteristics of videos can be captured. Experimental results on two challenging datasets demonstrate the superiority of PCF in improving the accuracy of a 2D-convNets, advancing the state-of-the-art without increasing computational complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Puerto Rico, USA, pp 6299–6308

  2. Do Carmo Nogueira T, Vinhal CDN, da Cruz Júnior G, Ullmann MRD (2020) Reference-based model using multimodal gated recurrent units for image captioning. Multimed Tools Appl 79:30615–30635. https://doi.org/10.1007/s11042-020-09539-5

    Article  Google Scholar 

  3. Donahue J, Hendricks LA, Guadarrama S et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, Massachusetts, USA, pp 2625–2634

  4. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, pp 1933–1941

  5. Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Puerto Rico, USA, pp 4768–4777

  6. Gan C, Naiyan Wang, Yang Y et al (2015) DevNet: a deep event network for multimedia event detection and evidence recounting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, pp 2568–2577

  7. Gao L, Li X, Song J, Shen HT (2019) Hierarchical LSTMs with Adaptive Attention for Visual Captioning. IEEE Trans Pattern Anal Mach Intell 1–1. https://doi.org/10.1109/TPAMI.2019.2894139

  8. Girdhar R, Ramanan D, Gupta A et al (2017) ActionVLAD: learning spatio-temporal aggregation for action classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, pp 3165–3174

  9. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, pp 770–778

  10. Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, Massachusetts, USA, pp 961–970

  11. Hochreiter sepp, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  12. Idrees H, Zamir AR, Jiang Y-G et al (2017) The THUMOS challenge on action recognition for videos “in the Wild. Comput Vis Image Underst 155:1–23. https://doi.org/10.1016/j.cviu.2016.10.018

    Article  Google Scholar 

  13. Karpathy A, Toderici G, Shetty S et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, pp 1725–1732

  14. Khurram S, Amir Roshan Z, Mubarak S (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:14091556 [cs]

  15. Kuehne H, Jhuang H, Garrote E et al (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision. Barcelona, Spain, pp 2556–2563

  16. Li Z, Gavrilyuk K, Gavves E et al (2018) VideoLSTM convolves, attends and flows for action recognition. Comput Vis Image Underst 166:41–50

    Article  Google Scholar 

  17. Lin J, Gan C, Han S (2019) TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE International Conference on Computer Vision. Seoul, Korea, pp 7083–7093

  18. Ng JYue-Hei, Hausknecht M, Vijayanarasimhan S et al (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, pp 4694–4702

  19. Priyanka S (2020) Microstructure pattern extraction based image retrieval. Multimed Tools Appl 79:2263–2283. https://doi.org/10.1007/s11042-019-08113-y

    Article  Google Scholar 

  20. Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy, pp 5534–5542

  21. Shen J, Tao D, Li X (2008) Modality mixture projections for semantic video event detection. IEEE Trans Circuits Syst Video Technol 18:1587–1596. https://doi.org/10.1109/TCSVT.2008.2005607

    Article  Google Scholar 

  22. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems. Curran Associates, Inc

  23. Song J, Guo Y, Gao L et al (2019) From deterministic to generative: multimodal stochastic RNNs for video captioning. IEEE Trans Neural Netw Learn Syst 30:3047–3058. https://doi.org/10.1109/TNNLS.2018.2851077

    Article  Google Scholar 

  24. Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, France, pp 843–852

  25. Sun Y, Wang X, Tang X (2015) Deeply learned face representations are sparse, selective, and robust. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, pp 2892–2900

  26. Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile, pp 4489–4497

  27. Tran D, Ray J, Shou Z et al (2017) ConvNet architecture search for spatiotemporal feature learning. arXiv:170805038 [cs]

  28. Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40:1510–1517

    Article  Google Scholar 

  29. Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, pp 1–15

  30. Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level CNN: saliency-aware 3-D CNN with LSTM for video action recognition. IEEE Signal Process Lett 24:510–514. https://doi.org/10.1109/LSP.2016.2611485

    Article  Google Scholar 

  31. Wang X, Gao L, Wang P et al (2018) Two-stream 3-D convNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimedia 20:634–644. https://doi.org/10.1109/TMM.2017.2749159

    Article  Google Scholar 

  32. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, pp 7794–7803

  33. Wang L, Qian X, Zhang Y et al (2020) Enhancing sketch-based image retrieval by CNN semantic re-ranking. IEEE Trans Cybern 50:3330–3342. https://doi.org/10.1109/TCYB.2019.2894498

    Article  Google Scholar 

  34. Yang C, Xu Y, Shi J et al (2020) Temporal Pyramid Network for Action Recognition. In: 2020 IEEE/CVF Conference on Computer Vision, Recognition P (CVPR). IEEE, Seattle, WA, USA, pp 588–597

  35. Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal Relational Reasoning in Videos. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Proceedings of the European Conference on Computer Vision. Munich, Germany, pp 831–846

  36. Zolfaghari M, Singh K, Brox T (2018) ECO: efficient convolutional network for online video understanding. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Proceedings of the European Conference on Computer Vision. Munich, Germany, pp 713–730

Download references

Acknowledgements

This research was jointly supported by (1) the National Natural Science Foundation of China (Nos. 61771068, 61671079, 61471063, 61372120, 61421061, and 31971493); (2) the Beijing Municipal Natural Science Foundation (Nos. 4182041 and 4152039); (3) the National Basic Research Program of China (No. 2013CB329102); (4) the Research and Development Fund Talent Startup Project of Zhejiang A&F University (No. 2019FR070).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tongcun Liu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, T., Liu, H. & Wang, Y. Exploiting local spatio-temporal characteristics for effective video understanding. Multimed Tools Appl 80, 31821–31836 (2021). https://doi.org/10.1007/s11042-021-11093-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11093-7

Keywords

Navigation