Exploiting local spatio-temporal characteristics for effective video understanding

Liu, Tongcun; Liu, Haoxin; Wang, Yulong

doi:10.1007/s11042-021-11093-7

Exploiting local spatio-temporal characteristics for effective video understanding

Published: 20 July 2021

Volume 80, pages 31821–31836, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

199 Accesses
1 Altmetric
Explore all metrics

Abstract

The explosive growth in online video streaming presents challenges for video understanding with high accuracy and low computation complexity. Recent methods have realized global video representation without considering the local spatial structures of the videos over time. In this paper, we propose a method called partial channel fusion (PCF), which exploits local spatio-temporal characteristics for video understanding. We also present an agnostic and effective module for PCF which can provide both high efficiency and high performance in a variety of networks. Rather than independently modeling the spatial structure and motion structure of videos, the PCF module enables information exchange among multiple frames by partially fusing channels over the temporal dimension. By inserting the PCF module into different layers of a 2D convolutional network (2D-convNets), the local and global spatio-temporal characteristics of videos can be captured. Experimental results on two challenging datasets demonstrate the superiority of PCF in improving the accuracy of a 2D-convNets, advancing the state-of-the-art without increasing computational complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CBAM: Convolutional Block Attention Module

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Visual attention network

Article Open access 28 July 2023

References

Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Puerto Rico, USA, pp 6299–6308
Do Carmo Nogueira T, Vinhal CDN, da Cruz Júnior G, Ullmann MRD (2020) Reference-based model using multimodal gated recurrent units for image captioning. Multimed Tools Appl 79:30615–30635. https://doi.org/10.1007/s11042-020-09539-5
Article Google Scholar
Donahue J, Hendricks LA, Guadarrama S et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, Massachusetts, USA, pp 2625–2634
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, pp 1933–1941
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Puerto Rico, USA, pp 4768–4777
Gan C, Naiyan Wang, Yang Y et al (2015) DevNet: a deep event network for multimedia event detection and evidence recounting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, pp 2568–2577
Gao L, Li X, Song J, Shen HT (2019) Hierarchical LSTMs with Adaptive Attention for Visual Captioning. IEEE Trans Pattern Anal Mach Intell 1–1. https://doi.org/10.1109/TPAMI.2019.2894139
Girdhar R, Ramanan D, Gupta A et al (2017) ActionVLAD: learning spatio-temporal aggregation for action classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, pp 3165–3174
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, pp 770–778
Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, Massachusetts, USA, pp 961–970
Hochreiter sepp, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Idrees H, Zamir AR, Jiang Y-G et al (2017) The THUMOS challenge on action recognition for videos “in the Wild. Comput Vis Image Underst 155:1–23. https://doi.org/10.1016/j.cviu.2016.10.018
Article Google Scholar
Karpathy A, Toderici G, Shetty S et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, pp 1725–1732
Khurram S, Amir Roshan Z, Mubarak S (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:14091556 [cs]
Kuehne H, Jhuang H, Garrote E et al (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision. Barcelona, Spain, pp 2556–2563
Li Z, Gavrilyuk K, Gavves E et al (2018) VideoLSTM convolves, attends and flows for action recognition. Comput Vis Image Underst 166:41–50
Article Google Scholar
Lin J, Gan C, Han S (2019) TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE International Conference on Computer Vision. Seoul, Korea, pp 7083–7093
Ng JYue-Hei, Hausknecht M, Vijayanarasimhan S et al (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, pp 4694–4702
Priyanka S (2020) Microstructure pattern extraction based image retrieval. Multimed Tools Appl 79:2263–2283. https://doi.org/10.1007/s11042-019-08113-y
Article Google Scholar
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy, pp 5534–5542
Shen J, Tao D, Li X (2008) Modality mixture projections for semantic video event detection. IEEE Trans Circuits Syst Video Technol 18:1587–1596. https://doi.org/10.1109/TCSVT.2008.2005607
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems. Curran Associates, Inc
Song J, Guo Y, Gao L et al (2019) From deterministic to generative: multimodal stochastic RNNs for video captioning. IEEE Trans Neural Netw Learn Syst 30:3047–3058. https://doi.org/10.1109/TNNLS.2018.2851077
Article Google Scholar
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, France, pp 843–852
Sun Y, Wang X, Tang X (2015) Deeply learned face representations are sparse, selective, and robust. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, pp 2892–2900
Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile, pp 4489–4497
Tran D, Ray J, Shou Z et al (2017) ConvNet architecture search for spatiotemporal feature learning. arXiv:170805038 [cs]
Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40:1510–1517
Article Google Scholar
Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, pp 1–15
Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level CNN: saliency-aware 3-D CNN with LSTM for video action recognition. IEEE Signal Process Lett 24:510–514. https://doi.org/10.1109/LSP.2016.2611485
Article Google Scholar
Wang X, Gao L, Wang P et al (2018) Two-stream 3-D convNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimedia 20:634–644. https://doi.org/10.1109/TMM.2017.2749159
Article Google Scholar
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, pp 7794–7803
Wang L, Qian X, Zhang Y et al (2020) Enhancing sketch-based image retrieval by CNN semantic re-ranking. IEEE Trans Cybern 50:3330–3342. https://doi.org/10.1109/TCYB.2019.2894498
Article Google Scholar
Yang C, Xu Y, Shi J et al (2020) Temporal Pyramid Network for Action Recognition. In: 2020 IEEE/CVF Conference on Computer Vision, Recognition P (CVPR). IEEE, Seattle, WA, USA, pp 588–597
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal Relational Reasoning in Videos. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Proceedings of the European Conference on Computer Vision. Munich, Germany, pp 831–846
Zolfaghari M, Singh K, Brox T (2018) ECO: efficient convolutional network for online video understanding. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Proceedings of the European Conference on Computer Vision. Munich, Germany, pp 713–730

Download references

Acknowledgements

This research was jointly supported by (1) the National Natural Science Foundation of China (Nos. 61771068, 61671079, 61471063, 61372120, 61421061, and 31971493); (2) the Beijing Municipal Natural Science Foundation (Nos. 4182041 and 4152039); (3) the National Basic Research Program of China (No. 2013CB329102); (4) the Research and Development Fund Talent Startup Project of Zhejiang A&F University (No. 2019FR070).

Author information

Authors and Affiliations

School of Information Engineering, Zhejiang A & F University, Hangzhou, China
Tongcun Liu
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
Haoxin Liu & Yulong Wang

Authors

Tongcun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haoxin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yulong Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tongcun Liu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, T., Liu, H. & Wang, Y. Exploiting local spatio-temporal characteristics for effective video understanding. Multimed Tools Appl 80, 31821–31836 (2021). https://doi.org/10.1007/s11042-021-11093-7

Download citation

Received: 15 September 2020
Revised: 05 January 2021
Accepted: 21 May 2021
Published: 20 July 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11042-021-11093-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting local spatio-temporal characteristics for effective video understanding

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploiting local spatio-temporal characteristics for effective video understanding

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Visual attention network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation