Abstract
The explosive growth in online video streaming presents challenges for video understanding with high accuracy and low computation complexity. Recent methods have realized global video representation without considering the local spatial structures of the videos over time. In this paper, we propose a method called partial channel fusion (PCF), which exploits local spatio-temporal characteristics for video understanding. We also present an agnostic and effective module for PCF which can provide both high efficiency and high performance in a variety of networks. Rather than independently modeling the spatial structure and motion structure of videos, the PCF module enables information exchange among multiple frames by partially fusing channels over the temporal dimension. By inserting the PCF module into different layers of a 2D convolutional network (2D-convNets), the local and global spatio-temporal characteristics of videos can be captured. Experimental results on two challenging datasets demonstrate the superiority of PCF in improving the accuracy of a 2D-convNets, advancing the state-of-the-art without increasing computational complexity.
Similar content being viewed by others
References
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Puerto Rico, USA, pp 6299–6308
Do Carmo Nogueira T, Vinhal CDN, da Cruz Júnior G, Ullmann MRD (2020) Reference-based model using multimodal gated recurrent units for image captioning. Multimed Tools Appl 79:30615–30635. https://doi.org/10.1007/s11042-020-09539-5
Donahue J, Hendricks LA, Guadarrama S et al (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, Massachusetts, USA, pp 2625–2634
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, pp 1933–1941
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Puerto Rico, USA, pp 4768–4777
Gan C, Naiyan Wang, Yang Y et al (2015) DevNet: a deep event network for multimedia event detection and evidence recounting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, pp 2568–2577
Gao L, Li X, Song J, Shen HT (2019) Hierarchical LSTMs with Adaptive Attention for Visual Captioning. IEEE Trans Pattern Anal Mach Intell 1–1. https://doi.org/10.1109/TPAMI.2019.2894139
Girdhar R, Ramanan D, Gupta A et al (2017) ActionVLAD: learning spatio-temporal aggregation for action classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, HI, pp 3165–3174
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, NV, USA, pp 770–778
Heilbron FC, Escorcia V, Ghanem B, Niebles JC (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, Massachusetts, USA, pp 961–970
Hochreiter sepp, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Idrees H, Zamir AR, Jiang Y-G et al (2017) The THUMOS challenge on action recognition for videos “in the Wild. Comput Vis Image Underst 155:1–23. https://doi.org/10.1016/j.cviu.2016.10.018
Karpathy A, Toderici G, Shetty S et al (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the Conference on Computer Vision and Pattern Recognition. Columbus, OH, USA, pp 1725–1732
Khurram S, Amir Roshan Z, Mubarak S (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:14091556 [cs]
Kuehne H, Jhuang H, Garrote E et al (2011) HMDB: a large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision. Barcelona, Spain, pp 2556–2563
Li Z, Gavrilyuk K, Gavves E et al (2018) VideoLSTM convolves, attends and flows for action recognition. Comput Vis Image Underst 166:41–50
Lin J, Gan C, Han S (2019) TSM: temporal shift module for efficient video understanding. In: Proceedings of the IEEE International Conference on Computer Vision. Seoul, Korea, pp 7083–7093
Ng JYue-Hei, Hausknecht M, Vijayanarasimhan S et al (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, pp 4694–4702
Priyanka S (2020) Microstructure pattern extraction based image retrieval. Multimed Tools Appl 79:2263–2283. https://doi.org/10.1007/s11042-019-08113-y
Qiu Z, Yao T, Mei T (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In: Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy, pp 5534–5542
Shen J, Tao D, Li X (2008) Modality mixture projections for semantic video event detection. IEEE Trans Circuits Syst Video Technol 18:1587–1596. https://doi.org/10.1109/TCSVT.2008.2005607
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems. Curran Associates, Inc
Song J, Guo Y, Gao L et al (2019) From deterministic to generative: multimodal stochastic RNNs for video captioning. IEEE Trans Neural Netw Learn Syst 30:3047–3058. https://doi.org/10.1109/TNNLS.2018.2851077
Srivastava N, Mansimov E, Salakhutdinov R (2015) Unsupervised learning of video representations using LSTMs. In: Proceedings of the 32nd International Conference on Machine Learning. Lille, France, pp 843–852
Sun Y, Wang X, Tang X (2015) Deeply learned face representations are sparse, selective, and robust. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, pp 2892–2900
Tran D, Bourdev L, Fergus R et al (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. Santiago, Chile, pp 4489–4497
Tran D, Ray J, Shou Z et al (2017) ConvNet architecture search for spatiotemporal feature learning. arXiv:170805038 [cs]
Varol G, Laptev I, Schmid C (2017) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40:1510–1517
Wang L, Xiong Y, Wang Z et al (2016) Temporal segment networks: towards good practices for deep action recognition. In: Proceedings of the 14th European Conference on Computer Vision. Amsterdam, pp 1–15
Wang X, Gao L, Song J, Shen H (2017) Beyond frame-level CNN: saliency-aware 3-D CNN with LSTM for video action recognition. IEEE Signal Process Lett 24:510–514. https://doi.org/10.1109/LSP.2016.2611485
Wang X, Gao L, Wang P et al (2018) Two-stream 3-D convNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans Multimedia 20:634–644. https://doi.org/10.1109/TMM.2017.2749159
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, pp 7794–7803
Wang L, Qian X, Zhang Y et al (2020) Enhancing sketch-based image retrieval by CNN semantic re-ranking. IEEE Trans Cybern 50:3330–3342. https://doi.org/10.1109/TCYB.2019.2894498
Yang C, Xu Y, Shi J et al (2020) Temporal Pyramid Network for Action Recognition. In: 2020 IEEE/CVF Conference on Computer Vision, Recognition P (CVPR). IEEE, Seattle, WA, USA, pp 588–597
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal Relational Reasoning in Videos. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Proceedings of the European Conference on Computer Vision. Munich, Germany, pp 831–846
Zolfaghari M, Singh K, Brox T (2018) ECO: efficient convolutional network for online video understanding. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Proceedings of the European Conference on Computer Vision. Munich, Germany, pp 713–730
Acknowledgements
This research was jointly supported by (1) the National Natural Science Foundation of China (Nos. 61771068, 61671079, 61471063, 61372120, 61421061, and 31971493); (2) the Beijing Municipal Natural Science Foundation (Nos. 4182041 and 4152039); (3) the National Basic Research Program of China (No. 2013CB329102); (4) the Research and Development Fund Talent Startup Project of Zhejiang A&F University (No. 2019FR070).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, T., Liu, H. & Wang, Y. Exploiting local spatio-temporal characteristics for effective video understanding. Multimed Tools Appl 80, 31821–31836 (2021). https://doi.org/10.1007/s11042-021-11093-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11093-7