Two stream multi-layer convolutional network for keyframe-based video summarization

Khurana, Khushboo; Deshpande, Umesh

doi:10.1007/s11042-023-14665-x

Two stream multi-layer convolutional network for keyframe-based video summarization

Published: 16 March 2023

Volume 82, pages 38467–38508, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

367 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

In this paper, we propose an unsupervised static video summarization method that extracts keyframes representing the entire video. A two-stream method is presented, that extracts motion and visual features from the video. Features are also considered from different levels of abstraction for the visual stream by performing multi-level feature extraction and fusion. The utilization of features from different layers facilitates better frame representation by focusing on both coarse and fine-grained details of the frames. Neighborhood peak detection and redundancy removal algorithms are then applied to the fused features to produce the final keyframes representing the video summary. The proposed method particularly aims towards the summarization of industrial surveillance videos. Extensive experimentation is performed on both domain-specific as well as domain-independent datasets, to demonstrate the wide applicability of the proposed model. Results of the experimentation on publicly available benchmark datasets namely, OVP and YouTube, show an increase in the F-score as compared to other unsupervised methods. We also report results on a new dataset that we created from the CCTV footage of an industry. The results show that the proposed method outperforms the existing methods by about 10% in terms of the F-score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Domain-Independent Video Summarization Based on Transfer Learning Using Convolutional Neural Network

Deep Learning Framework Based on Audio–Visual Features for Video Summarization

Unsupervised, efficient and scalable key-frame selection for automatic summarization of surveillance videos

Article 08 February 2016

References

Abd-Almageed W (2008) Online, simultaneous shot boundary detection and key frame extraction for sports videos using rank tracing. In: 2008 15th IEEE international conference on image processing. IEEE, pp 3200–3203
Almeida J, Leite NJ, Torres RdS (2012) Vison: video summarization for online applications. Pattern Recogn Lett 33(4):397–409
Article Google Scholar
Asim M, Almaadeed N, Al-Máadeed S, Bouridane A, Beghdadi A (2018) A key frame based video summarization using color features. In: 2018 colour and visual computing symposium (CVCS). IEEE, pp 1–6
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv:1405.3531
Cong Y, Liu J, Sun G, You Q, Li Y, Luo J (2016) Adaptive greedy dictionary selection for web media summarization. IEEE Trans Image Process 26(1):185–195
Article MathSciNet MATH Google Scholar
Datt M, Mukhopadhyay J (2018) Content based video summarization: finding interesting temporal sequences of frames. In: 2018 25th IEEE international conference on image processing (ICIP). IEEE, pp 1268–1272
De Avila SEF, Lopes APB, da Luz Jr A, de Albuquerque Araújo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68
DeMenthon D, Kobla V, Doermann D (1998) Video summarization by curve simplification. In: Proceedings of the Sixth ACM international conference on multimedia, pp 211–218
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4768–4777
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Fu T-J, Tai S-H, Chen H-T (2019) Attentive and adversarial learning for video summarization. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, pp 1579–1587
Furini M, Geraci F, Montangero M, Pellegrini M (2007) Visto: visual storyboard for web video browsing. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 635–642
Furini M, Geraci F, Montangero M, Pellegrini M (2010) Stimo: still and moving video storyboard for the web scenario. Multimed Tools Appl 46 (1):47–69
Article Google Scholar
Gong B, Chao W-L, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. Adv Neural Inf Process Syst 27:2069–2077
Google Scholar
Goodale MA, Milner AD (1992) Separate visual pathways for perception and action. Trends Neurosci 15(1):20–25
Article Google Scholar
He X, Hua Y, Song T, Zhang Z, Xue Z, Ma R, Robertson N, Guan H (2019) Unsupervised video summarization with attentive conditional generative adversarial networks. In: Proceedings of the 27th ACM international conference on multimedia, pp 2296–2304
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Herranz L, Martínez JM (2009) An efficient summarization algorithm based on clustering and bitstream extraction. In: 2009 IEEE international conference on multimedia and Expo. IEEE, pp 654–657
Huang C, Wang H (2019) A novel key-frames selection framework for comprehensive video summarization. IEEE Trans Circuits Syst Video Technol 30(2):577–589
Article Google Scholar
Jadon S, Jasim M (2020) Video summarization using keyframe extraction and video skimming. Tech Rep EasyChair
Ji Z, Xiong K, Pang Y, Li X (2019) Video summarization with attention-based encoder–decoder networks. IEEE Trans Circuits Syst Video Technol 30 (6):1709–1717
Article Google Scholar
Jiang Y, Cui K, Peng B, Xu C (2019) Comprehensive video understanding: video summarization with content-based video recommender design. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp 0–0
Jung Y, Cho D, Kim D, Woo S, Kweon IS (2019) Discriminative feature learning for unsupervised video summarization. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8537–8544
Kang H-W, Matsushita Y, Tang X, Chen X-Q (2006) Space-time video montage. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06). IEEE, vol 2, pp 1331–1338
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kuanar SK, Panda R, Chowdhury AS (2013) Video key frame extraction through dynamic delaunay clustering with a structural constraint. J Vis Commun Image Represent 24(7):1212–1227
Article Google Scholar
Kwon H, Shim W, Cho M (2019) Temporal u-nets for video summarization with scene and action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp 0–0
Lee YJ, Ghosh J, Grauman K (2012) Discovering important people and objects for egocentric video summarization. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 1346–1353
Li E, Xia J, Du P, Lin C, Samat A (2017) Integrating multilayer features of convolutional neural networks for remote sensing scene classification. IEEE Trans Geosci Remote Sens 55(10):5653–5665
Article Google Scholar
Lin J, Gan C, Han S (2019) TSM: Temporal Shift Module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7083–7093
Ma C, Mu X, Sha D (2019) Multi-layers feature fusion of convolutional neural network for scene classification of remote sensing. IEEE Access 7:121685–121694
Article Google Scholar
Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial lstm networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 202–211
Martins GB, Afonso LCS, Osaku D, Almeida J, Papa JP (2014) Static video summarization through optimum-path forest clustering. In: Iberoamerican congress on pattern recognition. Springer, pp 893–900
Martins GB, Papa JP, Almeida J (2016) Temporal-and spatial-driven video summarization using optimum-path forest. In: 2016 29th SIBGRAPI conference on graphics, patterns and images (SIBGRAPI). Ieee, pp 335–339
Mathews RP, Panicker MR, Hareendranathan AR, Chen YT, Jaremko JL, Buchanan B, Narayan KV, Mathews G, et al. (2021) Unsupervised multi-latent space reinforcement learning framework for video summarization in ultrasound imaging. arXiv:2109.01309
Muhammad K, Hussain T, Del Ser J, Palade V, De Albuquerque VHC (2019) Deepres: a deep learning-based video summarization strategy for resource-constrained industrial surveillance scenarios. IEEE Trans Indust Inf 16(9):5938–5947
Article Google Scholar
Mundur P, Rao Y, Yesha Y (2006) Keyframe-based video summarization using delaunay clustering. Int J Digit Libr 6(2):219–232
Article Google Scholar
Niu Y, Lu Z, Wen J-R, Xiang T, Chang S-F (2018) Multi-modal multi-scale deep learning for large-scale image annotation. IEEE Trans Image Process 28(4):1720–1731
Article MathSciNet Google Scholar
Open Video Project (2021) https://www.open-video.org. [Online; accessed 25-May-2021]
Pass G, Zabih R, Miller J (1997) Comparing images using color coherence vectors. In: Proceedings of the fourth ACM international conference on multimedia, pp 65–73
Pritch Y, Rav-Acha A, Peleg S (2008) Nonchronological video synopsis and indexing. IEEE Trans Patt Anal Mach Intell 30(11):1971–1984
Article Google Scholar
Rochan M, Ye L, Wang Y (2018) Video summarization using fully convolutional sequence networks. In: Proceedings of the European conference on computer vision (ECCV), pp 347–363
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Sugano M, Nakajima Y, Yanagihara H (2002) Automated mpeg audio-video summarization and description. In: Proceedings international conference on image processing. IEEE, vol 1
Swain MJ, Ballard DH (2004) Color indexing. Int J Comput Vision 7:11–32
Article Google Scholar
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. https://doi.org/10.1109/CVPR.2016.308
Takahashi Y, Nitta N, Babaguchi N (2005) Video summarization for large sports video archives. In: 2005 IEEE international conference on multimedia and Expo. IEEE, pp 1170–1173
Tiwari V, Bhatnagar C (2021) A survey of recent work on video summarization: approaches and techniques. Multimed Tools Appl:1–35
Wang M, Yang G-W, Hu S-M, Yau S-T, Shamir A (2019) Write-a-video: computational video montage from themed text. ACM Trans Graph 38 (6):177–1
Article Google Scholar
Wang W, Zhang Q, Luo B, Tang J, Ruan R, Li C (2017) Selecting attentive frames from visually coherent video chunks for surveillance video summarization. In: 2017 IEEE international conference on image processing (ICIP). IEEE, pp 2408–2412
Wei H, Ni B, Yan Y, Yu H, Yang X, Yao C (2018) Video summarization via semantic attended networks. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Wu J, Zhong S. -h., Jiang J, Yang Y (2017) A novel clustering method for static video summarization. Multimed Tools Appl 76(7):9625–9641
Article Google Scholar
Yan X, Gilani SZ, Qin H, Feng M, Zhang L, Mian A (2018) Deep keyframe detection in human action videos. arXiv:1804.10021
Yang H, Wang B, Lin S, Wipf D, Guo M, Guo B (2015) Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: Proceedings of the IEEE international conference on computer vision, pp 4633–4641
Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 982–990
Zhang K, Chao W-L, Sha F, Grauman K (2016) Summary transfer: exemplar-based subset selection for video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1059–1067
Zhang K, Chao W-L, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: European conference on computer vision. Springer, pp 766–782
Zhang Y, Kampffmeyer M, Zhao X, Tan M (2019) Dtr-gan: dilated temporal relational adversarial network for video summarization. In: Proceedings of the ACM turing celebration conference-China, pp 1–6
Zhang Y, Kampffmeyer M, Zhao X, Tan M (2019) Deep reinforcement learning for query-conditioned video summarization. Appl Sci 9(4):750
Article Google Scholar
Zhao Y, Guo Y, Sun R, Liu Z, Guo D (2020) Unsupervised video summarization via clustering validity index. Multimed Tools Appl 79 (45):33417–33430
Article Google Scholar
Zhou K, Qiao Y, Xiang T (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Shri Ramdeobaba College of Engineering and Management, Nagpur, Maharashtra, India
Khushboo Khurana
Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, Maharashtra, India
Umesh Deshpande

Authors

Khushboo Khurana
View author publications
You can also search for this author in PubMed Google Scholar
Umesh Deshpande
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Khushboo Khurana.

Ethics declarations

Conflict of Interests

The authors declare that they have no confict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Khurana, K., Deshpande, U. Two stream multi-layer convolutional network for keyframe-based video summarization. Multimed Tools Appl 82, 38467–38508 (2023). https://doi.org/10.1007/s11042-023-14665-x

Download citation

Received: 08 November 2021
Revised: 07 February 2022
Accepted: 03 February 2023
Published: 16 March 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s11042-023-14665-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Two stream multi-layer convolutional network for keyframe-based video summarization

Abstract

Access this article

Similar content being viewed by others

Domain-Independent Video Summarization Based on Transfer Learning Using Convolutional Neural Network

Deep Learning Framework Based on Audio–Visual Features for Video Summarization

Unsupervised, efficient and scalable key-frame selection for automatic summarization of surveillance videos

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Two stream multi-layer convolutional network for keyframe-based video summarization

Abstract

Access this article

Similar content being viewed by others

Domain-Independent Video Summarization Based on Transfer Learning Using Convolutional Neural Network

Deep Learning Framework Based on Audio–Visual Features for Video Summarization

Unsupervised, efficient and scalable key-frame selection for automatic summarization of surveillance videos

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation