Abstract
In this paper, we propose an unsupervised static video summarization method that extracts keyframes representing the entire video. A two-stream method is presented, that extracts motion and visual features from the video. Features are also considered from different levels of abstraction for the visual stream by performing multi-level feature extraction and fusion. The utilization of features from different layers facilitates better frame representation by focusing on both coarse and fine-grained details of the frames. Neighborhood peak detection and redundancy removal algorithms are then applied to the fused features to produce the final keyframes representing the video summary. The proposed method particularly aims towards the summarization of industrial surveillance videos. Extensive experimentation is performed on both domain-specific as well as domain-independent datasets, to demonstrate the wide applicability of the proposed model. Results of the experimentation on publicly available benchmark datasets namely, OVP and YouTube, show an increase in the F-score as compared to other unsupervised methods. We also report results on a new dataset that we created from the CCTV footage of an industry. The results show that the proposed method outperforms the existing methods by about 10% in terms of the F-score.
Similar content being viewed by others
References
Abd-Almageed W (2008) Online, simultaneous shot boundary detection and key frame extraction for sports videos using rank tracing. In: 2008 15th IEEE international conference on image processing. IEEE, pp 3200–3203
Almeida J, Leite NJ, Torres RdS (2012) Vison: video summarization for online applications. Pattern Recogn Lett 33(4):397–409
Asim M, Almaadeed N, Al-Máadeed S, Bouridane A, Beghdadi A (2018) A key frame based video summarization using color features. In: 2018 colour and visual computing symposium (CVCS). IEEE, pp 1–6
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv:1405.3531
Cong Y, Liu J, Sun G, You Q, Li Y, Luo J (2016) Adaptive greedy dictionary selection for web media summarization. IEEE Trans Image Process 26(1):185–195
Datt M, Mukhopadhyay J (2018) Content based video summarization: finding interesting temporal sequences of frames. In: 2018 25th IEEE international conference on image processing (ICIP). IEEE, pp 1268–1272
De Avila SEF, Lopes APB, da Luz Jr A, de Albuquerque Araújo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68
DeMenthon D, Kobla V, Doermann D (1998) Video summarization by curve simplification. In: Proceedings of the Sixth ACM international conference on multimedia, pp 211–218
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4768–4777
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941
Fu T-J, Tai S-H, Chen H-T (2019) Attentive and adversarial learning for video summarization. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, pp 1579–1587
Furini M, Geraci F, Montangero M, Pellegrini M (2007) Visto: visual storyboard for web video browsing. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 635–642
Furini M, Geraci F, Montangero M, Pellegrini M (2010) Stimo: still and moving video storyboard for the web scenario. Multimed Tools Appl 46 (1):47–69
Gong B, Chao W-L, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. Adv Neural Inf Process Syst 27:2069–2077
Goodale MA, Milner AD (1992) Separate visual pathways for perception and action. Trends Neurosci 15(1):20–25
He X, Hua Y, Song T, Zhang Z, Xue Z, Ma R, Robertson N, Guan H (2019) Unsupervised video summarization with attentive conditional generative adversarial networks. In: Proceedings of the 27th ACM international conference on multimedia, pp 2296–2304
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Herranz L, Martínez JM (2009) An efficient summarization algorithm based on clustering and bitstream extraction. In: 2009 IEEE international conference on multimedia and Expo. IEEE, pp 654–657
Huang C, Wang H (2019) A novel key-frames selection framework for comprehensive video summarization. IEEE Trans Circuits Syst Video Technol 30(2):577–589
Jadon S, Jasim M (2020) Video summarization using keyframe extraction and video skimming. Tech Rep EasyChair
Ji Z, Xiong K, Pang Y, Li X (2019) Video summarization with attention-based encoder–decoder networks. IEEE Trans Circuits Syst Video Technol 30 (6):1709–1717
Jiang Y, Cui K, Peng B, Xu C (2019) Comprehensive video understanding: video summarization with content-based video recommender design. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp 0–0
Jung Y, Cho D, Kim D, Woo S, Kweon IS (2019) Discriminative feature learning for unsupervised video summarization. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8537–8544
Kang H-W, Matsushita Y, Tang X, Chen X-Q (2006) Space-time video montage. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06). IEEE, vol 2, pp 1331–1338
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Kuanar SK, Panda R, Chowdhury AS (2013) Video key frame extraction through dynamic delaunay clustering with a structural constraint. J Vis Commun Image Represent 24(7):1212–1227
Kwon H, Shim W, Cho M (2019) Temporal u-nets for video summarization with scene and action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp 0–0
Lee YJ, Ghosh J, Grauman K (2012) Discovering important people and objects for egocentric video summarization. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 1346–1353
Li E, Xia J, Du P, Lin C, Samat A (2017) Integrating multilayer features of convolutional neural networks for remote sensing scene classification. IEEE Trans Geosci Remote Sens 55(10):5653–5665
Lin J, Gan C, Han S (2019) TSM: Temporal Shift Module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7083–7093
Ma C, Mu X, Sha D (2019) Multi-layers feature fusion of convolutional neural network for scene classification of remote sensing. IEEE Access 7:121685–121694
Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial lstm networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 202–211
Martins GB, Afonso LCS, Osaku D, Almeida J, Papa JP (2014) Static video summarization through optimum-path forest clustering. In: Iberoamerican congress on pattern recognition. Springer, pp 893–900
Martins GB, Papa JP, Almeida J (2016) Temporal-and spatial-driven video summarization using optimum-path forest. In: 2016 29th SIBGRAPI conference on graphics, patterns and images (SIBGRAPI). Ieee, pp 335–339
Mathews RP, Panicker MR, Hareendranathan AR, Chen YT, Jaremko JL, Buchanan B, Narayan KV, Mathews G, et al. (2021) Unsupervised multi-latent space reinforcement learning framework for video summarization in ultrasound imaging. arXiv:2109.01309
Muhammad K, Hussain T, Del Ser J, Palade V, De Albuquerque VHC (2019) Deepres: a deep learning-based video summarization strategy for resource-constrained industrial surveillance scenarios. IEEE Trans Indust Inf 16(9):5938–5947
Mundur P, Rao Y, Yesha Y (2006) Keyframe-based video summarization using delaunay clustering. Int J Digit Libr 6(2):219–232
Niu Y, Lu Z, Wen J-R, Xiang T, Chang S-F (2018) Multi-modal multi-scale deep learning for large-scale image annotation. IEEE Trans Image Process 28(4):1720–1731
Open Video Project (2021) https://www.open-video.org. [Online; accessed 25-May-2021]
Pass G, Zabih R, Miller J (1997) Comparing images using color coherence vectors. In: Proceedings of the fourth ACM international conference on multimedia, pp 65–73
Pritch Y, Rav-Acha A, Peleg S (2008) Nonchronological video synopsis and indexing. IEEE Trans Patt Anal Mach Intell 30(11):1971–1984
Rochan M, Ye L, Wang Y (2018) Video summarization using fully convolutional sequence networks. In: Proceedings of the European conference on computer vision (ECCV), pp 347–363
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Sugano M, Nakajima Y, Yanagihara H (2002) Automated mpeg audio-video summarization and description. In: Proceedings international conference on image processing. IEEE, vol 1
Swain MJ, Ballard DH (2004) Color indexing. Int J Comput Vision 7:11–32
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. https://doi.org/10.1109/CVPR.2016.308
Takahashi Y, Nitta N, Babaguchi N (2005) Video summarization for large sports video archives. In: 2005 IEEE international conference on multimedia and Expo. IEEE, pp 1170–1173
Tiwari V, Bhatnagar C (2021) A survey of recent work on video summarization: approaches and techniques. Multimed Tools Appl:1–35
Wang M, Yang G-W, Hu S-M, Yau S-T, Shamir A (2019) Write-a-video: computational video montage from themed text. ACM Trans Graph 38 (6):177–1
Wang W, Zhang Q, Luo B, Tang J, Ruan R, Li C (2017) Selecting attentive frames from visually coherent video chunks for surveillance video summarization. In: 2017 IEEE international conference on image processing (ICIP). IEEE, pp 2408–2412
Wei H, Ni B, Yan Y, Yu H, Yang X, Yao C (2018) Video summarization via semantic attended networks. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Wu J, Zhong S. -h., Jiang J, Yang Y (2017) A novel clustering method for static video summarization. Multimed Tools Appl 76(7):9625–9641
Yan X, Gilani SZ, Qin H, Feng M, Zhang L, Mian A (2018) Deep keyframe detection in human action videos. arXiv:1804.10021
Yang H, Wang B, Lin S, Wipf D, Guo M, Guo B (2015) Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: Proceedings of the IEEE international conference on computer vision, pp 4633–4641
Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 982–990
Zhang K, Chao W-L, Sha F, Grauman K (2016) Summary transfer: exemplar-based subset selection for video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1059–1067
Zhang K, Chao W-L, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: European conference on computer vision. Springer, pp 766–782
Zhang Y, Kampffmeyer M, Zhao X, Tan M (2019) Dtr-gan: dilated temporal relational adversarial network for video summarization. In: Proceedings of the ACM turing celebration conference-China, pp 1–6
Zhang Y, Kampffmeyer M, Zhao X, Tan M (2019) Deep reinforcement learning for query-conditioned video summarization. Appl Sci 9(4):750
Zhao Y, Guo Y, Sun R, Liu Z, Guo D (2020) Unsupervised video summarization via clustering validity index. Multimed Tools Appl 79 (45):33417–33430
Zhou K, Qiao Y, Xiang T (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Proceedings of the AAAI conference on artificial intelligence, vol 32
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no confict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Khurana, K., Deshpande, U. Two stream multi-layer convolutional network for keyframe-based video summarization. Multimed Tools Appl 82, 38467–38508 (2023). https://doi.org/10.1007/s11042-023-14665-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-14665-x