Skip to main content
Log in

Two stream multi-layer convolutional network for keyframe-based video summarization

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In this paper, we propose an unsupervised static video summarization method that extracts keyframes representing the entire video. A two-stream method is presented, that extracts motion and visual features from the video. Features are also considered from different levels of abstraction for the visual stream by performing multi-level feature extraction and fusion. The utilization of features from different layers facilitates better frame representation by focusing on both coarse and fine-grained details of the frames. Neighborhood peak detection and redundancy removal algorithms are then applied to the fused features to produce the final keyframes representing the video summary. The proposed method particularly aims towards the summarization of industrial surveillance videos. Extensive experimentation is performed on both domain-specific as well as domain-independent datasets, to demonstrate the wide applicability of the proposed model. Results of the experimentation on publicly available benchmark datasets namely, OVP and YouTube, show an increase in the F-score as compared to other unsupervised methods. We also report results on a new dataset that we created from the CCTV footage of an industry. The results show that the proposed method outperforms the existing methods by about 10% in terms of the F-score.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Fig. 5
Algorithm 2
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

References

  1. Abd-Almageed W (2008) Online, simultaneous shot boundary detection and key frame extraction for sports videos using rank tracing. In: 2008 15th IEEE international conference on image processing. IEEE, pp 3200–3203

  2. Almeida J, Leite NJ, Torres RdS (2012) Vison: video summarization for online applications. Pattern Recogn Lett 33(4):397–409

    Article  Google Scholar 

  3. Asim M, Almaadeed N, Al-Máadeed S, Bouridane A, Beghdadi A (2018) A key frame based video summarization using color features. In: 2018 colour and visual computing symposium (CVCS). IEEE, pp 1–6

  4. Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. arXiv:1405.3531

  5. Cong Y, Liu J, Sun G, You Q, Li Y, Luo J (2016) Adaptive greedy dictionary selection for web media summarization. IEEE Trans Image Process 26(1):185–195

    Article  MathSciNet  MATH  Google Scholar 

  6. Datt M, Mukhopadhyay J (2018) Content based video summarization: finding interesting temporal sequences of frames. In: 2018 25th IEEE international conference on image processing (ICIP). IEEE, pp 1268–1272

  7. De Avila SEF, Lopes APB, da Luz Jr A, de Albuquerque Araújo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68

  8. DeMenthon D, Kobla V, Doermann D (1998) Video summarization by curve simplification. In: Proceedings of the Sixth ACM international conference on multimedia, pp 211–218

  9. Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4768–4777

  10. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1933–1941

  11. Fu T-J, Tai S-H, Chen H-T (2019) Attentive and adversarial learning for video summarization. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, pp 1579–1587

  12. Furini M, Geraci F, Montangero M, Pellegrini M (2007) Visto: visual storyboard for web video browsing. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 635–642

  13. Furini M, Geraci F, Montangero M, Pellegrini M (2010) Stimo: still and moving video storyboard for the web scenario. Multimed Tools Appl 46 (1):47–69

    Article  Google Scholar 

  14. Gong B, Chao W-L, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. Adv Neural Inf Process Syst 27:2069–2077

    Google Scholar 

  15. Goodale MA, Milner AD (1992) Separate visual pathways for perception and action. Trends Neurosci 15(1):20–25

    Article  Google Scholar 

  16. He X, Hua Y, Song T, Zhang Z, Xue Z, Ma R, Robertson N, Guan H (2019) Unsupervised video summarization with attentive conditional generative adversarial networks. In: Proceedings of the 27th ACM international conference on multimedia, pp 2296–2304

  17. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  18. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  19. Herranz L, Martínez JM (2009) An efficient summarization algorithm based on clustering and bitstream extraction. In: 2009 IEEE international conference on multimedia and Expo. IEEE, pp 654–657

  20. Huang C, Wang H (2019) A novel key-frames selection framework for comprehensive video summarization. IEEE Trans Circuits Syst Video Technol 30(2):577–589

    Article  Google Scholar 

  21. Jadon S, Jasim M (2020) Video summarization using keyframe extraction and video skimming. Tech Rep EasyChair

  22. Ji Z, Xiong K, Pang Y, Li X (2019) Video summarization with attention-based encoder–decoder networks. IEEE Trans Circuits Syst Video Technol 30 (6):1709–1717

    Article  Google Scholar 

  23. Jiang Y, Cui K, Peng B, Xu C (2019) Comprehensive video understanding: video summarization with content-based video recommender design. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp 0–0

  24. Jung Y, Cho D, Kim D, Woo S, Kweon IS (2019) Discriminative feature learning for unsupervised video summarization. In: Proceedings of the AAAI conference on artificial intelligence, vol 33, pp 8537–8544

  25. Kang H-W, Matsushita Y, Tang X, Chen X-Q (2006) Space-time video montage. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06). IEEE, vol 2, pp 1331–1338

  26. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  27. Kuanar SK, Panda R, Chowdhury AS (2013) Video key frame extraction through dynamic delaunay clustering with a structural constraint. J Vis Commun Image Represent 24(7):1212–1227

    Article  Google Scholar 

  28. Kwon H, Shim W, Cho M (2019) Temporal u-nets for video summarization with scene and action recognition. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp 0–0

  29. Lee YJ, Ghosh J, Grauman K (2012) Discovering important people and objects for egocentric video summarization. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, pp 1346–1353

  30. Li E, Xia J, Du P, Lin C, Samat A (2017) Integrating multilayer features of convolutional neural networks for remote sensing scene classification. IEEE Trans Geosci Remote Sens 55(10):5653–5665

    Article  Google Scholar 

  31. Lin J, Gan C, Han S (2019) TSM: Temporal Shift Module for efficient video understanding. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7083–7093

  32. Ma C, Mu X, Sha D (2019) Multi-layers feature fusion of convolutional neural network for scene classification of remote sensing. IEEE Access 7:121685–121694

    Article  Google Scholar 

  33. Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial lstm networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 202–211

  34. Martins GB, Afonso LCS, Osaku D, Almeida J, Papa JP (2014) Static video summarization through optimum-path forest clustering. In: Iberoamerican congress on pattern recognition. Springer, pp 893–900

  35. Martins GB, Papa JP, Almeida J (2016) Temporal-and spatial-driven video summarization using optimum-path forest. In: 2016 29th SIBGRAPI conference on graphics, patterns and images (SIBGRAPI). Ieee, pp 335–339

  36. Mathews RP, Panicker MR, Hareendranathan AR, Chen YT, Jaremko JL, Buchanan B, Narayan KV, Mathews G, et al. (2021) Unsupervised multi-latent space reinforcement learning framework for video summarization in ultrasound imaging. arXiv:2109.01309

  37. Muhammad K, Hussain T, Del Ser J, Palade V, De Albuquerque VHC (2019) Deepres: a deep learning-based video summarization strategy for resource-constrained industrial surveillance scenarios. IEEE Trans Indust Inf 16(9):5938–5947

    Article  Google Scholar 

  38. Mundur P, Rao Y, Yesha Y (2006) Keyframe-based video summarization using delaunay clustering. Int J Digit Libr 6(2):219–232

    Article  Google Scholar 

  39. Niu Y, Lu Z, Wen J-R, Xiang T, Chang S-F (2018) Multi-modal multi-scale deep learning for large-scale image annotation. IEEE Trans Image Process 28(4):1720–1731

    Article  MathSciNet  Google Scholar 

  40. Open Video Project (2021) https://www.open-video.org. [Online; accessed 25-May-2021]

  41. Pass G, Zabih R, Miller J (1997) Comparing images using color coherence vectors. In: Proceedings of the fourth ACM international conference on multimedia, pp 65–73

  42. Pritch Y, Rav-Acha A, Peleg S (2008) Nonchronological video synopsis and indexing. IEEE Trans Patt Anal Mach Intell 30(11):1971–1984

    Article  Google Scholar 

  43. Rochan M, Ye L, Wang Y (2018) Video summarization using fully convolutional sequence networks. In: Proceedings of the European conference on computer vision (ECCV), pp 347–363

  44. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision, pp 618–626

  45. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. arXiv:1406.2199

  46. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  47. Soomro K, Zamir AR, Shah M (2012) Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  48. Sugano M, Nakajima Y, Yanagihara H (2002) Automated mpeg audio-video summarization and description. In: Proceedings international conference on image processing. IEEE, vol 1

  49. Swain MJ, Ballard DH (2004) Color indexing. Int J Comput Vision 7:11–32

    Article  Google Scholar 

  50. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  51. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  52. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. https://doi.org/10.1109/CVPR.2016.308

  53. Takahashi Y, Nitta N, Babaguchi N (2005) Video summarization for large sports video archives. In: 2005 IEEE international conference on multimedia and Expo. IEEE, pp 1170–1173

  54. Tiwari V, Bhatnagar C (2021) A survey of recent work on video summarization: approaches and techniques. Multimed Tools Appl:1–35

  55. Wang M, Yang G-W, Hu S-M, Yau S-T, Shamir A (2019) Write-a-video: computational video montage from themed text. ACM Trans Graph 38 (6):177–1

    Article  Google Scholar 

  56. Wang W, Zhang Q, Luo B, Tang J, Ruan R, Li C (2017) Selecting attentive frames from visually coherent video chunks for surveillance video summarization. In: 2017 IEEE international conference on image processing (ICIP). IEEE, pp 2408–2412

  57. Wei H, Ni B, Yan Y, Yu H, Yang X, Yao C (2018) Video summarization via semantic attended networks. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

  58. Wu J, Zhong S. -h., Jiang J, Yang Y (2017) A novel clustering method for static video summarization. Multimed Tools Appl 76(7):9625–9641

    Article  Google Scholar 

  59. Yan X, Gilani SZ, Qin H, Feng M, Zhang L, Mian A (2018) Deep keyframe detection in human action videos. arXiv:1804.10021

  60. Yang H, Wang B, Lin S, Wipf D, Guo M, Guo B (2015) Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: Proceedings of the IEEE international conference on computer vision, pp 4633–4641

  61. Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 982–990

  62. Zhang K, Chao W-L, Sha F, Grauman K (2016) Summary transfer: exemplar-based subset selection for video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1059–1067

  63. Zhang K, Chao W-L, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: European conference on computer vision. Springer, pp 766–782

  64. Zhang Y, Kampffmeyer M, Zhao X, Tan M (2019) Dtr-gan: dilated temporal relational adversarial network for video summarization. In: Proceedings of the ACM turing celebration conference-China, pp 1–6

  65. Zhang Y, Kampffmeyer M, Zhao X, Tan M (2019) Deep reinforcement learning for query-conditioned video summarization. Appl Sci 9(4):750

    Article  Google Scholar 

  66. Zhao Y, Guo Y, Sun R, Liu Z, Guo D (2020) Unsupervised video summarization via clustering validity index. Multimed Tools Appl 79 (45):33417–33430

    Article  Google Scholar 

  67. Zhou K, Qiao Y, Xiang T (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Proceedings of the AAAI conference on artificial intelligence, vol 32

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Khushboo Khurana.

Ethics declarations

Conflict of Interests

The authors declare that they have no confict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khurana, K., Deshpande, U. Two stream multi-layer convolutional network for keyframe-based video summarization. Multimed Tools Appl 82, 38467–38508 (2023). https://doi.org/10.1007/s11042-023-14665-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-14665-x

Keywords

Navigation