Video summarization and captioning using dynamic mode decomposition for surveillance

Radarapu, Rakesh; Gopal, Akkajosyula Surya Sai; NH, Madhusudhan; M., Anand Kumar

doi:10.1007/s41870-021-00668-0

Video summarization and captioning using dynamic mode decomposition for surveillance

Original Research
Published: 13 May 2021

Volume 13, pages 1927–1936, (2021)
Cite this article

International Journal of Information Technology Aims and scope Submit manuscript

Rakesh Radarapu¹,
Akkajosyula Surya Sai Gopal¹,
Madhusudhan NH¹ &
…
Anand Kumar M. ORCID: orcid.org/0000-0003-0310-4510¹

271 Accesses
6 Citations
Explore all metrics

Abstract

Video surveillance has become a major tool in security maintenance. But analyzing in a playback version to detect any motion or any sort of movements might be tedious work because only for a short length of the video there would be any motion. There would be a lot of time wasted in analyzing the video and also it is impossible to always find the accurate frame where the transition has occurred. So there is a need in obtaining a summary video that captures any changes/motion. With the advancements in image processing using OpenCV and deep learning, video summarization is no longer an impossible work. Captions are generated for the summarized videos using an encoder–decoder captioning model. With the help of large, well-labeled video data sets like common objects in context, Microsoft video description, video captioning is a feasible task. Encoder–decoder models are used extensively to extract text from visual features with the arrival of long short term memory (LSTM). Attention mechanism has been widely used on decoder for the work of video captioning. Keyframes are obtained from very long videos using methods like dynamic mode decomposition, an algorithm in fluid dynamics, OpenCV’s absdiff(). We propose these tools for motion detection and video/image captioning for very long videos which are common in video surveillance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Understanding temporal structure for video captioning

Article 05 January 2019

Hierarchical Attention-Based Video Captioning Using Key Frames

Video Captioning Based on the Spatial-Temporal Saliency Tracing

References

Min Z (2007) Key frame extraction from scenery video. In: 2007 International conference on wavelet analysis and pattern recognition, pp 540–543. https://doi.org/10.1109/ICWAPR.2007.4420729
Shi Y, Yang H, Gong M, Liu X, Xia Y (2017) A fast and robust key frame extraction method for video copyright protection. J Elec Comput Eng 2017:1231794
Xu N, Liu AA, Wong Y, Zhang Y, Nie W, Su Y, Kankanhalli M (2018) Dual-stream recurrent neural network for video captioning. IEEE Trans Circuits Syst Video Technol 29(8):2482–2493
Song J Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: multimodal stochastic RNNs for video captioning. IEEE Trans Neural Netw Learn Syst 30(10):3047–3058
Gao L, Li X, Song J, Shen HT (2019) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pat Anal Mach Intel 42(5):1112–1131
Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition pp 650–6512
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics pp 311–318
LIN C (2004) Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona, Spain pp 74–81
Elliott D, Frank K (2013) Image description using visual dependency representations. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp 1292–1302
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition pp 4566–4575
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. PMLR pp 6105–6114
Struman DJ, Zeltzer D (1994) A survey of glove-based input. IEEE Comput Graph Appl Mag 14(1):30–39
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Yamashita R, Nishio M, Do RKG et al (2018) Convolutional neural networks: an overview and application in radiology. Insights Imaging 9:611–629
Article Google Scholar
Schmidhuber J, Hochreiter S (1997).Long short-term memory. Neural Comput 9(8):1735–1780
Kenter T, Borisov A, De Rijke M (2016) Siamese CBOW: optimizing word embeddings for sentence representations. University of Amsterdam, Amsterdam, Yandex, Moscow
Google Scholar
Potdar K, Pardawala TS, Pai CD (2017) A comparative study of categorical variable encoding techniques for neural network classifiers. Int J Comput Appl 175(4):7–9
Pan R, Tian Y, Wang Z (2010) Key-frame extraction based on clustering. In: 2010 IEEE international conference on progress in informatics and computing IEEE, vol 2, pp 867–871
Basaldella M, Antolli E, Serra G, Tasso C (2018) Bidirectional LSTM recurrent neural network for keyphrase extraction. https://doi.org/10.1007/978-3-319-73165-0
Article Google Scholar
Wang Y, Sun Y, Ma Z, Gao L, Xu Y, Wu Y (2020) A method of relation extraction using pre-training models. In: 2020 13th International Symposium on Computational Intelligence and Design (ISCID) pp 176–179. https://doi.org/10.1109/ISCID51228.2020.00046
Shi Y, Yang H, Gong M, Liu X, Xia Y (2017) A fast and robust key frame extraction method for video copyright protection. J Electr Comput Eng 2017:1–7. https://doi.org/10.1155/2017/1231794
Article Google Scholar
Pandey S, Dwivedy P, Meena S, Potnis A (2017) A survey on key frame extraction methods of a MPEG video. In: 2017 International Conference on Computing, Communication and Automation (ICCCA) IEEE, pp 1192–1196
Sun L, Zhou Y (2011) A key frame extraction method based on mutual information and image entropy. In: 2011 International Conference on Multimedia Technology, IEEE pp 35–38
Mentzelopoulos M, Psarrou A (2004) Key-frame extraction algorithm using entropy difference. In: Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval pp 39–45
Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based LSTM and semantic consistency. IEEE Trans Multimedia 19(9):2045–2055. https://doi.org/10.1109/TMM.2017.2729019
Article Google Scholar
Lei X, Jiang X, Wang C (2013) Design and implementation of a real-time video stream analysis system based on FFMPEG. In: 2013 Fourth World Congress on Software Engineering IEEE, pp 212–216
Qaiser S, Ali R (2018) Text mining: use of TF-IDF to examine the relevance of words to documents. Int J Comput Appl 181(1):25–29
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojn, Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition pp 2818–2826
Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies pp 190–200

Download references

Funding

Our research was supported by Dr. Anand Kumar M. Due references have been provided on all supporting literatures and resources.

Author information

Authors and Affiliations

Department of Information Technology, National Institute of Technology, Karnataka, Surathkal, Mangalore, Karnataka, India
Rakesh Radarapu, Akkajosyula Surya Sai Gopal, Madhusudhan NH & Anand Kumar M.

Authors

Rakesh Radarapu
View author publications
You can also search for this author in PubMed Google Scholar
Akkajosyula Surya Sai Gopal
View author publications
You can also search for this author in PubMed Google Scholar
Madhusudhan NH
View author publications
You can also search for this author in PubMed Google Scholar
Anand Kumar M.
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anand Kumar M..

Ethics declarations

Availability of data and material

Microsoft Video Description (MSVD) [29] dataset containing YouTube video clips and well labelled captions was used.

Code availability

Custom code was used to build models using Tensorflow 1.15.0. Google Colab was used to train and evaluate the models on a GPU.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Radarapu, R., Gopal, A.S.S., NH, M. et al. Video summarization and captioning using dynamic mode decomposition for surveillance. Int. j. inf. tecnol. 13, 1927–1936 (2021). https://doi.org/10.1007/s41870-021-00668-0

Download citation

Received: 30 June 2020
Accepted: 07 April 2021
Published: 13 May 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s41870-021-00668-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video summarization and captioning using dynamic mode decomposition for surveillance

Abstract

Access this article

Similar content being viewed by others

Understanding temporal structure for video captioning

Hierarchical Attention-Based Video Captioning Using Key Frames

Video Captioning Based on the Spatial-Temporal Saliency Tracing

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Availability of data and material

Code availability

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Video summarization and captioning using dynamic mode decomposition for surveillance

Abstract

Access this article

Similar content being viewed by others

Understanding temporal structure for video captioning

Hierarchical Attention-Based Video Captioning Using Key Frames

Video Captioning Based on the Spatial-Temporal Saliency Tracing

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Availability of data and material

Code availability

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation