Abstract
Video surveillance has become a major tool in security maintenance. But analyzing in a playback version to detect any motion or any sort of movements might be tedious work because only for a short length of the video there would be any motion. There would be a lot of time wasted in analyzing the video and also it is impossible to always find the accurate frame where the transition has occurred. So there is a need in obtaining a summary video that captures any changes/motion. With the advancements in image processing using OpenCV and deep learning, video summarization is no longer an impossible work. Captions are generated for the summarized videos using an encoder–decoder captioning model. With the help of large, well-labeled video data sets like common objects in context, Microsoft video description, video captioning is a feasible task. Encoder–decoder models are used extensively to extract text from visual features with the arrival of long short term memory (LSTM). Attention mechanism has been widely used on decoder for the work of video captioning. Keyframes are obtained from very long videos using methods like dynamic mode decomposition, an algorithm in fluid dynamics, OpenCV’s absdiff(). We propose these tools for motion detection and video/image captioning for very long videos which are common in video surveillance.
Similar content being viewed by others
References
Min Z (2007) Key frame extraction from scenery video. In: 2007 International conference on wavelet analysis and pattern recognition, pp 540–543. https://doi.org/10.1109/ICWAPR.2007.4420729
Shi Y, Yang H, Gong M, Liu X, Xia Y (2017) A fast and robust key frame extraction method for video copyright protection. J Elec Comput Eng 2017:1231794
Xu N, Liu AA, Wong Y, Zhang Y, Nie W, Su Y, Kankanhalli M (2018) Dual-stream recurrent neural network for video captioning. IEEE Trans Circuits Syst Video Technol 29(8):2482–2493
Song J Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: multimodal stochastic RNNs for video captioning. IEEE Trans Neural Netw Learn Syst 30(10):3047–3058
Gao L, Li X, Song J, Shen HT (2019) Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans Pat Anal Mach Intel 42(5):1112–1131
Pan Y, Yao T, Li H, Mei T (2017) Video captioning with transferred semantic attributes. In: Proceedings of the IEEE conference on computer vision and pattern recognition pp 650–6512
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics pp 311–318
LIN C (2004) Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona, Spain pp 74–81
Elliott D, Frank K (2013) Image description using visual dependency representations. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp 1292–1302
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition pp 4566–4575
Tan M, Le Q (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. PMLR pp 6105–6114
Struman DJ, Zeltzer D (1994) A survey of glove-based input. IEEE Comput Graph Appl Mag 14(1):30–39
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Yamashita R, Nishio M, Do RKG et al (2018) Convolutional neural networks: an overview and application in radiology. Insights Imaging 9:611–629
Schmidhuber J, Hochreiter S (1997).Long short-term memory. Neural Comput 9(8):1735–1780
Kenter T, Borisov A, De Rijke M (2016) Siamese CBOW: optimizing word embeddings for sentence representations. University of Amsterdam, Amsterdam, Yandex, Moscow
Potdar K, Pardawala TS, Pai CD (2017) A comparative study of categorical variable encoding techniques for neural network classifiers. Int J Comput Appl 175(4):7–9
Pan R, Tian Y, Wang Z (2010) Key-frame extraction based on clustering. In: 2010 IEEE international conference on progress in informatics and computing IEEE, vol 2, pp 867–871
Basaldella M, Antolli E, Serra G, Tasso C (2018) Bidirectional LSTM recurrent neural network for keyphrase extraction. https://doi.org/10.1007/978-3-319-73165-0
Wang Y, Sun Y, Ma Z, Gao L, Xu Y, Wu Y (2020) A method of relation extraction using pre-training models. In: 2020 13th International Symposium on Computational Intelligence and Design (ISCID) pp 176–179. https://doi.org/10.1109/ISCID51228.2020.00046
Shi Y, Yang H, Gong M, Liu X, Xia Y (2017) A fast and robust key frame extraction method for video copyright protection. J Electr Comput Eng 2017:1–7. https://doi.org/10.1155/2017/1231794
Pandey S, Dwivedy P, Meena S, Potnis A (2017) A survey on key frame extraction methods of a MPEG video. In: 2017 International Conference on Computing, Communication and Automation (ICCCA) IEEE, pp 1192–1196
Sun L, Zhou Y (2011) A key frame extraction method based on mutual information and image entropy. In: 2011 International Conference on Multimedia Technology, IEEE pp 35–38
Mentzelopoulos M, Psarrou A (2004) Key-frame extraction algorithm using entropy difference. In: Proceedings of the 6th ACM SIGMM international workshop on Multimedia information retrieval pp 39–45
Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based LSTM and semantic consistency. IEEE Trans Multimedia 19(9):2045–2055. https://doi.org/10.1109/TMM.2017.2729019
Lei X, Jiang X, Wang C (2013) Design and implementation of a real-time video stream analysis system based on FFMPEG. In: 2013 Fourth World Congress on Software Engineering IEEE, pp 212–216
Qaiser S, Ali R (2018) Text mining: use of TF-IDF to examine the relevance of words to documents. Int J Comput Appl 181(1):25–29
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojn, Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition pp 2818–2826
Chen D, Dolan WB (2011) Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies pp 190–200
Funding
Our research was supported by Dr. Anand Kumar M. Due references have been provided on all supporting literatures and resources.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Availability of data and material
Microsoft Video Description (MSVD) [29] dataset containing YouTube video clips and well labelled captions was used.
Code availability
Custom code was used to build models using Tensorflow 1.15.0. Google Colab was used to train and evaluate the models on a GPU.
Rights and permissions
About this article
Cite this article
Radarapu, R., Gopal, A.S.S., NH, M. et al. Video summarization and captioning using dynamic mode decomposition for surveillance. Int. j. inf. tecnol. 13, 1927–1936 (2021). https://doi.org/10.1007/s41870-021-00668-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41870-021-00668-0