Abstract
Video captioning used in social media like YouTube and Facebook plays a major role for better understanding of the video even when the audio is not clear. In this work, we propose a key frame-based model for video captioning. Here, instead of using all frames in a video only the key frame from the videos are used for video representation. The key frames are extracted from video by comparing images using structural similarity index to identify the difference between the frames and extract only informative frames for the video captioning. We extract the features of the key frames using pre-trained convolutional neural network. We also extract the semantic features from the frames. The key frames are applied to an object detection algorithm to identify the objects and extract the features of the objects. Hierarchical attention is applied on the key frames feature, semantic feature, and the features of the objects identified from the key frames of videos and given as input to LSTM in order to generate the caption for the video.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2019) Describing video with attention-based bidirectional LSTM. IEEE Trans Cybern 49(7):2631–2641
Dahiya D, Issac A, Dutta MK, RÃha K, Kriz P (2018) Computer vision technique for scene captioning to provide assistance to visually impaired. In: 41st international conference on telecommunications and signal processing (TSP), Athens, pp 1–4
Donahue J et al (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691
Gan Z et al (2017) Semantic compositional networks for visual captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, pp 1141–1150
Gao L, Guo Z, Zhang H, Xu X, Shen HT (2015) Video captioning with attention-based LSTM and semantic consistency. IEEE Trans Multimedia 19(9):2045–2055
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks (2015). In: IEEE international conference on computer vision (ICCV), Santiago, pp 4489–4497. https://doi.org/10.1109/ICCV.2015.510
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence—video to text. In: IEEE international conference on computer vision (ICCV), Santiago, pp 4534–4542
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, pp 3156–3164
Xu Y, Han Y, Hong R, Tian Q (2018) Sequential video VLAD: training the aggregation locally and temporally. IEEE Trans Image Process 27(10):4933–4944. https://doi.org/10.1109/TIP.2018.2846664
Wu Q, Shen C, Wang P, Dick A, Hengel AVD (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381. https://doi.org/10.1109/TPAMI.2017.2708709
Yang Y et al (2018) Video captioning by adversarial LSTM. IEEE Trans Image Process 27(11):5600–5611
Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, pp 4584–4593
Yu N, Hu X, Song B, Yang J, Zhang J (2018) Topic-oriented image captioning based on order-embedding. IEEE Trans Image Process 28(6):2743–2754. https://doi.org/10.1109/TIP.2018.2889922
Bahdanau D, Cho K, Bengio Y (2016) Neural machine translation by jointly learning to align and translate. arXiv-1409.0473
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Hemalatha, M., Karthik, P. (2022). Hierarchical Attention-Based Video Captioning Using Key Frames. In: Raje, R.R., Hussain, F., Kannan, R.J. (eds) Artificial Intelligence and Technologies. Lecture Notes in Electrical Engineering, vol 806. Springer, Singapore. https://doi.org/10.1007/978-981-16-6448-9_30
Download citation
DOI: https://doi.org/10.1007/978-981-16-6448-9_30
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-6447-2
Online ISBN: 978-981-16-6448-9
eBook Packages: Computer ScienceComputer Science (R0)