Hierarchical Attention-Based Video Captioning Using Key Frames

Hemalatha, Munusamy; Karthik, P.

doi:10.1007/978-981-16-6448-9_30

Munusamy Hemalatha³⁹ &
P. Karthik³⁹

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 806))

954 Accesses

Abstract

Video captioning used in social media like YouTube and Facebook plays a major role for better understanding of the video even when the audio is not clear. In this work, we propose a key frame-based model for video captioning. Here, instead of using all frames in a video only the key frame from the videos are used for video representation. The key frames are extracted from video by comparing images using structural similarity index to identify the difference between the frames and extract only informative frames for the video captioning. We extract the features of the key frames using pre-trained convolutional neural network. We also extract the semantic features from the frames. The key frames are applied to an object detection algorithm to identify the objects and extract the features of the objects. Hierarchical attention is applied on the key frames feature, semantic feature, and the features of the objects identified from the key frames of videos and given as input to LSTM in order to generate the caption for the video.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2019) Describing video with attention-based bidirectional LSTM. IEEE Trans Cybern 49(7):2631–2641
Article Google Scholar
Dahiya D, Issac A, Dutta MK, Ríha K, Kriz P (2018) Computer vision technique for scene captioning to provide assistance to visually impaired. In: 41st international conference on telecommunications and signal processing (TSP), Athens, pp 1–4
Google Scholar
Donahue J et al (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691
Google Scholar
Gan Z et al (2017) Semantic compositional networks for visual captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, pp 1141–1150
Google Scholar
Gao L, Guo Z, Zhang H, Xu X, Shen HT (2015) Video captioning with attention-based LSTM and semantic consistency. IEEE Trans Multimedia 19(9):2045–2055
Article Google Scholar
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Google Scholar
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks (2015). In: IEEE international conference on computer vision (ICCV), Santiago, pp 4489–4497. https://doi.org/10.1109/ICCV.2015.510
Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence—video to text. In: IEEE international conference on computer vision (ICCV), Santiago, pp 4534–4542
Google Scholar
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, pp 3156–3164
Google Scholar
Xu Y, Han Y, Hong R, Tian Q (2018) Sequential video VLAD: training the aggregation locally and temporally. IEEE Trans Image Process 27(10):4933–4944. https://doi.org/10.1109/TIP.2018.2846664
Article MathSciNet Google Scholar
Wu Q, Shen C, Wang P, Dick A, Hengel AVD (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381. https://doi.org/10.1109/TPAMI.2017.2708709
Yang Y et al (2018) Video captioning by adversarial LSTM. IEEE Trans Image Process 27(11):5600–5611
Article MathSciNet Google Scholar
Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, pp 4584–4593
Google Scholar
Yu N, Hu X, Song B, Yang J, Zhang J (2018) Topic-oriented image captioning based on order-embedding. IEEE Trans Image Process 28(6):2743–2754. https://doi.org/10.1109/TIP.2018.2889922
Article MathSciNet MATH Google Scholar
Bahdanau D, Cho K, Bengio Y (2016) Neural machine translation by jointly learning to align and translate. arXiv-1409.0473
Google Scholar

Download references

Author information

Authors and Affiliations

Anna University, Chennai, Tamil Nadu, India
Munusamy Hemalatha & P. Karthik

Authors

Munusamy Hemalatha
View author publications
You can also search for this author in PubMed Google Scholar
P. Karthik
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Science, Computer and Information Science, Indiana University–Purdue University, Indianapolis, IN, USA
Rajeev R. Raje
University of Technology Sydney, Sydney, NSW, Australia
Farookh Hussain
School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, Tamil Nadu, India
R. Jagadeesh Kannan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hemalatha, M., Karthik, P. (2022). Hierarchical Attention-Based Video Captioning Using Key Frames. In: Raje, R.R., Hussain, F., Kannan, R.J. (eds) Artificial Intelligence and Technologies. Lecture Notes in Electrical Engineering, vol 806. Springer, Singapore. https://doi.org/10.1007/978-981-16-6448-9_30

Download citation

DOI: https://doi.org/10.1007/978-981-16-6448-9_30
Published: 17 December 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-6447-2
Online ISBN: 978-981-16-6448-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics