Skip to main content

Hierarchical Attention-Based Video Captioning Using Key Frames

  • Conference paper
  • First Online:
Artificial Intelligence and Technologies

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 806))

  • 954 Accesses

Abstract

Video captioning used in social media like YouTube and Facebook plays a major role for better understanding of the video even when the audio is not clear. In this work, we propose a key frame-based model for video captioning. Here, instead of using all frames in a video only the key frame from the videos are used for video representation. The key frames are extracted from video by comparing images using structural similarity index to identify the difference between the frames and extract only informative frames for the video captioning. We extract the features of the key frames using pre-trained convolutional neural network. We also extract the semantic features from the frames. The key frames are applied to an object detection algorithm to identify the objects and extract the features of the objects. Hierarchical attention is applied on the key frames feature, semantic feature, and the features of the objects identified from the key frames of videos and given as input to LSTM in order to generate the caption for the video.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 249.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bin Y, Yang Y, Shen F, Xie N, Shen HT, Li X (2019) Describing video with attention-based bidirectional LSTM. IEEE Trans Cybern 49(7):2631–2641

    Article  Google Scholar 

  2. Dahiya D, Issac A, Dutta MK, Ríha K, Kriz P (2018) Computer vision technique for scene captioning to provide assistance to visually impaired. In: 41st international conference on telecommunications and signal processing (TSP), Athens, pp 1–4

    Google Scholar 

  3. Donahue J et al (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691

    Google Scholar 

  4. Gan Z et al (2017) Semantic compositional networks for visual captioning. In: IEEE conference on computer vision and pattern recognition (CVPR), Honolulu, HI, pp 1141–1150

    Google Scholar 

  5. Gao L, Guo Z, Zhang H, Xu X, Shen HT (2015) Video captioning with attention-based LSTM and semantic consistency. IEEE Trans Multimedia 19(9):2045–2055

    Article  Google Scholar 

  6. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

    Google Scholar 

  7. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks (2015). In: IEEE international conference on computer vision (ICCV), Santiago, pp 4489–4497. https://doi.org/10.1109/ICCV.2015.510

  8. Venugopalan S, Rohrbach M, Donahue J, Mooney R, Darrell T, Saenko K (2015) Sequence to sequence—video to text. In: IEEE international conference on computer vision (ICCV), Santiago, pp 4534–4542

    Google Scholar 

  9. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: IEEE conference on computer vision and pattern recognition (CVPR), Boston, MA, pp 3156–3164

    Google Scholar 

  10. Xu Y, Han Y, Hong R, Tian Q (2018) Sequential video VLAD: training the aggregation locally and temporally. IEEE Trans Image Process 27(10):4933–4944. https://doi.org/10.1109/TIP.2018.2846664

    Article  MathSciNet  Google Scholar 

  11. Wu Q, Shen C, Wang P, Dick A, Hengel AVD (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40(6):1367–1381. https://doi.org/10.1109/TPAMI.2017.2708709

  12. Yang Y et al (2018) Video captioning by adversarial LSTM. IEEE Trans Image Process 27(11):5600–5611

    Article  MathSciNet  Google Scholar 

  13. Yu H, Wang J, Huang Z, Yang Y, Xu W (2016) Video paragraph captioning using hierarchical recurrent neural networks. In: IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, pp 4584–4593

    Google Scholar 

  14. Yu N, Hu X, Song B, Yang J, Zhang J (2018) Topic-oriented image captioning based on order-embedding. IEEE Trans Image Process 28(6):2743–2754. https://doi.org/10.1109/TIP.2018.2889922

    Article  MathSciNet  MATH  Google Scholar 

  15. Bahdanau D, Cho K, Bengio Y (2016) Neural machine translation by jointly learning to align and translate. arXiv-1409.0473

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hemalatha, M., Karthik, P. (2022). Hierarchical Attention-Based Video Captioning Using Key Frames. In: Raje, R.R., Hussain, F., Kannan, R.J. (eds) Artificial Intelligence and Technologies. Lecture Notes in Electrical Engineering, vol 806. Springer, Singapore. https://doi.org/10.1007/978-981-16-6448-9_30

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-6448-9_30

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-6447-2

  • Online ISBN: 978-981-16-6448-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics