Skip to main content

Diverse Video Captioning by Adaptive Spatio-temporal Attention

  • Conference paper
  • First Online:
Pattern Recognition (DAGM GCPR 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13485))

Included in the following conference series:

Abstract

To generate proper captions for videos, the inference needs to identify relevant concepts and pay attention to the spatial relationships between them as well as to the temporal development in the clip. Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures, an adapted transformer for a single joint spatio-temporal video analysis as well as a self-attention-based decoder for advanced text generation. Furthermore, we introduce an adaptive frame selection scheme to reduce the number of required incoming frames while maintaining the relevant content when training both transformers. Additionally, we estimate semantic concepts relevant for video captioning by aggregating all ground truth captions of each sample. Our approach achieves state-of-the-art results on the MSVD, as well as on the large-scale MSR-VTT and the VATEX benchmark datasets considering multiple Natural Language Generation (NLG) metrics. Additional evaluations on diversity scores highlight the expressiveness and diversity in the structure of our generated captions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Categorizing and POS tagging using NLTK (https://www.nltk.org/).

References

  1. Aafaq, N., Akhtar, N., Liu, W., Gilani, S.Z., Mian, A.: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12487–12496 (2019)

    Google Scholar 

  2. Aafaq, N., Mian, A., Liu, W., Gilani, S.Z., Shah, M.: Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput. Surv. 52(6) (2019). https://doi.org/10.1145/3355390

  3. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691 (2021)

  4. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  5. Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)

    Google Scholar 

  6. Black, S., Gao, L., Wang, P., Leahy, C., Biderman, S.: GPT-Neo: large scale autoregressive language modeling with mesh-tensorflow. If you use this Software, Please Cite it using these Metadata 58 (2021)

    Google Scholar 

  7. Chen, D., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200 (2011)

    Google Scholar 

  8. Chen, H., Lin, K., Maye, A., Li, J., Hu, X.: A semantics-assisted video captioning model trained with scheduled sampling. Front. Robot. AI 7, 475767 (2020)

    Article  Google Scholar 

  9. Chen, M., Li, Y., Zhang, Z., Huang, S.: TVT: two-view transformer network for video captioning. In: Asian Conference on Machine Learning, pp. 847–862. PMLR (2018)

    Google Scholar 

  10. Chen, S., Jiang, Y.G.: Motion guided region message passing for video captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1543–1552 (2021)

    Google Scholar 

  11. Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

  12. Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: picking informative frames for video captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 358–373 (2018)

    Google Scholar 

  13. Cherian, A., Wang, J., Hori, C., Marks, T.: Spatio-temporal ranked-attention networks for video captioning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1617–1626 (2020)

    Google Scholar 

  14. Dai, B., Fidler, S., Lin, D.: A neural compositional paradigm for image captioning. NIPS 31, 658–668 (2018)

    Google Scholar 

  15. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006). https://doi.org/10.1007/11744047_33

    Chapter  Google Scholar 

  16. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  17. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)

    Google Scholar 

  18. Gan, Z., et al.: Semantic compositional networks for visual captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5630–5639 (2017)

    Google Scholar 

  19. Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition. arXiv preprint arXiv:2012.10671 (2020)

  20. Guadarrama, S., et al.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712–2719 (2013)

    Google Scholar 

  21. Jin, T., Huang, S., Chen, M., Li, Y., Zhang, Z.: SBAT: video captioning with sparse boundary-aware transformer. arXiv preprint arXiv:2007.11888 (2020)

  22. Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)

  23. Kayser, M., et al.: e-Vil: a dataset and benchmark for natural language explanations in vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1244–1254 (2021)

    Google Scholar 

  24. Khan, M.U.G., Zhang, L., Gotoh, Y.: Human focused video description. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 1480–1487. IEEE (2011)

    Google Scholar 

  25. Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50(2), 171–184 (2002). https://doi.org/10.1023/A:1020346032608

    Article  MATH  Google Scholar 

  26. Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI Conference on Artificial Intelligence (2013)

    Google Scholar 

  27. Lee, M.W., Hakeem, A., Haering, N., Zhu, S.C.: Save: a framework for semantic annotation of visual events. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–8. IEEE (2008)

    Google Scholar 

  28. Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

    Google Scholar 

  29. Liu, Z., et al.: Swin Transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)

  30. Liu, Z., et al.: Video Swin Transformer. arXiv preprint arXiv:2106.13230 (2021)

  31. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  32. Pan, B., et al.: Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870–10879 (2020)

    Google Scholar 

  33. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318 (2002)

    Google Scholar 

  34. Perez-Martin, J., Bustos, B., Pérez, J.: Improving video captioning with temporal composition of a visual-syntactic embedding. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3039–3049 (2021)

    Google Scholar 

  35. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  36. Reddy, D.R., et al.: Speech Understanding Systems: a Summary Of Results of the Five-year Research Effort. Department of Computer Science, CMU, Pittsburgh, PA (1977)

    Google Scholar 

  37. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015)

    Google Scholar 

  38. Shekhar, C.C., et al.: Domain-specific semantics guided approach to video captioning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1587–1596 (2020)

    Google Scholar 

  39. Singh, A., Singh, T.D., Bandyopadhyay, S.: NITS-VC system for VATEX video captioning challenge 2020. arXiv preprint arXiv:2006.04058 (2020)

  40. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  41. Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 1218–1227 (2014)

    Google Scholar 

  42. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

    Google Scholar 

  43. Vaidya, J., Subramaniam, A., Mittal, A.: Co-Segmentation aided two-stream architecture for video captioning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2774–2784 (2022)

    Google Scholar 

  44. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  45. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

    Google Scholar 

  46. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)

    Google Scholar 

  47. Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)

  48. Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference, pp. 124–1. BMVA Press (2009)

    Google Scholar 

  49. Wang, X., Han, T.X., Yan, S.: An HOG-LBP human detector with partial occlusion handling. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 32–39. IEEE (2009)

    Google Scholar 

  50. Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.F., Wang, W.Y.: VATEX: a large-scale, high-quality multilingual dataset for video-and-language research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4581–4591 (2019)

    Google Scholar 

  51. Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016)

    Google Scholar 

  52. Yan, C., et al.: STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans. Multimedia 22(1), 229–241 (2019)

    Article  Google Scholar 

  53. Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)

    Google Scholar 

  54. Zhang, J., Peng, Y.: Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8327–8336 (2019)

    Google Scholar 

  55. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)

    Google Scholar 

  56. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with Bert. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=SkeHuCVFDr

  57. Zhang, Z., et al.: Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13278–13288 (2020)

    Google Scholar 

  58. Zheng, Q., Wang, C., Tao, D.: Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13096–13105 (2020)

    Google Scholar 

  59. Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: CVPR, pp. 8739–8748 (2018)

    Google Scholar 

  60. Zhu, Y., et al.: Texygen: a benchmarking platform for text generation models. In: ACM SIGIR, pp. 1097–1100 (2018)

    Google Scholar 

Download references

Acknowledgements

This work has been supported by the German Research Foundation: EXC 2064/1 - Project number 390727645, the CRC 1233 - Project number 276693517, as well as by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Zohreh Ghaderi and Leonard Salewski.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zohreh Ghaderi .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 30953 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ghaderi, Z., Salewski, L., Lensch, H.P.A. (2022). Diverse Video Captioning by Adaptive Spatio-temporal Attention. In: Andres, B., Bernard, F., Cremers, D., Frintrop, S., Goldlücke, B., Ihrke, I. (eds) Pattern Recognition. DAGM GCPR 2022. Lecture Notes in Computer Science, vol 13485. Springer, Cham. https://doi.org/10.1007/978-3-031-16788-1_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-16788-1_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-16787-4

  • Online ISBN: 978-3-031-16788-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics