Diverse Video Captioning by Adaptive Spatio-temporal Attention

Ghaderi, Zohreh; Salewski, Leonard; Lensch, Hendrik P. A.

doi:10.1007/978-3-031-16788-1_25

Zohreh Ghaderi¹³,
Leonard Salewski¹³ &
Hendrik P. A. Lensch¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13485))

Included in the following conference series:

DAGM German Conference on Pattern Recognition

1539 Accesses
3 Altmetric

Abstract

To generate proper captions for videos, the inference needs to identify relevant concepts and pay attention to the spatial relationships between them as well as to the temporal development in the clip. Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures, an adapted transformer for a single joint spatio-temporal video analysis as well as a self-attention-based decoder for advanced text generation. Furthermore, we introduce an adaptive frame selection scheme to reduce the number of required incoming frames while maintaining the relevant content when training both transformers. Additionally, we estimate semantic concepts relevant for video captioning by aggregating all ground truth captions of each sample. Our approach achieves state-of-the-art results on the MSVD, as well as on the large-scale MSR-VTT and the VATEX benchmark datasets considering multiple Natural Language Generation (NLG) metrics. Additional evaluations on diversity scores highlight the expressiveness and diversity in the structure of our generated captions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Categorizing and POS tagging using NLTK (https://www.nltk.org/).

References

Aafaq, N., Akhtar, N., Liu, W., Gilani, S.Z., Mian, A.: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12487–12496 (2019)
Google Scholar
Aafaq, N., Mian, A., Liu, W., Gilani, S.Z., Shah, M.: Video description: a survey of methods, datasets, and evaluation metrics. ACM Comput. Surv. 52(6) (2019). https://doi.org/10.1145/3355390
Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691 (2021)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Google Scholar
Black, S., Gao, L., Wang, P., Leahy, C., Biderman, S.: GPT-Neo: large scale autoregressive language modeling with mesh-tensorflow. If you use this Software, Please Cite it using these Metadata 58 (2021)
Google Scholar
Chen, D., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 190–200 (2011)
Google Scholar
Chen, H., Lin, K., Maye, A., Li, J., Hu, X.: A semantics-assisted video captioning model trained with scheduled sampling. Front. Robot. AI 7, 475767 (2020)
Article Google Scholar
Chen, M., Li, Y., Zhang, Z., Huang, S.: TVT: two-view transformer network for video captioning. In: Asian Conference on Machine Learning, pp. 847–862. PMLR (2018)
Google Scholar
Chen, S., Jiang, Y.G.: Motion guided region message passing for video captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1543–1552 (2021)
Google Scholar
Chen, X., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: picking informative frames for video captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 358–373 (2018)
Google Scholar
Cherian, A., Wang, J., Hori, C., Marks, T.: Spatio-temporal ranked-attention networks for video captioning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1617–1626 (2020)
Google Scholar
Dai, B., Fidler, S., Lin, D.: A neural compositional paradigm for image captioning. NIPS 31, 658–668 (2018)
Google Scholar
Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006). https://doi.org/10.1007/11744047_33
Chapter Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)
Google Scholar
Gan, Z., et al.: Semantic compositional networks for visual captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5630–5639 (2017)
Google Scholar
Gowda, S.N., Rohrbach, M., Sevilla-Lara, L.: Smart frame selection for action recognition. arXiv preprint arXiv:2012.10671 (2020)
Guadarrama, S., et al.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2712–2719 (2013)
Google Scholar
Jin, T., Huang, S., Chen, M., Li, Y., Zhang, Z.: SBAT: video captioning with sparse boundary-aware transformer. arXiv preprint arXiv:2007.11888 (2020)
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
Kayser, M., et al.: e-Vil: a dataset and benchmark for natural language explanations in vision-language tasks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1244–1254 (2021)
Google Scholar
Khan, M.U.G., Zhang, L., Gotoh, Y.: Human focused video description. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 1480–1487. IEEE (2011)
Google Scholar
Kojima, A., Tamura, T., Fukunaga, K.: Natural language description of human activities from video images based on concept hierarchy of actions. Int. J. Comput. Vis. 50(2), 171–184 (2002). https://doi.org/10.1023/A:1020346032608
Article MATH Google Scholar
Krishnamoorthy, N., Malkarnenkar, G., Mooney, R., Saenko, K., Guadarrama, S.: Generating natural-language video descriptions using text-mined knowledge. In: Twenty-Seventh AAAI Conference on Artificial Intelligence (2013)
Google Scholar
Lee, M.W., Hakeem, A., Haering, N., Zhu, S.C.: Save: a framework for semantic annotation of visual events. In: 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–8. IEEE (2008)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)
Google Scholar
Liu, Z., et al.: Swin Transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
Liu, Z., et al.: Video Swin Transformer. arXiv preprint arXiv:2106.13230 (2021)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Pan, B., et al.: Spatio-temporal graph for video captioning with knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10870–10879 (2020)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Perez-Martin, J., Bustos, B., Pérez, J.: Improving video captioning with temporal composition of a visual-syntactic embedding. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3039–3049 (2021)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018)
Google Scholar
Reddy, D.R., et al.: Speech Understanding Systems: a Summary Of Results of the Five-year Research Effort. Department of Computer Science, CMU, Pittsburgh, PA (1977)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28 (2015)
Google Scholar
Shekhar, C.C., et al.: Domain-specific semantics guided approach to video captioning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1587–1596 (2020)
Google Scholar
Singh, A., Singh, T.D., Bandyopadhyay, S.: NITS-VC system for VATEX video captioning challenge 2020. arXiv preprint arXiv:2006.04058 (2020)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Thomason, J., Venugopalan, S., Guadarrama, S., Saenko, K., Mooney, R.: Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 1218–1227 (2014)
Google Scholar
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Google Scholar
Vaidya, J., Subramaniam, A., Mittal, A.: Co-Segmentation aided two-stream architecture for video captioning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2774–2784 (2022)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)
Google Scholar
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)
Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference, pp. 124–1. BMVA Press (2009)
Google Scholar
Wang, X., Han, T.X., Yan, S.: An HOG-LBP human detector with partial occlusion handling. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 32–39. IEEE (2009)
Google Scholar
Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.F., Wang, W.Y.: VATEX: a large-scale, high-quality multilingual dataset for video-and-language research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4581–4591 (2019)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5288–5296 (2016)
Google Scholar
Yan, C., et al.: STAT: spatial-temporal attention mechanism for video captioning. IEEE Trans. Multimedia 22(1), 229–241 (2019)
Article Google Scholar
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4507–4515 (2015)
Google Scholar
Zhang, J., Peng, Y.: Object-aware aggregation with bidirectional temporal graph for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8327–8336 (2019)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018)
Google Scholar
Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: evaluating text generation with Bert. In: International Conference on Learning Representations (2020). https://openreview.net/forum?id=SkeHuCVFDr
Zhang, Z., et al.: Object relational graph with teacher-recommended learning for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13278–13288 (2020)
Google Scholar
Zheng, Q., Wang, C., Tao, D.: Syntax-aware action targeting for video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13096–13105 (2020)
Google Scholar
Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: CVPR, pp. 8739–8748 (2018)
Google Scholar
Zhu, Y., et al.: Texygen: a benchmarking platform for text generation models. In: ACM SIGIR, pp. 1097–1100 (2018)
Google Scholar

Download references

Acknowledgements

This work has been supported by the German Research Foundation: EXC 2064/1 - Project number 390727645, the CRC 1233 - Project number 276693517, as well as by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Zohreh Ghaderi and Leonard Salewski.

Author information

Authors and Affiliations

University of Tübingen, Tübingen, Germany
Zohreh Ghaderi, Leonard Salewski & Hendrik P. A. Lensch

Authors

Zohreh Ghaderi
View author publications
You can also search for this author in PubMed Google Scholar
Leonard Salewski
View author publications
You can also search for this author in PubMed Google Scholar
Hendrik P. A. Lensch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zohreh Ghaderi .

Editor information

Editors and Affiliations

TU Dresden, Dresden, Germany
Björn Andres
University of Bonn, Bonn, Germany
Florian Bernard
Technical University of Munich, Munich, Germany
Daniel Cremers
University of Hamburg, Hamburg, Germany
Simone Frintrop
University of Konstanz, Konstanz, Germany
Bastian Goldlücke
University of Siegen, Siegen, Germany
Ivo Ihrke

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 30953 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ghaderi, Z., Salewski, L., Lensch, H.P.A. (2022). Diverse Video Captioning by Adaptive Spatio-temporal Attention. In: Andres, B., Bernard, F., Cremers, D., Frintrop, S., Goldlücke, B., Ihrke, I. (eds) Pattern Recognition. DAGM GCPR 2022. Lecture Notes in Computer Science, vol 13485. Springer, Cham. https://doi.org/10.1007/978-3-031-16788-1_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-16788-1_25
Published: 20 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16787-4
Online ISBN: 978-3-031-16788-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Diverse Video Captioning by Adaptive Spatio-temporal Attention