Multimodal-enhanced hierarchical attention network for video captioning

Zhong, Maosheng; Chen, Youde; Zhang, Hao; Xiong, Hao; Wang, Zhixiang

doi:10.1007/s00530-023-01130-w

Multimodal-enhanced hierarchical attention network for video captioning

Regular Paper
Published: 15 July 2023

Volume 29, pages 2469–2482, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Maosheng Zhong¹,
Youde Chen¹,
Hao Zhang¹,
Hao Xiong¹ &
…
Zhixiang Wang¹

659 Accesses
2 Citations
Explore all metrics

Abstract

In video captioning, many pioneering approaches have been developed to generate higher-quality captions by exploring and adding new video feature modalities. However, as the number of modalities increases, the negative interaction between them gradually reduces the gain of caption generation. To address this problem, we propose a three-layer hierarchical attention network based on a bidirectional decoding transformer that enhances multimodal features. In the first layer, we execute different encoders according to the characteristics of each modality to enhance the vector representation of each modality. Then, in the second layer, we select keyframes from all sampled frames of the modality by calculating the attention value between the generated words and each frame of the modality. Finally, in the third layer, we allocate weights to different modalities to reduce redundancy between them before generating the current word. Additionally, we use a bidirectional decoder to consider the context of the ground-truth caption when generating captions. Experiments on two mainstream benchmark datasets, MSVD and MSR-VTT, demonstrate the effectiveness of our proposed model. The model achieves state-of-the-art performance in significant metrics, and the generated sentences are more in line with human language habits. Overall, our three-layer hierarchical attention network based on a bidirectional decoding transformer effectively enhances multimodal features and generates high-quality video captions. Codes are available on https://github.com/nickchen121/MHAN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BiTransformer: augmenting semantic context in video captioning via bidirectional decoder

Article 12 August 2022

Multimodal Interaction Fusion Network Based on Transformer for Video Captioning

Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion

Article 25 August 2023

Data availability

This paper uses two common datasets in the field of video captioning, MSVD and MSR-VTT. Data availability is not applicable to this article as no new data were created or analyzed in this study.

References

Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634) (2015)
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729 (2014)
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision (pp. 4507-4515) (2015)
Xu, J., Yao, T., Zhang, Y., Mei, T.: Learning multimodal attention LSTM networks for video captioning. In Proceedings of the 25th ACM international conference on Multimedia (pp. 537-545) (2017, October)
Singh, A., Singh, T.D., Bandyopadhyay, S.: Attention based video captioning framework for hindi. Multimedia Syst. 28(1), 195–207 (2022)
Article Google Scholar
Zhong, M., Zhang, H., Xiong, H., Chen, Y., Wang, M., Zhou, X.: Kgvideo: A Video Captioning Method Based on Object Detection and Knowledge Graph. Available at SSRN 4017055
Zhong, M., Zhang, H., Wang, Y., Xiong, H.: BiTransformer: augmenting semantic context in video captioning via bidirectional decoder. Mach. Vis. Appl. 33(5), 1–9 (2022)
Article Google Scholar
Yang, B., Zhang, T., Zou, Y.: (2022) CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter. In: Pattern Recognition and Computer Vision: 5th Chinese Conference. PRCV,: Shenzhen, China, November 4?7, 2022, Proceedings, Part I, pp. 368–381. Springer International Publishing, Cham (2022)
Hori, C., Hori, T., Lee, T. Y., Zhang, Z., Harsham, B., Hershey, J. R., ... Sumi, K.: Attention-based multimodal fusion for video description. In Proceedings of the IEEE international conference on computer vision (pp. 4193-4202) (2017)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)
Article Google Scholar
Aafaq, N., Akhtar, N., Liu, W., Gilani, S. Z., & Mian, A.: Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12487-12496) (2019)
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6077-6086) (2018)
Lee, J.Y.: Deep multimodal embedding for video captioning. Multimedia Tools Appl. 78(22), 31793–31805 (2019)
Article Google Scholar
Liu, A.A., Xu, N., Wong, Y., Li, J., Su, Y.T., Kankanhalli, M.: Hierarchical & multimodal video captioning: Discovering and transferring multimodal knowledge for vision to language. Comput. Vis. Image Underst. 163, 113–125 (2017)
Article Google Scholar
Jin, Q., Chen, J., Chen, S., Xiong, Y., & Hauptmann, A.: Describing videos using multi-modal fusion. In Proceedings of the 24th ACM international conference on Multimedia (pp. 1087-1091) (2016, October)
Jiang, Y.: Multi-feature fusion for video captioning. Int. J. Comput. Appl. 181(48), 975–8887 (2019)
Google Scholar
Li, L., Zhang, Y., Tang, S., Xie, L., Li, X., Tian, Q.: Adaptive spatial location with balanced loss for video captioning. IEEE Trans. Circuits Syst. Video Technol. 32(1), 17–30 (2020)
Huang, Y., Cai, Q., Xu, S., Chen, J.: Xlanv model with adaptively multi-modality feature fusing for video captioning. In Proceedings of the 28th ACM International Conference on Multimedia (pp. 4600-4604) (2020, October)
Yan, Z., Chen, Y., Song, J., Zhu, J.: Multimodal feature fusion based on object relation for video captioning. CAAI Trans. Intell. Technol. 8(1), 247–259 (2023)
Krizhevsky, A., Sutskever, I., Hinton, G. E.: ImageNet classification with deep convolutional neural networks. Commun ACM 60(6), 84–90 (2017)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... Polosukhin, I.: Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (pp. 6000–6010) (2017)
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... Sutskever, I.: Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR (2021, July)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence (2017, February)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308) (2017)
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., ... Zisserman, A.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
He, K., Gkioxari, G., Dollr, P., Girshick, R.: Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961–2969) (2017)
Bordes, A., Usunier, N., Garcia-Durán, A., Weston, J., & Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2 (pp. 2787–2795) (2013)
Han, X., Cao, S., Lv, X., Lin, Y., Liu, Z., Sun, M., Li, J.: Openke: An open toolkit for knowledge embedding. In Proceedings of the 2018 conference on empirical methods in natural language processing: system demonstrations (pp. 139-144) (2018, November)
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... Zitnick, C. L.: Microsoft coco: Common objects in context. In European conference on computer vision (pp. 740-755). Springer, Cham (2014, September)
Lin, T. Y., Dollr, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117-2125) (2017)
Ren, S., He, K., Girshick, R., & Sun, J.: Faster R-CNN: towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell 39(6), 1137–1149 (2017)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7622-7631) (2018)
Xu, W., Yu, J., Miao, Z., Wan, L., Tian, Y., Ji, Q.: Deep reinforcement polishing network for video captioning. IEEE Trans. Multimedia 23, 1772–1784 (2020)
Article Google Scholar
Chen, Y., Wang, S., Zhang, W., Huang, Q.: Less is more: Picking informative frames for video captioning. In Proceedings of the European conference on computer vision (ECCV) (pp. 358-373) (2018)
Xu, N., Liu, A.A., Nie, W., Su, Y.: Multi-guiding long short-term memory for video captioning. Multimedia Syst. 25, 663–672 (2019)
Article Google Scholar
Chen, J., Pan, Y., Li, Y., Yao, T., Chao, H., Mei, T.: Temporal deformable convolutional encoder-decoder networks for video captioning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, No. 01, pp. 8167-8174) (2019, July)
Li, L., Zhang, Y., Tang, S., Xie, L., Li, X., Tian, Q.: Adaptive spatial location with balanced loss for video captioning. IEEE Trans. Circuits Syst. Video Technol. 32(1), 17–30 (2022)
Article Google Scholar
Wenjie, Pei., Jiyuan, Zhang., Xiangrong, Wang., Lei, Ke., Xi-aoyong, Shen., Yu-Wing, Tai.: Memory-attended recurrentnetwork for video captioning. In CVPR, pages 8347?8356, (2019)
Yang, B., Zou, Y., Liu, F., Zhang, C.: Non-autoregressive coarse-to-fine video captioning. Proc. AAAI Conf. Artif. Intell. 35(4), 3119–3127 (2021). https://doi.org/10.1609/aaai.v35i4.16421
Article Google Scholar
Chen, S., & Jiang, Y. G.: Motion guided region message passing for video captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1543-1552) (2021)
Vaidya, J., Subramaniam, A., Mittal, A.: Co-Segmentation Aided Two-Stream Architecture for Video Captioning. In Proceedings of the IEEE/CVF Win ter Conference on Applications of Computer Vision (pp. 2774-2784) (2022)
Deng, J., Li, L., Zhang, B., Wang, S., Zha, Z., Huang, Q.: Syntax-guided hierarchical attention network for video captioning. IEEE Trans. Circuits Syst. Video Technol. 32(2), 880–892 (2022). https://doi.org/10.1109/TCSVT.2021.3063423
Article Google Scholar
Zhang, Z., Shi, Y., Yuan, C., Li, B., Wang, P., Hu, W., Zha, Z. J.: Object relational graph with teacher-recommended learning for video captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13278-13288) (2020)
Wu, B., Niu, G., Yu, J., Xiao, X., Zhang, J., Wu, H.: Towards Knowledge-aware Video Captioning via Transitive Visual Relationship Detection. IEEE Transactions on Circuits and Systems for Video Technology. (2022)
Ye, H., Li, G., Qi, Y., Wang, S., Huang, Q., Yang, M.: Hierarchical Modular Network for Video Captioning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022, 17918–17927 (2022)
Chen, D., Dolan, W. B.: Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies (pp. 190-200) (2011, June)
Xu, J., Mei, T., Yao, T., Rui, Y.: Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5288-5296) (2016)
Pei, W., Zhang, J., Wang, X., Ke, L., Shen, X., Tai, Y. W.: Memory-attended recurrent network for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8347-8356) (2019)
Pan, B., Cai, H., Huang, D. A., Lee, K. H., Gaidon, A., Adeli, E., Niebles, J.C.: Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10870-10879) (2020)
Papineni, K., Roukos, S., Ward, T., Zhu, W. J.: Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311-318) (2002, July)
Banerjee, S., Lavie, A.: METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization (pp. 65-72) (2005, June)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4566-4575) (2015)
Lin, C. Y.: Rouge: a package for automatic evaluation of summaries. In Text summarization branches out (pp. 74–81) (2004)
Pennington, J., Socher, R., Manning, C. D.: Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543) (2014)
Novikova, J., Du?ek, O., Curry, A. C., Rieser, V.: Why We Need New Evaluation Metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 2241-2252) (2017)

Download references

Acknowledgements

The authors would like to express their gratitude to the anonymous reviewers for their valuable comments, which have helped to improve the quality of the paper. This research has been partially supported by the National Natural Science Foundation of China (Grant No. 61877031) and the Jiangxi Normal University Graduate Innovation Fund (Grant No. YJS2022029).

Author information

Authors and Affiliations

Jiangxi Normal University, 99 Ziyang Avenue, Nanchang, 330022, China
Maosheng Zhong, Youde Chen, Hao Zhang, Hao Xiong & Zhixiang Wang

Authors

Maosheng Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Youde Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hao Xiong
View author publications
You can also search for this author in PubMed Google Scholar
Zhixiang Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YC completed the main code writing and experiments and the rest of the people participated in the compilation of part of the code and the design of the experiment. MZ and YC wrote the main manuscript text. All authors reviewed the manuscript.

Corresponding author

Correspondence to Youde Chen.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by P. Pala.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhong, M., Chen, Y., Zhang, H. et al. Multimodal-enhanced hierarchical attention network for video captioning. Multimedia Systems 29, 2469–2482 (2023). https://doi.org/10.1007/s00530-023-01130-w

Download citation

Received: 18 March 2023
Accepted: 26 June 2023
Published: 15 July 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00530-023-01130-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multimodal-enhanced hierarchical attention network for video captioning

Abstract

Access this article

Similar content being viewed by others

BiTransformer: augmenting semantic context in video captioning via bidirectional decoder

Multimodal Interaction Fusion Network Based on Transformer for Video Captioning

Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multimodal-enhanced hierarchical attention network for video captioning

Abstract

Access this article

Similar content being viewed by others

BiTransformer: augmenting semantic context in video captioning via bidirectional decoder

Multimodal Interaction Fusion Network Based on Transformer for Video Captioning

Video Captioning Based on Cascaded Attention-Guided Visual Feature Fusion

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation