Abstract
The objects’ semantic information of the image is vital for image captioning. Though some methods have used semantic information, the alignment between the specific semantic feature and the corresponding visual feature has not been explored. In this paper, we propose a novel Multi-Feature Fusion enhanced Transformer (MFF-Transformer) for image captioning, which can realize the multi-feature fusion by aligning the specific semantic features with their corresponding visual features for achieving a better visual feature representation. In the MFF-Transformer, a novel Interacted Multi-Feature REpresentation (IMFRE) module is proposed, which effectively fuses the visual features with the semantic features by adaptive pooling to obtain global and local multi-feature information for enhancing the visual features. In addition, a new Multi-Layer Features Fusion (MLFF) module is proposed to achieve complete and valid multi-layer decoding information for utilizing the hierarchical context of the MFF-Transformer model. Experiments on the MSCOCO dataset illustrate that our proposed MFF-Transformer can achieve good performance and outperform other state-of-the-art methods.
Similar content being viewed by others
References
Khan MA, Muhammad K, Sharif M, Akram T, Kadry S (2021) Intelligent fusion-assisted skin lesion localization and classification for smart healthcare. Neural Computing and Applications, 1–16
Khan S, Khan MA, Alhaisoni M, Tariq U, Yong H-S, Armghan A, Alenezi F (2021) Human action recognition: a paradigm of best deep learning features selection and serial based extended fusion. Sensors 21:7941
Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7219–7228
Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé III H (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the european chapter of the association for computational linguistics, pp 747–756
Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: the quirks and what works. arXiv:1505.01809
Wang C, Yang H, Bartz C, Meinel C (2016) Image captioning with deep bidirectional LSTMs. In: Proceedings of the 24th ACM international conference on multimedia, pp 988–997
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699
Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8928–8937
Ashraf AH, Imran M, Qahtani AM, Alsufyani A, Almutiry O, Mahmood A, Habib M (2021) Weapons Detection for Security and Video Surveillance Using CNN and YOLO-v5s
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2020) An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929 [cs]
Khan MA, Alhaisoni M, Tariq U, Hussain N, Majid A, Damaševičius R, Maskeliūnas R (2021) COVID-19 case recognition from chest CT images by deep learning, Entropy-controlled Firefly Optimization, and Parallel Feature Fusion. Sensors 21:7286
Saeed F, Khan MA, Sharif M, Mittal M, Goyal LM, Roy S (2021) Deep neural network features fusion and selection based on PLS regression with an application for crops diseases classification, vol 103
Nawaz M, Nazir T, Javed A, Tariq U, Yong H-S, Khan MA, Cha J (2022) An efficient deep learning approach to automatic glaucoma detection using optic disc and optic cup localization. Sensors 22:434
Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN Encoder-Decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1724–1734. https://doi.org/10.3115/v1/D14-1179
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4565–4574
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Vinyals O, Toshev A, Bengio S, Erhan D (2016) Show and tell: lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39:652–663
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
Zhang Z, Wu Q, Wang Y, Chen F (2021) Exploring region relationships implicitly: image captioning with visual relationship attention. Image and Vision Computing 109:104146
Guo Y, Liu Y, De Boer MH, Liu L, Lew MS (2018) A dual prediction network for image captioning. In: 2018 IEEE international conference on multimedia and expo (ICME), pp 1–6
Zhong X, Nie G, Huang W, Liu W, Ma B, Lin C-W (2021) Attention-guided image captioning with adaptive global and local feature fusion, vol 78
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
Gu J, Cai J, Wang G, Chen T (2018) Stack-captioning: coarse-to-fine learning for image captioning. In: Proceedings of the AAAI conference on artificial intelligence 32
Xu N, Zhang H, Liu A. -A., Nie W, Su Y, Nie J, Zhang Y (2019) Multi-level policy and reward-based deep reinforcement learning framework for image captioning. IEEE Transactions on Multimedia, 1–1.
Wei Y, Wang L, Cao H, Shao M, Wu C (2020) Multi-attention generative adversarial network for image captioning. Neurocomputing 387:91–99
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Deng L (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE international conference on computer vision, pp 4894–4902
Li N, Chen Z (2018) Image Cationing with Visual-Semantic LSTM. In: IJCAI, pp 793–799
Wu Q, Shen C, Liu L, Dick A, Van Den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 203–212
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Ren S, He K, Girshick R, Sun J, Faster R -CNN (2016) Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149
Yang X, Tang K, Zhang H, Cai J (2019) Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10685–10694
Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proceedings of the 27th ACM international conference on multimedia, pp 765–773
Zhang J, Li K, Wang Z (2021) Parallel-fusion LSTM with synchronous semantic and visual information for image captioning. Journal of Visual Communication and Image Representation 75:103044
Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 2556–2565
Herdade S, Kappeler A, Boakye K, Soares J (2019) Image captioning: transforming objects into words. Advances in Neural Information Processing Systems 32
He S, Liao W, Tavakoli HR, Yang M, Rosenhahn B, Pugeault N (2020) Image captioning through image transformer. In: Proceedings of the Asian conference on computer vision
Huang L, Wang W, Chen J, Wei X-Y (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: CVPR, pp 10575–10584. https://doi.org/10.1109/CVPR42600.2020.01059
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv:1607.06450
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2921–2929
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision, pp 740– 755
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Krishna R, Zhu Y, Groth O, Johnson J, Hata K, Kravitz J, Chen S, Kalantidis Y, Li L-J, Shamma DA, Bernstein MS, Fei-Fei L (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123:32–73
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the Acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision, pp 382–398
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980
Acknowledgments
This study was funded by the Nature Science Foundation of Shanghai “Research on image sentiment analysis and expression based on human vision and cognitive psychology”(grant number 22ZR1418400).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, J., Fang, Z. & Wang, Z. Multi-feature fusion enhanced transformer with multi-layer fused decoding for image captioning. Appl Intell 53, 13398–13414 (2023). https://doi.org/10.1007/s10489-022-04202-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04202-y