Skip to main content
Log in

M-FFN: multi-scale feature fusion network for image captioning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In this work, we present a novel multi-scale feature fusion network (M-FFN) for image captioning task to incorporate discriminative features and scene contextual information of an image. We construct multi-scale feature fusion network by leveraging spatial transformation and multi-scale feature pyramid networks via feature fusion block to enrich spatial and global semantic information. In particular, we take advantage of multi-scale feature pyramid network to incorporate global contextual information by employing atrous convolutions on top layers of convolutional neural network (CNN). And, the spatial transformation network is exploited on early layers of CNN to remove intra-class variability caused by spatial transformations. Further, the feature fusion block integrates both global contextual information and spatial features to encode the visual information of an input image. Moreover, spatial-semantic attention module is incorporated to learn attentive contextual features to guide the captioning module. The efficacy of the proposed model is evaluated on the COCO dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  2. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2117–2125

  3. Kang K, Li H, Yan J, Zeng X, Yang B, Xiao T, Zhang C, Wang Z, Wang R, Wang X et al (2017) T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Trans Circuits Syst Video Technol 28(10):2896–2907

    Article  Google Scholar 

  4. He X, Yang Y, Shi B, Bai X (2019) Vd-san:, Visual-densely semantic attention network for image caption generation. Neurocomputing 328:48–55

    Article  Google Scholar 

  5. Su J, Tang J, Lu Z, Han X, Zhang H (2019) A neural image captioning model with caption-to-images semantic constructor. Neurocomputing 367:144–151

    Article  Google Scholar 

  6. Xiao F, Gong X, Zhang Y, Shen Y, Li J, Gao X (2019) Daa: Dual lstms with adaptive attention for image captioning. Neurocomputing 364:322–329

    Article  Google Scholar 

  7. Feng Q, Wu Y, Fan H, Yan C, Xu M, Yang Y (2020) Cascaded revision network for novel object captioning. IEEE Trans Circuits Syst Video Technol 30(10):3413–3421

    Article  Google Scholar 

  8. Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol 30(12):4467–4480

    Article  Google Scholar 

  9. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473

  10. Tan Ying Hua, Chan Chee Seng (2019) Phrase-based image caption generator with hierarchical lstm network. Neurocomputing 333:86–100

    Article  Google Scholar 

  11. Zhang X, He S, Song X, Lau RWH, Jiao J, Ye Q (2020) Image captioning via semantic element embedding. Neurocomputing 395:212–221

    Article  Google Scholar 

  12. Zhao D, Chang Z, Guo S (2019) A multimodal fusion approach for image captioning. Neurocomputing 329:476–485

    Article  Google Scholar 

  13. Guo L, Liu J, Yao P, Li J, Lu H (2019) Mscap: Multi-style image captioning with unpaired stylized text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4204–4213

  14. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

  15. Farahzadeh E, Cham TJ, Li W (2015) Semantic and spatial content fusion for scene recognition. In: New Development in Robot Vision, pp 33–53. Springer

  16. Gatys LA, Ecker AS, Bethge M (2015) Texture synthesis and the controlled generation of natural stimuli using convolutional neural networks. In: In Bernstein Conference, vol 2015, pp 219–219

  17. Mahendran A, Vedaldi A (2015) Understanding deep image representations by inverting them. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5188–5196

  18. Song J, Yu Q, Song YZ, Xiang T, Hospedales TM (2017) Deep spatial-semantic attention for fine-grained sketch-based image retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 5551–5560

  19. Zhang Z, Zhang X, Peng C, Xue X, Sun J (2018) Exfuse: Enhancing feature fusion for semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 269–284

  20. Annunziata R, Sagonas C, Calì J (2018) Destnet:, Densely fused spatial transformer networks. arXiv:1807.04050

  21. Jaderberg M, Simonyan K, Zisserman A et al (2015) Spatial transformer networks. Adv Neural Inf Process Syst 28:2017–2025

    Google Scholar 

  22. Ma C, Huang JB, Yang X, Yang MH (2018) Robust visual tracking via hierarchical convolutional features. IEEE Trans Pattern Anal Mach Intell 41(11):2709–2723

    Article  Google Scholar 

  23. Mishra SR, Mishra TK, Sanyal G, Sarkar A, Satapathy SC (2020) Real time human action recognition using triggered frame extraction and a typical cnn heuristic. Pattern Recogn Lett 135:329– 336

    Article  Google Scholar 

  24. Ding S, Qu S, Xi Y, Wan S (2020) Stimulus-driven and concept-driven analysis for image caption generation. Neurocomputing 398:520–530

    Article  Google Scholar 

  25. Si H, Zhang Z, Lv F, Yu G, Lu F (2019) Real-time semantic segmentation via multiply spatial fusion network. arXiv:1911.072171911.07217

  26. Wang W, Zhang Z, Qi S, Shen J, Pang Y, Shao L (2019) Learning compositional neural information fusion for human parsing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 5703–5713

  27. Lu Y, Feng M, Wu M, Zhang C (2020) C-dlinknet: considering multi-level semantic features for human parsing. arXiv:2001.116902001.11690

  28. Luo H, Jiang W, Fan X, Zhang C (2020) Stnreid: Deep convolutional networks with pairwise spatial transformer networks for partial person re-identification. IEEE Trans Multimed 22(11):2905–2913

    Article  Google Scholar 

  29. Qian Y, Yang M, Zhao X, Wang C, Wang B (2019) Oriented spatial transformer network for pedestrian detection using fish-eye camera. IEEE Trans Multimed 22(2):421–431

    Article  Google Scholar 

  30. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2017) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 40(4):834–848

    Article  Google Scholar 

  31. Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 801–818

  32. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  33. Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE international conference on computer vision, pp 2407–2415

  34. Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder–decoder network for image captioning. IEEE Trans Multimed 21(11):2942–2956

    Article  Google Scholar 

  35. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057. PMLR

  36. Lu J, Xiong C, Parikh D, Socher R (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383

  37. Chen L, Zhang H, Xiao J, Nie L, Shao J, Liu W, Chua TS (2017) Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5659–5667

  38. Takahashi R, Matsubara T, Uehara K (2019) Data augmentation using random image cropping and patching for deep cnns. IEEE Trans Circuits Syst Video Technol 30(9):2917–2931

    Article  Google Scholar 

  39. Tu Z, Xie W, Dauwels J, Li B, Yuan J (2018) Semantic cues enhanced multimodality multistream cnn for action recognition. IEEE Trans Circuits Syst Video Technol 29(5):1423– 1437

    Article  Google Scholar 

  40. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90

    Article  Google Scholar 

  41. Cohen TS, Welling M (2014) Transformation properties of learned visual representations. arXiv:1412.7659

  42. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3146–3154

  43. Ainam JP, Qin K, Liu G (2018) Self attention grid for person re-identification. arXiv:1809.08556

  44. Gan Z, Gan C, He X, Pu Y, Tran K, Gao J, Carin L, Li D (2017) Semantic compositional networks for visual captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5630–5639

  45. Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318

  46. Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

  47. Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74– 81

  48. Vedantam R, Lawrence ZC, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  49. Kingma DP, Ba J (2014) Adam:, A method for stochastic optimization. arXiv:1412.6980

  50. Wang J, Wang W, Wang L, Wang Z, Feng DD, Tan T (2020) Learning visual relationship and context-aware attention for image captioning. Pattern Recogn 98:107075

    Article  Google Scholar 

  51. Karpathy A, Li FF (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  52. Li L, Tang S, Zhang Y, Deng L, Qi T (2017) Gla: Global–local attention for image description. IEEE Transactions on Multimedia 20(3):726–737

    Article  Google Scholar 

  53. Tan JH, Chan CS, Chuah JH (2019) Comic: Toward a compact image captioning model with attention. IEEE Transactions on Multimedia 21(10):2686–2696

    Article  Google Scholar 

  54. Wu L, Xu M, Wang J, Perry S (2019) Recall what you see continually using gridlstm in image captioning. IEEE Transactions on Multimedia 22(3):808–818

    Article  Google Scholar 

  55. Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4634–4643

  56. Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10578–10587

  57. Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F et al (2020) Oscar: Object-semantics aligned pre-training for vision-language tasks. In: European Conference on Computer Vision, pp 121–137. Springer

  58. Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5579–5588

  59. Zhou L, Zhang Y, Jiang YG, Zhang T, Fan W (2019) Re-caption: Saliency-enhanced image captioning through two-phase learning. IEEE Trans Image Process 29:694–709

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jeripothula Prudviraj.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Multi-view Learning

Guest Editors: Guoqing Chao, Xingquan Zhu, Weiping Ding, Jinbo Bi and Shiliang Sun

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Prudviraj, J., Vishnu, C. & Mohan, C.K. M-FFN: multi-scale feature fusion network for image captioning. Appl Intell 52, 14711–14723 (2022). https://doi.org/10.1007/s10489-022-03463-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03463-x

Keywords

Navigation