Skip to main content
Log in

MRRC: multiple role representation crossover interpretation for image captioning with R-CNN feature distribution composition (FDC)

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

While image captioning through machines requires structured learning and basis for interpretation, improvement requires multiple context understanding and processing in a meaningful way. This research provides a novel concept for context combination and impacts many applications to deal with visual features as an equivalence of descriptions of objects, activities, and events. There are three components of our architecture: Feature Distribution Composition (FDC) Layer Attention, Multiple Role Representation Crossover (MRRC) Attention Layer, and the Language Decoder. FDC Layer Attention helps in generating the weighted attention from RCNN features, MRRC Attention Layer acts as intermediate representation processing and helps in generating the next word attention. A Language Decoder helps in the estimation of the likelihood for the next probable word in the sentence. We demonstrated the effectiveness of FDC, MRRC, regional object feature attention, and reinforcement learning for effective learning to generate better captions from images. The performance of our model enhanced previous performances to 35.3% and created a new standard and theory for representation generation based on logic, better interpretability, and contexts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, vol 3, no 5, p 6

  2. Chen H, Ding G, Lin Z, Zhao S, Han J (2018) Show, observe and tell: Attribute-driven attention model for image captioning. In: IJCAI, pp. 606–612

  3. Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J (2017) Reference based LSTM for image captioning. In: AAAI, pp 3981–3987

  4. Chen F, Ji R, Su J, Wu Y, Wu Y (2017) Structcap: Structured semantic embedding for image captioning. In: Proceedings of the 2017 ACM on Multimedia Conference (pp. 46–54). ACM

  5. Chen F, Ji R, Sun X, Wu Y, Su J (2018) Groupcap: Group-based image captioning with structured relevance and diversity constraints. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1345–1353

  6. Chen Xinlei, Lawrence Zitnick C (2015) “Mind’s eye: A recurrent visual representation for image caption generation.” Proceedings of the IEEE conference on computer vision and pattern recognition

  7. Chen H, Zhang H, Chen PY, Yi J, Hsieh CJ (2017) Show-and-fool: Crafting adversarial examples for neural image captioning. arXiv:1712.02051

  8. Chen T, Zhang Z, You Q, Fang C, Wang Z, Jin H, Luo J (2018) “Factual” or “Emotional”: Stylized image captioning with adaptive learning and attention. arXiv:1807.03871

  9. Chunseong Park C, Kim B, Kim G (2017) Attend to you: Personalized image captioning with context sequence memory networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 895–903

  10. Cohn-Gordon R, Goodman N, Potts C (2018) Pragmatically informative image captioning with character-level reference. arXiv:1804.05417

  11. Cornia M, Baraldi L, Serra G, Cucchiara R (2018) Paying more attention to saliency: Image captioning with saliency and context attention. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14(2):48

    Google Scholar 

  12. Devlin J, Gupta S, Girshick R, Mitchell M, Zitnick CL (2015) Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467

  13. Devlin J, et al. (2015) “Language models for image captioning: The quirks and what works”. arXiv:1505.01809

  14. Donahue J, et al. (2015) “Long-term recurrent convolutional networks for visual recognition and description.” Proceedings of the IEEE conference on computer vision and pattern recognition

  15. Fang Hao, et al. (2015) “From captions to visual concepts and back.” Proceedings of the IEEE conference on computer vision and pattern recognition

  16. Fu K, Jin J, Cui R, Sha F, Zhang C (2017) Aligning where to see and what to tell: Image captioning with region-based attention and scene-specific contexts. IEEE transactions on pattern analysis and machine intelligence 39(12):2321–2334

    Article  Google Scholar 

  17. Fu K, Li J, Jin J, Zhang C (2018) Image-text surgery: efficient concept learning in image captioning by generating Pseudopairs. IEEE Transactions on Neural Networks and Learning Systems, (99), pp 1–12

  18. Gan Z, et al. (2016) “Semantic compositional networks for visual captioning”. arXiv:1611.08002

  19. Gan C, et al. (2017) “Stylenet: Generating attractive visual captions with styles.” CVPR

  20. Girshick R, et al. (2014) “Rich feature hierarchies for accurate object detection and semantic segmentation.” Proceedings of the IEEE conference on computer vision and pattern recognition

  21. Harzig P, Brehm S, Lienhart R, Kaiser C, Schallner R (2018) Multimodal image captioning for marketing analysis. arXiv:1802.01958

  22. Jia X, et al. (2015) “Guiding the long-short term memory model for image caption generation.” Proceedings of the IEEE International Conference on Computer Vision

  23. Jiang W, Ma L, Chen X, Zhang H, Liu W (2018) Learning to guide decoding for image captioning. arXiv:1804.00887

  24. Jin J, et al. (2015) “Aligning where to see and what to tell: image caption with region-based attention and scene factorization. arXiv:1506.06272

  25. Karpathy A, Li F-F (2015) “Deep visual-semantic alignments for generating image descriptions.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  26. Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: International conference on machine learning, pp 595–603

  27. Kiros R, Salakhutdinov R, Zemel RS (2014) “Unifying visual-semantic embeddings with multimodal neural language models”. arXiv:1411.2539

  28. Kulkarni G, et al. (2013) “Babytalk: Understanding and generating simple image descriptions”. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(12):2891–2903

    Article  Google Scholar 

  29. Kuznetsova P, et al. (2014) “TREETALK: Composition and Compression of Trees for Image Descriptions”. TACL 2(10):351–362

    Article  Google Scholar 

  30. LTran D, et al. (2015) “Learning spatiotemporal features with 3d convolutional networks.” Proceedings of the IEEE international conference on computer vision

  31. Li X, Wang X, Xu C, Lan W, Wei Q, Yang G, Xu J (2018) COCO-CN for Cross-Lingual Image Tagging, Captioning and Retrieval. In: arXiv:1805.08661

  32. Liu X, Li H, Shao J, Chen D, Wang X (2018) Show, tell and discriminate: Image captioning by self-retrieval with partially labeled data. arXiv:1803.08314

  33. Liu C, Mao J, Sha F, Yuille AL (2017) Attention correctness in neural image captioning. In: AAAI, pp 4176–4182

  34. Liu C, Sun F, Wang C, Wang F, Yuille A (2017) MAT: A multimodal attentive translator for image captioning. arXiv:1702.05658

  35. Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proc. IEEE Int. Conf. Comp. Vis, vol 3, p 3

  36. Lu D, Whitehead S, Huang L, Ji H, Chang SF (2018) Entity-aware Image Caption Generation. arXiv:1804.07889

  37. Lu J, Xiong C, Socher R, Parikh D (2017) Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol 6, p 2

  38. Lu J, Yang J, Batra D, Parikh D (2018) Neural baby talk. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7219–7228

  39. Mao J, et al. (2014) “Deep captioning with multimodal recurrent neural networks (m-rnn)”. arXiv:1412.6632

  40. Mao J, et al. (2015) “Learning like a child: Fast novel visual concept learning from sentence descriptions of images.” Proceedings of the IEEE International Conference on Computer Vision

  41. Mathews AP, Xie L, He X (2016) “Senticap: Generating Image Descriptions with Sentiments.” AAAI

  42. Melnyk I, Sercu T, Dognin PL, Ross J, Mroueh Y (2018) Improved image captioning with adversarial semantic alignment. arXiv:1805.00063

  43. Park CC, Kim G (2018) Towards personalized image captioning via Multimodal memory networks. IEEE Transactions on Pattern Analysis and Machine Intelligence

  44. Pu Yunchen, et al. (2016) “Variational autoencoder for deep learning of images, labels and captions.” Advances in Neural Information Processing Systems

  45. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp 91–99

  46. Ren Z, Wang X, Zhang N, Lv X, Li LJ (2017) Deep reinforcement learning-based image captioning with embedding reward. arXiv:1704.03899

  47. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: CVPR, vol 1, no 2, p 3

  48. Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual Captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th annual meeting of the association for computational linguistics (Volume 1: Long Papers), vol 1, pp 2556–2565

  49. Socher Richard, et al. (2014) Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2:207–218

    Article  Google Scholar 

  50. Sur C (2018) “Feature fusion effects of tensor product representation on (De) compositional network for caption generation for images”. arXiv:1812.06624

  51. Sur C (2018) Representation for Language Understanding. Gainesville: University of Florida, pp. 1–90. Available at: https://drive.google.com/file/d/15Fhmt5aM_b0J5jtE9mdWInQPfDS3TqVw

  52. Sur C (2019) Survey of deep learning and architectures for visual captioning—transitioning between media and natural languages. Multimedia Tools and Applications, pp 1–51

  53. Sur C (2019) “UCRLF: unified constrained reinforcement learning framework for phase-aware architectures for autonomous vehicle signaling and trajectory optimization”. Evol Intel 12(4):689–712

    Article  Google Scholar 

  54. Sur C (2019) “CRUR: Coupled-recurrent unit for unification. Conceptualization and context capture for language representation–a generalization of BI directional LSTM”. arXiv:1911.10132

  55. Sur C (2019) “Tpsgtr: Neural-symbolic tensor product scene-graph-triplet representation for image captioning”. arXiv:1911.10115

  56. Sur C (2020) “SACT:, self-aware multi-space feature composition transformer for multinomial attention for video captioning”. arXiv:2006.14262

  57. Sur C (2020) “Self-segregating and coordinated-segregating transformer for focused deep multi-modular network for visual question answering”. arXiv:2006.14264

  58. Sur C (2020) “ReLGAN: generalization of consistency for GAN with disjoint constraints and relative learning of generative processes for multiple transformation learning”. arXiv:2006.07809

  59. Sur C (2020) AACR: Feature fusion effects of algebraic amalgamation composed representation on (De)Compositional network for caption generation for images, vol 1. https://doi.org/10.1007/s42979-020-00238-4

  60. Sur C (2020) GenAtSeq GAN with heuristic reforms for knowledge centric network with browsing characteristics learning, individual tracking and malware detection with website2Vec. SN COMPUT. SCI. 1:228. https://doi.org/10.1007/s42979-020-00234-8

    Article  Google Scholar 

  61. Sur C (2020) “Gaussian smoothen semantic features (GSSF)–exploring the linguistic aspects of visual captioning in indian languages (Bengali) using MSCOCO framework”. arXiv:2002.06701

  62. Sur C (2020) “aiTPR: attribute interaction-tensor product representation for image caption”. arXiv:2001.09545

  63. Sur C (2020) “RBN: enhancement in language attribute prediction using global representation of natural language transfer learning technology like Google BERT”. SN Applied Sciences 2(1):22

    Article  MathSciNet  Google Scholar 

  64. Sur C, Liu P, Zhou Y, Wu D (2019) “Semantic Tensor Product for Image Captioning”. In: 2019 5th international conference on big data computing and communications (BIGCOM), pp 33–37. IEEE

  65. Sutskever I, Martens J, Hinton GE (2011) “Generating text with recurrent neural networks.” Proceedings of the 28th International Conference on Machine Learning (ICML-11)

  66. Sutskever I, Vinyals O, Le QV (2014) “Sequence to sequence learning with neural networks.” Advances in neural information processing systems

  67. Tran K, et al. (2016) “Rich image captioning in the wild.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

  68. Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence 39(4):652–663

    Article  Google Scholar 

  69. Vinyals Oriol, et al. (2015) “Show and tell: A neural image caption generator.” Proceedings of the IEEE conference on computer vision and pattern recognition

  70. Wang Y, Lin Z, Shen X, Cohen S, Cottrell GW (2017) Skeleton key: Image captioning by skeleton-attribute decomposition. arXiv:1704.06972

  71. Wang C, Yang H, Meinel C (2018) Image captioning with deep bidirectional LSTMs and multi-task learning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14.2s:40

    Google Scholar 

  72. Wu J, Hu Z, Mooney RJ (2018) Joint image captioning and question answering. arXiv:1805.08389

  73. Wu Q, Shen C, Wang P, Dick A, van den Hengel A (2017) Image captioning and visual question answering based on attributes and external knowledge. IEEE transactions on pattern analysis and machine intelligence

  74. Wu C, Wei Y, Chu X, Su F, Wang L (2018) Modeling visual and word-conditional semantic attention for image captioning. Signal Processing: Image Communication

  75. Xu Kelvin, et al. (2015) “Show, attend and tell: Neural image caption generation with visual attention.” International conference on machine learning

  76. Yang Z, Yuan Y, Wu Y, Salakhutdinov R, Cohen WW (2016) Encode, review, and decode: Reviewer module for caption generation. arXiv. arXiv:1605.07912

  77. Yang Y, et al. (2011) “Corpus-guided sentence generation of natural images.” Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics

  78. Yao T, Pan Y, Li Y, Mei T (2017) Incorporating copying mechanism in image captioning for learning novel objects. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 5263–5271). IEEE

  79. Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: IEEE International conference on computer vision, ICCV, pp 22–29

  80. Ye S, Han J (2018) Attentive linear transformation for image captioning. IEEE Transactions on Image Processing

  81. You Q, Jin H, Luo J (2018) Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions. arXiv:1801.10121

  82. You Q, et al. (2016) “Image captioning with semantic attention.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

  83. Zhang L, Sung F, Liu F, Xiang T, Gong S, Yang Y, Hospedales TM (2017) Actor-critic sequence training for image captioning. arXiv:1706.09601

  84. Zhang M, Yang Y, Zhang H, Ji Y, Shen HT, Chua TS (2018) More is Better: precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Transactions on Image Processing

  85. Zhao W, Wang B, Ye J, Yang M, Zhao Z, Luo R, Qiao Y (2018) A multi-task learning approach for image captioning. In: IJCAI, pp. 1205–1211

Download references

Acknowledgment

The author has used University of Florida HiperGator, equipped with NVIDIA Tesla K80 GPU, extensively for the experiments. The author acknowledges University of Florida Research Computing for providing computational resources and support that have contributed to the research results reported in this publication. URL: http://www.researchcomputing.ufl.edu

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chiranjib Sur.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sur, C. MRRC: multiple role representation crossover interpretation for image captioning with R-CNN feature distribution composition (FDC). Multimed Tools Appl 80, 18413–18443 (2021). https://doi.org/10.1007/s11042-021-10578-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-10578-9

Keywords

Navigation