Advertisement

Novel model to integrate word embeddings and syntactic trees for automatic caption generation from images

  • Hongbin ZhangEmail author
  • Diedie Qiu
  • Renzhong Wu
  • Donghong Ji
  • Guangli Li
  • Zhenyu Niu
  • Tao Li
Methodologies and Application
  • 8 Downloads

Abstract

Automatic caption generation from images is an interesting and mainstream direction in the field of machine learning. This method enables us to build a powerful computer model that can interpret the implicit semantic information of images. However, the current state of research faces significant challenges such as those related to extracting robust image features, suppressing noisy words, and improving a caption’s coherence. For the first problem, a novel computer vision system is presented to create a new image feature called MK–KDES-1 (MK–KDES represents Multiple Kernel–Kernel Descriptors) after extracting three KDES features and fusing them by MKL (Multiple Kernel Learning) model. The MK–KDES-1 feature captures both textural characteristics and shape characteristics of images, which contribute to improving the BLEU_1 (BLEU represents Bilingual Evaluation Understudy) scores of captions. For the second problem, an effective newly designed two-layer TR (Tag Refinement) strategy is integrated into our NLG (Natural Language Generation) algorithm. Words that are most relevant semantically to images are summarized to generate N-gram phrases. Noisy words are suppressed using the innovative TR strategy. For the last problem, on the one hand, a pop WE (Word Embeddings) model and a novel metric called PDI (Positive Distance Information) are introduced together to generate N-gram phrases. The phrases are evaluated by the AWSC (Accumulated Word Semantic Correlation) metric. On the other hand, the phrases are fused to generate captions by the ST (Syntactic Trees). Experimental results demonstrate that informative captions with high BLEU_3 scores can be obtained to describe images.

Keywords

Caption generation from images Word embeddings Syntactic trees Kernel descriptors Tag refinement Natural language generation N-gram phrases BLEU 

Notes

Acknowledgements

I would like to express my warmest gratitude to Yi Yin, my first graduate student, for her valuable work on the writing of the original manuscript. Our work is supported by the National Natural Science Foundation of China under Grant Nos. 61762038, 61741108 and 61861016, the Humanity and Social Science Foundation of the Ministry of Education under Grant Nos. 17YJAZH117 and 16YJAZH029, the Natural Science Foundation of Jiangxi under Grant No. 20171BAB202023, the Key Research and Development Plan of Jiangxi Provincial Science and Technology Department under Grant No. 20171BBG70093, the Humanity and Social Science Foundation of Jiangxi Province under Grant No. 16TQ02, the Humanity and Social Science Foundation of Jiangxi University under Grant No. TQ1503, XW1502.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

References

  1. Aker A, Gaizauskas R (2010) Generating image descriptions using dependency relational patterns. In Proceedings of annual meeting of the Association for Computational LinguisticsGoogle Scholar
  2. Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of computer vision and pattern recognitionGoogle Scholar
  3. Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: Proceedings of international conference on machine learning, JMLR W&CP, vol 28, no. 3, pp 1247–1255Google Scholar
  4. Berg TL, Berg AC, Shih J (2010) Automatic attribute discovery and characterization from noisy web data. In: Proceedings of European conference on computer visionGoogle Scholar
  5. Blei DM, Ng AY, Jordan MJ (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022zbMATHGoogle Scholar
  6. Bo L, Ren X, Fox D (2009) Efficient match kernels between sets of features for visual recognition. In: Proceedings of advances in neural information processing systems, pp 135–143Google Scholar
  7. Bo L, Ren X, Fox D (2010) Kernel descriptors for visual recognition. In Proceedings of advances in neural information processing systems, pp 1734–1742Google Scholar
  8. Chen K, Gao J, Nevatia R (2018) Knowledge aided consistency for weakly supervised phrase grounding. In: Proceedings of computer vision and pattern recognitionGoogle Scholar
  9. Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 886–893Google Scholar
  10. Yang Y, Teo CL, Daume H, III, Aloimonos Y (2011) Corpus-guided sentence generation of natural images. In: Proceedings of conference on empirical methods on natural language processing, pp 444–454Google Scholar
  11. Deshpande A, Aneja J, Wang L, Schwing A, Forsyth DA (2018) Diverse and controllable image captioning with part-of-speech guidance. In: Proceedings of advances in neural information processing systemsGoogle Scholar
  12. Devlin J, Cheng H, Fang H, Gupta S, Deng L, He X, Zweig G, Mitchell M (2015) Language models for image captioning: the quirks and what works. In: Proceedings of annual meeting of the Association for Computational LinguisticsGoogle Scholar
  13. Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of IEEE conference on computer vision and pattern recognitionGoogle Scholar
  14. Elamri C, de Planque T (2016) Automated neural image caption generator for visually impaired people [EB/OL]. Stanford CS224DGoogle Scholar
  15. Elliott D, de Vries AP (2015) Describing images using inferred visual dependency representations. In: Proceedings of Annual Meeting of the Association for Computational LinguisticsGoogle Scholar
  16. Fang H, Gupta S, Iandola F, Srivastava R, Deng L, Dollar P, Gao J, He X, Mitchell M, Platt J, Zitnick CL, Zweig G (2015) From captions to visual concepts and back. In: Proceedings of IEEE conference on computer vision and pattern recognitionGoogle Scholar
  17. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Proceedings of European Conference on Computer Vision, pp 15–29Google Scholar
  18. Feng Y, Lapata M (2013) Automatic caption generation for news images. IEEE Trans Pattern Anal Mach Intell 35(4):797–812CrossRefGoogle Scholar
  19. Gan Z, Gan C, He X et al (2017) Semantic compositional networks for visual captioning. In: Proceedings of computer vision and pattern recognitionGoogle Scholar
  20. Gu J, Cai J, Wang G et al (2017) Stack-captioning: coarse-to-fine learning for image captioning. In: Proceedings of computer vision and pattern recognitionGoogle Scholar
  21. Gupta A, Verma Y, Jawahar CV (2012) Choosing linguistics over vision to describe images. In: Proceedings of AAAI conference on artificial intelligenceGoogle Scholar
  22. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of IEEE international conference on computer vision, pp 770–778Google Scholar
  23. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5186):504–507MathSciNetCrossRefzbMATHGoogle Scholar
  24. Hinton GE, Osindero S, Teh Y (2006) A fast learning algorithm for deep belief nets. Neural Comput 18:1527–1554MathSciNetCrossRefzbMATHGoogle Scholar
  25. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899MathSciNetCrossRefzbMATHGoogle Scholar
  26. Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1):177–196CrossRefzbMATHGoogle Scholar
  27. Hwang S, Grauman K (2012) Learning the relative importance of objects from tagged images for retrieval and cross-modal search. Int J Comput Vis 100(2):134–153MathSciNetCrossRefGoogle Scholar
  28. Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the Special Interest Group on Information Retrieval, pp 119–126Google Scholar
  29. Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of international conference on computer visionGoogle Scholar
  30. Karpathy A, Joulin A, Fei-Fei L (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Proceedings of advances in neural information processing systemsGoogle Scholar
  31. Kiros R, Salakhutdinov R, Zemel R (2014) Multimodal neural language models. In: Proceedings of international conference on machine learning, JMLR Workshop, pp 595–603Google Scholar
  32. Kiros R, Salakhutdinov R, Zemel RS (2015) Unifying visual-semantic embeddings with multimodal neural language models. In: Proceedings of advances in neural information processing systems deep learning workshopGoogle Scholar
  33. Krizhevsky A, Sutskever I, Hinton G (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of conference on advances in neural information processing systems, pp 1106–1114Google Scholar
  34. Kulkarni G, Premraj V, Dhar S et al (2013) Baby talk: understanding and generating simple image descriptions. IEEE Trans Pattern Anal Mach Intell 35(12):2891–2903CrossRefGoogle Scholar
  35. Kuznetsova P, Ordonez V, Berg AC, Berg TL, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of annual meeting of the Association for Computational LinguisticsGoogle Scholar
  36. Lebret R, Pinheiro PO, Collobert R (2015) Phrase-based image captioning. In: Proceedings of international conference on machine learningGoogle Scholar
  37. Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of conference on natural language learningGoogle Scholar
  38. Li P, Ma J, Gao S (2012) Learning to summarize web image and text mutually. In: Proceedings of international conference on multimedia retrievalGoogle Scholar
  39. Li D, Huang Q, He X (2018) Generating diverse and accurate visual captions by comparative adversarial learning [EB/OL]. arXiv.orgGoogle Scholar
  40. Liu C, Mao J, Sha F, Yuille A (2017) Attention correctness in neural image captioning. In: Proceedings of AAAI conference on artificial intelligenceGoogle Scholar
  41. Liu X, Li H, Shao J (2018) Show, tell and discriminate: image captioning by self-retrieval with partially labeled data. In: Proceedings of European conference on computer visionGoogle Scholar
  42. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of IEEE conference on computer vision and pattern recognitionGoogle Scholar
  43. Lowe David G (2004) Distinctive image features from scale-invariant key points. Int J Comput Vision 60(2):91–110CrossRefGoogle Scholar
  44. Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the annual meeting of the Association for Computational Linguistics: Human Language Technologies, pp 142–150Google Scholar
  45. Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: Proceedings of European conference on computer vision, pp 316–329Google Scholar
  46. Mansimov E, Parisotto E, Ba JL et al (2016) Generating images from captions with attention. In: Proceedings of international conference on learning representationsGoogle Scholar
  47. Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: Proceedings of international conference on learning representationsGoogle Scholar
  48. Mason R (2013) Domain-independent captioning of domain-specific images. In: Proceedings of North American Association for Computational Linguistics, pp 69–76Google Scholar
  49. Mason R, Charniak E (2013) Annotation of online shopping images without labeled training examples. In: Proceedings of human language technologies: conference of the North American Chapter of the Association of Computational LinguisticsGoogle Scholar
  50. Mikolov T, Sutskever I, Chen K et al (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of conference on advances in neural information processing systemsGoogle Scholar
  51. Mitchell M, Dodge J, Goyal A et al (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of European Association for Computational Linguistics, pp 747–756Google Scholar
  52. Monay F, Gatica-Perez D (2004) PLSA-based image auto annotation: constraining the latent space. In: Proceedings of ACM international conference on multimedia, pp 348–351Google Scholar
  53. Mukuta Y, Harada T (2014) Probabilistic partial canonical correlation analysis. In: Proceedings of international conference on machine learning, pp 1449–1457Google Scholar
  54. Ojala T, Pietikainen M, Maenpaa T (2002) Multi-resolution grayscale and rotation invariant texture classification with local binary patterns. IEEE 24(7):971–987Google Scholar
  55. Oliva A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Prog Brain Res Visual Percept 155:23–36CrossRefGoogle Scholar
  56. Ordonez V, Kulkarni G, Berg TL (2011) Im2Text: describing images using 1 million captioned photographs. In: Conference on neural information processing systemsGoogle Scholar
  57. Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of annual meeting of the Association for Computational Linguistics, pp 311–318Google Scholar
  58. Pedersoli M, Lucas T, Schmid C, Verbeek J (2017) Areas of attention for image captioning. In: Proceedings of international conference on computer visionGoogle Scholar
  59. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of conference on empirical methods on natural language processingGoogle Scholar
  60. Pinheiro P, Lebret R, Collobert R (2015) Simple image description generator via a linear phrase-based model. In: Proceedings of international conference on learning representations workshopGoogle Scholar
  61. Quan R, Han J, Zhang D, Nie F (2016) Object co-segmentation via graph optimized-flexible manifold ranking. In: Proceedings of computer vision and pattern recognitionGoogle Scholar
  62. Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: Proceedings of MICCAIGoogle Scholar
  63. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Proceedings of international conference on learning representationsGoogle Scholar
  64. Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. Trans Assoc Comput Linguist 2:207–218CrossRefGoogle Scholar
  65. Sutskever I, Vinyals O, Le Q (2014) Sequence to sequence learning with neural networks. In: Proceedings of NIPSGoogle Scholar
  66. Ushiku Y, Harada T, Kuniyoshi Y (2011) Automatic sentence generation from images. In: Proceedings of ACM multimedia conference, pp 1533–1536Google Scholar
  67. Vedaldi A, Gulshan V, Varma M, Zisserman A (2009) Multiple kernels for object detection. In: Proceedings of international conference on computer vision, pp 606–613Google Scholar
  68. Vittayakorn S, Umeda T, Murasaki K, Sudo K, Okatani T, Yamaguchi K (2016) Automatic attribute discovery with neural activations. In: Proceedings of European conference on computer visionGoogle Scholar
  69. Wang Q, Chan AB (2018) CNN + CNN: convolutional decoders for image captioning [EB/OL]. arXiv.orgGoogle Scholar
  70. Wang J, Madhyastha P, Specia L (2018) Object counts! bringing explicit detections back into image captioning. In: Proceedings of North American Chapter of the Association for Computational LinguisticsGoogle Scholar
  71. Wu Q, Shen C, Liu L, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: Proceedings of computer vision and pattern recognitionGoogle Scholar
  72. Xiao F, Sigal L, Lee YJ (2017) Weakly-supervised visual grounding of phrases with linguistic structures. In: Proceedings of computer vision and pattern recognitionGoogle Scholar
  73. Xu K, Ba JL, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of international conference on machine learningGoogle Scholar
  74. Xu N, Price B, Cohen S, Yang J, Huang TS (2016) Deep interactive object selection. In Proceedings of computer vision and pattern recognitionGoogle Scholar
  75. Yang J, Yu K, Gong Y et al. (2009) Linear spatial pyramid matching using sparse coding for image classification. In: IEEE conference on computer vision and pattern recognition, pp 1794–1801Google Scholar
  76. Yao B, Yang X, Lin L, Lee MW, Zhu S-C (2010) I2t: image parsing to text description. Proc IEEE 98(8):1485–1508CrossRefGoogle Scholar
  77. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of computer vision and pattern recognitionGoogle Scholar
  78. You Q, You Q, Luo J 2018) Image captioning at will: a versatile scheme for effectively injecting sentiments into image descriptions [EB/OL]. arXiv.orgGoogle Scholar
  79. Zhang H, Ji D, Yin L, Ren Y, Niu Z (2016) Caption generation from product image based on tag refinement and syntactic tree. J Comput Res Dev 53(11):2542–2555Google Scholar
  80. Zhang Z, Xie Y, Xing F et al. (2017) MDNet: a semantically and visually interpretable medical image diagnosis network. In: Proceedings of IEEE conference on computer vision and pattern recognitionGoogle Scholar
  81. Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr PHS (2015) Conditional random fields as recurrent neural networks. In: Proceedings of international conference on computer visionGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Software SchoolEast China Jiaotong UniversityNanchangChina
  2. 2.Computer SchoolWuhan UniversityWuhanChina
  3. 3.School of Information EngineeringEast China Jiaotong UniversityNanchangChina
  4. 4.Baidu CompanyBeijingChina
  5. 5.School of Computing and Information ScienceFlorida International UniversityMiamiUSA

Personalised recommendations