Adaptive Multi-attention for Image Sentence Generator Using C-LSTM

Vidhya, K. A.; Krishnakumar, S.; Cynddia, B.

doi:10.1007/978-981-19-1610-6_51

K. A. Vidhya¹³,
S. Krishnakumar¹³ &
B. Cynddia¹⁴

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 448))

466 Accesses

Abstract

Capturing image feature and multi-object region of an image and transferring it into a Natural Language Sentence is a research issue needs to be addressed with natural language processing. Technically, the attention mechanism will force every word representation to an corresponding image region, however at times it do neglect certain words like ‘the’ in the description text, as it misleads the text interpretation. The captioning of an image involves not only detecting the features from various images, but also decoding the collaborations between the items into significant image text. The focus of the suggested work, predicts the image sentence in a more detailed way for every region/frame of an image. To overcome, an image feature extraction is carried out using CNN and LSTM for the image text generation with the help of adaptive attention mechanism, which will be add in the layer of LSTM to predict better image sentence is constructed. The above mentioned deep network methods have been analyzed using two output combination. Experiments have been implemented using Flickr8k dataset. The implementation analysis illustrates that adaptive attention performs significantly better than without adaptive attention of image sentence model and generates more meaningful captions compared to any of the individual models used. From the results on test images, the suggested network gives the accuracy, bleu score with and without using adaptive attention in the LSTM of 81.53, 61.94 and 73.53, 57.94%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arifianto A, Ramadhani KN, Mahadi MRS (2020) “Adaptive attention generation for indonesian image captioning.” In: International conference on information and communication technology (ICICT), pp 1–6
Google Scholar
Hani A, Kherallah M, Tagougui N (2019) “Image caption generation using a deep architecture.” In International Arab conference on information technology (ACIT), pp 246–251
Google Scholar
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In ICCV, 4904–4912
Google Scholar
Yang Z, Yuan Y, Wu Y, Cohen WW, Salakhutdinov RR (2016) Review networks for caption generation. In NIPS, 2361–2369
Google Scholar
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In CVPR, 4651–4659
Google Scholar
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions
Google Scholar
Transactions of the Association for Computational Linguistics 2:67–78
Google Scholar
Song J, Guo Y, Gao L, Li X, Hanjalic A, Shen HT (2018a) From deterministic to generative: multimodal stochastic rnns for video captioning. IEEE Trans Neural Netw Learn Syst 1–12
Google Scholar
Song J, Zeng P, Gao L, Shen HT (2018b) From pixels to objects: cubic visual attention for visual question answering. In IJCAI, 906–912
Google Scholar
Zhang M, Yang Y, Zhang H, Ji Y, Shen HT, Chua T-S (2019) More is better: precise and detailed image captioning using online positive recall and missing concepts mining. IEEE Trans Image Process 28(1):32–44
Article MathSciNet Google Scholar
Wang B, Wang C, Zhang Q, Su Y, Wang Y, Xu Y (2020) Cross-lingual image caption generation based on visual attention model. Inst Electric Electron Eng 8:104543–104554
Google Scholar
Song J, Gao L, Guo Z, Liu W, Zhang D, Shen HT (2017) Hierarchical LSTM with adjusted temporal attention for video captioning. In IJCAI, 2737–2743
Google Scholar
Mohan A, Laxman K, Vigneswaran D, Yuvaraj J, Kumar NK (2019) Detection and recognition of objects in image caption generator system: a deep learning approach. Int Conf Adv Comput Commun Syst (ICACCS) 9:107–109
Google Scholar
Bajpayee A, Raghuvanshi D, Mittal H, Bhatia Y (2019) Image captioning using google’s inception-resnet-v2 and recurrent neural network. Twelfth Int Conf Contemp Comput (IC3) 5:1–6
Google Scholar
Sarfi A, Ghasemian F, Asadi N, Karimpour Z (2020) Show, attend to everything, and tell: image captioning with more thorough image understanding. Int Conf on Comput Knowl Eng (ICCKE) 8:001–005
Google Scholar
Parmar B, Jayaswal D, Parikh H, Sawant H, Shah R, Chapaneri S (2020) “Encoder-decoder architecture for image caption generation.” In International conference on communication system, computing and it applications (CSCITA), 174–179
Google Scholar
Pan C, Ding K, Wang L, Xiang S, Xiao X (2019) Deep hierarchical encoder–decoder network for image captioning. Inst Electric Electron Eng 21:2942–2956
Google Scholar
Al Fatta H, Hartatik, Fajar U (2019) “Captioning image using convolutional neural network (cnn) and long-short term memory (lstm).” Int Seminar Res Inf Technol Intell Syst (ISRITI) 4:263–268
Google Scholar
Hu H, Tian J, Li L, Liu M, Guan W (2020) Image caption generation with dual attention mechanism. Inf Process Manage 5:587–521
Google Scholar
Xia Y, Tian F, Wu L, Lin J, Qin T, Yu N, Liu T (2017) Deliberation networks: sequence generation beyond one-pass decoding. In NIPS, 1782–1792
Google Scholar
Wang A, Chan AB (2018) “CNN+ CNN: convolutional decoders for image captioning”, arXiv preprint arXiv:1805.09019
Li L, Tang S, Zhang Y, Deng L, Tian Q (2018) GLA: global-local attention for image description. IEEE Trans Multimed 20(3):726–737
Article Google Scholar
Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2048–2057
Google Scholar
Chen C, Wang EK, Wang F, Wu T, Zhang X (2019) Multilayer dense attention model for image caption. Inst Electric Electron Eng 7:66358–66368
Google Scholar
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In CVPR, 4566–4575
Google Scholar

Download references

Author information

Authors and Affiliations

Department of IST, Anna University, Chennai, India
K. A. Vidhya & S. Krishnakumar
Department of CSE, Shiv Nadar University, Chennai, India
B. Cynddia

Authors

K. A. Vidhya
View author publications
You can also search for this author in PubMed Google Scholar
S. Krishnakumar
View author publications
You can also search for this author in PubMed Google Scholar
B. Cynddia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. A. Vidhya .

Editor information

Editors and Affiliations

Middlesex University, London, UK
Xin-She Yang
The University of Reading, Reading, UK
Simon Sherratt
JIS University, Kolkata, India
Nilanjan Dey
Global Knowledge Research Foundation, Ahmedabad, India
Amit Joshi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vidhya, K.A., Krishnakumar, S., Cynddia, B. (2023). Adaptive Multi-attention for Image Sentence Generator Using C-LSTM. In: Yang, XS., Sherratt, S., Dey, N., Joshi, A. (eds) Proceedings of Seventh International Congress on Information and Communication Technology. Lecture Notes in Networks and Systems, vol 448. Springer, Singapore. https://doi.org/10.1007/978-981-19-1610-6_51

Download citation

DOI: https://doi.org/10.1007/978-981-19-1610-6_51
Published: 27 July 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-1609-0
Online ISBN: 978-981-19-1610-6
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics