Abstract
Image Captioning is an arduous task of producing syntactically and semantically correct textual descriptions of an image in natural language with context related to the image. Existing notable pieces of research in Bengali Image Captioning (BIC) are based on encoder-decoder architecture. This paper presents an end-to-end image captioning system utilizing a multimodal architecture by combining a one-dimensional convolutional neural network (CNN) to encode sequence information with a pre-trained ResNet-50 model image encoder for extracting region-based visual features. We investigate our approach’s performance on the BanglaLekhaImageCaptions dataset using the existing evaluation metrics and perform a human evaluation for qualitative analysis. Experiments show that our approach’s language encoder captures the fine-grained information in the caption, and combined with the image features, it generates accurate and diversified caption. Our work outperforms all the existing BIC works and achieves a new state-of-the-art (SOTA) performance by scoring 0.651 on BLUE-1, 0.572 on CIDEr, 0.297 on METEOR, 0.434 on ROUGE, and 0.357 on SPICE.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
https://github.com/FaiyazKhan11/Improved-Bengali-Image-Captioning-via-deep-convolutional-neural-network-based-encoder-decoder-model.
- 2.
https://github.com/salaniz/pycocoevalcap.
- 3.
https://github.com.
References
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Rahman M, Mohammed N, Mansoor N, Momen S (2019) Chittron: an automatic Bangla image captioning system. Procedia Computer Sci 154:636–642
Deb T, Ali MZA, Bhowmik S, Firoze A, Ahmed SS, Tahmeed MA, Rah-man N, Rahman RM (2019) Oboyob: a sequential-semantic bengali image captioning engine. J Intell Fuzzy Syst (Preprint) 1–13
Tanti M, Gatt A, Camilleri K (2017) What is the role of recurrent neural networks (RNNs) in an image caption generator? In: Proceedings of the 10th international conference on natural language generation. Association for Computational Linguistics, Santiago de Compostela, Spain, pp 51–60
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
Wang Q, Chan AB (2018) CNN + CNN: Convolutional decoders for image captioning. In:31st IEEE/CVF conference on computer vision and pattern recognition (CVPR2018)
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: CVPR09
Mansoor NK, Mohammed AH, Momen N, Rahman S, Matiur M (2019) Banglalekhaimagecaptions, mendeleydata. https://doi.org/10.17632/rxxch9vw59.2
Gerber R, Nagel NH (1996) Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences. In: Proceedings of 3rd IEEE international conference on image processing, vol 2. IEEE, pp 805–808
Duygulu P, Barnard K, de Freitas JF, Forsyth DA (2002) Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. European conference on computer vision. Springer, Berlin, pp 97–112
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: 3rd international conference on learning representations (ICLR 2015)
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4565–4574
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, Berlin, pp 740–755
Young P, Lai A, Hodosh M, Hockenmaier J (2014) From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguistics 2:67–78
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899
Yoshikawa Y, Shigeto Y, Takeuchi A (2017) Stair captions: constructing a large-scale japanese image caption dataset. In: Proceedings of the 55th annual meeting of the Association for Computational Linguistics (vol 2: short papers), pp 417–421
Li X, Lan W, Dong J, Liu H (2016) Adding chinese captions to images. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, pp 271–275
Elliott D, Frank S, Sima’an K, Specia L (2016) Multi30k: multilingual English-German image descriptions. In: Proceedings of the 5th workshop on vision and language, pp 70–74
Al-Muzaini HA, Al-Yahya TN, Benhidour H (2018) Automatic arabic image captioning using RNN-LSTM-based language model and CNN. Int J Adv Comput Sci Appl 9(6)
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318
Denkowski M, Lavie A (2014) Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the ninth workshop on statistical machine translation, pp 376–380
Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp 74–81
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: semantic propositional image caption evaluation. European conference on computer vision. Springer, Berlin, pp 382–398
Chen X, Fang H, Lin TY, Vedantam R, Gupta S, Dollár P, Zitnick CL (2015) Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325
Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1 (Long and Short Papers), pp 4171–4186
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV (2019) Xlnet: generalized auto regressive pretraining for language understanding. In: Advances in neural information processing systems, pp 5753–5763
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Conneau A, Lample G (2019) Cross-lingual language model pretraining. In: Advances in neural information processing systems, pp 7059–7069
Acknowledgements
We want to thank the Natural Language Processing Group, Dept. of CSE, SUST, for their valuable guidelines in our research work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Faiyaz Khan, M., Sadiq-Ur-Rahman, S.M., Saiful Islam, M. (2021). Improved Bengali Image Captioning via Deep Convolutional Neural Network Based Encoder-Decoder Model. In: Uddin, M.S., Bansal, J.C. (eds) Proceedings of International Joint Conference on Advances in Computational Intelligence. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-16-0586-4_18
Download citation
DOI: https://doi.org/10.1007/978-981-16-0586-4_18
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-0585-7
Online ISBN: 978-981-16-0586-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)