SACIC: A Semantics-Aware Convolutional Image Captioner Using Multi-level Pervasive Attention

Parameswaran, Sandeep Narayan; Das, Sukhendu

doi:10.1007/978-3-030-36718-3_6

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11955))

Included in the following conference series:

International Conference on Neural Information Processing

2747 Accesses
1 Citations

Abstract

Attention mechanisms alongside encoder-decoder architectures have become integral components for solving the image captioning problem. The attention mechanism recombines an encoding of the image depending on the state of the decoder, to generate the caption sequence. The decoder is predominantly recurrent in nature. In contrast, we propose a novel network possessing attention-like properties that are pervasive through its layers, by utilizing a convolutional neural network (CNN) to refine and combine representations at multiple levels of the architecture for captioning images. We also enable the model to use explicit higher-level semantic information obtained by performing panoptic segmentation on the image. The attention capability of the model is visually demonstrated, and an experimental evaluation is shown on the MS-COCO dataset. We exhibit that the approach is more robust, efficient, and yields better performance in comparison to the state-of-the-art architectures for image captioning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_24
Chapter Google Scholar
Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Aneja, J., Deshpande, A., Schwing, A.G.: Convolutional image captioning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: The Association for Computational Linguistics (ACL) Workshop, vol. 29, pp. 65–72 (2005)
Google Scholar
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: Thing and stuff classes in context. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1209–1218 (2018)
Google Scholar
Dai, B., Fidler, S., Lin, D.: A neural compositional paradigm for image captioning. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 656–666. Curran Associates Inc., USA (2018)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255 (2009)
Google Scholar
Elbayad, M., Besacier, L., Verbeek, J.: Pervasive attention: 2d convolutional neural networks for sequence-to-sequence prediction. In: The Conference on Computational Natural Language Learning (CoNLL), pp. 1–11 (2018)
Google Scholar
Fang, H., et al.: From captions to visual concepts and back. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1473–1482 (2015)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, p. 3 (2017)
Google Scholar
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128–3137 (2015)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kirillov, A., He, K., Girshick, R., Rother, C., Dollar, P.: Panoptic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: The Association for Computational Linguistics (ACL) Workshop, vol. 8 (2004)
Google Scholar
Lin, T.Y., et al.: Microsoft coco: common objects in context. In: The European Conference on Computer Vision (ECCV), pp. 740–755 (2014)
Chapter Google Scholar
Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7219–7228 (2018)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: The Annual Meeting on Association for Computational Linguistics (ACL), pp. 311–318 (2002)
Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: The Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Pleiss, G., Chen, D., Huang, G., Li, T., van der Maaten, L., Weinberger, K.Q.: Memory-efficient implementation of densenets. arXiv preprint arXiv:1707.06990 (2017)
Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1179–1195 (2017)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)
Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164 (2015)
Google Scholar
Wu, Q., Shen, C., Liu, L., Dick, A., Van Den Hengel, A.: What value do explicit high level concepts have in vision to language problems? In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 203–212 (2016)
Google Scholar
Xiong, Y., et al.: Upsnet: A unified panoptic segmentation network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: The International Conference on International Conference on Machine Learning (ICML), pp. 2048–2057 (2015)
Google Scholar
Yang, Z., Yuan, Y., Wu, Y., Cohen, W.W., Salakhutdinov, R.R.: Review networks for caption generation. In: Advances in Neural Information Processing Systems (NeurIPS), pp. 2361–2369 (2016)
Google Scholar
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: The Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 1480–1489 (2016)
Google Scholar
You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4651–4659 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

Visualization and Perception Lab, Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India
Sandeep Narayan Parameswaran & Sukhendu Das

Authors

Sandeep Narayan Parameswaran
View author publications
You can also search for this author in PubMed Google Scholar
Sukhendu Das
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sandeep Narayan Parameswaran .

Editor information

Editors and Affiliations

Australian National University, Canberra, ACT, Australia
Tom Gedeon
Murdoch University, Murdoch, WA, Australia
Kok Wai Wong
Kyungpook National University, Daegu, Korea (Republic of)
Minho Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Parameswaran, S.N., Das, S. (2019). SACIC: A Semantics-Aware Convolutional Image Captioner Using Multi-level Pervasive Attention. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Lecture Notes in Computer Science(), vol 11955. Springer, Cham. https://doi.org/10.1007/978-3-030-36718-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-36718-3_6
Published: 09 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36717-6
Online ISBN: 978-3-030-36718-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics