Remote sensing image caption generation via transformer and reinforcement learning

Shen, Xiangqing; Liu, Bing; Zhou, Yong; Zhao, Jiaqi

doi:10.1007/s11042-020-09294-7

Remote sensing image caption generation via transformer and reinforcement learning

Published: 17 July 2020

Volume 79, pages 26661–26682, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Xiangqing Shen ORCID: orcid.org/0000-0002-9825-7853¹,
Bing Liu^1,2,3,
Yong Zhou^1,2 &
…
Jiaqi Zhao²

1839 Accesses
29 Citations
Explore all metrics

Abstract

Image captioning is a task generating the natural semantic description of the given image, which plays an essential role for machines to understand the content of the image. Remote sensing image captioning is a part of the field. Most of the current remote sensing image captioning models failed to fully utilize the semantic information in images and suffered the overfitting problem induced by the small size of the dataset. To this end, we propose a new model using the Transformer to decode the image features to target sentences. For making the Transformer more adaptive to the remote sensing image captioning task, we additionally employ dropout layers, residual connections, and adaptive feature fusion in the Transformer. Reinforcement Learning is then applied to enhance the quality of the generated sentences. We demonstrate the validity of our proposed model on three remote sensing image captioning datasets. Our model obtains all seven higher scores on the Sydney Dataset and Remote Sensing Image Caption Dataset (RSICD), four higher scores on UCM dataset, which indicates that the proposed methods perform better than the previous state of the art models in remote sensing image caption generation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Collaborative Learning Method for Natural Image Captioning

Transformer with Prior Language Knowledge for Image Captioning

Automatic image caption generation using deep learning

Article 01 June 2023

References

Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition. IEEE https://doi.org/10.1109/cvpr.2018.00636
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, conference track proceedings
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization; association for computational linguistics: Ann Arbor, Michigan, pp 65–72
Bengio S, Vinyals O, Jaitly N, Shazeer N (2015) Scheduled sampling for sequence prediction with recurrent neural networks. AdvX Neural Inf Process Syst 28(Nips 2015):28
Google Scholar
Chen K, Zhou Z, Guo J, Zhang D, Sun X (2013) Semantic scene understanding oriented high resolution remote sensing image change information analysis. In: Proceedings of the annual conference on high resolution earth observation, Beijing, China, pp 1–12
Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: 2015 IEEE conference on computer vision and pattern recognition (Cvpr), pp 2422–2431
Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: benchmark and state of the art. Proc IEEE 105:1865–1883. https://doi.org/10.1109/jproc.2017.2675998
Article Google Scholar
Cheng G, Yang CY, Yao XW, Guo L, Han JW (2018) When deep learning meets metric learning: remote sensing image scene classification via learning discriminative CNNs. IEEE Trans Geosci Remote Sens 56:2811–2821. https://doi.org/10.1109/Tgrs.2017.2783902
Article Google Scholar
Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura J, Parikh D (2017) Visual Dialog. 1080–1089. https://doi.org/10.1109/CVPR.2017.121
Dong L, S M, Shan J, Liu B, Yu Y, Yan T (2019) Computation offloading for mobile-edge computing with multi-user. 841–850. https://doi.org/10.1109/ICDCS.2019.00088
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Computer vision – ECCV 2010. https://doi.org/10.1007/978-3-642-15561-1_2. Springer, Berlin, pp 15–29
Feng Q, Wu Y, Fan H, Yan C, Xu M, Yang Y (2020) Cascaded revision network for novel object captioning. IEEE Trans. Circuits Syst. Video Technol 1–1. https://doi.org/10.1109/tcsvt.2020.2965966
Gerber R, Nagel NH (1996) Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences. In: Proceedings of 3rd IEEE international conference on image processing, vol 2. IEEE, pp 805–808
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Computer Vision – ECCV 2014. https://doi.org/10.1007/978-3-319-10593-2_35. Springer International Publishing, Cham, pp 529–545
Guo J, Sun Z, Tang H, Jia X, Wang S, Yan X, Ye G, Wu G (2016) Hybrid optimization algorithm of particle swarm optimization and cuckoo search for preventive maintenance period optimization. Discret Dyn Nat Soc
Han XB, Zhong YF, Zhang LP (2017) An efficient and robust integrated geospatial object detection framework for high spatial resolution remote sensing imagery. Remote Sens, 9. https://doi.org/10.3390/rs9070666
He KM, Zhang XY, Ren SQ, Sun J (2016) Deep residual learning for image recognition. In: 2016 Ieee conference on computer vision and pattern recognition (Cvpr), pp 770–778, https://doi.org/10.1109/ Cvpr.2016.90
Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899. https://doi.org/10.1613/jair.3994
Article MathSciNet Google Scholar
Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. Adv Neural Inf Process Syst 27(Nips 2014):27
Google Scholar
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA May 7-9, 2015, conference track proceedings
Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2011) Baby talk: understanding and generating simple image descriptions. CVPR 2011 IEEE, https://doi.org/10.1109/cvpr.2011.5995466
Kundra H, Sadawarti H (2015) Hybrid algorithm of cuckoo search and particle swarm optimization for natural terrain feature extraction. Res J Inf Technol 7:58–69
Google Scholar
Li Y (2012) Target detection method of high resolution remote sensing image based on semantic model. Graduate University of Chinese Academy of Sciences, Beijing
Google Scholar
Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning; association for computational linguistics: Stroudsburg, PA, USA, CoNLL ’11, pp 220–228
Lin CY (2004) ROUGE: A package for automatic evaluation of summaries. text summarization branches out; association for computational linguistics: Barcelona, Spain, 74–81
Liu T., Li P., Zhang L., Chen X. (2009) A remote sensing image retrieval model based on semantic mining. Geomatics Inf Sci Wuhan Univ 34:684–687. https://doi.org/10.1042/BSR20080061
Google Scholar
Lu XX, Wang BQ, Zheng XT, Li XL (2018) Exploring models and data for remote sensing image caption generation. IEEE Trans Geosci Remote Sens 56:2183–2195. https://doi.org/10.1109/Tgrs.2017.2776321
Article Google Scholar
Lu JS, Xiong CM, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: 30th Ieee conference on computer vision and pattern recognition (Cvpr 2017), pp 3242–3250. https://doi.org/10.1109/Cvpr.2017.345
Lu XQ, Zheng XT, Yuan Y (2017) Remote sensing scene classification by unsupervised representation learning. IEEE Trans Geosci Remote Sens 55:5148–5157. https://doi.org/10.1109/Tgrs.2017.2702596
Article Google Scholar
Maggiori E, Tarabalka Y, Charpiat G, Alliez P (2017) Convolutional neural networks for large-scale remote-sensing image classification. IEEE Trans Geosci Remote Sens 55:645–657. https://doi.org/10.1109/Tgrs.2016.2612821
Article Google Scholar
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, conference track proceedings
Nie GY, Cheng MM, Liu Y, Liang Z, Fan DP, Liu Y, Wang Y (2019) Multi-level context ultra-aggregation for stereo matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3283–3291
Pant T, Han C, Wang H (2019) Examination of errors of table integration in flamelet/progress variable modeling of a turbulent non-premixed jet flame. Appl Math Model 72:369–384
Article MathSciNet Google Scholar
Papineni K, Roukos S, Ward T, Zhu WJBLEU (2001) Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135 https://doi.org/10.3115/1073083.1073135
Qu B, Li XL, Tao DC, Lu XQ (2016) Deep semantic understanding of high resolution remote sensing image. In: 2016 International conference on computer, information and telecommunication systems (Cits), pp 124–128
R JA, Raimond K (2015) A review on availability of remote sensing data. IEEE Technological innovation in ICT for agriculture and rural development (TIAR). IEEE, 2015. https://doi.org/10.1109/tiar.2015.7358548 https://doi.org/10.1109/tiar.2015.7358548
Rahaman KR, Hassan QK (2016) Application of remote sensing to quantify local warming trends: a review. In: 2016 5th International conference on informatics, electronics and vision (Iciev), pp 256–261
Ranzato M, Chopra S, Auli M, Zaremba W (2016) Sequence level training with recurrent neural networks. In: Bengio Y, LeCun Y (eds) 4th International conference on learning representations, ICLR 2016, San Juan, Puerto Rico May 2-4, 2016, conference track proceedings
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: 30th IEEE conference on computer vision and pattern recognition (Cvpr 2017), pp 1179–1195. https://doi.org/10.1109/Cvpr.2017.131
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang ZH, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet Large Scale Visual Recognition Challenge. Int J Comput Vis 115:211–252. https://doi.org/10.1007/s11263-015-0816-y
Article MathSciNet Google Scholar
Shi ZW, Zou ZX (2017) Can a machine generate humanlike language descriptions for a remote sensing image?. IEEE Trans Geosci Remote Sens 55:3623–3634. https://doi.org/10.1109/Tgrs.2017.2677464
Article Google Scholar
Spratling MW, Johnson MH (2004) A feedback model of visual attention. J Cogn Neurosci 16:219–237. https://doi.org/10.1162/089892904322984526
Article Google Scholar
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
MathSciNet MATH Google Scholar
Sun C, Gan C, Nevatia R (2015) Automatic concept discovery from parallel text and visual corpora. In: 2015 Ieee international conference on computer vision (Iccv). https://doi.org/10.1109/Iccv.2015.298, pp 2596–2604
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge
MATH Google Scholar
Toth C, Jóźków G (2016) Remote sensing platforms and sensors: a survey. Isprs J Photogramm Remote Sens 115:22–36. https://doi.org/10.1016/j.isprsjprs.2015.10.004
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates, Inc, pp 5998–6008
Vedantam R, Zitnick C, Parikh D (2015) CIDEr: consensus-based image description evaluation 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087
Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39:652–663. https://doi.org/10.1109/Tpami.2016.2587640
Article Google Scholar
Wang B, Lu X, Zheng X, Li X (2019) Semantic descriptions of high-resolution remote sensing images. IEEE Geosci Remote Sens Lett 1–5. https://doi.org/10.1109/LGRS.2019.2893772
Wang J, Zhou H (2012) Research on key technologies of remote sensing image data retrieval based on semantics. Comput Digit Eng 40:48–50
Google Scholar
Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8:229–256. https://doi.org/10.1007/BF00992696
MATH Google Scholar
Wu Y, Jiang L, Yang Y (2020) Revisiting embodiedQA: A Simple Baseline and Beyond. IEEE Trans Image Process 29:3984–3992. https://doi.org/10.1109/tip.2020.2967584
Article Google Scholar
Wu Q, Shen CH, Liu LQ, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: 2016 Ieee conference on computer vision and pattern recognition (Cvpr). https://doi.org/10.1109/Cvpr.2016.29, pp 203–212
Wu Q, Shen CH, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40:1367–1381. https://doi.org/10.1109/Tpami.2017.2708709
Article Google Scholar
Wu Y, Zhu L, Jiang L, Yang Y (2018) Decoupled novel object captioner. 2018 ACM multimedia conference on multimedia conference. ACM Press. https://doi.org/10.1145/3240508.3240640
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. proceedings of the 32nd international conference on machine learning. In: Bach F, Blei D (eds) Proceedings of machine learning research. PMLR: Lille, France, vol 37, pp 2048–2057
Yang J, Jiang Z, Zhou Q, Zhang H, Shi J (2015) Remote sensing image semantic labeling based on conditional random field 36:3069–3081. https://doi.org/10.7527/S1000-6893.2014.0356
Yang Y, Newsam S (2010) Bag-of-visual-words and spatial extensions for land-use classification. In: Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems; ACM: New York, NY, USA, 2010; GIS’10, pp 270–279. https://doi.org/10.1145/1869790.1869829 https://doi.org/10.1145/1869790.1869829
Yao T, Pan YW, Li YH, Qiu ZF, Mei T (2017) Boosting image captioning with attributes. In: 2017 Ieee international conference on computer vision (Iccv), pp 4904–4912. https://doi.org/10.1109/Iccv.2017.524 https://doi.org/10.1109/Iccv.2017.524
Yao BZ, Yang X, Lin LA, Lee MW, Zhu SC (2010) I2t: image parsing to text description. Proc Ieee 98:1485–1508. https://doi.org/10.1109/Jproc.2010.2050411
Article Google Scholar
You QZ, Jin HL, Wang ZW, Fang C, Luo JB (2016) Image captioning with semantic attention. In: 2016 Ieee conference on computer vision and pattern recognition (Cvpr 2017), pp 4651–4659. https://doi.org/10.1109/Cvpr.2016.503
Zhang F, Du B, Zhang LP (2015) Saliency-guided unsupervised feature learning for scene classification. IEEE Trans Geosci Remote Sens 53:2175–2184. https://doi.org/10.1109/Tgrs.2014.2357078
Article Google Scholar
Zhang XR, Wang X, Tang X, Zhou HY, Li C (2019) Description generation for remote sensing images using attribute attention mechanism. Remote Sens 11. https://doi.org/10.3390/rs11060612
Zhang LB, Zhang YY (2017) Airport detection and aircraft recognition based on two-layer saliency model in high spatial resolution remote-sensing images. IEEE J Sel Top Appl Earth Obs Remote Sens 10:1511–1524. https://doi.org/10.1109/Jstars.2016.2620900
Article Google Scholar
Zhu L, Xu Z, Yang Y, Hauptmann AG (2017) Uncovering the temporal context for video question answering. Int J Comput Vis 124:409–421. https://doi.org/10.1007/s11263-017-1033-7
Article MathSciNet Google Scholar
Zhu QQ, Zhong YF, Zhang LP (2014) Multi-feature probability topic scene classifier for high spatial resolution remote sensing imagery 2014. IEEE International Geoscience and Remote Sensing Symposium (Igarss). https://doi.org/10.1109/Igarss.2014.6947071

Download references

Acknowledgments

This work was supported by Fundamental Research Funds for the Central Universities, China (2017XKQY082).

Author information

Authors and Affiliations

School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, Jiangsu province, China
Xiangqing Shen, Bing Liu & Yong Zhou
Mine Digitization Engineering Research Center of Ministry of Education of the People’s Republic of China, Xuzhou, China
Bing Liu, Yong Zhou & Jiaqi Zhao
Insititute of Electrics, Chinese Academy of Sciences, Beijing, 100190, China
Bing Liu

Authors

Xiangqing Shen
View author publications
You can also search for this author in PubMed Google Scholar
Bing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bing Liu.

Ethics declarations

Conflict of interests

The authors declare no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shen, X., Liu, B., Zhou, Y. et al. Remote sensing image caption generation via transformer and reinforcement learning. Multimed Tools Appl 79, 26661–26682 (2020). https://doi.org/10.1007/s11042-020-09294-7

Download citation

Received: 18 July 2019
Revised: 24 June 2020
Accepted: 29 June 2020
Published: 17 July 2020
Issue Date: September 2020
DOI: https://doi.org/10.1007/s11042-020-09294-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Remote sensing image caption generation via transformer and reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Collaborative Learning Method for Natural Image Captioning

Transformer with Prior Language Knowledge for Image Captioning

Automatic image caption generation using deep learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Remote sensing image caption generation via transformer and reinforcement learning

Abstract

Access this article

Similar content being viewed by others

Collaborative Learning Method for Natural Image Captioning

Transformer with Prior Language Knowledge for Image Captioning

Automatic image caption generation using deep learning

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation