Skip to main content
Log in

Remote sensing image caption generation via transformer and reinforcement learning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Image captioning is a task generating the natural semantic description of the given image, which plays an essential role for machines to understand the content of the image. Remote sensing image captioning is a part of the field. Most of the current remote sensing image captioning models failed to fully utilize the semantic information in images and suffered the overfitting problem induced by the small size of the dataset. To this end, we propose a new model using the Transformer to decode the image features to target sentences. For making the Transformer more adaptive to the remote sensing image captioning task, we additionally employ dropout layers, residual connections, and adaptive feature fusion in the Transformer. Reinforcement Learning is then applied to enhance the quality of the generated sentences. We demonstrate the validity of our proposed model on three remote sensing image captioning datasets. Our model obtains all seven higher scores on the Sydney Dataset and Remote Sensing Image Caption Dataset (RSICD), four higher scores on UCM dataset, which indicates that the proposed methods perform better than the previous state of the art models in remote sensing image caption generation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition. IEEE https://doi.org/10.1109/cvpr.2018.00636

  2. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, conference track proceedings

  3. Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization; association for computational linguistics: Ann Arbor, Michigan, pp 65–72

  4. Bengio S, Vinyals O, Jaitly N, Shazeer N (2015) Scheduled sampling for sequence prediction with recurrent neural networks. AdvX Neural Inf Process Syst 28(Nips 2015):28

    Google Scholar 

  5. Chen K, Zhou Z, Guo J, Zhang D, Sun X (2013) Semantic scene understanding oriented high resolution remote sensing image change information analysis. In: Proceedings of the annual conference on high resolution earth observation, Beijing, China, pp 1–12

  6. Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: 2015 IEEE conference on computer vision and pattern recognition (Cvpr), pp 2422–2431

  7. Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: benchmark and state of the art. Proc IEEE 105:1865–1883. https://doi.org/10.1109/jproc.2017.2675998

    Article  Google Scholar 

  8. Cheng G, Yang CY, Yao XW, Guo L, Han JW (2018) When deep learning meets metric learning: remote sensing image scene classification via learning discriminative CNNs. IEEE Trans Geosci Remote Sens 56:2811–2821. https://doi.org/10.1109/Tgrs.2017.2783902

    Article  Google Scholar 

  9. Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura J, Parikh D (2017) Visual Dialog. 1080–1089. https://doi.org/10.1109/CVPR.2017.121

  10. Dong L, S M, Shan J, Liu B, Yu Y, Yan T (2019) Computation offloading for mobile-edge computing with multi-user. 841–850. https://doi.org/10.1109/ICDCS.2019.00088

  11. Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Computer vision – ECCV 2010. https://doi.org/10.1007/978-3-642-15561-1_2. Springer, Berlin, pp 15–29

  12. Feng Q, Wu Y, Fan H, Yan C, Xu M, Yang Y (2020) Cascaded revision network for novel object captioning. IEEE Trans. Circuits Syst. Video Technol 1–1. https://doi.org/10.1109/tcsvt.2020.2965966

  13. Gerber R, Nagel NH (1996) Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences. In: Proceedings of 3rd IEEE international conference on image processing, vol 2. IEEE, pp 805–808

  14. Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Computer Vision – ECCV 2014. https://doi.org/10.1007/978-3-319-10593-2_35. Springer International Publishing, Cham, pp 529–545

  15. Guo J, Sun Z, Tang H, Jia X, Wang S, Yan X, Ye G, Wu G (2016) Hybrid optimization algorithm of particle swarm optimization and cuckoo search for preventive maintenance period optimization. Discret Dyn Nat Soc

  16. Han XB, Zhong YF, Zhang LP (2017) An efficient and robust integrated geospatial object detection framework for high spatial resolution remote sensing imagery. Remote Sens, 9. https://doi.org/10.3390/rs9070666

  17. He KM, Zhang XY, Ren SQ, Sun J (2016) Deep residual learning for image recognition. In: 2016 Ieee conference on computer vision and pattern recognition (Cvpr), pp 770–778, https://doi.org/10.1109/ Cvpr.2016.90

  18. Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580

  19. Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899. https://doi.org/10.1613/jair.3994

    Article  MathSciNet  Google Scholar 

  20. Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. Adv Neural Inf Process Syst 27(Nips 2014):27

    Google Scholar 

  21. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA May 7-9, 2015, conference track proceedings

  22. Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2011) Baby talk: understanding and generating simple image descriptions. CVPR 2011 IEEE, https://doi.org/10.1109/cvpr.2011.5995466

  23. Kundra H, Sadawarti H (2015) Hybrid algorithm of cuckoo search and particle swarm optimization for natural terrain feature extraction. Res J Inf Technol 7:58–69

    Google Scholar 

  24. Li Y (2012) Target detection method of high resolution remote sensing image based on semantic model. Graduate University of Chinese Academy of Sciences, Beijing

    Google Scholar 

  25. Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning; association for computational linguistics: Stroudsburg, PA, USA, CoNLL ’11, pp 220–228

  26. Lin CY (2004) ROUGE: A package for automatic evaluation of summaries. text summarization branches out; association for computational linguistics: Barcelona, Spain, 74–81

  27. Liu T., Li P., Zhang L., Chen X. (2009) A remote sensing image retrieval model based on semantic mining. Geomatics Inf Sci Wuhan Univ 34:684–687. https://doi.org/10.1042/BSR20080061

    Google Scholar 

  28. Lu XX, Wang BQ, Zheng XT, Li XL (2018) Exploring models and data for remote sensing image caption generation. IEEE Trans Geosci Remote Sens 56:2183–2195. https://doi.org/10.1109/Tgrs.2017.2776321

    Article  Google Scholar 

  29. Lu JS, Xiong CM, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: 30th Ieee conference on computer vision and pattern recognition (Cvpr 2017), pp 3242–3250. https://doi.org/10.1109/Cvpr.2017.345

  30. Lu XQ, Zheng XT, Yuan Y (2017) Remote sensing scene classification by unsupervised representation learning. IEEE Trans Geosci Remote Sens 55:5148–5157. https://doi.org/10.1109/Tgrs.2017.2702596

    Article  Google Scholar 

  31. Maggiori E, Tarabalka Y, Charpiat G, Alliez P (2017) Convolutional neural networks for large-scale remote-sensing image classification. IEEE Trans Geosci Remote Sens 55:645–657. https://doi.org/10.1109/Tgrs.2016.2612821

    Article  Google Scholar 

  32. Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, conference track proceedings

  33. Nie GY, Cheng MM, Liu Y, Liang Z, Fan DP, Liu Y, Wang Y (2019) Multi-level context ultra-aggregation for stereo matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3283–3291

  34. Pant T, Han C, Wang H (2019) Examination of errors of table integration in flamelet/progress variable modeling of a turbulent non-premixed jet flame. Appl Math Model 72:369–384

    Article  MathSciNet  Google Scholar 

  35. Papineni K, Roukos S, Ward T, Zhu WJBLEU (2001) Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135https://doi.org/10.3115/1073083.1073135

  36. Qu B, Li XL, Tao DC, Lu XQ (2016) Deep semantic understanding of high resolution remote sensing image. In: 2016 International conference on computer, information and telecommunication systems (Cits), pp 124–128

  37. R JA, Raimond K (2015) A review on availability of remote sensing data. IEEE Technological innovation in ICT for agriculture and rural development (TIAR). IEEE, 2015. https://doi.org/10.1109/tiar.2015.7358548https://doi.org/10.1109/tiar.2015.7358548

  38. Rahaman KR, Hassan QK (2016) Application of remote sensing to quantify local warming trends: a review. In: 2016 5th International conference on informatics, electronics and vision (Iciev), pp 256–261

  39. Ranzato M, Chopra S, Auli M, Zaremba W (2016) Sequence level training with recurrent neural networks. In: Bengio Y, LeCun Y (eds) 4th International conference on learning representations, ICLR 2016, San Juan, Puerto Rico May 2-4, 2016, conference track proceedings

  40. Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: 30th IEEE conference on computer vision and pattern recognition (Cvpr 2017), pp 1179–1195. https://doi.org/10.1109/Cvpr.2017.131

  41. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang ZH, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet Large Scale Visual Recognition Challenge. Int J Comput Vis 115:211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

  42. Shi ZW, Zou ZX (2017) Can a machine generate humanlike language descriptions for a remote sensing image?. IEEE Trans Geosci Remote Sens 55:3623–3634. https://doi.org/10.1109/Tgrs.2017.2677464

    Article  Google Scholar 

  43. Spratling MW, Johnson MH (2004) A feedback model of visual attention. J Cogn Neurosci 16:219–237. https://doi.org/10.1162/089892904322984526

    Article  Google Scholar 

  44. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958

    MathSciNet  MATH  Google Scholar 

  45. Sun C, Gan C, Nevatia R (2015) Automatic concept discovery from parallel text and visual corpora. In: 2015 Ieee international conference on computer vision (Iccv). https://doi.org/10.1109/Iccv.2015.298, pp 2596–2604

  46. Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge

    MATH  Google Scholar 

  47. Toth C, Jóźków G (2016) Remote sensing platforms and sensors: a survey. Isprs J Photogramm Remote Sens 115:22–36. https://doi.org/10.1016/j.isprsjprs.2015.10.004

    Article  Google Scholar 

  48. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates, Inc, pp 5998–6008

  49. Vedantam R, Zitnick C, Parikh D (2015) CIDEr: consensus-based image description evaluation 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087

  50. Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39:652–663. https://doi.org/10.1109/Tpami.2016.2587640

    Article  Google Scholar 

  51. Wang B, Lu X, Zheng X, Li X (2019) Semantic descriptions of high-resolution remote sensing images. IEEE Geosci Remote Sens Lett 1–5. https://doi.org/10.1109/LGRS.2019.2893772

  52. Wang J, Zhou H (2012) Research on key technologies of remote sensing image data retrieval based on semantics. Comput Digit Eng 40:48–50

    Google Scholar 

  53. Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8:229–256. https://doi.org/10.1007/BF00992696

    MATH  Google Scholar 

  54. Wu Y, Jiang L, Yang Y (2020) Revisiting embodiedQA: A Simple Baseline and Beyond. IEEE Trans Image Process 29:3984–3992. https://doi.org/10.1109/tip.2020.2967584

    Article  Google Scholar 

  55. Wu Q, Shen CH, Liu LQ, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: 2016 Ieee conference on computer vision and pattern recognition (Cvpr). https://doi.org/10.1109/Cvpr.2016.29, pp 203–212

  56. Wu Q, Shen CH, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40:1367–1381. https://doi.org/10.1109/Tpami.2017.2708709

    Article  Google Scholar 

  57. Wu Y, Zhu L, Jiang L, Yang Y (2018) Decoupled novel object captioner. 2018 ACM multimedia conference on multimedia conference. ACM Press. https://doi.org/10.1145/3240508.3240640

  58. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. proceedings of the 32nd international conference on machine learning. In: Bach F, Blei D (eds) Proceedings of machine learning research. PMLR: Lille, France, vol 37, pp 2048–2057

  59. Yang J, Jiang Z, Zhou Q, Zhang H, Shi J (2015) Remote sensing image semantic labeling based on conditional random field 36:3069–3081. https://doi.org/10.7527/S1000-6893.2014.0356

  60. Yang Y, Newsam S (2010) Bag-of-visual-words and spatial extensions for land-use classification. In: Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems; ACM: New York, NY, USA, 2010; GIS’10, pp 270–279. https://doi.org/10.1145/1869790.1869829https://doi.org/10.1145/1869790.1869829

  61. Yao T, Pan YW, Li YH, Qiu ZF, Mei T (2017) Boosting image captioning with attributes. In: 2017 Ieee international conference on computer vision (Iccv), pp 4904–4912. https://doi.org/10.1109/Iccv.2017.524https://doi.org/10.1109/Iccv.2017.524

  62. Yao BZ, Yang X, Lin LA, Lee MW, Zhu SC (2010) I2t: image parsing to text description. Proc Ieee 98:1485–1508. https://doi.org/10.1109/Jproc.2010.2050411

    Article  Google Scholar 

  63. You QZ, Jin HL, Wang ZW, Fang C, Luo JB (2016) Image captioning with semantic attention. In: 2016 Ieee conference on computer vision and pattern recognition (Cvpr 2017), pp 4651–4659. https://doi.org/10.1109/Cvpr.2016.503

  64. Zhang F, Du B, Zhang LP (2015) Saliency-guided unsupervised feature learning for scene classification. IEEE Trans Geosci Remote Sens 53:2175–2184. https://doi.org/10.1109/Tgrs.2014.2357078

    Article  Google Scholar 

  65. Zhang XR, Wang X, Tang X, Zhou HY, Li C (2019) Description generation for remote sensing images using attribute attention mechanism. Remote Sens 11. https://doi.org/10.3390/rs11060612

  66. Zhang LB, Zhang YY (2017) Airport detection and aircraft recognition based on two-layer saliency model in high spatial resolution remote-sensing images. IEEE J Sel Top Appl Earth Obs Remote Sens 10:1511–1524. https://doi.org/10.1109/Jstars.2016.2620900

    Article  Google Scholar 

  67. Zhu L, Xu Z, Yang Y, Hauptmann AG (2017) Uncovering the temporal context for video question answering. Int J Comput Vis 124:409–421. https://doi.org/10.1007/s11263-017-1033-7

    Article  MathSciNet  Google Scholar 

  68. Zhu QQ, Zhong YF, Zhang LP (2014) Multi-feature probability topic scene classifier for high spatial resolution remote sensing imagery 2014. IEEE International Geoscience and Remote Sensing Symposium (Igarss). https://doi.org/10.1109/Igarss.2014.6947071

Download references

Acknowledgments

This work was supported by Fundamental Research Funds for the Central Universities, China (2017XKQY082).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bing Liu.

Ethics declarations

Conflict of interests

The authors declare no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shen, X., Liu, B., Zhou, Y. et al. Remote sensing image caption generation via transformer and reinforcement learning. Multimed Tools Appl 79, 26661–26682 (2020). https://doi.org/10.1007/s11042-020-09294-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09294-7

Keywords

Navigation