Abstract
Image captioning is a task generating the natural semantic description of the given image, which plays an essential role for machines to understand the content of the image. Remote sensing image captioning is a part of the field. Most of the current remote sensing image captioning models failed to fully utilize the semantic information in images and suffered the overfitting problem induced by the small size of the dataset. To this end, we propose a new model using the Transformer to decode the image features to target sentences. For making the Transformer more adaptive to the remote sensing image captioning task, we additionally employ dropout layers, residual connections, and adaptive feature fusion in the Transformer. Reinforcement Learning is then applied to enhance the quality of the generated sentences. We demonstrate the validity of our proposed model on three remote sensing image captioning datasets. Our model obtains all seven higher scores on the Sydney Dataset and Remote Sensing Image Caption Dataset (RSICD), four higher scores on UCM dataset, which indicates that the proposed methods perform better than the previous state of the art models in remote sensing image caption generation.
Similar content being viewed by others
References
Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE/CVF Conference on computer vision and pattern recognition. IEEE https://doi.org/10.1109/cvpr.2018.00636
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, conference track proceedings
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization; association for computational linguistics: Ann Arbor, Michigan, pp 65–72
Bengio S, Vinyals O, Jaitly N, Shazeer N (2015) Scheduled sampling for sequence prediction with recurrent neural networks. AdvX Neural Inf Process Syst 28(Nips 2015):28
Chen K, Zhou Z, Guo J, Zhang D, Sun X (2013) Semantic scene understanding oriented high resolution remote sensing image change information analysis. In: Proceedings of the annual conference on high resolution earth observation, Beijing, China, pp 1–12
Chen X, Zitnick CL (2015) Mind’s eye: a recurrent visual representation for image caption generation. In: 2015 IEEE conference on computer vision and pattern recognition (Cvpr), pp 2422–2431
Cheng G, Han J, Lu X (2017) Remote sensing image scene classification: benchmark and state of the art. Proc IEEE 105:1865–1883. https://doi.org/10.1109/jproc.2017.2675998
Cheng G, Yang CY, Yao XW, Guo L, Han JW (2018) When deep learning meets metric learning: remote sensing image scene classification via learning discriminative CNNs. IEEE Trans Geosci Remote Sens 56:2811–2821. https://doi.org/10.1109/Tgrs.2017.2783902
Das A, Kottur S, Gupta K, Singh A, Yadav D, Moura J, Parikh D (2017) Visual Dialog. 1080–1089. https://doi.org/10.1109/CVPR.2017.121
Dong L, S M, Shan J, Liu B, Yu Y, Yan T (2019) Computation offloading for mobile-edge computing with multi-user. 841–850. https://doi.org/10.1109/ICDCS.2019.00088
Farhadi A, Hejrati M, Sadeghi MA, Young P, Rashtchian C, Hockenmaier J, Forsyth D (2010) Every picture tells a story: generating sentences from images. In: Computer vision – ECCV 2010. https://doi.org/10.1007/978-3-642-15561-1_2. Springer, Berlin, pp 15–29
Feng Q, Wu Y, Fan H, Yan C, Xu M, Yang Y (2020) Cascaded revision network for novel object captioning. IEEE Trans. Circuits Syst. Video Technol 1–1. https://doi.org/10.1109/tcsvt.2020.2965966
Gerber R, Nagel NH (1996) Knowledge representation for the generation of quantified natural language descriptions of vehicle traffic in image sequences. In: Proceedings of 3rd IEEE international conference on image processing, vol 2. IEEE, pp 805–808
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Computer Vision – ECCV 2014. https://doi.org/10.1007/978-3-319-10593-2_35. Springer International Publishing, Cham, pp 529–545
Guo J, Sun Z, Tang H, Jia X, Wang S, Yan X, Ye G, Wu G (2016) Hybrid optimization algorithm of particle swarm optimization and cuckoo search for preventive maintenance period optimization. Discret Dyn Nat Soc
Han XB, Zhong YF, Zhang LP (2017) An efficient and robust integrated geospatial object detection framework for high spatial resolution remote sensing imagery. Remote Sens, 9. https://doi.org/10.3390/rs9070666
He KM, Zhang XY, Ren SQ, Sun J (2016) Deep residual learning for image recognition. In: 2016 Ieee conference on computer vision and pattern recognition (Cvpr), pp 770–778, https://doi.org/10.1109/ Cvpr.2016.90
Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov R Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580
Hodosh M, Young P, Hockenmaier J (2013) Framing image description as a ranking task: data, models and evaluation metrics. J Artif Intell Res 47:853–899. https://doi.org/10.1613/jair.3994
Karpathy A, Joulin A, Li FF (2014) Deep fragment embeddings for bidirectional image sentence mapping. Adv Neural Inf Process Syst 27(Nips 2014):27
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA May 7-9, 2015, conference track proceedings
Kulkarni G, Premraj V, Dhar S, Li S, Choi Y, Berg AC, Berg TL (2011) Baby talk: understanding and generating simple image descriptions. CVPR 2011 IEEE, https://doi.org/10.1109/cvpr.2011.5995466
Kundra H, Sadawarti H (2015) Hybrid algorithm of cuckoo search and particle swarm optimization for natural terrain feature extraction. Res J Inf Technol 7:58–69
Li Y (2012) Target detection method of high resolution remote sensing image based on semantic model. Graduate University of Chinese Academy of Sciences, Beijing
Li S, Kulkarni G, Berg TL, Berg AC, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the fifteenth conference on computational natural language learning; association for computational linguistics: Stroudsburg, PA, USA, CoNLL ’11, pp 220–228
Lin CY (2004) ROUGE: A package for automatic evaluation of summaries. text summarization branches out; association for computational linguistics: Barcelona, Spain, 74–81
Liu T., Li P., Zhang L., Chen X. (2009) A remote sensing image retrieval model based on semantic mining. Geomatics Inf Sci Wuhan Univ 34:684–687. https://doi.org/10.1042/BSR20080061
Lu XX, Wang BQ, Zheng XT, Li XL (2018) Exploring models and data for remote sensing image caption generation. IEEE Trans Geosci Remote Sens 56:2183–2195. https://doi.org/10.1109/Tgrs.2017.2776321
Lu JS, Xiong CM, Parikh D, Socher R (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: 30th Ieee conference on computer vision and pattern recognition (Cvpr 2017), pp 3242–3250. https://doi.org/10.1109/Cvpr.2017.345
Lu XQ, Zheng XT, Yuan Y (2017) Remote sensing scene classification by unsupervised representation learning. IEEE Trans Geosci Remote Sens 55:5148–5157. https://doi.org/10.1109/Tgrs.2017.2702596
Maggiori E, Tarabalka Y, Charpiat G, Alliez P (2017) Convolutional neural networks for large-scale remote-sensing image classification. IEEE Trans Geosci Remote Sens 55:645–657. https://doi.org/10.1109/Tgrs.2016.2612821
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2015) Deep captioning with multimodal recurrent neural networks (m-RNN). In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, conference track proceedings
Nie GY, Cheng MM, Liu Y, Liang Z, Fan DP, Liu Y, Wang Y (2019) Multi-level context ultra-aggregation for stereo matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3283–3291
Pant T, Han C, Wang H (2019) Examination of errors of table integration in flamelet/progress variable modeling of a turbulent non-premixed jet flame. Appl Math Model 72:369–384
Papineni K, Roukos S, Ward T, Zhu WJBLEU (2001) Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135https://doi.org/10.3115/1073083.1073135
Qu B, Li XL, Tao DC, Lu XQ (2016) Deep semantic understanding of high resolution remote sensing image. In: 2016 International conference on computer, information and telecommunication systems (Cits), pp 124–128
R JA, Raimond K (2015) A review on availability of remote sensing data. IEEE Technological innovation in ICT for agriculture and rural development (TIAR). IEEE, 2015. https://doi.org/10.1109/tiar.2015.7358548https://doi.org/10.1109/tiar.2015.7358548
Rahaman KR, Hassan QK (2016) Application of remote sensing to quantify local warming trends: a review. In: 2016 5th International conference on informatics, electronics and vision (Iciev), pp 256–261
Ranzato M, Chopra S, Auli M, Zaremba W (2016) Sequence level training with recurrent neural networks. In: Bengio Y, LeCun Y (eds) 4th International conference on learning representations, ICLR 2016, San Juan, Puerto Rico May 2-4, 2016, conference track proceedings
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: 30th IEEE conference on computer vision and pattern recognition (Cvpr 2017), pp 1179–1195. https://doi.org/10.1109/Cvpr.2017.131
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang ZH, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet Large Scale Visual Recognition Challenge. Int J Comput Vis 115:211–252. https://doi.org/10.1007/s11263-015-0816-y
Shi ZW, Zou ZX (2017) Can a machine generate humanlike language descriptions for a remote sensing image?. IEEE Trans Geosci Remote Sens 55:3623–3634. https://doi.org/10.1109/Tgrs.2017.2677464
Spratling MW, Johnson MH (2004) A feedback model of visual attention. J Cogn Neurosci 16:219–237. https://doi.org/10.1162/089892904322984526
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958
Sun C, Gan C, Nevatia R (2015) Automatic concept discovery from parallel text and visual corpora. In: 2015 Ieee international conference on computer vision (Iccv). https://doi.org/10.1109/Iccv.2015.298, pp 2596–2604
Sutton RS, Barto AG (2018) Reinforcement learning: an introduction. MIT press, Cambridge
Toth C, Jóźków G (2016) Remote sensing platforms and sensors: a survey. Isprs J Photogramm Remote Sens 115:22–36. https://doi.org/10.1016/j.isprsjprs.2015.10.004
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Lu, Polosukhin I (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems 30. Curran Associates, Inc, pp 5998–6008
Vedantam R, Zitnick C, Parikh D (2015) CIDEr: consensus-based image description evaluation 4566–4575. https://doi.org/10.1109/CVPR.2015.7299087
Vinyals O, Toshev A, Bengio S, Erhan D (2017) Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans Pattern Anal Mach Intell 39:652–663. https://doi.org/10.1109/Tpami.2016.2587640
Wang B, Lu X, Zheng X, Li X (2019) Semantic descriptions of high-resolution remote sensing images. IEEE Geosci Remote Sens Lett 1–5. https://doi.org/10.1109/LGRS.2019.2893772
Wang J, Zhou H (2012) Research on key technologies of remote sensing image data retrieval based on semantics. Comput Digit Eng 40:48–50
Williams RJ (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach Learn 8:229–256. https://doi.org/10.1007/BF00992696
Wu Y, Jiang L, Yang Y (2020) Revisiting embodiedQA: A Simple Baseline and Beyond. IEEE Trans Image Process 29:3984–3992. https://doi.org/10.1109/tip.2020.2967584
Wu Q, Shen CH, Liu LQ, Dick A, van den Hengel A (2016) What value do explicit high level concepts have in vision to language problems?. In: 2016 Ieee conference on computer vision and pattern recognition (Cvpr). https://doi.org/10.1109/Cvpr.2016.29, pp 203–212
Wu Q, Shen CH, Wang P, Dick A, van den Hengel A (2018) Image captioning and visual question answering based on attributes and external knowledge. IEEE Trans Pattern Anal Mach Intell 40:1367–1381. https://doi.org/10.1109/Tpami.2017.2708709
Wu Y, Zhu L, Jiang L, Yang Y (2018) Decoupled novel object captioner. 2018 ACM multimedia conference on multimedia conference. ACM Press. https://doi.org/10.1145/3240508.3240640
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: neural image caption generation with visual attention. proceedings of the 32nd international conference on machine learning. In: Bach F, Blei D (eds) Proceedings of machine learning research. PMLR: Lille, France, vol 37, pp 2048–2057
Yang J, Jiang Z, Zhou Q, Zhang H, Shi J (2015) Remote sensing image semantic labeling based on conditional random field 36:3069–3081. https://doi.org/10.7527/S1000-6893.2014.0356
Yang Y, Newsam S (2010) Bag-of-visual-words and spatial extensions for land-use classification. In: Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems; ACM: New York, NY, USA, 2010; GIS’10, pp 270–279. https://doi.org/10.1145/1869790.1869829https://doi.org/10.1145/1869790.1869829
Yao T, Pan YW, Li YH, Qiu ZF, Mei T (2017) Boosting image captioning with attributes. In: 2017 Ieee international conference on computer vision (Iccv), pp 4904–4912. https://doi.org/10.1109/Iccv.2017.524https://doi.org/10.1109/Iccv.2017.524
Yao BZ, Yang X, Lin LA, Lee MW, Zhu SC (2010) I2t: image parsing to text description. Proc Ieee 98:1485–1508. https://doi.org/10.1109/Jproc.2010.2050411
You QZ, Jin HL, Wang ZW, Fang C, Luo JB (2016) Image captioning with semantic attention. In: 2016 Ieee conference on computer vision and pattern recognition (Cvpr 2017), pp 4651–4659. https://doi.org/10.1109/Cvpr.2016.503
Zhang F, Du B, Zhang LP (2015) Saliency-guided unsupervised feature learning for scene classification. IEEE Trans Geosci Remote Sens 53:2175–2184. https://doi.org/10.1109/Tgrs.2014.2357078
Zhang XR, Wang X, Tang X, Zhou HY, Li C (2019) Description generation for remote sensing images using attribute attention mechanism. Remote Sens 11. https://doi.org/10.3390/rs11060612
Zhang LB, Zhang YY (2017) Airport detection and aircraft recognition based on two-layer saliency model in high spatial resolution remote-sensing images. IEEE J Sel Top Appl Earth Obs Remote Sens 10:1511–1524. https://doi.org/10.1109/Jstars.2016.2620900
Zhu L, Xu Z, Yang Y, Hauptmann AG (2017) Uncovering the temporal context for video question answering. Int J Comput Vis 124:409–421. https://doi.org/10.1007/s11263-017-1033-7
Zhu QQ, Zhong YF, Zhang LP (2014) Multi-feature probability topic scene classifier for high spatial resolution remote sensing imagery 2014. IEEE International Geoscience and Remote Sensing Symposium (Igarss). https://doi.org/10.1109/Igarss.2014.6947071
Acknowledgments
This work was supported by Fundamental Research Funds for the Central Universities, China (2017XKQY082).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors declare no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shen, X., Liu, B., Zhou, Y. et al. Remote sensing image caption generation via transformer and reinforcement learning. Multimed Tools Appl 79, 26661–26682 (2020). https://doi.org/10.1007/s11042-020-09294-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09294-7