Advertisement

Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning

Conference paper
  • 531 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12359)

Abstract

Change Captioning is a task that aims to describe the difference between images with natural language. Most existing methods treat this problem as a difference judgment without the existence of distractors, such as viewpoint changes. However, in practice, viewpoint changes happen often and can overwhelm the semantic difference to be described. In this paper, we propose a novel visual encoder to explicitly distinguish viewpoint changes from semantic changes in the change captioning task. Moreover, we further simulate the attention preference of humans and propose a novel reinforcement learning process to fine-tune the attention directly with language evaluation rewards. Extensive experimental results show that our method outperforms the state-of-the-art approaches by a large margin in both Spot-the-Diff and CLEVR-Change datasets .

Keywords

Image captioning Change captioning Attention Reinforcement learning 

Notes

Acknowledgement

This research is partially supported by the MOE Tier-1 research grants: RG28/18 (S) and the Monash University FIT Start-up Grant.

References

  1. 1.
    Alcantarilla, P.F., Stent, S., Ros, G., Arroyo, R., Gherardi, R.: Street-view change detection with deconvolutional networks. Auton. Robots 42(7), 1301–1322 (2018)CrossRefGoogle Scholar
  2. 2.
    Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part V. LNCS, vol. 9909, pp. 382–398. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_24CrossRefGoogle Scholar
  3. 3.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)Google Scholar
  4. 4.
    Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 1171–1179 (2015)Google Scholar
  5. 5.
    Bruzzone, L., Prieto, D.F.: Automatic analysis of the difference image for unsupervised change detection. IEEE Trans. Geosci. Remote Sensi. 38(3), 1171–1182 (2000)CrossRefGoogle Scholar
  6. 6.
    Chen, C., Mu, S., Xiao, W., Ye, Z., Wu, L., Ju, Q.: Improving image captioning with conditional generative adversarial nets. In: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, vol. 33, pp. 8142–8150 (2019)Google Scholar
  7. 7.
    Daudt, R.C., Le Saux, B., Boulch, A.: Fully convolutional siamese networks for change detection. In: Proceedings of the 2018 25th IEEE International Conference on Image Processing, pp. 4063–4067. IEEE (2018)Google Scholar
  8. 8.
    Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth workshop on Statistical Machine Translation, pp. 376–380 (2014)Google Scholar
  9. 9.
    Farhadi, A., et al.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15561-1_2CrossRefGoogle Scholar
  10. 10.
    Gu, J., Cai, J., Wang, G., Chen, T.: Stack-captioning: Coarse-to-fine learning for image captioning. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (2018)Google Scholar
  11. 11.
    Gu, J., Joty, S., Cai, J., Wang, G.: Unpaired image captioning by language pivoting. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018, Part I. LNCS, vol. 11205, pp. 519–535. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01246-5_31CrossRefGoogle Scholar
  12. 12.
    Gu, J., Joty, S., Cai, J., Zhao, H., Yang, X., Wang, G.: Unpaired image captioning via scene graph alignments. In: Proceedings of the IEEE International Conference on Computer Vision (2019)Google Scholar
  13. 13.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  14. 14.
    Jhamtani, H., Berg-Kirkpatrick, T.: Learning to describe differences between pairs of similar images. In: arXiv preprint arXiv:1808.10584 (2018)
  15. 15.
    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  16. 16.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Proceedings of the Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)Google Scholar
  17. 17.
    Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81 (2004)Google Scholar
  18. 18.
    Oh, S., et al.: A large-scale benchmark dataset for event recognition in surveillance video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3153–3160. IEEE (2011)Google Scholar
  19. 19.
    Oluwasanmi, A., Aftab, M.U., Alabdulkreem, E., Kumeda, B., Baagyere, E.Y., Qin, Z.: Captionnet: automatic end-to-end siamese difference captioning model with attention. IEEE Access 7, 106773–106783 (2019)CrossRefGoogle Scholar
  20. 20.
    Oluwasanmi, A., Frimpong, E., Aftab, M.U., Baagyere, E.Y., Qin, Z., Ullah, K.: Fully convolutional captionnet: siamese difference captioning attention model. IEEE Access 7, 175929–175939 (2019)CrossRefGoogle Scholar
  21. 21.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Proceedings of Association for Computational Linguistics (2002)Google Scholar
  22. 22.
    Park, D.H., Darrell, T., Rohrbach, A.: Robust change captioning. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4624–4633 (2019)Google Scholar
  23. 23.
    Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015)
  24. 24.
    Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, V.: Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024 (2017)Google Scholar
  25. 25.
    Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  26. 26.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)Google Scholar
  27. 27.
    Wang, Q., Yuan, Z., Du, Q., Li, X.: Getnet: A general end-to-end 2-D CNN framework for hyperspectral image change detection. IEEE Trans. Geosci. Remote Sens. 57(1), 3–13 (2018)CrossRefGoogle Scholar
  28. 28.
    Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  29. 29.
    Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, June 2019Google Scholar
  30. 30.
    Yang, X., Zhang, H., Cai, J.: Deconfounded image captioning: A causal retrospect. arXiv preprint arXiv:2003.03923 (2020)
  31. 31.
    Yu, L., Zhang, W., Wang, J., Yu, Y.: Seqgan: sequence generative adversarial nets with policy gradient. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Nanyang Technological UniversitySingaporeSingapore
  2. 2.Adobe ResearchCollege ParkUSA
  3. 3.Monash UniversityClaytonAustralia

Personalised recommendations