Skip to main content

Adversarial Learning for Visual Storytelling with Sense Group Partition

  • 1263 Accesses

Part of the Lecture Notes in Computer Science book series (LNIP,volume 11364)


Visual storytelling aims to investigate the generation of a paragraph to describe the content of a photo stream. Despite the substantial progress in vision and language research, the techniques for sequential vision-to-language are still far away from being perfect. Due to the limitation of maximum likelihood estimation on training, the majority of existing models encourage high resemblance to texts in the training database, which makes the description overly rigid and lack in diverse expressions. Therefore, We cast the task as a reinforcement learning problem and propose an Adversarial All-in-one Learning (AAL) framework to learn a reward model, which simultaneously incorporates the information of all images in the photo stream and all texts in the paragraph, and optimize a generative model with the estimated reward. Specifically, in light of the linguistic reading theory with sense group as the unit, we propose to do the paragraph generation at sense group level instead of sentence level. Experiments on the widely-used dataset show that our approach generates higher-quality descriptions than previous baselines.


  • Vision and language
  • Sense group
  • Adversarial learning

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-20870-7_11
  • Chapter length: 16 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   89.00
Price excludes VAT (USA)
  • ISBN: 978-3-030-20870-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   119.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.


  1. 1.

    We compute a sense group embedding by making the sum of embedding of each word in the sense group.


  1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  2. Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)

    Google Scholar 

  3. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2625–2634 (2015)

    Google Scholar 

  4. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

    Google Scholar 

  5. Huang, T.H.K., et al.: Visual storytelling. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1233–1239 (2016)

    Google Scholar 

  6. Lamb, A.M., Goyal, A.G.A.P., Zhang, Y., Zhang, S., Courville, A.C., Bengio, Y.: Professor forcing: a new algorithm for training recurrent networks. In: Advances In Neural Information Processing Systems, pp. 4601–4609 (2016)

    Google Scholar 

  7. Li, F.F., Karpathy, A., Johnson, J.: CS231n: Convolutional neural networks for visual recognition. University Lecture (2015)

    Google Scholar 

  8. Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)

    Google Scholar 

  9. Liu, Y., Fu, J., Mei, T., Chen, C.W.: Storytelling of photo stream with bidirectional multi-thread recurrent neural network. arXiv preprint arXiv:1606.00625 (2016)

  10. Machinery, C.: Computing machinery and intelligence-AM turing. Mind 59(236), 433 (1950)

    MathSciNet  Google Scholar 

  11. Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632 (2014)

  12. Mishima, H., Itow, T.: Encoder and decoder, uS Patent 5,488,418, 30 January 1996

    Google Scholar 

  13. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)

    Google Scholar 

  14. Park, C.C., Kim, G.: Expressing an image stream with a sequence of natural sentences. In: Advances in Neural Information Processing Systems, pp. 73–81 (2015)

    Google Scholar 

  15. Peris, Á., Bolaños, M., Radeva, P., Casacuberta, F.: Video description using bidirectional recurrent neural networks. In: Villa, A.E.P., Masulli, P., Pons Rivero, A.J. (eds.) ICANN 2016. LNCS, vol. 9887, pp. 3–11. Springer, Cham (2016).

    CrossRef  Google Scholar 

  16. Pfau, D., Vinyals, O.: Connecting generative adversarial networks and actor-critic methods. arXiv preprint arXiv:1610.01945 (2016)

  17. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

    Google Scholar 

  18. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

    Google Scholar 

  19. Vendrov, I., Kiros, R., Fidler, S., Urtasun, R.: Order-embeddings of images and language. arXiv preprint arXiv:1511.06361 (2015)

  20. Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4534–4542 (2015)

    Google Scholar 

  21. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)

    Google Scholar 

  22. Wang, X., Chen, W., Wang, Y.F., Wang, W.Y.: No metrics are perfect: adversarial reward learning for visual storytelling. arXiv preprint arXiv:1804.09160 (2018)

  23. Yu, L., Zhang, W., Wang, J., Yu, Y.: SeqGAN: sequence generative adversarial nets with policy gradient. In: AAAI, pp. 2852–2858 (2017)

    Google Scholar 

  24. Yu, L., Bansal, M., Berg, T.L.: Hierarchically-attentive RNN for album summarization and storytelling. arXiv preprint arXiv:1708.02977 (2017)

Download references


This work is partially supported by Funds for Creative Research Groups of China (No. 61421061), and Natural Science Foundation of China (No. 61601046, No. 61602048).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Lingbo Mo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Mo, L., Zhang, C., Ji, Y., Hu, Z. (2019). Adversarial Learning for Visual Storytelling with Sense Group Partition. In: Jawahar, C., Li, H., Mori, G., Schindler, K. (eds) Computer Vision – ACCV 2018. ACCV 2018. Lecture Notes in Computer Science(), vol 11364. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-20869-1

  • Online ISBN: 978-3-030-20870-7

  • eBook Packages: Computer ScienceComputer Science (R0)