Diverse and Coherent Paragraph Generation from Images

  • Moitreya ChatterjeeEmail author
  • Alexander G. Schwing
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11206)


Paragraph generation from images, which has gained popularity recently, is an important task for video summarization, editing, and support of the disabled. Traditional image captioning methods fall short on this front, since they aren’t designed to generate long informative descriptions. Moreover, the vanilla approach of simply concatenating multiple short sentences, possibly synthesized from a classical image captioning system, doesn’t embrace the intricacies of paragraphs: coherent sentences, globally consistent structure, and diversity. To address those challenges, we propose to augment paragraph generation techniques with “coherence vectors,” “global topic vectors,” and modeling of the inherent ambiguity of associating paragraphs with images, via a variational auto-encoder formulation. We demonstrate the effectiveness of the developed approach on two datasets, outperforming existing state-of-the-art techniques on both.


Captioning Review generation Variational autoencoders 



This material is based upon work supported in part by the National Science Foundation under Grant No. 1718221, Samsung, and 3M. We thank NVIDIA for providing the GPUs used for this research.

Supplementary material

474176_1_En_45_MOESM1_ESM.pdf (2.4 mb)
Supplementary material 1 (pdf 2503 KB)


  1. 1.
    Aneja, J., Deshpande, A., Schwing, A.G.: Convolutional image captioning. In: Proceedings CVPR (2018)Google Scholar
  2. 2.
    Antol, S., et al.: VQA: visual question answering. In: Proceedings ICCV (2015)Google Scholar
  3. 3.
    Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein GAN. arXiv preprint 2017Google Scholar
  4. 4.
    Chatterjee, M., Leuski, A.: A novel statistical approach for image and video retrieval and its adaption for active learning. In: Proceedings ACM Multimedia (2015)Google Scholar
  5. 5.
    Chen, X., Lawrence Zitnick, C.: Mind’s eye: a recurrent visual representation for image caption generation. In: Proceedings CVPR (2015)Google Scholar
  6. 6.
    Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint 2014Google Scholar
  7. 7.
    Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A.C., Bengio, Y.: A recurrent latent variable model for sequential data. In: Proceedings NIPS (2015)Google Scholar
  8. 8.
    Dai, B., Fidler, S., Urtasun, R., Lin, D.: Towards diverse and natural image descriptions via a conditional GAN. arXiv preprint 2017Google Scholar
  9. 9.
    Denkowski, M., Lavie, A.: Meteor Universal: language specific translation evaluation for any target language. In: Proceedings Ninth Workshop on Statistical Machine Translation (2014)Google Scholar
  10. 10.
    Deshpande, A., Aneja, J., Wang, L., Schwing, A.G., Forsyth, D.A.: Diverse and controllable image captioning with part-of-speech guidance (2018).
  11. 11.
    Donahue, J., et al: Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings CVPR (2015)Google Scholar
  12. 12.
    Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question. In: Proceedings NIPS (2015)Google Scholar
  13. 13.
    Goodfellow, I., et al.: Generative adversarial nets. In: Proceedings NIPS (2014)Google Scholar
  14. 14.
    Gregor, K., Danihelka, I., Graves, A., Rezende, D.J., Wierstra, D.: Draw: a recurrent neural network for image generation. arXiv preprint 2015Google Scholar
  15. 15.
    Jain, U., Lazebnik, S., Schwing, A.G.: Two can play this game: visual dialog with discriminative question generation and answering. In: Proceedings CVPR (2018)Google Scholar
  16. 16.
    Jain, U., Zhang, Z., Schwing, A.: Creativity: generating diverse questions using variational autoencoders. In: Proceedings CVPR (2017)Google Scholar
  17. 17.
    Johnson, J., Karpathy, A., Fei-Fei, L.: DenseCap: fully convolutional localization networks for dense captioning. In: Proceedings CVPR (2016)Google Scholar
  18. 18.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings CVPR (2015)Google Scholar
  19. 19.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint 2014Google Scholar
  20. 20.
    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint 2013Google Scholar
  21. 21.
    Klambauer, G., Unterthiner, T., Mayr, A., Hochreiter, S.: Self-normalizing neural networks. In: Proceedings NIPS (2017)Google Scholar
  22. 22.
    Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: Proceedings CVPR (2017)Google Scholar
  23. 23.
    Krishna, R.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  24. 24.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings NIPS (2012)Google Scholar
  25. 25.
    Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: Proceedings NIPS (2004)Google Scholar
  26. 26.
    Liang, X., Hu, Z., Zhang, H., Gan, C., Xing, E.P.: Recurrent topic-transition GAN for visual paragraph generation. In: Proceedings ICCV (2017)Google Scholar
  27. 27.
    Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  28. 28.
    Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: Proceedings ICCV (2015)Google Scholar
  29. 29.
    Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint 2014Google Scholar
  30. 30.
    McAuley, J., Targett, C., Shi, Q., Van Den Hengel, A.: Image-based recommendations on styles and substitutes. In: Proceedings ACM SIGIR (2015)Google Scholar
  31. 31.
    Pan, J.Y., Yang, H.J., Duygulu, P., Faloutsos, C.: Automatic image captioning. In: Proceedings ICME (2004)Google Scholar
  32. 32.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings ACL (2002)Google Scholar
  33. 33.
    Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Proceedings NIPS (2015)Google Scholar
  34. 34.
    Schwartz, I., Schwing, A.G., Hazan, T.: High-order attention models for visual question answering. In: Proceedings NIPS (2017)Google Scholar
  35. 35.
    Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: Proceedings CVPR (2016)Google Scholar
  36. 36.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint 2014Google Scholar
  37. 37.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings CVPR (2015)Google Scholar
  38. 38.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings CVPR (2015)Google Scholar
  39. 39.
    Wang, L., Schwing, A.G., Lazebnik, S.: Diverse and accurate image description using a variational auto-encoder with an additive gaussian encoding space. In: Proceedings NIPS (2017)Google Scholar
  40. 40.
    Xiao, Y., Chua, T.-S., Lee, C.-H.: Fusion of region and image-based techniques for automatic image annotation. In: Cham, T.-J., Cai, J., Dorai, C., Rajan, D., Chua, T.-S., Chia, L.-T. (eds.) MMM 2007. LNCS, vol. 4351, pp. 247–258. Springer, Heidelberg (2006). Scholar
  41. 41.
    Xie, P.: Diversity-promoting and large-scale machine learning for healthcare (2018). Accessed 25 July 2018
  42. 42.
    Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. In: Proceedings ICML (2016)Google Scholar
  43. 43.
    Xu, H., Saenko, K.: Ask, attend and answer: exploring question-guided spatial attention for visual question answering. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 451–466. Springer, Cham (2016). Scholar
  44. 44.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings ICML (2015)Google Scholar
  45. 45.
    Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings CVPR (2016)Google Scholar
  46. 46.
    You, Q., Jin, H., Wang, Z., Fang, C., Luo, J.: Image captioning with semantic attention. In: Proceedings CVPR (2016)Google Scholar
  47. 47.
    Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: Proceedings CVPR (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.University of Illinois at Urbana-ChampaignUrbanaUSA

Personalised recommendations