Auto-Illustrating Poems and Songs with Style

  • Katharina SchwarzEmail author
  • Tamara L. Berg
  • Hendrik P. A. Lensch
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10114)


We develop an optimization based framework to automatically illustrate poems and songs. Our method is able to produce both semantically relevant and visually coherent illustrations, all while matching a particular user selected visual style. We demonstrate our method on a selection of 200 popular poems and songs collected from the internet and operate on around 14M Flickr images. A user study evaluates variations on our optimization procedure. Finally, we present two applications, identifying textual style, and automatic music video generation.


Text Line Music Video Candidate Image Amazon Mechanical Turk Song Lyric 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work originated from a research stay of Katharina Schwarz at the University of North Carolina (UNC).

Supplementary material

416263_1_En_6_MOESM1_ESM.pdf (6.3 mb)
Supplementary material 1 (pdf 6453 KB)


  1. 1.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)Google Scholar
  2. 2.
    Karayev, S., Trentacoste, M., Han, H., Agarwala, A., Darrell, T., Hertzmann, A., Winnemoeller, H.: Recognizing image style. In: BMVC (2014)Google Scholar
  3. 3.
    Snavely, K.N.: Scene reconstruction and visualization from internet photo collections. PhD thesis, University of Washington (2009)Google Scholar
  4. 4.
    Frahm, J.-M., et al.: Building Rome on a cloudless day. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 368–381. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15561-1_27 CrossRefGoogle Scholar
  5. 5.
    Hays, J., Efros, A.A.: Scene completion using millions of photographs. In: ACM SIGGRAPH (2007)Google Scholar
  6. 6.
    Averbuch-Elor, H., Wang, Y., Qian, Y., Gong, M., Kopf, J., Zhang, H., Cohen-Or, D.: Distilled collections from textual image queries. Comput. Graph. Forum 34(2), 131–142 (2015)CrossRefGoogle Scholar
  7. 7.
    Kim, G., Xing, E.P.: Reconstructing storyline graphs for image recommendation from web community photos. In: CVPR, pp. 3882–3889 (2014)Google Scholar
  8. 8.
    Kim, G., Sigal, L., Xing, E.P.: Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: CVPR, pp. 4225–4232 (2014)Google Scholar
  9. 9.
    Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: understanding and generating image descriptions. In: CVPR, pp. 1601–1608 (2011)Google Scholar
  10. 10.
    Ordonez, V., Kulkarni, G., Berg, T.L.: Im2Text: describing images using 1 million captioned photographs. In: NIPS, pp. 1143–1151 (2011)Google Scholar
  11. 11.
    Karpathy, A., Li, F.F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)Google Scholar
  12. 12.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)Google Scholar
  13. 13.
    Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., Zweig, G., Mitchell, M.: Language models for image captioning: the quirks and what works. In: ACL, pp. 100–105 (2015)Google Scholar
  14. 14.
    Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, M., Platt, J.C., Zitnick, C.L., Zweig, G.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015)Google Scholar
  15. 15.
    Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)Google Scholar
  16. 16.
    Coyne, B., Sproat, R.: WordsEye: an automatic text-to-scene conversion system. In: SIGGRAPH, pp. 487–496. ACM (2001)Google Scholar
  17. 17.
    Spika, C., Schwarz, K., Dammertz, H., Lensch, H.P.A.: AVDT - automatic visualization of descriptive texts. In: VMV, pp. 129–136 (2011)Google Scholar
  18. 18.
    Zitnick, C.L., Parikh, D.: Bringing semantics into focus using visual abstraction. In: CVPR, pp. 3009–3016 (2013)Google Scholar
  19. 19.
    Zitnick, C.L., Parikh, D., Vanderwende, L.: Learning the visual interpretation of sentences. In: ICCV, pp. 1681–1688 (2013)Google Scholar
  20. 20.
    Kong, C., Lin, D., Bansal, M., Urtasun, R., Fidler, S.: What are you talking about? Text-to-image coreference. In: CVPR, pp. 3558–3565 (2014)Google Scholar
  21. 21.
    Joshi, D., Wang, J.Z., Li, J.: The story picturing engine–a system for automatic text illustration. TOMCCAP 2(1), 68–89 (2006)CrossRefGoogle Scholar
  22. 22.
    Schwarz, K., Rojtberg, P., Caspar, J., Gurevych, I., Goesele, M., Lensch, H.P.A.: Text-to-video: story illustration from online photo collections. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds.) KES 2010. LNCS (LNAI), vol. 6279, pp. 402–409. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15384-6_43 CrossRefGoogle Scholar
  23. 23.
    Kim, G., Moon, S., Sigal, L.: Ranking and retrieval of image sequences from multiple paragraph queries. In: CVPR, pp. 1993–2001 (2015)Google Scholar
  24. 24.
    Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python, 1st edn. O’Reilly Media, Inc., Sebastopol (2009)zbMATHGoogle Scholar
  25. 25.
    Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  26. 26.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Representations in Vector Space. CoRR (2013)Google Scholar
  27. 27.
    Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. CoRR (2014)Google Scholar
  28. 28.
    Thomee, B.: Yahoo! Webscope dataset YFCC-100M (2014).
  29. 29.
    Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.: The New Data and New Challenges in Multimedia Research. CoRR (2015)Google Scholar
  30. 30.
    Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1568–1583 (2006)CrossRefGoogle Scholar
  31. 31.
    Shiang-shiang, K.D.: Information about LRC (2012).

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Katharina Schwarz
    • 1
    Email author
  • Tamara L. Berg
    • 2
  • Hendrik P. A. Lensch
    • 1
  1. 1.University of TübingenTübingenGermany
  2. 2.University of North CarolinaChapel HillUSA

Personalised recommendations