Connecting Vision and Language with Localized Narratives

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12350)


We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available. We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning.

Supplementary material

504441_1_En_38_MOESM1_ESM.pdf (11.4 mb)
Supplementary material 1 (pdf 11697 KB)


  1. 1.
    Amodei, D., et al.: Deep speech 2: end-to-end speech recognition in English and Mandarin. In: ICML (2016)Google Scholar
  2. 2.
    Anderson, P., Fernando, B., Johnson, M., Gould, S.: SPICE: semantic propositional image caption evaluation. In: ECCV (2016)Google Scholar
  3. 3.
    Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR (2018)Google Scholar
  4. 4.
    Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)Google Scholar
  5. 5.
    Benenson, R., Popov, S., Ferrari, V.: Large-scale interactive object segmentation with human annotators. In: CVPR (2019)Google Scholar
  6. 6.
    Bigham, J.P., et al.: VizWiz: nearly real-time answers to visual questions. In: Proceedings of the 23nd Annual ACM Symposium on User Interface Software and Technology (2010)Google Scholar
  7. 7.
    Changpinyo, S., Pang, B., Sharma, P., Soricut, R.: Decoupled box proposal and featurization with ultrafine-grained semantic labels improve image captioning and visual question answering. In: EMNLP-IJCNLP (2019)Google Scholar
  8. 8.
    Chen, X., et al.: Microsoft COCO captions: data collection and evaluation server. arXiv (2015)Google Scholar
  9. 9.
    Cirik, V., Morency, L.P., Berg-Kirkpatrick, T.: Visual referring expression recognition: what do systems actually learn? In: NAACL (2018)Google Scholar
  10. 10.
    Cornia, M., Baraldi, L., Cucchiara, R.: Show, control and tell: a framework for generating controllable and grounded captions. In: CVPR (2019)Google Scholar
  11. 11.
    Dai, D.: Towards cost-effective and performance-aware vision algorithms. Ph.D. thesis, ETH Zurich (2016)Google Scholar
  12. 12.
    Damen, D., et al.: The EPIC-KITCHENS dataset: collection, challenges and baselines. IEEE Trans. PAMI (2020)Google Scholar
  13. 13.
    Dogan, P., Sigal, L., Gross, M.: Neural sequential phrase grounding (seqground). In: CVPR (2019)Google Scholar
  14. 14.
    Google cloud speech-to-text API.
  15. 15.
    Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: ICASSP (2013)Google Scholar
  16. 16.
    Gygli, M., Ferrari, V.: Efficient object annotation via speaking and pointing. In: IJCV (2019)Google Scholar
  17. 17.
    Gygli, M., Ferrari, V.: Fast object class labelling via speech. In: CVPR (2019)Google Scholar
  18. 18.
    Harwath, D., Recasens, A., Surís, D., Chuang, G., Torralba, A., Glass, J.: Jointly discovering visual objects and spoken words from raw sensory input. In: ECCV (2018)Google Scholar
  19. 19.
    Honnibal, M., Montani, I.: spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing (2017). spacy.ioGoogle Scholar
  20. 20.
    Hudson, D.A., Manning, C.D.: GQA: a new dataset for real-world visual reasoning and compositional question answering. In: CVPR (2019)Google Scholar
  21. 21.
    Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: fully convolutional localization networks for dense captioning. In: CVPR (2016)Google Scholar
  22. 22.
    Kahneman, D.: Attention and effort. Citeseer (1973)Google Scholar
  23. 23.
    Kalchbrenner, N., et al.: Efficient neural audio synthesis. In: ICML (2018)Google Scholar
  24. 24.
    Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: referring to objects in photographs of natural scenes. In: EMNLP (2014)Google Scholar
  25. 25.
    Kim, D.J., Choi, J., Oh, T.H., Kweon, I.S.: Dense relational captioning: triple-stream networks for relationship-based captioning. In: CVPR (2019)Google Scholar
  26. 26.
    Krause, J., Johnson, J., Krishna, R., Fei-Fei, L.: A hierarchical approach for generating descriptive image paragraphs. In: CVPR (2017)Google Scholar
  27. 27.
    Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Kruskal, J.B., Liberman, M.: The symmetric time-warping problem: from continuous to discrete. In: Time Warps, String Edits, and Macromolecules - The Theory and Practice of Sequence Comparison, chap. 4. CSLI Publications (1999)Google Scholar
  29. 29.
    Kuznetsova, A., et al.: The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982 (2018)
  30. 30.
    Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out (2004)Google Scholar
  31. 31.
    Lin, T.Y., et al.: Microsoft COCO: common objects in context. In: ECCV (2014)Google Scholar
  32. 32.
    Liu, C., Mao, J., Sha, F., Yuille, A.: Attention correctness in neural image captioning. In: AAAI (2017)Google Scholar
  33. 33.
    Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. In: CVPR (2018)Google Scholar
  34. 34.
    Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: ICCV (2015)Google Scholar
  35. 35.
    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: CVPR (2016)Google Scholar
  36. 36.
    Mehri, S., et al.: Samplernn: an unconditional end-to-end neural audio generation model. In: ICLR (2017)Google Scholar
  37. 37.
    Oord, A.V.D., et al.: Wavenet: a generative model for raw audio. arXiv 1609.03499 (2016)Google Scholar
  38. 38.
    Oviatt, S.: Multimodal interfaces. In: The Human-Computer Interaction Handbook: Fundamentals, Evolving Technologies and Emerging Applications (2003)Google Scholar
  39. 39.
    Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: ACL (2002)Google Scholar
  40. 40.
    Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV 123(1), 74–93 (2017)MathSciNetCrossRefGoogle Scholar
  41. 41.
    Ravanelli, M., Parcollet, T., Bengio, Y.: The Pytorch-Kaldi speech recognition toolkit. In: ICASSP (2019)Google Scholar
  42. 42.
    Reed, S.E., Akata, Z., Mohan, S., Tenka, S., Schiele, B., Lee, H.: Learning what and where to draw. In: NeurIPS, pp. 217–225 (2016)Google Scholar
  43. 43.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)Google Scholar
  44. 44.
    Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: EMNLP (2018)Google Scholar
  45. 45.
    Selvaraju, R.R., et al.: Taking a HINT: leveraging explanations to make vision and language models more grounded. In: ICCV (2019)Google Scholar
  46. 46.
    Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL (2018)Google Scholar
  47. 47.
    Tan, F., Feng, S., Ordonez, V.: Text2scene: generating compositional scenes from textual descriptions. In: CVPR (2019)Google Scholar
  48. 48.
    Vaidyanathan, P., Prud, E., Pelz, J.B., Alm, C.O.: SNAG : spoken narratives and gaze dataset. In: ACL (2018)Google Scholar
  49. 49.
    Vasudevan, A.B., Dai, D., Van Gool, L.: Object referring in visual scene with spoken language. In: CVPR (2017)Google Scholar
  50. 50.
    Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: CVPR (2015)Google Scholar
  51. 51.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR (2015)Google Scholar
  52. 52.
    Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. PAMI 39(4), 652–663 (2016)CrossRefGoogle Scholar
  53. 53.
    Website: Localized Narratives Data and Visualization (2020).
  54. 54.
    Wu, S., Wieland, J., Farivar, O., Schiller, J.: Automatic alt-text: computer-generated image descriptions for blind users on a social network service. In: Conference on Computer Supported Cooperative Work and Social Computing (2017)Google Scholar
  55. 55.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)Google Scholar
  56. 56.
    Yan, S., Yang, H., Robertson, N.: ParaCNN: visual paragraph generation via adversarial twin contextual CNNs. arXiv (2020)Google Scholar
  57. 57.
    Yin, G., Liu, B., Sheng, L., Yu, N., Wang, X., Shao, J.: Semantics disentangling for text-to-image generation. In: CVPR (2019)Google Scholar
  58. 58.
    Yin, G., Sheng, L., Liu, B., Yu, N., Wang, X., Shao, J.: Context and attribute grounded dense captioning. In: CVPR (2019)Google Scholar
  59. 59.
    Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)CrossRefGoogle Scholar
  60. 60.
    Yu, J., Li, J., Yu, Z., Huang, Q.: Multimodal transformer with multi-view visual representation for image captioning. arXiv 1905.07841 (2019)Google Scholar
  61. 61.
    Zhao, Y., Wu, S., Reynolds, L., Azenkot, S.: The effect of computer-generated descriptions on photo-sharing experiences of people with visual impairments. ACM Hum.-Comput. Interact. 1 (2017)Google Scholar
  62. 62.
    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ADE20K dataset. IJCV 127(3), 302–321 (2019)CrossRefGoogle Scholar
  63. 63.
    Zhou, L., Kalantidis, Y., Chen, X., Corso, J.J., Rohrbach, M.: Grounded video description. In: CVPR (2019)Google Scholar
  64. 64.
    Ziegler, Z.M., Melas-Kyriazi, L., Gehrmann, S., Rush, A.M.: Encoder-agnostic adaptation for conditional language generation. arXiv (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Google ResearchZurichUSA
  2. 2.Google ResearchVeniceUSA

Personalised recommendations