“Show Me the Cup”: Reference with Continuous Representations

  • Marco BaroniEmail author
  • Gemma Boleda
  • Sebastian Padó
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10761)


One of the most basic functions of language is to refer to objects in a shared scene. Modeling reference with continuous representations is challenging because it requires individuation, i.e., tracking and distinguishing an arbitrary number of referents. We introduce a neural network model that, given a definite description and a set of objects represented by natural images, points to the intended object if the expression has a unique referent, or indicates a failure, if it does not. The model, directly trained on reference acts, is competitive with a pipeline manually engineered to perform the same task, both when referents are purely visual, and when they are characterized by a combination of visual and linguistic properties.



We are grateful to Elia Bruni for the CNN baseline idea, and to Angeliki Lazaridou for providing us with the visual vectors used in the paper. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 655577 (LOVe) and ERC grant agreement No 715154 (AMORE); ERC 2011 Starting Independent Research Grant n. 283554 (COMPOSES); DFG (SFB 732, Project D10); and Spanish MINECO (grant FFI2013-41301-P). This paper reflects the authors’ view only, and the EU is not responsible for any use that may be made of the information it contains. Open image in new window

Supplementary material


  1. 1.
    Russell, B.: On denoting. Mind 14, 479–493 (1905)CrossRefGoogle Scholar
  2. 2.
    Harnad, S.: The symbol grounding problem. Physica D 42, 335–346 (1990)CrossRefGoogle Scholar
  3. 3.
    Reimer, M., Michaelson, E.: Reference. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy. Winter 2014 edn. (2014)Google Scholar
  4. 4.
    Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)MathSciNetCrossRefGoogle Scholar
  5. 5.
    Bos, J., Clark, S., Steedman, M., Curran, J.R., Hockenmaier, J.: Wide-coverage semantic representations from a CCG parser. In: Proceedings of the COLING, Geneva, Switzerland, pp. 1240–1246 (2004)Google Scholar
  6. 6.
    Nielsen, M.: Neural Networks and Deep Learning. Determination Press, New York (2015).
  7. 7.
    Frome, A., et al.: DeViSE: A deep visual-semantic embedding model. In: Proceedings of NIPS, Lake Tahoe, NV, pp. 2121–2129 (2013)Google Scholar
  8. 8.
    Lazaridou, A., Dinu, G., Baroni, M.: Hubness and pollution: delving into cross-space mapping for zero-shot learning. In: Proceedings of ACL, Beijing, China, pp. 270–280 (2015)Google Scholar
  9. 9.
    Weston, J., Bengio, S., Usunier, N.: WSABIE: scaling up to large vocabulary image annotation. In: Proceedings of IJCAI, Barcelona, Spain, pp. 2764–2770 (2011)Google Scholar
  10. 10.
    Lazaridou, A., Pham, N., Baroni, M.: Combining language and vision with a multimodal skip-gram model. In: Proceedings of NAACL, Denver, CO, pp. 153–163 (2015)Google Scholar
  11. 11.
    Baroni, M., Lenci, A.: Distributional memory: a general framework for corpus-based semantics. Comput. Linguist. 36, 673–721 (2010)CrossRefGoogle Scholar
  12. 12.
    Brysbaert, M., Warriner, A.B., Kuperman, V.: Concreteness ratings for 40 thousand generally known English word lemmas. Behav. Res. Methods 46, 904–911 (2014)CrossRefGoogle Scholar
  13. 13.
    Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of ACL, Baltimore, MD, pp. 238–247 (2014)Google Scholar
  14. 14.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: Proceedings of ICLR Conference Track, San Diego, CA (2015).
  15. 15.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proceedings of ICLR Conference Track, San Diego, CA (2015).
  16. 16.
    Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of ICML, Lille, France, pp. 2048–2057 (2015)Google Scholar
  17. 17.
    Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Proceedings of NIPS, Montreal, Canada, pp. 2692–2700 (2015)Google Scholar
  18. 18.
    Sukhbaatar, S., Szlam, A., Weston, J., Fergus, R.: End-to-end memory networks (2015).
  19. 19.
    Weston, J., Chopra, S., Bordes, A.: Memory networks. In: Proceedings of ICLR Conference Track, San Diego, CA (2015).
  20. 20.
    Gorniak, P., Roy, D.: Grounded semantic composition for visual scenes. J. Artif. Intell. Res. 21, 429–470 (2004)CrossRefGoogle Scholar
  21. 21.
    Larsson, S.: Formal semantics for perceptual classification. J. Logic Comput. 25, 335–369 (2015)MathSciNetCrossRefGoogle Scholar
  22. 22.
    Matuszek, C., Bo, L., Zettlemoyer, L., Fox, D.: Learning from unscripted deictic gesture and language for human-robot interactions. In: Proceedings of AAAI, Quebec City, Canada, pp. 2556–2563 (2014)Google Scholar
  23. 23.
    Steels, L., Belpaeme, T.: Coordinating perceptually grounded categories through language: a case study for colour. Behav. Brain Sci. 28, 469–529 (2005)Google Scholar
  24. 24.
    Kennington, C., Schlangen, D.: Simple learning and compositional application of perceptually grounded word meanings for incremental reference resolution. In: Proceedings of ACL, Beijing, China, pp. 292–301 (2015)Google Scholar
  25. 25.
    Krahmer, E., van Deemter, K.: Computational generation of referring expressions: a survey. Comput. Linguist. 38 (2012)Google Scholar
  26. 26.
    Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: ReferItGame: referring to objects in photographs of natural scenes. In: Proceedings of EMNLP, Doha, Qatar, pp. 787–798 (2014)Google Scholar
  27. 27.
    Tily, H., Piantadosi, S.: Refer efficiently: use less informative expressions for more predictable meanings. In: Proceedings of the CogSci Workshop on the Production of Referring Expressions, Amsterdam, The Netherlands (2009)Google Scholar
  28. 28.
    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of CVPR, Las Vegas, NV (2016) (in Press)Google Scholar
  29. 29.
    Geman, D., Geman, S., Hallonquist, N., Younes, L.: Visual turing test for computer vision systems. Proc. Nat. Acad. Sci. 112, 3618–3623 (2015)Google Scholar
  30. 30.
    Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Proceedings of NIPS, Montreal, Canada, pp. 1682–1690 (2014)Google Scholar
  31. 31.
    Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: Proceedings of NIPS, Montreal, Canada (2015).
  32. 32.
    Baroni, M.: Grounding distributional semantics in the visual world. Lang. Linguist. Compass 10, 3–13 (2016)CrossRefGoogle Scholar
  33. 33.
    Abbott, B.: Reference. Oxford University Press, Oxford (2010)Google Scholar
  34. 34.
    Datta, R., Joshi, D., Li, J., Wang, J.: Image retrieval: ideas, influences, and trends of the new age. ACM Comput. Surv. 40, 1–60 (2008)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Center for Mind/Brain SciencesUniversity of TrentoTrentoItaly
  2. 2.Institut für Maschinelle SprachverarbeitungUniversität StuttgartStuttgartGermany

Personalised recommendations