Utilizing Visual Attention for Cross-Modal Coreference Interpretation

  • Donna Byron
  • Thomas Mampilly
  • Vinay Sharma
  • Tianfang Xu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3554)


In this paper, we describe an exploratory study to develop a model of visual attention that could aid automatic interpretation of exophors in situated dialog. The model is intended to support the reference resolution needs of embodied conversational agents, such as graphical avatars and robotic collaborators. The model tracks the attentional state of one dialog participant as it is represented by his visual input stream, taking into account the recency, exposure time, and visual distinctness of each viewed item. The model correctly predicts the correct referent of 52% of referring expressions produced by speakers in human-human dialog while they were collaborating on a task in a virtual world. This accuracy is comparable with reference resolution based on calculating linguistic salience for the same data.


Visual Attention Noun Phrase Virtual World Linguistic Context Visual Salience 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Azzam, S.: Resolving anaphors in embedded sentences. In: Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL 1996), pp. 263–269 (1996)Google Scholar
  2. 2.
    Brennan, S.E., Friedman, M.W., Pollard, C.J.: A centering approach to pronouns. In: Proceedings of ACL 1987, pp. 155–162 (1987)Google Scholar
  3. 3.
    Byron, D.K.: Improving discourse management in TRIPS 1998. In: Proceedings of the 6th European Conference on Speech Communication and Technology, Eurospeech 1999 (1999)Google Scholar
  4. 4.
    Byron, D.K.: Resolving pronominal reference to abstract entities. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 80–87 (2002)Google Scholar
  5. 5.
    Campana, E., Baldridge, J., Dowding, J., Hockey, B.A., Remington, R.W., Stone, L.S.: Using eye movements to determine referents in a spoken dialogue system. In: Workshop on Perceptive User InterfacesGoogle Scholar
  6. 6.
    Christ, S.E., McCrae, C.S., Abrams, R.A.: Inhibition of return in static and dynamic displays. Psychonomic Bulletin and Review 9(1), 80–85 (2002)CrossRefGoogle Scholar
  7. 7.
    Eckert, M., Strube, M.: Dialogue acts, synchronising units and anaphora resolution. Journal of Semantics 17(1), 51–89 (2000)CrossRefGoogle Scholar
  8. 8.
    Egeth, H.E., Virzi, R.A., Garbart, H.: Searching for conjunctively defined targets. J. Exp. Psychol.: Human Perception and Performance 10, 32–39 (1984)CrossRefGoogle Scholar
  9. 9.
    Gabsdil, M., Koller, A., Striegnitz, K.: Natural Language and Inference in a Computer Game. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei (2002)Google Scholar
  10. 10.
    Gillund, G., Shiffrin, R.M.: A retrieval model for both recogintion and recall. Psychological Review 91, 1–67 (1984)CrossRefGoogle Scholar
  11. 11.
    Grosz, B.J., Joshi, A.K., Weinstein, S.: Centering: A framework for modeling the local coherence of discourse. Computational Linguistics 21(2), 203–226 (1995)Google Scholar
  12. 12.
    Guindon, R.: Anaphora resolution: short term memory and focusing. In: Proceedings of the 23rd Annual Meeting of the Association for Computational Linguistics (ACL 1985), pp. 218–227 (1985)Google Scholar
  13. 13.
    Heeman, P., Allen, J.: The Trains spoken dialog corpus. CD-ROM, Linguistics Data Consortium (1995)Google Scholar
  14. 14.
    Hoffman, J.E.: A two-stage model for visual search. Perception and Psychophysics 25, 319–327 (1979)CrossRefGoogle Scholar
  15. 15.
    Johnston, M., Bangalore, S., Vasireddy, G., Stent, A., Ehlen, P., Walker, M., Whittaker, S., Maloor, P.: Match: An architecture for multimodal dialogue systems. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 376–383 (2002)Google Scholar
  16. 16.
    Kaiser, E., Olwal, A., McGee, D., Benko, H., Corradini, A., Li, X., Cohen, P., Feiner, S.: Mutual disambiguation of 3d multimodal interaction in augmented and virtual reality. In: Proceedings of the 5th international conference on Multimodal interfaces (ICMI 2003), Vancouver, B.C., Canada (November 2003)Google Scholar
  17. 17.
    Kehler, A.: Cognitive status and form of reference in multimodal human-computer interaction. In: Proceedings of the 17th National Conference on Artificial Intelligence (AAAI 2000), Austin, Texas (July 2000)Google Scholar
  18. 18.
    Kelleher, J., van Genabith, J.: Exploiting visual salience for the generation of referring expressions. In: Proceedings of the 17th International FLAIRS conference (2004)Google Scholar
  19. 19.
    Kelleher, J., van Genabith, J.: Dynamically updating and interrelating representations of visual and linguistic discourse. Submitted to Artificial IntelligenceGoogle Scholar
  20. 20.
    Kelleher, J., van Genabith, J.: Visual salience and reference resolution in simulated 3-d environments. Artificial Intelligence Review 21(3), 253–267 (2004)CrossRefGoogle Scholar
  21. 21.
    Knoeferle, P., Crocker, M.W., Scheepers, C., Pickering, M.J.: The influence of the immediate visual context on incremental thematic role-assignment: Evidence from eye-movements in depicted events. Cognition (2004)Google Scholar
  22. 22.
    Matthias, S., Kathleen, E., Virgil, A.: A real-time robotic model of human reference resolution using visual constraints. Connection Science (2004)Google Scholar
  23. 23.
    Neisser, U.: Cognitive Psychology. Appleton, Century, Crofts, New York (1967)Google Scholar
  24. 24.
    Sidner, C.L.: Focusing in the comprehension of definite anaphora. In: Brady, M., Berwick, R. (eds.) Computational Models of Discourse, pp. 363–394 (1983)Google Scholar
  25. 25.
    Tanenhaus, M.K., Spivey-Knowlton, M.J., Eberhard, K.M., Sedivy, J.E.: Integration of visual and linguistic information in spoken language comprehension. Science 268, 1632–1634 (1995)CrossRefGoogle Scholar
  26. 26.
    Tetreault, J.R.: Empirical Evaluations of Pronoun Resolution. PhD thesis, University of Rochester (2004)Google Scholar
  27. 27.
    Walker, M.A.: Limited attention and discourse structure. Computational Linguistics 22(2), 255–264 (1996)Google Scholar
  28. 28.
    Winograd, T.: Understanding natural language. Academic Press, New York (1972)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Donna Byron
    • 1
  • Thomas Mampilly
    • 1
  • Vinay Sharma
    • 1
  • Tianfang Xu
    • 1
  1. 1.Department of Computer Science and EngineeringThe Ohio State UniversityColumbusUSA

Personalised recommendations