Skip to main content
Log in

Attention driven reference resolution in multimodal contexts

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

In recent years a number of psycholinguistic experiments have pointed to the interaction between language and vision. In particular, the interaction between visual attention and linguistic reference. In parallel with this, several theories of discourse have attempted to provide an account of the relationship between types of referential expressions on the one hand and the degree of mental activation on the other. Building on both of these traditions, this paper describes an attention based approach to visually situated reference resolution. The framework uses the relationship between referential form and preferred mode of interpretation as a basis for a weighted integration of linguistic and visual attention scores for each entity in the multimodal context. The resulting integrated attention scores are then used to rank the candidate referents during the resolution process, with the candidate scoring the highest selected as the referent. One advantage of this approach is that the resolution process occurs within the full multimodal context, in so far as the referent is selected from a full list of the objects in the multimodal context. As a result situations where the intended target of the reference is erroneously excluded, due to an individual assumption within the resolution process, are avoided. Moreover, the system can recognise situations where attention cues from different modalities make a reference potentially ambiguous.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Allen J, Schubert L (1991) The TRAINS Project. Technical report, Department of Computer Science, University of Rochester

  • Alshawi H (1987). Memory and context for language interpretation. Cambridge University Press, Cambridge, UK

    Google Scholar 

  • Ariel M (1990). Accessing noun-phrase antecedents. Routeledge, London

    Google Scholar 

  • Asher N, Lascarides A (2003) Logics of conversation. Cambridge University Press

  • Chum M, Wolfe J (2001) Visual attention. In: Goldstein EB (ed) Blackwell Handbook of perception, Handbooks of experimental psychology, Chapt. 9. Blackwell, pp 272–310

  • Duwe I, Strohner H (1997) Towards a cognitive model of linguistic reference. Report: 97/1–Situierte Künstliche Kommunikatoren 97/1, Univeristät Bielefeld

  • Enns J and Rensink RA (1990). Influence of scene-based properties on visual search. Science 247: 721–723

    Article  Google Scholar 

  • Gorniak P and Roy D (2004). Grounded semantic composition for visual scenes. J Artif Intell Res 21: 429–470

    Google Scholar 

  • Grosz B (1977) The representation and use of focus in dialogue understanding. Ph.D. thesis, Standford University

  • Grosz B, Joshi A and Weinstein W (1995). Centering: a framework for modelling local coherence of discourse. Comput linguist 21(2): 203–255

    Google Scholar 

  • Grosz B and Sidner C (1986). Attention, intentions and the structure of discourse. Comput Linguis 12(3): 175–204

    Google Scholar 

  • Gundel J, Hedberg N and Zacharski R (1993). Cognitive status and the form of referring expression in discourse. Language 69: 274–307

    Article  Google Scholar 

  • Hajicová E (1993) Issues of sentence structure and discourse patterns, Theoretical and Computational Linguistics, vol 2. Charles University Press

  • Heinke D, Humphreys G (2004) Computational models of visual selective attention: a review. In: Houghton G (ed) Connectionist models in psychology. Psychology Press

  • Hobbs J (1985) On the coherence and structure of discourse. Technical Report CSLI-85-37, Center for the Study of Language and Information

  • Hopfinger J, Buonocore M and Mangun G (2000). The neural mechanisms of top-down attentional control. Nat Neurosci 3(3): 284–291

    Article  Google Scholar 

  • Itti L and Koch C (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Res 40: 1489–1506

    Article  Google Scholar 

  • Kelleher J, Costello F and van Genabith J (2005). Dynamically structuring, updating and interrelating representations of visual and lingusitic discourse context. Artif Intell 167(1–2): 62–102

    Article  Google Scholar 

  • Kelleher J and Genabith J (2004). Visual salience and reference resolution in simulated 3D environments. AI Rev 21(3–4): 253–267

    Google Scholar 

  • Kievit L, Piwek P, Beun R and Bunt H (2001). Multimodal cooperative resolution of referential expressions in the denk system. In: Bunt, H and Beun, R (eds) Cooperative multimodal communication: Lecture Notes in Artificial Intelligence 2155, pp 197–214. Springer-Verlag, Berlin Heidelberg

    Google Scholar 

  • Koch C and Itti L (2001). Computational modelling of visual attention. Nat Rev Neurosci 2(3): 194–203

    Article  Google Scholar 

  • Krahmer E and Theune M (2002). Efficient context-sensitive generation of referring expressions. In: Kibble, R (eds) Information sharing: reference and presupposition in language generation and interpretation, pp. CLSI Publications, Standford

    Google Scholar 

  • Kruijff-Korbayová I, Hajicová E (1997) Topics and centers: a comparison of the salience-based approach and the centering theory. Prague Bull Math Linguist 67:25–50. Charles University, Prague, Czech Republic

    Google Scholar 

  • Landragin F, Romary L (2003) Referring to objects through sub-contexts in multimodal human–computer interaction. In: DiaBruck 7th workshop on the semantics and pragmatics of dialogue, Sept 4th–6th 2003. University of Saarland, Germany

  • Lappin S and Leass H (1994). An algorithm for pronominal anaphora resolution. Computat Linguist 20(4): 535–561

    Google Scholar 

  • Mann W and Thompson S (1987). Rhetorical structure theory: description and construction of text structures. In: Kempen, G (eds) Natural language generation: new results in artificial intelligence, psychology and linguistics, pp 83–96. Nijhoff., Dordrecht

    Google Scholar 

  • McKevitt P (ed) (1995/1996) Integration of natural language and vision processing, vols I–IV. The Netherlands: Kluwer Academic Publishers, Dordrecht

  • Poesio M (1993) A situation-theoretic formalization of definite description interpretation in plan elaboration dialogues. In: Aczel P, Israel D, Katagiri Y, Peters S (eds) Situation theory and its applications, vol 3. CSLI, pp 339–374

  • Regier T and Carlson L (2001). Grounding spatial language in perception: an empirical and computational investigation. J Exp Psychol Gen 130(2): 273–298

    Article  Google Scholar 

  • Salmon-Alt S, Romary L (2001) Reference resolution within the framework of cogitive grammar. In: Proceedings of the Seventh International colloquium on cognitive science (ICCS-01). Donostia, Spain, pp 284–299

  • Spivey-Knowlton M, Tanenhaus M, Eberhard K, Sedivy J (1998) Integration of visuospatial and linguistic information: language comprehension in real time and real space. In: Olivier P, Gapp K (eds) Representation and processing of spatial expressions. Lawrence Erlbaum Associates, pp 201–214

  • Tanenhaus M, Spivey-Knowlton M, Eberhard K and Spivey J (1995). Integration of visual and linguistic information in spoken language comprehension. Science 268: 1632–1634

    Article  Google Scholar 

  • Yarbus A (1967). Eye movements and vision. Plenium Press, New York

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. D. Kelleher.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kelleher, J.D. Attention driven reference resolution in multimodal contexts. Artif Intell Rev 25, 21–35 (2006). https://doi.org/10.1007/s10462-007-9022-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-007-9022-9

Keywords

Navigation