Abstract
In recent years a number of psycholinguistic experiments have pointed to the interaction between language and vision. In particular, the interaction between visual attention and linguistic reference. In parallel with this, several theories of discourse have attempted to provide an account of the relationship between types of referential expressions on the one hand and the degree of mental activation on the other. Building on both of these traditions, this paper describes an attention based approach to visually situated reference resolution. The framework uses the relationship between referential form and preferred mode of interpretation as a basis for a weighted integration of linguistic and visual attention scores for each entity in the multimodal context. The resulting integrated attention scores are then used to rank the candidate referents during the resolution process, with the candidate scoring the highest selected as the referent. One advantage of this approach is that the resolution process occurs within the full multimodal context, in so far as the referent is selected from a full list of the objects in the multimodal context. As a result situations where the intended target of the reference is erroneously excluded, due to an individual assumption within the resolution process, are avoided. Moreover, the system can recognise situations where attention cues from different modalities make a reference potentially ambiguous.
Similar content being viewed by others
References
Allen J, Schubert L (1991) The TRAINS Project. Technical report, Department of Computer Science, University of Rochester
Alshawi H (1987). Memory and context for language interpretation. Cambridge University Press, Cambridge, UK
Ariel M (1990). Accessing noun-phrase antecedents. Routeledge, London
Asher N, Lascarides A (2003) Logics of conversation. Cambridge University Press
Chum M, Wolfe J (2001) Visual attention. In: Goldstein EB (ed) Blackwell Handbook of perception, Handbooks of experimental psychology, Chapt. 9. Blackwell, pp 272–310
Duwe I, Strohner H (1997) Towards a cognitive model of linguistic reference. Report: 97/1–Situierte Künstliche Kommunikatoren 97/1, Univeristät Bielefeld
Enns J and Rensink RA (1990). Influence of scene-based properties on visual search. Science 247: 721–723
Gorniak P and Roy D (2004). Grounded semantic composition for visual scenes. J Artif Intell Res 21: 429–470
Grosz B (1977) The representation and use of focus in dialogue understanding. Ph.D. thesis, Standford University
Grosz B, Joshi A and Weinstein W (1995). Centering: a framework for modelling local coherence of discourse. Comput linguist 21(2): 203–255
Grosz B and Sidner C (1986). Attention, intentions and the structure of discourse. Comput Linguis 12(3): 175–204
Gundel J, Hedberg N and Zacharski R (1993). Cognitive status and the form of referring expression in discourse. Language 69: 274–307
Hajicová E (1993) Issues of sentence structure and discourse patterns, Theoretical and Computational Linguistics, vol 2. Charles University Press
Heinke D, Humphreys G (2004) Computational models of visual selective attention: a review. In: Houghton G (ed) Connectionist models in psychology. Psychology Press
Hobbs J (1985) On the coherence and structure of discourse. Technical Report CSLI-85-37, Center for the Study of Language and Information
Hopfinger J, Buonocore M and Mangun G (2000). The neural mechanisms of top-down attentional control. Nat Neurosci 3(3): 284–291
Itti L and Koch C (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Res 40: 1489–1506
Kelleher J, Costello F and van Genabith J (2005). Dynamically structuring, updating and interrelating representations of visual and lingusitic discourse context. Artif Intell 167(1–2): 62–102
Kelleher J and Genabith J (2004). Visual salience and reference resolution in simulated 3D environments. AI Rev 21(3–4): 253–267
Kievit L, Piwek P, Beun R and Bunt H (2001). Multimodal cooperative resolution of referential expressions in the denk system. In: Bunt, H and Beun, R (eds) Cooperative multimodal communication: Lecture Notes in Artificial Intelligence 2155, pp 197–214. Springer-Verlag, Berlin Heidelberg
Koch C and Itti L (2001). Computational modelling of visual attention. Nat Rev Neurosci 2(3): 194–203
Krahmer E and Theune M (2002). Efficient context-sensitive generation of referring expressions. In: Kibble, R (eds) Information sharing: reference and presupposition in language generation and interpretation, pp. CLSI Publications, Standford
Kruijff-Korbayová I, Hajicová E (1997) Topics and centers: a comparison of the salience-based approach and the centering theory. Prague Bull Math Linguist 67:25–50. Charles University, Prague, Czech Republic
Landragin F, Romary L (2003) Referring to objects through sub-contexts in multimodal human–computer interaction. In: DiaBruck 7th workshop on the semantics and pragmatics of dialogue, Sept 4th–6th 2003. University of Saarland, Germany
Lappin S and Leass H (1994). An algorithm for pronominal anaphora resolution. Computat Linguist 20(4): 535–561
Mann W and Thompson S (1987). Rhetorical structure theory: description and construction of text structures. In: Kempen, G (eds) Natural language generation: new results in artificial intelligence, psychology and linguistics, pp 83–96. Nijhoff., Dordrecht
McKevitt P (ed) (1995/1996) Integration of natural language and vision processing, vols I–IV. The Netherlands: Kluwer Academic Publishers, Dordrecht
Poesio M (1993) A situation-theoretic formalization of definite description interpretation in plan elaboration dialogues. In: Aczel P, Israel D, Katagiri Y, Peters S (eds) Situation theory and its applications, vol 3. CSLI, pp 339–374
Regier T and Carlson L (2001). Grounding spatial language in perception: an empirical and computational investigation. J Exp Psychol Gen 130(2): 273–298
Salmon-Alt S, Romary L (2001) Reference resolution within the framework of cogitive grammar. In: Proceedings of the Seventh International colloquium on cognitive science (ICCS-01). Donostia, Spain, pp 284–299
Spivey-Knowlton M, Tanenhaus M, Eberhard K, Sedivy J (1998) Integration of visuospatial and linguistic information: language comprehension in real time and real space. In: Olivier P, Gapp K (eds) Representation and processing of spatial expressions. Lawrence Erlbaum Associates, pp 201–214
Tanenhaus M, Spivey-Knowlton M, Eberhard K and Spivey J (1995). Integration of visual and linguistic information in spoken language comprehension. Science 268: 1632–1634
Yarbus A (1967). Eye movements and vision. Plenium Press, New York
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kelleher, J.D. Attention driven reference resolution in multimodal contexts. Artif Intell Rev 25, 21–35 (2006). https://doi.org/10.1007/s10462-007-9022-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-007-9022-9