Attention driven reference resolution in multimodal contexts

Kelleher, J. D.

doi:10.1007/s10462-007-9022-9

Attention driven reference resolution in multimodal contexts

Published: 25 August 2007

Volume 25, pages 21–35, (2006)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

J. D. Kelleher¹

106 Accesses
3 Citations
Explore all metrics

Abstract

In recent years a number of psycholinguistic experiments have pointed to the interaction between language and vision. In particular, the interaction between visual attention and linguistic reference. In parallel with this, several theories of discourse have attempted to provide an account of the relationship between types of referential expressions on the one hand and the degree of mental activation on the other. Building on both of these traditions, this paper describes an attention based approach to visually situated reference resolution. The framework uses the relationship between referential form and preferred mode of interpretation as a basis for a weighted integration of linguistic and visual attention scores for each entity in the multimodal context. The resulting integrated attention scores are then used to rank the candidate referents during the resolution process, with the candidate scoring the highest selected as the referent. One advantage of this approach is that the resolution process occurs within the full multimodal context, in so far as the referent is selected from a full list of the objects in the multimodal context. As a result situations where the intended target of the reference is erroneously excluded, due to an individual assumption within the resolution process, are avoided. Moreover, the system can recognise situations where attention cues from different modalities make a reference potentially ambiguous.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Allen J, Schubert L (1991) The TRAINS Project. Technical report, Department of Computer Science, University of Rochester
Alshawi H (1987). Memory and context for language interpretation. Cambridge University Press, Cambridge, UK
Google Scholar
Ariel M (1990). Accessing noun-phrase antecedents. Routeledge, London
Google Scholar
Asher N, Lascarides A (2003) Logics of conversation. Cambridge University Press
Chum M, Wolfe J (2001) Visual attention. In: Goldstein EB (ed) Blackwell Handbook of perception, Handbooks of experimental psychology, Chapt. 9. Blackwell, pp 272–310
Duwe I, Strohner H (1997) Towards a cognitive model of linguistic reference. Report: 97/1–Situierte Künstliche Kommunikatoren 97/1, Univeristät Bielefeld
Enns J and Rensink RA (1990). Influence of scene-based properties on visual search. Science 247: 721–723
Article Google Scholar
Gorniak P and Roy D (2004). Grounded semantic composition for visual scenes. J Artif Intell Res 21: 429–470
Google Scholar
Grosz B (1977) The representation and use of focus in dialogue understanding. Ph.D. thesis, Standford University
Grosz B, Joshi A and Weinstein W (1995). Centering: a framework for modelling local coherence of discourse. Comput linguist 21(2): 203–255
Google Scholar
Grosz B and Sidner C (1986). Attention, intentions and the structure of discourse. Comput Linguis 12(3): 175–204
Google Scholar
Gundel J, Hedberg N and Zacharski R (1993). Cognitive status and the form of referring expression in discourse. Language 69: 274–307
Article Google Scholar
Hajicová E (1993) Issues of sentence structure and discourse patterns, Theoretical and Computational Linguistics, vol 2. Charles University Press
Heinke D, Humphreys G (2004) Computational models of visual selective attention: a review. In: Houghton G (ed) Connectionist models in psychology. Psychology Press
Hobbs J (1985) On the coherence and structure of discourse. Technical Report CSLI-85-37, Center for the Study of Language and Information
Hopfinger J, Buonocore M and Mangun G (2000). The neural mechanisms of top-down attentional control. Nat Neurosci 3(3): 284–291
Article Google Scholar
Itti L and Koch C (2000). A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Res 40: 1489–1506
Article Google Scholar
Kelleher J, Costello F and van Genabith J (2005). Dynamically structuring, updating and interrelating representations of visual and lingusitic discourse context. Artif Intell 167(1–2): 62–102
Article Google Scholar
Kelleher J and Genabith J (2004). Visual salience and reference resolution in simulated 3D environments. AI Rev 21(3–4): 253–267
Google Scholar
Kievit L, Piwek P, Beun R and Bunt H (2001). Multimodal cooperative resolution of referential expressions in the denk system. In: Bunt, H and Beun, R (eds) Cooperative multimodal communication: Lecture Notes in Artificial Intelligence 2155, pp 197–214. Springer-Verlag, Berlin Heidelberg
Google Scholar
Koch C and Itti L (2001). Computational modelling of visual attention. Nat Rev Neurosci 2(3): 194–203
Article Google Scholar
Krahmer E and Theune M (2002). Efficient context-sensitive generation of referring expressions. In: Kibble, R (eds) Information sharing: reference and presupposition in language generation and interpretation, pp. CLSI Publications, Standford
Google Scholar
Kruijff-Korbayová I, Hajicová E (1997) Topics and centers: a comparison of the salience-based approach and the centering theory. Prague Bull Math Linguist 67:25–50. Charles University, Prague, Czech Republic
Google Scholar
Landragin F, Romary L (2003) Referring to objects through sub-contexts in multimodal human–computer interaction. In: DiaBruck 7th workshop on the semantics and pragmatics of dialogue, Sept 4th–6th 2003. University of Saarland, Germany
Lappin S and Leass H (1994). An algorithm for pronominal anaphora resolution. Computat Linguist 20(4): 535–561
Google Scholar
Mann W and Thompson S (1987). Rhetorical structure theory: description and construction of text structures. In: Kempen, G (eds) Natural language generation: new results in artificial intelligence, psychology and linguistics, pp 83–96. Nijhoff., Dordrecht
Google Scholar
McKevitt P (ed) (1995/1996) Integration of natural language and vision processing, vols I–IV. The Netherlands: Kluwer Academic Publishers, Dordrecht
Poesio M (1993) A situation-theoretic formalization of definite description interpretation in plan elaboration dialogues. In: Aczel P, Israel D, Katagiri Y, Peters S (eds) Situation theory and its applications, vol 3. CSLI, pp 339–374
Regier T and Carlson L (2001). Grounding spatial language in perception: an empirical and computational investigation. J Exp Psychol Gen 130(2): 273–298
Article Google Scholar
Salmon-Alt S, Romary L (2001) Reference resolution within the framework of cogitive grammar. In: Proceedings of the Seventh International colloquium on cognitive science (ICCS-01). Donostia, Spain, pp 284–299
Spivey-Knowlton M, Tanenhaus M, Eberhard K, Sedivy J (1998) Integration of visuospatial and linguistic information: language comprehension in real time and real space. In: Olivier P, Gapp K (eds) Representation and processing of spatial expressions. Lawrence Erlbaum Associates, pp 201–214
Tanenhaus M, Spivey-Knowlton M, Eberhard K and Spivey J (1995). Integration of visual and linguistic information in spoken language comprehension. Science 268: 1632–1634
Article Google Scholar
Yarbus A (1967). Eye movements and vision. Plenium Press, New York
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Dublin Institute of Technology, Dublin 8, Ireland
J. D. Kelleher

Authors

J. D. Kelleher
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. D. Kelleher.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kelleher, J.D. Attention driven reference resolution in multimodal contexts. Artif Intell Rev 25, 21–35 (2006). https://doi.org/10.1007/s10462-007-9022-9

Download citation

Published: 25 August 2007
Issue Date: April 2006
DOI: https://doi.org/10.1007/s10462-007-9022-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Attention driven reference resolution in multimodal contexts

Abstract

Access this article

Similar content being viewed by others

Attentional Distribution and Spatial Language

Fixations in the visual world paradigm: where, when, why?

When more is more: redundant modifiers can facilitate visual search

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Attention driven reference resolution in multimodal contexts

Abstract

Access this article

Similar content being viewed by others

Attentional Distribution and Spatial Language

Fixations in the visual world paradigm: where, when, why?

When more is more: redundant modifiers can facilitate visual search

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation