Integrating mechanisms of visual guidance in naturalistic language production

Coco, Moreno I.; Keller, Frank

doi:10.1007/s10339-014-0642-0

Integrating mechanisms of visual guidance in naturalistic language production

Research Report
Published: 23 November 2014

Volume 16, pages 131–150, (2015)
Cite this article

Cognitive Processing Aims and scope Submit manuscript

Moreno I. Coco¹^nAff2 &
Frank Keller¹

567 Accesses
16 Citations
1 Altmetric
Explore all metrics

Abstract

Situated language production requires the integration of visual attention and linguistic processing. Previous work has not conclusively disentangled the role of perceptual scene information and structural sentence information in guiding visual attention. In this paper, we present an eye-tracking study that demonstrates that three types of guidance, perceptual, conceptual, and structural, interact to control visual attention. In a cued language production experiment, we manipulate perceptual (scene clutter) and conceptual guidance (cue animacy) and measure structural guidance (syntactic complexity of the utterance). Analysis of the time course of language production, before and during speech, reveals that all three forms of guidance affect the complexity of visual responses, quantified in terms of the entropy of attentional landscapes and the turbulence of scan patterns, especially during speech. We find that perceptual and conceptual guidance mediate the distribution of attention in the scene, whereas structural guidance closely relates to scan pattern complexity. Furthermore, the eye–voice span of the cued object and its perceptual competitor are similar; its latency mediated by both perceptual and structural guidance. These results rule out a strict interpretation of structural guidance as the single dominant form of visual guidance in situated language production. Rather, the phase of the task and the associated demands of cross-modal cognitive processing determine the mechanisms that guide attention.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Note we use small caps to denote visual referents and italics to denote linguistic referents.
We include the Primary/Secondary distinction as a random variable in all mixed models concerning with latencies, and all four cases for the analysis of fixation distribution and scan pattern complexity.
We use -l and -r to denote the leftmost and rightmost object in case of ambiguous objects.
The logarithm of $\phi (x)$ is taken to avoid large numbers as $\phi (x)$ increases exponentially with increasing x.

References

Allopenna P, Magnuson J, Tanenhaus M (1998) Tracking the time course of spoken word recognition: evidence for continuous mapping models. J Mem Lang 38:419–439
Article Google Scholar
Altmann G, Kamide Y (1999) Incremental interpretation at verbs: restricting the domain of subsequent reference. Cognition 73:247–264
Article CAS PubMed Google Scholar
Andersson R, Ferreira F, Henderson J (2011) I see what you are saying: the integration of complex speech and scenes during language comprehension. Acta Psychol 137:208–216
Article Google Scholar
Arnold J, Griffin Z (2007) The effect of additional characters on choice of referring expression: everything counts. J Mem Lang 56:521–536
Article PubMed Central PubMed Google Scholar
Baayen R, Davidson D, Bates D (2008) Mixed-effects modeling with crossed random effects for subjects and items. J Mem Lang 59:390–412
Article Google Scholar
Barr DJ, Levy R, Scheepers C, Tily HJ (2013) Random-effects structure for confirmatory hypothesis testing: keep it maximal. J Mem Lang 68(3):255–278
Article Google Scholar
Bock K, Irwin D, Davidson D, Levelt W (2003) Minding the clock. J Mem Lang 4(48):653–685
Article Google Scholar
Branigan H, Pickering M, Tanaka M (2008) Contribution of animacy to grammatical function assignment and word order during production. Lingua 2(118):172–189
Article Google Scholar
Brown-Schmidt S, Tanenhaus M (2006) Watching the eyes when talking about size: an investigation of message formulation and utterance planning. J Mem Lang 54:592–609
Article Google Scholar
Castelhano M, Mack M, Henderson J (2009) Viewing task influences eye-movement control during active scene perception. J Vis 9:1–15
Article PubMed Google Scholar
Coco MI, Keller F (2009) The impact of visual information on referent assignment in sentence production. In: Taatgen NA, van Rijn H (eds) Proceedings of the 31th annual conference of the cognitive science society, Amsterdam
Coco MI, Keller F (2012) Scan patterns predict sentence production in the cross-modal processing of visual scenes. Cogn Sci 36(7):1204–1223
Article PubMed Google Scholar
Dahan D, Tanenhaus M (2005) Looking at the rope when looking for the snake: conceptually mediated eye movements during spoken-word recognition. Psychol Bull Rev 12:455–459
Article Google Scholar
Daumé III H, Marcu D (2005) Learning as search optimization: approximate large margin methods for structured prediction. In: International conference on machine learning (ICML)
Elazary L, Itti L (2008) Interesting objects are visually salient. J Vis 8(14:18):1–15
PubMed Google Scholar
Elzinga C, Liefbroer A (2007) Destandardization of the life course: a cross-national comparison using sequence analysis. Eur J Popul 23(3–4):225–250
Article Google Scholar
Ferreira V, Slevc L, Rogers E (2007) How do speakers avoid ambiguous linguistic expressions? Cognition 96:263–284
Article Google Scholar
Findlay J, Gilchrist I (2001) Visual attention: the active vision perspective. In: Jenkins M, Harris L (eds) Vision and attention. Springer, New York, pp 83–103
Chapter Google Scholar
Fletcher-Watson S, Findlay J, Leekam S, Benson V (2008) Rapid detection of person information in a naturalistic scene. Perception 37(4):571–583
Article PubMed Google Scholar
Frank M, Vul E, Johnson S (2009) Development of infants’ attention to faces during the first year. Cognition 110:160–170
Article PubMed Central PubMed Google Scholar
Fukumura K, Van Gompel R (2011) The effects of animacy in the choice of referring expressions. Lang Cogn Process 26:1472–1504
Article Google Scholar
Fukumura K, Van Gompel R, Pickering MJ (2010) The use of visual context during the production of referring expressions. Q J Exp Psychol 63:1700–1715
Article Google Scholar
Gabadinho A, Ritschard G, Müller N, Studer M (2011) Analyzing and visualizing state sequences in R with TraMineR. J Stat Softw 40:1–37
Google Scholar
Gleitman L, January D, Nappa R, Trueswell J (2007) On the give and take between event apprehension and utterance formulation. J Mem Lang 57:544–569
Article PubMed Central PubMed Google Scholar
Griffin Z, Bock K (2000) What the eyes say about speaking. Psychol Sci 11:274–279
Article CAS PubMed Google Scholar
Griffin Z, Davison J (2011) A technical introduction to using speakers’ eye movements to study language. In: The mental lexicon, vol 6. John Benjamins Publishing Company, Reading, MA, pp 55–82
Henderson J (2003) Human gaze control during real-world scene perception. Trends Cogn Sci 7:498–504
Article PubMed Google Scholar
Henderson J, Chanceaux M, Smith T (2009) The influence of clutter on real-world scene search: evidence from search efficiency and eye movements. J Vis 9(1):1–32
Article Google Scholar
Henderson J, Hollingworth A (1999) High-level scene perception. Annu Rev Psychol 50:243–271
Article CAS PubMed Google Scholar
Huettig F, Altmann G (2005) Word meaning and the control of eye fixation: semantic competitor effects and the visual world paradigm. Cognition 96(1):B23–B32
Article PubMed Google Scholar
Hwang A, Wang H, Pomplun M (2011) Semantic guidance of eye movements in real-world scenes. Vis Res 51:1192–1205
Article PubMed Google Scholar
Itti L, Koch C (2000) A saliency-based search mechanism for overt and covert shifts of visual attention. Vis Res 40(10–12):1489–1506
Article CAS PubMed Google Scholar
Kennedy A, Pynte J (2005) Parafoveal-on-foveal effects in normal reading. Vis Res 45:153–168
Article PubMed Google Scholar
Knoeferle P, Crocker M (2007) The influence of recent scene events on spoken comprehension: evidence from eye movements. J Mem Lang 57(4):519–543
Article Google Scholar
Kuchinksy S, Bock K, Irwin D (2011) Reversing the hands of time: changing the mapping from seeing to saying. J Exp Psychol Learn Mem Cogn 37:748–756
Article Google Scholar
Kukona A, Fang S, Aicher K, Chen H, Magnuson J (2011) The time course of anticipatory constraint integration. Cognition 119:23–42
Article PubMed Central PubMed Google Scholar
Levelt W, Roelofs A, Meyer A (1999) A theory of lexical access in speech production. Behav Brain Sci 22:1–75
CAS PubMed Google Scholar
Levin H, Buckler-Addis A (1979) The eye–voice span. MIT Press, Cambridge, MA
McDonald J, Bock J, Kelly M (1993) Word and world order: semantic, phonological and metrical determinants of serial position. Cogn Psychol 25(2):188–230
Article CAS PubMed Google Scholar
Meyer A, Sleiderink A, Levelt W (1998) Viewing and naming objects: eye movements during noun phrase production. Cognition 66(2)(8):B25–B33
Myachykov A, Thompson D, Scheepers C, Garrod S (2011) Visual attention and structural choice in sentence production across languages. Lang Linguist Compass 5(2):95–107
Article Google Scholar
Nelson W, Loftus G (1980) The functional visual field during picture viewing. J Exp Psychol Hum Learn Mem 7:369–376
Google Scholar
Noton D, Stark L (1971) Eye movements and visual perception. Sci Am 224(1):34–43
Google Scholar
Nuthmann A, Henderson J (2010) Object-based attentional selection in scene viewing. J Vis 10(8:20):1–20
Google Scholar
Papafragou A, Hulbert J, Trueswell J (2008) Does language guide event perception? Cognition 108:155–184
Article PubMed Central PubMed Google Scholar
Parkhursta D, Lawb K, Niebur E (2002) Modeling the role of salience in the allocation of overt visual attention. Vis Res 42(1):107–123
Article Google Scholar
Pinheiro J, Bates D (2000) Mixed-effects models in s and s-plus. Statistics and Computing Series. Springer, New York, NY
Book Google Scholar
Pomplun M, Ritter H, Velichkvosky B (1996) Disambiguating complex visual information: toward communication of personal views of a scene. Perception 25:931–948
Article CAS PubMed Google Scholar
Prat-Sala M, Branigan H (2000) Discourse constraints on syntactic processing in language production: a cross-linguistic study in English and Spanish. J Mem Lang 42:168–182
Article Google Scholar
Pynte J, New B, Kennedy A (2008) On-line contextual influences during reading normal text: a multiple-regression analysis. Vis Res 48(21):2172–2183
Article PubMed Google Scholar
Qu S, Chai J (2008) Incorporating temporal and semantic information with eye gaze for automatic word acquisition in multimodal conversational systems. In: Proceedings of the 2008 conference on empirical methods in natural language processing (EMNLP). Honolulu
Qu S, Chai J (2010) User language behavior, domain knowledge, and conversation context in automatic word acquisition for situated dialogue. J Artif Intell Res 37:247–277
Google Scholar
Rayner K (1998) Eye movements in reading and information processing: 20 years of research. Psychol Bull 124(3):372–422
Article CAS PubMed Google Scholar
Rosenholtz R, Li Y, Nakano L (2007) Measuring visual clutter. J Vis 7:1–22
Article PubMed Google Scholar
Rosenholtz R, Mansfield J, Jin Z (2005) Feature congestion, a measure of display clutter. In: SIGCHI, pp 761–770
Rothkopf CA, Ballard DH, Hayhoe MM (2007) Task and context determine where you look. J Vis 7(14):1–20
Article PubMed Google Scholar
Russell B, Torralba A, Murphy K, Freeman W (2008) Labelme: a database and web-based tool for image annotation. Int J Comput Vis 77(1–3):151–173
Google Scholar
Tanenhaus M, Spivey-Knowlton M, Eberhard K, Sedivy J (1995) Integration of visual and linguistic information in spoken language comprehension. Science 268:632–634
Article Google Scholar
Underwood G, Foulsham T (2006) Visual saliency and semantic incongruency influence eye movements when inspecting pictures. Q J Exp Psychol 59:2031–2038
Article Google Scholar
Zelinsky G, Schmidt J (2009) An effect of referential scene constraint on search implies scene segmentation. Vis Cogn 17(6):1004–1028
Article Google Scholar

Download references

Acknowledgements

European Research Council award "Synchronous Linguistic and Visual Processing" (number 203427) to FK, and Fundaçao para a Ciência e Tecnologia, Individual Pos-Doctoral Fellowship, award to MC (number SFRH/BDP/88374/2012) are gratefully acknowledged.

Author information

Moreno I. Coco
Present address: Faculdade de Psicologia, Universidade de Lisboa, Lisboa, Portugal

Authors and Affiliations

Institute for Language, Cognition and Computation, School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh, EH8 9AB, UK
Moreno I. Coco & Frank Keller

Authors

Moreno I. Coco
View author publications
You can also search for this author in PubMed Google Scholar
Frank Keller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Moreno I. Coco.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13442 KB)

Appendices

Appendix 1: Syntactic chunking

In order to quantify syntactic guidance, the sentences produced in our experiment were decomposed into syntactic chunks. Syntactic chunks are an intermediate representation between simple parts of speech and a full parse, which gives us basic structural information about the complexity of a sentence.

Chunking was performed automatically using the TagChunk system developed by Daumé and Marcu (2005), which performs combined part-of-speech tagging and syntactic chunking. It assigns syntactic labels to a sequence of words using the BIO encoding, in which the beginning of phrase X (e.g., noun phrase or NP) is tagged B-X (e.g., B-NP), the non-beginning (inside) of the X phrase is tagged I-X (e.g., I-NP), and any word that is not in a phrase is tagged O (outside). The TagChunk systems achieves a chunking accuracy of 97.4 % on the CoNLL 2,000 data set (8,936 training sentences, and 2,012 test sentences). As an example, consider a description of the scene in Fig. 1, and its chunked version:

(1)
There is a clipboard sitting on a coffee table and another clipboard next to it on which a US army man is writing.
(2)
[B-NP There] [B-VP is] [B-NP a [I-NP clipboard]] [B-VP sitting] [B-PP on] [B-NP a [I-NP coffee [I-NP table]]] [B-O and] [B-NP another [I-NP clipboard]] [B-ADJP next] [B-PP to] [B-NP it] [B-PP on] [B-NP which] [B-NP a [I-NP US [I-NP army [I-NP man]]]] [B-VP is [I-VP writing]].

Based on the TagChunk output, we can calculate the frequency of each syntactic type and the total number of constituents (e.g., the frequency of B-NP is 7 in the above sentence), which we use as factors in the mixed models reported in the main text.

Appendix 2: Turbulence

The concept of turbulence of a sequence comes from research in population research, where the goal is to quantify the complexity of different life trajectories (e.g., Single, Married, Married with Children, sequence: S–M–MC, vs. Single only, sequence: S; Elzinga and Liefbroer 2007). Turbulence is a composite measure calculated by integrating information about: (1) number of states a sequence is composed of, (2) their relative duration, and (3) the number of unique subsequences that can be derived from it.

Suppose we have three different scan patterns (x = man-R, man-L, clipboard, window, y = man-R, clipboard, man-R, man-L, z = man-R, clipboard, man-L, man-R) each of which consists of four fixated objects. Intuitively, x is more turbulent than y and z, as in x all fixated objects are different and inspected only once. When we compare y and z instead, we find that they have the same number of uniquely fixated objects, but in z we find that more objects are fixated before looking back at man-R. Therefore, z can be considered to be more turbulent than y, as more events have occurred before man-R is inspected again.

A combinatorial implication of this reasoning is that more distinct subsequences, denoted as $\phi (x)$, can be generated from z than from y. In fact, when computing the number of distinct subsequences for the three scan patterns, we find that $\phi (x) = 16 > \phi (z) = 15 > \phi (y) = 14$.^{Footnote 4} Each state of a sequence, however, is often associated with a certain duration. Let’s assume man-R is fixated for 200 ms before looking at another object. If we include fixation duration on x, we will obtain something like man-R/200, man-L/200, clipboard/200, window/200, and its turbulence increases when the variance duration between states decreases. So, x has a relative high turbulence (variance of 0 across states), compared to a scan pattern such as man-R/400, man-L/50, clipboard/75, man-L/50, where the duration is concentrated on a single state.

These intuitions can be formalized by defining the turbulence of a sequence as (Elzinga and Liefbroer 2007):

$$\begin{aligned} T(x) = \log _{2}\bigg (\phi (x) \frac{s_{t,\mathrm{max}}^2(x) + 1}{s_{t}^2 (x) + 1} \bigg ) \end{aligned}$$

(2)

where $\phi (x)$ denotes the number of distinct subsequences, $s_{t}^2$ is the variance of the state-durations and $s_{t,\mathrm{max}}^2$ the maximum of that variance given the total duration of the sequence, calculated as $s_{t,\mathrm{max}}^2 = (n - 1) (1 - \bar{t})^2$ with n denoting the number of states and $\bar{t}$ the average state duration of the sequence x. To compute the turbulence for the analyses in the main text, we utilized the R-package TraMine, a toolbox developed by Gabadinho et al. (2011) to perform categorical sequence analysis.

Appendix 3: Distance from mentioned, but not fixated, objects

Since the distance from a fixation is known to have processing implications (Nelson and Loftus 1980), we performed an additional analysis in which we looked at the closest fixation to the referent when is not mentioned, and calculate the distance of this fixation from the object centroid. We find that on average, fixations are at $11.65 \pm 5.04$ degree of visual angle from the object centroid. This indicates that peripheral vision is sufficient to identify and select the object for mentioning: for para-foveal effects, the object should be fixated within 4–5 degrees of visual angle. Moreover, a mixed model analysis reveals that the distance varies with the amount of visual clutter and the animacy of the target. In particular, an animate object tends to have a smaller visual distance (${\beta }_{{\rm Animate}} = -3.12;\,p < 0.05$), especially in a scene with minimal clutter (${\beta }_{{\rm Animate:Minimal}} = -6.25;\,p < 0.01$). This indicates that foveating the object is not an obligatory element of mentioning.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Coco, M.I., Keller, F. Integrating mechanisms of visual guidance in naturalistic language production. Cogn Process 16, 131–150 (2015). https://doi.org/10.1007/s10339-014-0642-0

Download citation

Received: 30 May 2014
Accepted: 30 October 2014
Published: 23 November 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s10339-014-0642-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integrating mechanisms of visual guidance in naturalistic language production

Abstract

Access this article

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 13442 KB)

Appendices

Appendix 1: Syntactic chunking

Appendix 2: Turbulence

Appendix 3: Distance from mentioned, but not fixated, objects

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation