Abstract
Situated language production requires the integration of visual attention and linguistic processing. Previous work has not conclusively disentangled the role of perceptual scene information and structural sentence information in guiding visual attention. In this paper, we present an eye-tracking study that demonstrates that three types of guidance, perceptual, conceptual, and structural, interact to control visual attention. In a cued language production experiment, we manipulate perceptual (scene clutter) and conceptual guidance (cue animacy) and measure structural guidance (syntactic complexity of the utterance). Analysis of the time course of language production, before and during speech, reveals that all three forms of guidance affect the complexity of visual responses, quantified in terms of the entropy of attentional landscapes and the turbulence of scan patterns, especially during speech. We find that perceptual and conceptual guidance mediate the distribution of attention in the scene, whereas structural guidance closely relates to scan pattern complexity. Furthermore, the eye–voice span of the cued object and its perceptual competitor are similar; its latency mediated by both perceptual and structural guidance. These results rule out a strict interpretation of structural guidance as the single dominant form of visual guidance in situated language production. Rather, the phase of the task and the associated demands of cross-modal cognitive processing determine the mechanisms that guide attention.
Notes
Note we use small caps to denote visual referents and italics to denote linguistic referents.
We include the Primary/Secondary distinction as a random variable in all mixed models concerning with latencies, and all four cases for the analysis of fixation distribution and scan pattern complexity.
We use -l and -r to denote the leftmost and rightmost object in case of ambiguous objects.
The logarithm of \(\phi (x)\) is taken to avoid large numbers as \(\phi (x)\) increases exponentially with increasing x.
References
Allopenna P, Magnuson J, Tanenhaus M (1998) Tracking the time course of spoken word recognition: evidence for continuous mapping models. J Mem Lang 38:419–439
Altmann G, Kamide Y (1999) Incremental interpretation at verbs: restricting the domain of subsequent reference. Cognition 73:247–264
Andersson R, Ferreira F, Henderson J (2011) I see what you are saying: the integration of complex speech and scenes during language comprehension. Acta Psychol 137:208–216
Arnold J, Griffin Z (2007) The effect of additional characters on choice of referring expression: everything counts. J Mem Lang 56:521–536
Baayen R, Davidson D, Bates D (2008) Mixed-effects modeling with crossed random effects for subjects and items. J Mem Lang 59:390–412
Barr DJ, Levy R, Scheepers C, Tily HJ (2013) Random-effects structure for confirmatory hypothesis testing: keep it maximal. J Mem Lang 68(3):255–278
Bock K, Irwin D, Davidson D, Levelt W (2003) Minding the clock. J Mem Lang 4(48):653–685
Branigan H, Pickering M, Tanaka M (2008) Contribution of animacy to grammatical function assignment and word order during production. Lingua 2(118):172–189
Brown-Schmidt S, Tanenhaus M (2006) Watching the eyes when talking about size: an investigation of message formulation and utterance planning. J Mem Lang 54:592–609
Castelhano M, Mack M, Henderson J (2009) Viewing task influences eye-movement control during active scene perception. J Vis 9:1–15
Coco MI, Keller F (2009) The impact of visual information on referent assignment in sentence production. In: Taatgen NA, van Rijn H (eds) Proceedings of the 31th annual conference of the cognitive science society, Amsterdam
Coco MI, Keller F (2012) Scan patterns predict sentence production in the cross-modal processing of visual scenes. Cogn Sci 36(7):1204–1223
Dahan D, Tanenhaus M (2005) Looking at the rope when looking for the snake: conceptually mediated eye movements during spoken-word recognition. Psychol Bull Rev 12:455–459
Daumé III H, Marcu D (2005) Learning as search optimization: approximate large margin methods for structured prediction. In: International conference on machine learning (ICML)
Elazary L, Itti L (2008) Interesting objects are visually salient. J Vis 8(14:18):1–15
Elzinga C, Liefbroer A (2007) Destandardization of the life course: a cross-national comparison using sequence analysis. Eur J Popul 23(3–4):225–250
Ferreira V, Slevc L, Rogers E (2007) How do speakers avoid ambiguous linguistic expressions? Cognition 96:263–284
Findlay J, Gilchrist I (2001) Visual attention: the active vision perspective. In: Jenkins M, Harris L (eds) Vision and attention. Springer, New York, pp 83–103
Fletcher-Watson S, Findlay J, Leekam S, Benson V (2008) Rapid detection of person information in a naturalistic scene. Perception 37(4):571–583
Frank M, Vul E, Johnson S (2009) Development of infants’ attention to faces during the first year. Cognition 110:160–170
Fukumura K, Van Gompel R (2011) The effects of animacy in the choice of referring expressions. Lang Cogn Process 26:1472–1504
Fukumura K, Van Gompel R, Pickering MJ (2010) The use of visual context during the production of referring expressions. Q J Exp Psychol 63:1700–1715
Gabadinho A, Ritschard G, Müller N, Studer M (2011) Analyzing and visualizing state sequences in R with TraMineR. J Stat Softw 40:1–37
Gleitman L, January D, Nappa R, Trueswell J (2007) On the give and take between event apprehension and utterance formulation. J Mem Lang 57:544–569
Griffin Z, Bock K (2000) What the eyes say about speaking. Psychol Sci 11:274–279
Griffin Z, Davison J (2011) A technical introduction to using speakers’ eye movements to study language. In: The mental lexicon, vol 6. John Benjamins Publishing Company, Reading, MA, pp 55–82
Henderson J (2003) Human gaze control during real-world scene perception. Trends Cogn Sci 7:498–504
Henderson J, Chanceaux M, Smith T (2009) The influence of clutter on real-world scene search: evidence from search efficiency and eye movements. J Vis 9(1):1–32
Henderson J, Hollingworth A (1999) High-level scene perception. Annu Rev Psychol 50:243–271
Huettig F, Altmann G (2005) Word meaning and the control of eye fixation: semantic competitor effects and the visual world paradigm. Cognition 96(1):B23–B32
Hwang A, Wang H, Pomplun M (2011) Semantic guidance of eye movements in real-world scenes. Vis Res 51:1192–1205
Itti L, Koch C (2000) A saliency-based search mechanism for overt and covert shifts of visual attention. Vis Res 40(10–12):1489–1506
Kennedy A, Pynte J (2005) Parafoveal-on-foveal effects in normal reading. Vis Res 45:153–168
Knoeferle P, Crocker M (2007) The influence of recent scene events on spoken comprehension: evidence from eye movements. J Mem Lang 57(4):519–543
Kuchinksy S, Bock K, Irwin D (2011) Reversing the hands of time: changing the mapping from seeing to saying. J Exp Psychol Learn Mem Cogn 37:748–756
Kukona A, Fang S, Aicher K, Chen H, Magnuson J (2011) The time course of anticipatory constraint integration. Cognition 119:23–42
Levelt W, Roelofs A, Meyer A (1999) A theory of lexical access in speech production. Behav Brain Sci 22:1–75
Levin H, Buckler-Addis A (1979) The eye–voice span. MIT Press, Cambridge, MA
McDonald J, Bock J, Kelly M (1993) Word and world order: semantic, phonological and metrical determinants of serial position. Cogn Psychol 25(2):188–230
Meyer A, Sleiderink A, Levelt W (1998) Viewing and naming objects: eye movements during noun phrase production. Cognition 66(2)(8):B25–B33
Myachykov A, Thompson D, Scheepers C, Garrod S (2011) Visual attention and structural choice in sentence production across languages. Lang Linguist Compass 5(2):95–107
Nelson W, Loftus G (1980) The functional visual field during picture viewing. J Exp Psychol Hum Learn Mem 7:369–376
Noton D, Stark L (1971) Eye movements and visual perception. Sci Am 224(1):34–43
Nuthmann A, Henderson J (2010) Object-based attentional selection in scene viewing. J Vis 10(8:20):1–20
Papafragou A, Hulbert J, Trueswell J (2008) Does language guide event perception? Cognition 108:155–184
Parkhursta D, Lawb K, Niebur E (2002) Modeling the role of salience in the allocation of overt visual attention. Vis Res 42(1):107–123
Pinheiro J, Bates D (2000) Mixed-effects models in s and s-plus. Statistics and Computing Series. Springer, New York, NY
Pomplun M, Ritter H, Velichkvosky B (1996) Disambiguating complex visual information: toward communication of personal views of a scene. Perception 25:931–948
Prat-Sala M, Branigan H (2000) Discourse constraints on syntactic processing in language production: a cross-linguistic study in English and Spanish. J Mem Lang 42:168–182
Pynte J, New B, Kennedy A (2008) On-line contextual influences during reading normal text: a multiple-regression analysis. Vis Res 48(21):2172–2183
Qu S, Chai J (2008) Incorporating temporal and semantic information with eye gaze for automatic word acquisition in multimodal conversational systems. In: Proceedings of the 2008 conference on empirical methods in natural language processing (EMNLP). Honolulu
Qu S, Chai J (2010) User language behavior, domain knowledge, and conversation context in automatic word acquisition for situated dialogue. J Artif Intell Res 37:247–277
Rayner K (1998) Eye movements in reading and information processing: 20 years of research. Psychol Bull 124(3):372–422
Rosenholtz R, Li Y, Nakano L (2007) Measuring visual clutter. J Vis 7:1–22
Rosenholtz R, Mansfield J, Jin Z (2005) Feature congestion, a measure of display clutter. In: SIGCHI, pp 761–770
Rothkopf CA, Ballard DH, Hayhoe MM (2007) Task and context determine where you look. J Vis 7(14):1–20
Russell B, Torralba A, Murphy K, Freeman W (2008) Labelme: a database and web-based tool for image annotation. Int J Comput Vis 77(1–3):151–173
Tanenhaus M, Spivey-Knowlton M, Eberhard K, Sedivy J (1995) Integration of visual and linguistic information in spoken language comprehension. Science 268:632–634
Underwood G, Foulsham T (2006) Visual saliency and semantic incongruency influence eye movements when inspecting pictures. Q J Exp Psychol 59:2031–2038
Zelinsky G, Schmidt J (2009) An effect of referential scene constraint on search implies scene segmentation. Vis Cogn 17(6):1004–1028
Acknowledgements
European Research Council award "Synchronous Linguistic and Visual Processing" (number 203427) to FK, and Fundaçao para a Ciência e Tecnologia, Individual Pos-Doctoral Fellowship, award to MC (number SFRH/BDP/88374/2012) are gratefully acknowledged.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Appendices
Appendix 1: Syntactic chunking
In order to quantify syntactic guidance, the sentences produced in our experiment were decomposed into syntactic chunks. Syntactic chunks are an intermediate representation between simple parts of speech and a full parse, which gives us basic structural information about the complexity of a sentence.
Chunking was performed automatically using the TagChunk system developed by Daumé and Marcu (2005), which performs combined part-of-speech tagging and syntactic chunking. It assigns syntactic labels to a sequence of words using the BIO encoding, in which the beginning of phrase X (e.g., noun phrase or NP) is tagged B-X (e.g., B-NP), the non-beginning (inside) of the X phrase is tagged I-X (e.g., I-NP), and any word that is not in a phrase is tagged O (outside). The TagChunk systems achieves a chunking accuracy of 97.4 % on the CoNLL 2,000 data set (8,936 training sentences, and 2,012 test sentences). As an example, consider a description of the scene in Fig. 1, and its chunked version:
-
(1)
There is a clipboard sitting on a coffee table and another clipboard next to it on which a US army man is writing.
-
(2)
[B-NP There] [B-VP is] [B-NP a [I-NP clipboard]] [B-VP sitting] [B-PP on] [B-NP a [I-NP coffee [I-NP table]]] [B-O and] [B-NP another [I-NP clipboard]] [B-ADJP next] [B-PP to] [B-NP it] [B-PP on] [B-NP which] [B-NP a [I-NP US [I-NP army [I-NP man]]]] [B-VP is [I-VP writing]].
Based on the TagChunk output, we can calculate the frequency of each syntactic type and the total number of constituents (e.g., the frequency of B-NP is 7 in the above sentence), which we use as factors in the mixed models reported in the main text.
Appendix 2: Turbulence
The concept of turbulence of a sequence comes from research in population research, where the goal is to quantify the complexity of different life trajectories (e.g., Single, Married, Married with Children, sequence: S–M–MC, vs. Single only, sequence: S; Elzinga and Liefbroer 2007). Turbulence is a composite measure calculated by integrating information about: (1) number of states a sequence is composed of, (2) their relative duration, and (3) the number of unique subsequences that can be derived from it.
Suppose we have three different scan patterns (x = man-R, man-L, clipboard, window, y = man-R, clipboard, man-R, man-L, z = man-R, clipboard, man-L, man-R) each of which consists of four fixated objects. Intuitively, x is more turbulent than y and z, as in x all fixated objects are different and inspected only once. When we compare y and z instead, we find that they have the same number of uniquely fixated objects, but in z we find that more objects are fixated before looking back at man-R. Therefore, z can be considered to be more turbulent than y, as more events have occurred before man-R is inspected again.
A combinatorial implication of this reasoning is that more distinct subsequences, denoted as \(\phi (x)\), can be generated from z than from y. In fact, when computing the number of distinct subsequences for the three scan patterns, we find that \(\phi (x) = 16 > \phi (z) = 15 > \phi (y) = 14\).Footnote 4 Each state of a sequence, however, is often associated with a certain duration. Let’s assume man-R is fixated for 200 ms before looking at another object. If we include fixation duration on x, we will obtain something like man-R/200, man-L/200, clipboard/200, window/200, and its turbulence increases when the variance duration between states decreases. So, x has a relative high turbulence (variance of 0 across states), compared to a scan pattern such as man-R/400, man-L/50, clipboard/75, man-L/50, where the duration is concentrated on a single state.
These intuitions can be formalized by defining the turbulence of a sequence as (Elzinga and Liefbroer 2007):
where \(\phi (x)\) denotes the number of distinct subsequences, \(s_{t}^2\) is the variance of the state-durations and \(s_{t,\mathrm{max}}^2\) the maximum of that variance given the total duration of the sequence, calculated as \(s_{t,\mathrm{max}}^2 = (n - 1) (1 - \bar{t})^2\) with n denoting the number of states and \(\bar{t}\) the average state duration of the sequence x. To compute the turbulence for the analyses in the main text, we utilized the R-package TraMine, a toolbox developed by Gabadinho et al. (2011) to perform categorical sequence analysis.
Appendix 3: Distance from mentioned, but not fixated, objects
Since the distance from a fixation is known to have processing implications (Nelson and Loftus 1980), we performed an additional analysis in which we looked at the closest fixation to the referent when is not mentioned, and calculate the distance of this fixation from the object centroid. We find that on average, fixations are at \(11.65 \pm 5.04\) degree of visual angle from the object centroid. This indicates that peripheral vision is sufficient to identify and select the object for mentioning: for para-foveal effects, the object should be fixated within 4–5 degrees of visual angle. Moreover, a mixed model analysis reveals that the distance varies with the amount of visual clutter and the animacy of the target. In particular, an animate object tends to have a smaller visual distance (\({\beta }_{{\rm Animate}} = -3.12;\,p < 0.05\)), especially in a scene with minimal clutter (\({\beta }_{{\rm Animate:Minimal}} = -6.25;\,p < 0.01\)). This indicates that foveating the object is not an obligatory element of mentioning.
Rights and permissions
About this article
Cite this article
Coco, M.I., Keller, F. Integrating mechanisms of visual guidance in naturalistic language production. Cogn Process 16, 131–150 (2015). https://doi.org/10.1007/s10339-014-0642-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10339-014-0642-0