Skip to main content
Log in

Integrating mechanisms of visual guidance in naturalistic language production

  • Research Report
  • Published:
Cognitive Processing Aims and scope Submit manuscript

Abstract

Situated language production requires the integration of visual attention and linguistic processing. Previous work has not conclusively disentangled the role of perceptual scene information and structural sentence information in guiding visual attention. In this paper, we present an eye-tracking study that demonstrates that three types of guidance, perceptual, conceptual, and structural, interact to control visual attention. In a cued language production experiment, we manipulate perceptual (scene clutter) and conceptual guidance (cue animacy) and measure structural guidance (syntactic complexity of the utterance). Analysis of the time course of language production, before and during speech, reveals that all three forms of guidance affect the complexity of visual responses, quantified in terms of the entropy of attentional landscapes and the turbulence of scan patterns, especially during speech. We find that perceptual and conceptual guidance mediate the distribution of attention in the scene, whereas structural guidance closely relates to scan pattern complexity. Furthermore, the eye–voice span of the cued object and its perceptual competitor are similar; its latency mediated by both perceptual and structural guidance. These results rule out a strict interpretation of structural guidance as the single dominant form of visual guidance in situated language production. Rather, the phase of the task and the associated demands of cross-modal cognitive processing determine the mechanisms that guide attention.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Notes

  1. Note we use small caps to denote visual referents and italics to denote linguistic referents.

  2. We include the Primary/Secondary distinction as a random variable in all mixed models concerning with latencies, and all four cases for the analysis of fixation distribution and scan pattern complexity.

  3. We use -l and -r to denote the leftmost and rightmost object in case of ambiguous objects.

  4. The logarithm of \(\phi (x)\) is taken to avoid large numbers as \(\phi (x)\) increases exponentially with increasing x.

References

  • Allopenna P, Magnuson J, Tanenhaus M (1998) Tracking the time course of spoken word recognition: evidence for continuous mapping models. J Mem Lang 38:419–439

    Article  Google Scholar 

  • Altmann G, Kamide Y (1999) Incremental interpretation at verbs: restricting the domain of subsequent reference. Cognition 73:247–264

    Article  CAS  PubMed  Google Scholar 

  • Andersson R, Ferreira F, Henderson J (2011) I see what you are saying: the integration of complex speech and scenes during language comprehension. Acta Psychol 137:208–216

    Article  Google Scholar 

  • Arnold J, Griffin Z (2007) The effect of additional characters on choice of referring expression: everything counts. J Mem Lang 56:521–536

    Article  PubMed Central  PubMed  Google Scholar 

  • Baayen R, Davidson D, Bates D (2008) Mixed-effects modeling with crossed random effects for subjects and items. J Mem Lang 59:390–412

    Article  Google Scholar 

  • Barr DJ, Levy R, Scheepers C, Tily HJ (2013) Random-effects structure for confirmatory hypothesis testing: keep it maximal. J Mem Lang 68(3):255–278

    Article  Google Scholar 

  • Bock K, Irwin D, Davidson D, Levelt W (2003) Minding the clock. J Mem Lang 4(48):653–685

    Article  Google Scholar 

  • Branigan H, Pickering M, Tanaka M (2008) Contribution of animacy to grammatical function assignment and word order during production. Lingua 2(118):172–189

    Article  Google Scholar 

  • Brown-Schmidt S, Tanenhaus M (2006) Watching the eyes when talking about size: an investigation of message formulation and utterance planning. J Mem Lang 54:592–609

    Article  Google Scholar 

  • Castelhano M, Mack M, Henderson J (2009) Viewing task influences eye-movement control during active scene perception. J Vis 9:1–15

    Article  PubMed  Google Scholar 

  • Coco MI, Keller F (2009) The impact of visual information on referent assignment in sentence production. In: Taatgen NA, van Rijn H (eds) Proceedings of the 31th annual conference of the cognitive science society, Amsterdam

  • Coco MI, Keller F (2012) Scan patterns predict sentence production in the cross-modal processing of visual scenes. Cogn Sci 36(7):1204–1223

    Article  PubMed  Google Scholar 

  • Dahan D, Tanenhaus M (2005) Looking at the rope when looking for the snake: conceptually mediated eye movements during spoken-word recognition. Psychol Bull Rev 12:455–459

    Article  Google Scholar 

  • Daumé III H, Marcu D (2005) Learning as search optimization: approximate large margin methods for structured prediction. In: International conference on machine learning (ICML)

  • Elazary L, Itti L (2008) Interesting objects are visually salient. J Vis 8(14:18):1–15

    PubMed  Google Scholar 

  • Elzinga C, Liefbroer A (2007) Destandardization of the life course: a cross-national comparison using sequence analysis. Eur J Popul 23(3–4):225–250

    Article  Google Scholar 

  • Ferreira V, Slevc L, Rogers E (2007) How do speakers avoid ambiguous linguistic expressions? Cognition 96:263–284

    Article  Google Scholar 

  • Findlay J, Gilchrist I (2001) Visual attention: the active vision perspective. In: Jenkins M, Harris L (eds) Vision and attention. Springer, New York, pp 83–103

    Chapter  Google Scholar 

  • Fletcher-Watson S, Findlay J, Leekam S, Benson V (2008) Rapid detection of person information in a naturalistic scene. Perception 37(4):571–583

    Article  PubMed  Google Scholar 

  • Frank M, Vul E, Johnson S (2009) Development of infants’ attention to faces during the first year. Cognition 110:160–170

    Article  PubMed Central  PubMed  Google Scholar 

  • Fukumura K, Van Gompel R (2011) The effects of animacy in the choice of referring expressions. Lang Cogn Process 26:1472–1504

    Article  Google Scholar 

  • Fukumura K, Van Gompel R, Pickering MJ (2010) The use of visual context during the production of referring expressions. Q J Exp Psychol 63:1700–1715

    Article  Google Scholar 

  • Gabadinho A, Ritschard G, Müller N, Studer M (2011) Analyzing and visualizing state sequences in R with TraMineR. J Stat Softw 40:1–37

    Google Scholar 

  • Gleitman L, January D, Nappa R, Trueswell J (2007) On the give and take between event apprehension and utterance formulation. J Mem Lang 57:544–569

    Article  PubMed Central  PubMed  Google Scholar 

  • Griffin Z, Bock K (2000) What the eyes say about speaking. Psychol Sci 11:274–279

    Article  CAS  PubMed  Google Scholar 

  • Griffin Z, Davison J (2011) A technical introduction to using speakers’ eye movements to study language. In: The mental lexicon, vol 6. John Benjamins Publishing Company, Reading, MA, pp 55–82

  • Henderson J (2003) Human gaze control during real-world scene perception. Trends Cogn Sci 7:498–504

    Article  PubMed  Google Scholar 

  • Henderson J, Chanceaux M, Smith T (2009) The influence of clutter on real-world scene search: evidence from search efficiency and eye movements. J Vis 9(1):1–32

    Article  Google Scholar 

  • Henderson J, Hollingworth A (1999) High-level scene perception. Annu Rev Psychol 50:243–271

    Article  CAS  PubMed  Google Scholar 

  • Huettig F, Altmann G (2005) Word meaning and the control of eye fixation: semantic competitor effects and the visual world paradigm. Cognition 96(1):B23–B32

    Article  PubMed  Google Scholar 

  • Hwang A, Wang H, Pomplun M (2011) Semantic guidance of eye movements in real-world scenes. Vis Res 51:1192–1205

    Article  PubMed  Google Scholar 

  • Itti L, Koch C (2000) A saliency-based search mechanism for overt and covert shifts of visual attention. Vis Res 40(10–12):1489–1506

    Article  CAS  PubMed  Google Scholar 

  • Kennedy A, Pynte J (2005) Parafoveal-on-foveal effects in normal reading. Vis Res 45:153–168

    Article  PubMed  Google Scholar 

  • Knoeferle P, Crocker M (2007) The influence of recent scene events on spoken comprehension: evidence from eye movements. J Mem Lang 57(4):519–543

    Article  Google Scholar 

  • Kuchinksy S, Bock K, Irwin D (2011) Reversing the hands of time: changing the mapping from seeing to saying. J Exp Psychol Learn Mem Cogn 37:748–756

    Article  Google Scholar 

  • Kukona A, Fang S, Aicher K, Chen H, Magnuson J (2011) The time course of anticipatory constraint integration. Cognition 119:23–42

    Article  PubMed Central  PubMed  Google Scholar 

  • Levelt W, Roelofs A, Meyer A (1999) A theory of lexical access in speech production. Behav Brain Sci 22:1–75

    CAS  PubMed  Google Scholar 

  • Levin H, Buckler-Addis A (1979) The eye–voice span. MIT Press, Cambridge, MA

  • McDonald J, Bock J, Kelly M (1993) Word and world order: semantic, phonological and metrical determinants of serial position. Cogn Psychol 25(2):188–230

    Article  CAS  PubMed  Google Scholar 

  • Meyer A, Sleiderink A, Levelt W (1998) Viewing and naming objects: eye movements during noun phrase production. Cognition 66(2)(8):B25–B33

  • Myachykov A, Thompson D, Scheepers C, Garrod S (2011) Visual attention and structural choice in sentence production across languages. Lang Linguist Compass 5(2):95–107

    Article  Google Scholar 

  • Nelson W, Loftus G (1980) The functional visual field during picture viewing. J Exp Psychol Hum Learn Mem 7:369–376

    Google Scholar 

  • Noton D, Stark L (1971) Eye movements and visual perception. Sci Am 224(1):34–43

    Google Scholar 

  • Nuthmann A, Henderson J (2010) Object-based attentional selection in scene viewing. J Vis 10(8:20):1–20

    Google Scholar 

  • Papafragou A, Hulbert J, Trueswell J (2008) Does language guide event perception? Cognition 108:155–184

    Article  PubMed Central  PubMed  Google Scholar 

  • Parkhursta D, Lawb K, Niebur E (2002) Modeling the role of salience in the allocation of overt visual attention. Vis Res 42(1):107–123

    Article  Google Scholar 

  • Pinheiro J, Bates D (2000) Mixed-effects models in s and s-plus. Statistics and Computing Series. Springer, New York, NY

    Book  Google Scholar 

  • Pomplun M, Ritter H, Velichkvosky B (1996) Disambiguating complex visual information: toward communication of personal views of a scene. Perception 25:931–948

    Article  CAS  PubMed  Google Scholar 

  • Prat-Sala M, Branigan H (2000) Discourse constraints on syntactic processing in language production: a cross-linguistic study in English and Spanish. J Mem Lang 42:168–182

    Article  Google Scholar 

  • Pynte J, New B, Kennedy A (2008) On-line contextual influences during reading normal text: a multiple-regression analysis. Vis Res 48(21):2172–2183

    Article  PubMed  Google Scholar 

  • Qu S, Chai J (2008) Incorporating temporal and semantic information with eye gaze for automatic word acquisition in multimodal conversational systems. In: Proceedings of the 2008 conference on empirical methods in natural language processing (EMNLP). Honolulu

  • Qu S, Chai J (2010) User language behavior, domain knowledge, and conversation context in automatic word acquisition for situated dialogue. J Artif Intell Res 37:247–277

    Google Scholar 

  • Rayner K (1998) Eye movements in reading and information processing: 20 years of research. Psychol Bull 124(3):372–422

    Article  CAS  PubMed  Google Scholar 

  • Rosenholtz R, Li Y, Nakano L (2007) Measuring visual clutter. J Vis 7:1–22

    Article  PubMed  Google Scholar 

  • Rosenholtz R, Mansfield J, Jin Z (2005) Feature congestion, a measure of display clutter. In: SIGCHI, pp 761–770

  • Rothkopf CA, Ballard DH, Hayhoe MM (2007) Task and context determine where you look. J Vis 7(14):1–20

    Article  PubMed  Google Scholar 

  • Russell B, Torralba A, Murphy K, Freeman W (2008) Labelme: a database and web-based tool for image annotation. Int J Comput Vis 77(1–3):151–173

    Google Scholar 

  • Tanenhaus M, Spivey-Knowlton M, Eberhard K, Sedivy J (1995) Integration of visual and linguistic information in spoken language comprehension. Science 268:632–634

    Article  Google Scholar 

  • Underwood G, Foulsham T (2006) Visual saliency and semantic incongruency influence eye movements when inspecting pictures. Q J Exp Psychol 59:2031–2038

    Article  Google Scholar 

  • Zelinsky G, Schmidt J (2009) An effect of referential scene constraint on search implies scene segmentation. Vis Cogn 17(6):1004–1028

    Article  Google Scholar 

Download references

Acknowledgements

European Research Council award "Synchronous Linguistic and Visual Processing" (number 203427) to FK, and Fundaçao para a Ciência e Tecnologia, Individual Pos-Doctoral Fellowship, award to MC (number SFRH/BDP/88374/2012) are gratefully acknowledged.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Moreno I. Coco.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 13442 KB)

Appendices

Appendix 1: Syntactic chunking

In order to quantify syntactic guidance, the sentences produced in our experiment were decomposed into syntactic chunks. Syntactic chunks are an intermediate representation between simple parts of speech and a full parse, which gives us basic structural information about the complexity of a sentence.

Chunking was performed automatically using the TagChunk system developed by Daumé and Marcu (2005), which performs combined part-of-speech tagging and syntactic chunking. It assigns syntactic labels to a sequence of words using the BIO encoding, in which the beginning of phrase X (e.g., noun phrase or NP) is tagged B-X (e.g., B-NP), the non-beginning (inside) of the X phrase is tagged I-X (e.g., I-NP), and any word that is not in a phrase is tagged O (outside). The TagChunk systems achieves a chunking accuracy of 97.4 % on the CoNLL 2,000 data set (8,936 training sentences, and 2,012 test sentences). As an example, consider a description of the scene in Fig. 1, and its chunked version:

  1. (1)

    There is a clipboard sitting on a coffee table and another clipboard next to it on which a US army man is writing.

  2. (2)

    [B-NP There] [B-VP is] [B-NP a [I-NP clipboard]] [B-VP sitting] [B-PP on] [B-NP a [I-NP coffee [I-NP table]]] [B-O and] [B-NP another [I-NP clipboard]] [B-ADJP next] [B-PP to] [B-NP it] [B-PP on] [B-NP which] [B-NP a [I-NP US [I-NP army [I-NP man]]]] [B-VP is [I-VP writing]].

Based on the TagChunk output, we can calculate the frequency of each syntactic type and the total number of constituents (e.g., the frequency of B-NP is 7 in the above sentence), which we use as factors in the mixed models reported in the main text.

Appendix 2: Turbulence

The concept of turbulence of a sequence comes from research in population research, where the goal is to quantify the complexity of different life trajectories (e.g., Single, Married, Married with Children, sequence: S–M–MC, vs. Single only, sequence: S; Elzinga and Liefbroer 2007). Turbulence is a composite measure calculated by integrating information about: (1) number of states a sequence is composed of, (2) their relative duration, and (3) the number of unique subsequences that can be derived from it.

Suppose we have three different scan patterns (x = man-R, man-L, clipboard, window, y = man-R, clipboard, man-R, man-L, z = man-R, clipboard, man-L, man-R) each of which consists of four fixated objects. Intuitively, x is more turbulent than y and z, as in x all fixated objects are different and inspected only once. When we compare y and z instead, we find that they have the same number of uniquely fixated objects, but in z we find that more objects are fixated before looking back at man-R. Therefore, z can be considered to be more turbulent than y, as more events have occurred before man-R is inspected again.

A combinatorial implication of this reasoning is that more distinct subsequences, denoted as \(\phi (x)\), can be generated from z than from y. In fact, when computing the number of distinct subsequences for the three scan patterns, we find that \(\phi (x) = 16 > \phi (z) = 15 > \phi (y) = 14\).Footnote 4 Each state of a sequence, however, is often associated with a certain duration. Let’s assume man-R is fixated for 200 ms before looking at another object. If we include fixation duration on x, we will obtain something like man-R/200, man-L/200, clipboard/200, window/200, and its turbulence increases when the variance duration between states decreases. So, x has a relative high turbulence (variance of 0 across states), compared to a scan pattern such as man-R/400, man-L/50, clipboard/75, man-L/50, where the duration is concentrated on a single state.

These intuitions can be formalized by defining the turbulence of a sequence as (Elzinga and Liefbroer 2007):

$$\begin{aligned} T(x) = \log _{2}\bigg (\phi (x) \frac{s_{t,\mathrm{max}}^2(x) + 1}{s_{t}^2 (x) + 1} \bigg ) \end{aligned}$$
(2)

where \(\phi (x)\) denotes the number of distinct subsequences, \(s_{t}^2\) is the variance of the state-durations and \(s_{t,\mathrm{max}}^2\) the maximum of that variance given the total duration of the sequence, calculated as \(s_{t,\mathrm{max}}^2 = (n - 1) (1 - \bar{t})^2\) with n denoting the number of states and \(\bar{t}\) the average state duration of the sequence x. To compute the turbulence for the analyses in the main text, we utilized the R-package TraMine, a toolbox developed by Gabadinho et al. (2011) to perform categorical sequence analysis.

Appendix 3: Distance from mentioned, but not fixated, objects

Since the distance from a fixation is known to have processing implications (Nelson and Loftus 1980), we performed an additional analysis in which we looked at the closest fixation to the referent when is not mentioned, and calculate the distance of this fixation from the object centroid. We find that on average, fixations are at \(11.65 \pm 5.04\) degree of visual angle from the object centroid. This indicates that peripheral vision is sufficient to identify and select the object for mentioning: for para-foveal effects, the object should be fixated within 4–5 degrees of visual angle. Moreover, a mixed model analysis reveals that the distance varies with the amount of visual clutter and the animacy of the target. In particular, an animate object tends to have a smaller visual distance (\({\beta }_{{\rm Animate}} = -3.12;\,p < 0.05\)), especially in a scene with minimal clutter (\({\beta }_{{\rm Animate:Minimal}} = -6.25;\,p < 0.01\)). This indicates that foveating the object is not an obligatory element of mentioning.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Coco, M.I., Keller, F. Integrating mechanisms of visual guidance in naturalistic language production. Cogn Process 16, 131–150 (2015). https://doi.org/10.1007/s10339-014-0642-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10339-014-0642-0

Keywords

Navigation