A focus on language and linguistic experience in theories of language comprehension

Psycholinguistic approaches to sentence comprehension have investigated what mechanisms and what representations we must assume to accommodate how comprehenders recover an interpretation from language input in real time. Early theories of real-time comprehension (Frazier and Fodor 1979), for instance, described a ‘sausage machine’ (p. 291) for which tree diagrams with abstract nodes represented words and syntactic relations between them. The authors proposed a two-stage mechanism of language comprehension, with a focus on building syntactic structure (called a ‘parse’). The postulated stages can be viewed as a mechanism for how structure is built (e.g., without permitting influence from world knowledge on initial syntactic analysis). An initial stage assigned lexical and phrasal nodes (in a tree diagram) to approximately six words within a sentence (Frazier and Fodor 1979). A second stage added nodes in the tree to build a complete sentence (see Bornkessel and Schlesewsky 2006; Friederici 2002, for neuroscientific accounts of related hierarchical models). If needed, the initial structure and interpretation could, at a second stage, be revised.

These early parsing accounts did not consider further contextual information such as objects, actions, and events in the world and their potential impact on comprehension. Nor did they focus on a model of the world and typical situations such as ordering food in a restaurant as a context for comprehension. Instead they focused on principles regarding syntactic analysis (Frazier and Fodor 1979; Kimball 1973) that were oriented towards maximizing syntactic simplicity. One such principle, for instance, postulated that a phrase was closed as soon as possible [unless the next parsed node was an “immediate constituent” of the phrase, p. 36 in Kimball (1973)]. This accommodates that comprehenders seem to interpret they knew the girl as a simple sentence instead of assuming that it might have a different structure as in they knew the girl was in the closet (p. 36).

An alternative view of language processing mechanisms that assumed rapid feedback from higher-order expectations and world knowledge on initial syntactic and semantic processes received support early on from so-called ‘shadowing’ tasks. In these, participants were trained to rapidly repeat (‘shadow’) a speaker’s language production (Marslen-Wilson 1973) at very short lags (254–278 ms for the best shadowers). Participants sometimes made errors in that task; the rationale was that if shadowers’ errors violated the speaker’s preceding syntax or semantics at short lags of shadowing, then one could assume that shadowers are not influenced by higher-level content during ongoing comprehension and production. Crucially, when shadowers made errors, these virtually never conflicted syntactically or semantically with the preceding linguistic context. On the basis of this finding, Marslen-Wilson (1973) argued that ‘higher-order linguistic structure’ (p. 522f. Marslen-Wilson 1973) is available during shadowing (comprehension and production) at very short lags. The shadowing findings contradicted postulates of ‘informational encapsulation’ in the philosophy of language (Fodor 1983, p.71). These had assumed that brain systems dealing with incoming language had no immediate access to high-level expectations or to ongoing visual perception (Coltheart 1999, p. 117). However, animacy (Trueswell et al. 1994), plausibility (Garnsey et al. 1997), thematic fit (e.g., such as a cop being a good agent for arresting a criminal McRae et al. 1998), and object arrangements (Tanenhaus et al. 1995; Spivey et al. 2002) all influenced the resolution of local structural ambiguity within a few hundred milliseconds. The reach of lexical, and associated world knowledge, as well as of visual perception into the resolution of temporary structural ambiguity boosted accounts of sentence processing that accorded a central role to the lexicon (MacDonald et al. 1994) and supported immediate interaction of syntactic parsing also with visual perception.

While some accounts accorded a central role to the lexicon and lexical information in structural ambiguity resolution, others postulated that parsing decisions at structural ambiguities are modulated not just by lexical information (Mitchell et al. 1995). For instance, in direct object/subject complement ambiguities such as The athlete realized her goals (dir. obj.)/her shoes were out of reach (subject complement)…, the verb biases towards the complement analysis. If lexical experience determined the initial parse, then participants should be biased towards a subject complement analysis. However, Pickering and Traxler (1995) observed more frequent regression when the object was implausible (shoes) than plausible (goals). The authors interpreted this result as suggesting that readers initially adopted an object analysis and that semantic plausibility rapidly modulated that analysis. Mitchell et al. (1995) argued that readers must use coarse-grained rather than fine-grained lexical experience. By coarse-grained they seem to mean the average “usage of the different structural forms over all verbs which share the same ambiguity”, which they expect to reflect a strong bias towards the direct object interpretation (p. 484). This supports early psycholinguistic claims—viz. that parsing principles guide ambiguity resolution—but motivates attachment decisions from experience with language instead of syntactic simplicity. Thus, both coarse-grained and item-specific experience appear to contribute substantially to real-time language comprehension.

The role of language experience also comes to the fore in research that argued comprehenders anticipate upcoming content (Altmann and Kamide 1999; Kutas and Hillyard 1984). Kutas and Hillyard (1984) provided evidence for the view that brain potentials reflected the expectancy of a word and of semantic associations during reading. The amplitude of a negative-going brain potential around 400 ms post stimulus, the so-called N400 (for a review see Kutas and Federmeier 2011) was smaller the more participants expected a sentence-final word (as measured by cloze probability). In psycholinguistics, Clifton et al. (1984) reported that readers relied on lexical (verb) information to anticipate noun phrases and structure (filler-gap relations). Predicting the expectedness of words was taken up in computational modeling via an information-theoretic notion dubbed ‘surprisal’ Hale (2003). Surprisal captured—via inverse log probabilities—how unexpected incoming information was. From such expectation measures researchers derived predictions for processing difficulty and, relatedly, measures such as reading times and event-related brain potentials (Frank 2013; Frank et al. 2015; Delogu et al. 2017; Demberg and Keller 2008).

The accounts discussed thus far were motivated by accommodating initial structure-building decisions and the incremental or anticipatory nature of language processing informed by linguistic and world knowledge, and visual context. The focus in these theories was further on accommodating comprehension “on average” with little attention to potential individual differences—with the exception of, for instance, studies on language deficits (e.g., Grossman et al. 1992, 1993), comprehension skill (King and Kutas 1995), and cross-linguistic variation, (Bates et al. 1987; MacWhinney et al. 1984). Empirically, object contexts in psycholinguistics were examined as a factor that could rapidly modulate parsing decisions, providing evidence against strictly informationally encapsulated processing (see Chambers et al. 2002, 2004; Tanenhaus et al. 1995).

But fleshing out the mechanistic and representational implications of integrating rich visual context (including representations of objects, action, events, and speakers, their gaze, mimics, and appearance) were not the focus of the dominant psycholinguistic theories of language comprehension from the late 1970s to the late 1990s. In fact, approaches that investigated how object representations gleaned from a picture are reconciled with language (e.g., in picture-sentence verification paradigms and associated formalized models) were criticized as not revealing comprehension (Tanenhaus et al. 1976). In the meantime, however, empirical research has provided more and more evidence that we must not only consider object and event contexts (and their associated representations) for developing theories of real-time language comprehension but also (representations of) the listener and his characteristics such as age, or educational background among others (see Münster and Knoeferle 2018; Huettig et al. 2018; Mishra et al. 2012).

Below I first discuss (representational and mechanistic) assumptions of context-focused theoretical approaches to comprehension. From these theoretical approaches, we can identify key processes and associated mechanisms in comprehension, viz. interpretation and structuring, attentional grounding in context (and associated representations), and verification of the interpretation against the context(ual representations). I discuss variation in these key processes and specifically in contextual grounding as a function of comprehender characteristics (“Variation in expectation-based comprehension and in context effects by comprehender characteristics”) and world-language relations (“Differentiating grounding: do different world-language relations elicit distinct context effects?”). From that discussion, I derive three principles for predicting variability in context effects (“Conclusions: predicting (visual) context effects”).

Context-focused approaches to comprehension and associated representational assumptions

Early contextualized approaches to language processing have specified mechanistic and representational assumptions but these have failed to influence early models of parsing and linguistic knowledge (Frazier and Fodor 1979; MacDonald et al. 1994; Mitchell et al. 1995). These contextualized approaches can be grouped in approaches that focus on (1) the representations and mechanisms (verification, semantic activation, and listening-looking interaction) of language processing in a physical or object-/event context, and (2) a mental model of a (discourse) context, obtained via inferential processing, drawing on script and event/body knowledge.

Verification, lexical-semantic activation, and listening-looking interaction

To examine language comprehension in object contexts, one paradigm had participants verify a sentence in the context of a picture (e.g., Clark and Chase 1972; Gough 1965). Incongruence (vs. congruence) of the sentence against the picture slowed verification latencies. The latter finding was accommodated in the Constituent-Comparison Model (Carpenter and Just 1975) of sentence comprehension via a serial comparison of picture-based and sentence-based representations (e.g., The dots are red, tracking the affirmative nature of the sentence and the picture representations: [AFFIRMATIVE, (RED, DOTS)], Picture of red dots: (RED, DOTS)). Corresponding constituents were retrieved and compared, pair by pair. The ordering of these comparisons was determined by the structure of the representations and inner propositions were compared first. The number of retrieve-and-compare steps determined the predicted verification latencies. The model was criticized (Tanenhaus et al. 1976) as not describing the processes at work while participants derive sentence representations but rather post-perceptual processes of verifying a sentence against an already-perceived picture.

That criticism fell short, however (Knoeferle et al. 2011) since incongruence in picture and sentence rapidly modulated comprehension also at the word that mismatched the visual context (Knoeferle and Crocker 2005) and not only at the point when participants gave the verification response via button press, post-sentence. For instance, when participants verified whether a word (e.g., red in Touched the small blue circle and the large red square) matched or mismatched the color of a depicted object touched by a hand (e.g., a circle bearing the label ‘blue’), participants’ event-related brain potentials (mean amplitude negativities) from around 200 ms and peaking between approximately 450 and 750 ms increased to mismatches (vs. matches) (D’Arcy and Connolly 1999).

On the basis of these and other results, it has been argued (Knoeferle et al. 2011) that verification of language against context seems to be a fundamental part of language processing in context. However, despite its occasional use in language comprehension research (Goolkasian 1996; Singer 2006; Underwood et al. 2004) insights from the verification task have had minimal impact on psycholinguistic theories of online sentence comprehension, but see Knoeferle et al. (2014).

Verification is one important mechanism; but further, inferential and semantic interpretation processes are necessary to accommodate language comprehension in context. Indeed, empirical research even in the 1970s had begun to examine how words and pictures are processed to obtain insight into their meaning representation. In one study by Potter and Faulconier (1975), participants saw words or object drawings, and read the word or named the object in a first condition pair. In further conditions, the experimenter named a category and the participant responded (‘yes’/‘no’) depending on whether the next word/object drawing belonged to that category. For object drawings (vs. words), naming latencies were longer but categorization latencies were shorter. Drawings were also categorized faster than they were named. The conclusion drawn from this finding was that the category knowledge of an object is linked to an abstract object concept (instead of its name/appearance), for recent related evidence in the case of object color see, Amsel et al. (2014) and Connell (2007).

Another line of research assessed representational issues via semantic ‘priming’. In priming, stimuli are presented in pairs (a ‘prime’ stimulus followed by a ‘target’ stimulus), and these were were either semantically related or unrelated in Sperber et al. (1979). Semantic relatedness facilitated naming/reading of pictures and words respectively and interacted with stimulus quality (in vs. out of focus photographs). Priming occurred whether prime and target were both words, both pictures, or mixed (e.g., picture–word), a finding that was interpreted as pictures and words accessing semantic information via a common conceptual store. Priming was, however, more pronounced in picture–picture than word–word or mixed pairs, a finding that Sperber et al. (1979) traced to overlap in pictorial representations of objects from the same semantic category.

Semantic priming is active also as a sentence-level comprehension mechanism in pictorial contexts (Ganis et al. 1996), and could be viewed as related to verification. Ganis et al. (1996) had participants read sentences that were semantically constraining and in which the final word (or picture) was in agreement with the sentence or mismatched (e.g., The old man lay on the grass and lit his pipe/carrot). From the findings of this study (earlier peaking and topographically distinct N400 effects to pictures vs. words), the authors concluded that pictures and words had at least partially distinct representations. Beyond representational insights regarding words and objects, priming research has impacted psycholinguistic research and theories, for instance, in production (Bock 1986), but also in comprehension (Nichol and Pickering 1993), though often with linguistic context as prime and target, see (Pickering and Branigan 1999; Pickering and Garrod 2004).

By contrast other, context-focused psycholinguistic research on language processing failed to influence psycholinguistic thinking of the 1970s and 80s (Cooper 1974; Just and Carpenter 1976). Cooper (1974) recorded eye movements to object drawings during spoken discourse comprehension and interpreted them as reflecting comprehension processes (p. 104). The analysis of the collected gaze record contributed three important findings. First, Cooper, discovered that listeners inspected referents upon their mention (e.g., a depiction of a zebra upon hearing zebra), suggesting rapid interaction of auditory comprehension with visual attention. Moreover, the analyses of Cooper’s eye-tracking data provided evidence for the role of expectations in comprehension: Listeners inspected, for instance, the depiction of a zebra and of a lion after they had heard While on a photographic safari in Africa. Neither the zebra nor the lion had been mentioned, suggesting that the linguistic context and mention of a safari and of Africa elicited expectations—via world knowledge—of as-yet-unmentioned referents. Finally, Cooper (1974) observed that picture-language relations such as between a zebra and the referential expression zebra elicited a much higher number of concurrent fixations than other relations such as between the modifier photographic and the depiction of a camera). This latter finding can be viewed as reflecting a prominent status of referential relations in language processingFootnote 1.

Cooper recognized the potential of the eye-tracking method: “Because the eye-movement response system in the presence of ongoing heard language is at times characterized by a high degree of linguistic sensitivity and small latencies (including a built-in anticipatory characteristic), the present technique of correlating the visual selection of appropriate targets with concurrently heard words could potentially be applied to study in great detail the manner in which people interpret and process spoken language in the context of their contemporary visual field.” (p. 106). He also recommended the method for examining referential ambiguity resolution, speech perception, and memory. Relevant ensuing research has indeed followed these suggestions, reporting, among others, effects of object size, contrast, and of action affordances on the resolution of referential and structural ambiguity (Chambers et al. 2004; Sedivy et al. 1999; Tanenhaus et al. 1995).

The inferential construction of mental models and grounding via bodily schemas

While verification mechanisms, semantic associations, and listening-looking interactions are important for comprehension in context, so are arguably further inferential processes and models of the world. One account by Johnson-Laird (1981) included both of these. It assumed that utterances are conveyed to representations and these can (but need not always) serve as the basis for the inferential construction of a mental model of world. The latter process enables“one person to have another’s experience of the world by proxy: instead of direct apprehension of a state of affairs, the listener constructs a model of them based on a speaker’s remarks” (1981, p. 139). Utterance interpretation in this proposal depended on script knowledge, viz. of normal event sequences such as lunching at a restaurant (e.g., Schank and Abelson 1975). Understanding non-stereotypical events (e.g,. Kafka’s Trial, Johnson-Laird 1981, p. 154), depended on inferences, on coherence and on plausibility, with recursion as a mechanism for manipulating the world model. However, much like the verification research, the influence of situation-model approaches on theories of sentence comprehension in the 80s and early 90s was negligible. However, they have influenced empirical research and theory formation since, as discussed in more detail below (for a review of embodiment and situation models see Meteyard et al. 2012; Zwaan 2016).

Another line of research that brings grammatical representations in close proximity of real-world perception and action are some versions of construction grammar (‘CG’). Central work in CG has assessed how constructions generalize (Goldberg 1995, 2006; Langacker 1987) but a few approaches have related semantic structure to visual or motor representations, for instance, Fluid Construction Grammar, (Trijp et al. 2012), Embodied Construction Grammar (ECG), and Template Construction Grammar, (Arbib and Lee 2008). As one example, ECG assumes embodied cognitive ‘schemas’, i.e., representations derived from perceptual and motor experience (Bergen and Chang 2005). ECG focuses on the image schemas of Lakoff and Johnson (1980) and motor schemas (X-schemas). These representations bring language representations in contact with the bodily-experienced world.

ECG cassumes that semantic representations are grounded in comprehenders’ perceptual and motor systems. Language comprehension implicates both analysis (what constructions must be built) and associated mental simulation (e.g., of a tossing action). ECG envisages incrementality (Bergen and Chang 2005) but does not specify the activation of constructions, word-by-word (but see Bryant (2008) on reading-time data). To interpret Mary tossed me a bottle in ECG, reference from Mary to the referent Mary is resolved via a referent scheme and tossed unlocks a predicate schema (the subject codes the cause of the transfer, the first post-verbal object the recipient, and the second post-verbal object the transferred object), as well as a transfer schema (specifying an agent: Mary; a recipient: me; and a theme: the bottle). These schemas are integrated into the interpretation, evoking an active-ditransitive schema with context verifying its constraints. The verb toss activates a caused-motion schema and a fly schema, specifying low force. X-schemas capture sequential action elements. But it remains open when an X-schema for tossing is activated (e.g., at tossed, or even earlier at Mary depending on whether Mary is inactive or tossing already).

In summary, by the early 2000s, it was clear that key processes in context-sensitive language comprehension include minimally the incremental interpretation of language, rapidly informed by linguistic and world knowledge; its reconciliation with the world (‘grounding’), mediated via (visual) attention to (depicted) objects in the context; and verification of the interpretation in relation to objects and events. Representationally, it had become clear that mental representations for grounding must include objects and their properties, action events and experience-based event knowledge (e.g. object affordances, scripts, bodily schemas). In addition, research in the cognitive sciences has examined, for instance, how authorship (social speaker identity) can constrain the interpretation of words (Fitneva and Spivey 2005), or how linguistic and para-linguistic information contribute to establishing reference and common ground between a speaker and a listener (e.g., Clark and Brennan 1991). Thus, by the early 2000s, empirically, it began to be clear that social aspects and grounding, too, play an important role for language processing. Below I first outline to what extent the contextualized approaches have impacted psycholinguistic theorizing; I enumerate key processes in situated language comprehension and discuss their variability (“Three key processing steps and associated variability”). Then I focus on one process—grounding and context effects—and discuss comprehender-based variation in context effects (“Variation in expectation-based comprehension and in context effects by comprehender characteristics”). Finally, I discuss distinct world-language relations as one source of variability in context effects (“Differentiating grounding: do different world-language relations elicit distinct context effects?”) and derive from this discussion predictions of context effects (“Conclusions: predicting (visual) context effects”).

Towards predicting context effects

Given the insights from (only a small sample of) context-focused approaches to language comprehension, one may ask to what extent these approaches were integrated with, or have influenced, earlier theories of sentence comprehension (see “A focus on language and linguistic experience in theories of language comprehension”). Verification research and the construction of mental world models received little attention in early psycholinguistic theory formation which focused rather on parsing principles, the modularity debate, and, gradually, a shift from principle- to lexicon- and, expectation-based approaches to comprehension. Cooper’s study and its evidence for rapid listening-looking interaction was acknowledged in publications on the eye-mind hypothesis (Just and Carpenter 1976) but much psycholinguistic theorizing in the late 1980s, and early 1990s seemed unaware of Cooper’s findings and their implications. Conversely, the context-focused approaches to comprehension were—except for ECG—underspecified regarding syntactic structure building and semantic interpretation (in an incremental manner).

But each of the context-based strands of research from the 70s and 80—verification, semantic activation, the inferential and script-based construction of mental models of the world, and real-time interaction of language comprehension with visual attention—has influenced recent psycholinguistic modeling research. Verification mechanisms have been integrated in sentence processing accounts (Knoeferle et al. 2014; Van Herten et al. 2006). Priming research has made its way into psycholinguistic theory in the form of ‘alignment’ between interlocutors (viz. that in dialogue interlocutors align their mental representations Pickering and Garrod 2004). Alignment is postulated to occur at all linguistic levels, up to situation models. The construction of mental models has further influenced the ‘immersed experiencer model’ (Zwaan and Ross 2004) (see Barsalou 1999, for a comprehensive overview of perceptual theories of cognition) and probabilistic neurocomputational models of expectation-based comprehension (Venhuizen et al. 2018). Research on the interaction of language comprehension with visual attention has been continued in speech perception models (Allopenna et al. 1998; Smith et al. 2017), in processing accounts of situated sentence comprehension (Altmann and Kamide 2007; Huettig et al. 2018; Knoeferle and Crocker 2006, 2007), and in computational models of visual attention and situated language comprehension (Crocker et al. 2010; Kukona and Tabor 2011; Mayberry et al. 2009; Roy and Mukherjee 2005).

Three key processing steps and associated variability

From the discussed theoretical approaches, we can identify key processes, and associated mechanisms in comprehension in rich visual context—sine qua comprehenders would not be able to fully comprehend the meaning of utterances.

  • Building structure and assigning an interpretation (by interpretation we mean a mental representation of sentence meaning, informed by linguistic, and world knowledge, and the immediate linguistic and non-linguistic context)

  • Grounding the interpretation in a model of the world/visual context via (internal/visual) attention (by ‘model’ we mean mental representations of the immediate non-linguistic context, (see Knoeferle and Crocker 2007; Knoeferle et al. 2014))

  • Verifying the structure and interpretation against representations of the world/visual context and revising as necessary

Psycholinguists have argued that each of these three processes—structuring and interpretation, grounding, and verification can vary and apply to different extents regarding depth of interpretation, of grounding and of verification. By contrast, other variability such as that derived from comprehender characteristics and/or world-language relations has only recently been investigated (see “Variation in expectation-based comprehension and in context effects by comprehender characteristics” and “Differentiating grounding: do different world-language relations elicit distinct context effects?”).

Depths of structuring and interpretation Regarding structure building and interpretation, Frazier and Fodor (1979) assumed in-depth structure building, as did MacDonald et al. (1994), McRae et al. (1998). By contrast, Ferreira et al. (2002) postulated ‘good-enough’ representations, meaning that comprehenders do not always reconstruct a correct sentence representation. For instance, for While Bill hunted the deer ran into the woods, comprehenders tend to initially attach deer as the direct object of hunted; at sentence end, however, it should become clear that hunted is used intransitively (p. 373 Christianson et al. 2001). Participants in the experiment by Christianson et al. (2001), however, seemed to retain the initial mis-attachment. When they were asked Did Bill hunt the deer? (see Christianson et al. 2001), they responded more often incorrectly that Bill hunted the deer when a sentence initially permitted an object attachment of deer to Bill hunted than when it did not (e.g., While Bill hunted the pheasant the deer ran into the woods). Possibly, late closure effects caused the garden-path representations to linger in working memory with comprehenders constructing perhaps a temporally ordered representation of Bill hunting the deer and then the deer running into the woods (at least this would be a situation model account for the observed findings). For related evidence see research by Kukona and Tabor (2011), Tabor et al. (2004).

Gradients of grounding When considering grounding, Johnson-Laird (1981) postulated that a situation model is only sometimes constructed during comprehension; by contrast, with regard to a theory of concepts (not comprehension), Gallese and Lakoff (2005) postulated “that concepts of a wide variety make direct use of the sensory-motor circuitry of the brain” (see Meteyard et al. 2012; Zwaan 2014, for a review of weakly to strongly embodied approaches). Since concepts are accessed during comprehension, Gallese’s and Lakoff’s position would seem to suggest a strongly embodied view of comprehension (e.g., Glenberg and Kaschak 2002, for relevant behavioral evidence). One key notion for grounding is to bring in incrementality, much like in ECG (Bergen and Chang 2005), and in the linguistic focusing hypothesis (Taylor and Zwaan 2008), which predicts that “engagement of the motor system during language comprehension is controlled by the focus of the linguistic message” (p. 143, Zwaan et al. 2010).

Extent of verification Some have argued that verification is ‘part and parcel’ (p. 505, Knoeferle et al. 2011) of language processing. Verification effects in event-related brain potentials to verb-action mismatches emerged prior to any explicit verification, and thus at a point in time when verification would not yet have been necessary; but that evidence came from a study in which participants explicitly assessed sentence veracity via a button press at sentence end. Other studies used comprehension tasks and have reported that comprehenders failed to notice incongruence between a target sentence and their knowledge. Erickson and Mattson (1981), reported the so-called ‘Moses illusion’, which we could characterize as a semantic version of good-enough processing. Participants read out aloud a question such as How many animals of each kind did Moses take on the Ark? (p. 540 Erickson and Mattson 1981). They were instructed to answer the question but also informed that questions can contain errors and that they should indicate errors by saying ‘wrong’. In spite of this instruction, the Moses-Ark question was answered incorrectly on 81 percent of the trials (p. 543, participants answered two instead of ‘wrong’ and failed to clarify that the person on the ark was named Noah).

Below I discuss further variation in grounding/context effects as a function of comprehender characteristics and world-language relations (“Variation in expectation-based comprehension and in context effects by comprehender characteristics”). With regard to grounding and context effects, I outline predictions in the form of relative (not absolute) preferences of how representations of visual context affect comprehension (“Differentiating grounding: do different world-language relations elicit distinct context effects?” and “Conclusions: predicting (visual) context effects”).

Variation in expectation-based comprehension and in context effects by comprehender characteristics

Orthogonal to the depth of interpretation, of grounding, and of verification is the much-debated issue of the extent to which comprehension processes are merely incremental or rather expectation-driven (Huettig 2015; Lau et al. 2006; Pickering and Garrod 2013; Pickering and Gambi 2018). A postulate of strong expectation-based comprehension can be derived from surprisal theory concerning syntactic processes (Levy 2008). For instance, if a comprehender encounters a sentential subject, the expectation of seeing that constituent again in the same clause decreases following the logic that multiple same-type constituents rarely co-occur in a clause (see p. 1146 Levy 2008). A strong view of expectations is also compatible with evidence suggesting highly specific expectations of lexical elements following a constraining sentence context and determiner (a vs. an as in The boy went to the park and flew a/an kite/airplane (DeLong et al. 2005). Readers exhibited larger mean amplitude negativities when they encountered an than a, a finding which was attributed to comprehenders noticing an incongruence between their expectation of a plausible noun—starting with a consonant—such as kite and the article an. DeLong et al. (2005) concluded that “predictions can be for specific phonological forms—words beginning with either vowels or consonants. In this sense, we propose that prediction can be highly specific, at least under some circumstances” (p. 1119f.). The N400 is, however, modulated by many further factors, among them plausibility and semantic relatedness (Nieuwland 2019); there is further ongoing debate as to how specific (word form or meaning) (Nieuwland et al. 2018), and how robust these expectations effects are (DeLong et al. 2017; Ito et al. 2017).

Federmeier (2007) explicitly points out variability in expectations and suggests that “… the brain uses context to predict features of likely upcoming items. However, although prediction seems important for comprehension, it also appears susceptible to age-related deterioration and can be associated with processing costs” (p. 491). That specific expectations are not always realized is underscored by recent replication failures of DeLong and colleagues’ finding (DeLong et al. 2017; Nieuwland et al. 2018; Ito et al. 2017), and by the insight that anticipatory processes (how quickly comprehenders visually anticipate an object before its mention given constraining linguistic context) varies by characteristics of the comprehender. Among these are his/her literacy (Huettig et al. 2018; Mishra et al. 2012), language production skills (Mani and Huettig 2012), working memory and processing speed (Huettig and Janse 2015), reading skills (Huettig and Brouwer 2015), and native (vs. non-native) language command (Ito et al. 2018). These findings extend to age-related variation in the time course of context effects (Münster 2016; Münster and Knoeferle 2018, for a related account). All of these studies share that they manipulate comprehender characteristics by comparing comprehension in different participant groups. Mishra et al. (2012), for instance, presented spoken sentences that did (vs. didn’t) contain highly constraining words (e.g., the adjective uncha/i, ‘high’, and a particle wala/i) to high and low literates. At issue was whether the semantic and syntactic (gender) constraints provided by uncha/i wala/i would prompt comprehenders to anticipate the target object (a door) more than distractor objects incompatible with the semantic and syntactic constraints. Only high, but not low literatures, visually anticipated the target door before its mention, a finding that the authors attribute to differences in the mental representations and / or processes in low (vs. high) comprehenders. In Mani and Huettig (2012), only children with a large (vs. smaller) vocabulary at age two, engaged in such anticipatory gaze behavior. Ito et al. (2018) observed that speaker competence appeared to modulate the visual anticipation of phonological competitor objects. Only English native speakers but not second-language speakers (first language: Japanese) anticipated the referent of an English-language phonological competitor (clown) when the context constrained meaning towards a highly likely target object (cloud after The tourists expected rain when the sun went behind the… but see Chambers and Cooke 2009, for failure to find related effects of language proficiency).

Differentiating grounding: do different world-language relations elicit distinct context effects?

The focus in context-based approaches to comprehension (see “Context-focused approaches to comprehension and associated representational assumptions”) has been on modeling comprehension with an average comprehender in mind, and without further differentiating the notion of contextual effects (unlike the distinction of syntactic from semantic processes and investigations of their interplay, see Hagoort 2003). Situation models contained different aspects of context (see also Jackendoff 2002), but it is important to also consider distinct world-language relations (e.g., referential vs. associative, referential vs. role relations), as well as speakers and their gaze and facial mimics as part of the visual context (Knoeferle and Guerra 2012).

That visual context effects are generic is not a claim that has been made explicitly in the literature. However, it appears to be assumed tacitly via labeling context effects generically as ‘visual context effects’ or ‘visual information’ (e.g., Tanenhaus et al. 1995; Huettig et al. 2018; Knoeferle and Crocker 2006; Sedivy et al. 1999). It has been shown that virtually any kind of information in the visual context rapidly modulates language comprehension. Among these are object properties such as size (Sedivy et al. 1999), shape Dahan and Tanenhaus (2005), and color (Huettig et al. 2018). Rapid effects on visual attention and language comprehension have also been reported for action depictions (Knoeferle et al. 2005, 2011), action affordances (Chambers et al. 2004), visual saliency (Coco and Keller 2015), speaker gaze (Hanna and Brennan 2007; Knoeferle and Kreysa 2012), and speaker facial emotions (Carminati and Knoeferle 2013). Parsimony dictates to assume a single mechanism underlying all of these effects. But recent evidence suggests we need a more fine-grained (and perhaps more principled) accounts of context effects to capture the observed variability. Some evidence for distinction between context effects has been contributed, for instance, by visual-world research reported in Huettig and McQueen (2007): these studies observed a temporal cascade in the activation of phonological, shape, and semantic competitors during spoken comprehension that varied moreover as a function of the preview time.

Robust referential and action effects: a processing preference? This variability notwithstanding, action effects, for example, have been robustly observed in numerous studies, using both eye-tracking and event-related brain potentials as measures; the effects emerged as soon as the verb mediated an action (Knoeferle et al. 2005; Knoeferle and Crocker 2006, 2007; Knoeferle et al. 2008, 2011, 2014). For instance, when comprehenders encountered a verb (e.g., ‘paints’) in the structurally ambiguous spoken sentence The princess (amb.: subj/obj?) paints …'; example translated from German), they rapidly related it to an action event depicting a princess as being painted by a fencer, eliciting resolution of the temporary ambiguity towards the non-canonical structure with the princess as the object and patient of the event. Action event depiction influenced both syntactic disambiguation (Knoeferle et al. 2005, 2008), and semantic processes implicated in relating an action to the verb (Knoeferle et al. 2011). Processes of establishing reference, also seem to be rapid and robust, as evidenced in the findings by Cooper (1974), Tanenhaus et al. (1995) and Sedivy et al. (1999).

Evidence for preferences in cue processing emerged when comparing the effects of referentially-mediated action events with effects that were mediated less directly (stereotypical associations between a verb and an agent). When a verb like bandages mediated a bandaging action and its agent (a chef) in a clipart scene Den Touristen bandagiert gleich der … , ‘The tourist (obj) bandages soon the … (subj)’), comprehenders anticipated the referentially-mediated action and the chef more before its mention than a competing stereotypical agent, a medico Knoeferle and Crocker (2006).

These findings support the view that not all language-world relations are processed similarly (see also Cooper 1974). Different aspects of context can be distinguished based on their relationship to language (e.g., referential vs. non-referential world-language relations). Cooper (1974) reported that participants inspected objects that were mediated by language via referential and superordinate-token relations (e.g., forrest—tree) more often than objects that were mediated otherwise by language (e.g., in The queen was in agony, agony was indirectly related to a picture of a queen, p. 88). Some support for a referential preference also comes from a comparison of looks to target objects [e.g., a piano when hearing piano in Eventually, the man agreed hesitantly but then he looked at the piano and appreciated that it was beautiful compared with an also-present trumpet (p. B25 Huettig and Altmann 2005)], suggesting referential relations trump non-referential ones. In another study (Kukona et al. 2014), participants listened to The boy will eat the white … while inspecting a display depicting a white cake, a brown cake, a white car, and a brown car. In that setting, participants inspected the white cake more than the brown cake, suggesting verb constraint and adjective constraint are rapidly integrated. However, they also inspected the white car more than the brown car. Thus local referential constraints influenced attention and comprehension (a white car was considered as a potential referent) even when verb-based anticipatory constraints should have excluded cars as referents.

Distinguishing referential from non-referential relations, and viewing the former as more central enables us to derive predictions about how rapidly and strongly a particular aspect of context should modulate comprehension. That reasoning receives support from arguments of core (vs. peripheral) semantic relations. For instance, the word zebra identifies the animal zebra and conjures up memories (even if imperfect) of what zebras look like and as such the relation between zebra and the animal zebra are core (see Rosch 1973, for related notions on category prototypicality). By contrast, feeding, or the zookeeper might conjure up images of a zebra (especially if a zebra is nearby) but in contrast to referential relations, the relation would be more peripheral. Similarly, smell would be associated with a nose but the relation is less core than between the word nose and its referent (see Duñabeitia et al. 2008, for related evidence). Referential cueing of information in visual context, on this account, would on average elicit stronger effects than non-referential mediation of visual context. This preference is likely not absolute, but we can use it as a test case for hypotheses about what governs rapid context effects during language comprehension.

Actions versus speaker gaze in comprehension and recall Further evidence that distinct language-world relations elicit distinct context effects come from eye-tracking in visual contexts. For instance, Kreysa et al. (2018) compared verb-mediated depicted action effects with speaker gaze effects in a crossed, two-by-two design (actions cueing a target were present vs. absent; the speaker gazed at the target object or was obscured). Such a direct experimental comparison permits characterizing the relative contribution of distinct language-world relations to situated language comprehension.

Kreysa et al. (2018) monitored participants’ eye movements to characters on a computer display while the participants listened to transitive agent-action-patient (subject-verb-object) sentences. In two eye-tracking experiments, the authors manipulated whether speaker gaze, a depicted action, neither, or both of these were visible. Across the experiments either both cues were deictic world-language relations (Experiment 1) or only speaker gaze (Experiment 2) was. This design permitted teasing apart the contribution of deictic relations (both actions and gaze were used in a deictic manner in Experiment 1) from that of cue type (action vs. speaker gaze). The gaze record revealed that deictic cues (speaker gaze and a single deictic action) modulated comprehension rapidly, and significantly earlier than a non-deictic action depiction. Not only did these two information types (actions, speaker gaze) differ in the effects on real-time language comprehension (reflected in visual attention) but also in how they affected ensuing recall of sentence content (the object was recalled better when an action depiction had been present vs. absent). Together these findings suggest that information type and its relation to language characterize real-time situated language comprehension and recall of sentence content.

Interesting insights into potentially distinct effects of cues such as a speaker’s gaze and an arrow (that can point to something in a deictic fashion) come also from research by Staudte et al. (2014). They compared the effects of a speaker’s gaze and of an arrow in referential processing and observed that when they are matched in visual precision, their effects were comparable (Staudte et al. 2014). This can be taken to support the view that similar relation of (cues in) the world to conveyed language results in similar effects. Showing that when precision (Staudte et al. 2014) or language-world relation (Kreysa et al. 2018) are matched, cue effects on real-time comprehension are similar, is noteworthy. However, one might argue that cues in context differ at least sometimes in their appearance, timing, and relation to language and that we must derive a principled account of what language-world relation elicits what effect on language comprehension (its time course and representations). To the extent that world-language relations govern context effects, the prediction from these findings is that verb-mediated action effects should influence comprehension and recall of the interpretation content more than cues that are not referentially tied into the interpretation [e.g., speaker gaze, relations between the queen was in agnoy and a depiction of the queen, emotional facial expressions cued by adjectives such as happy, distance between objects related to semantic similarity (Guerra and Knoeferle 2014)]. Kukona et al. (2011) add evidence to the argument of distinct language-world relations modulating context effects: As participants in their studies listened to sentences such as Toby arrested the thief., they inspected both a picture of a thief and of a policeman to a similar extent and more than distractor objects during arrested the. Looks to the thief only exceeded looks to the policeman as the thief was processed. The latter result suggests that sometimes lexical-semantic associations may rapidly guide attention to objects. One interpretation of this gaze pattern is that both policeman and thief were active as part of an event, eliciting inspection; alternatively, as the authors suggested, local constraints imposed by the verb may modulate compositional semantic interpretation [if eye movements follow the unfolding syntactic structure and semantic–thematic interpretation, then strictly speaking, a policeman should not have received a noticeable amount of attention (he is unlikely the patient of an arresting action)].

When events contribute to referential vs. thematic role relation processes Context effects on language comprehension differ not only by world-language relation but also by what sub-processes in comprehension they inform. Different types of picture–sentence mismatches, for instance, implicated different brain responses as indexed by ERP effects (Knoeferle et al. 2014). Participants read sentences such as The gymnast punches the journalist, after they had seen a clipart depiction that either fully matched the sentence, mismatched in the action, the role-relations, or both of these. If context effects implicated a single mechanism only, then we should see comparable mismatch effects in comprehenders’ brain responses. However, this was not the case: When role relations mismatched, effects emerged at journalist as anterior negativities, preceding centro-parietal N400 action mismatch effects at punches. Post-verbally, these mismatch manipulations also yielded different effects, and correlated differently with a participant’s mean accuracy in the verification task, verbal working memory and visual–spatial scores, and differed in their interactions with stimulus onset asynchrony (see also Wassenaar and Hagoort 2007).

These results were interpreted as implicating more than a single mismatch mechanism during comprehension. When comparing these results to those by Knoeferle et al. (2008), it is striking that in the latter study, linguistic cues (case marking) and event depictions seemed to affect comprehension similarly (as reflected in similar ERP effects), perhaps because these different information types contributed to one and the same—structural disambiguation—process. By contrast, in Knoeferle et al. (2014), the action and role relation depictions contributed to distinct comprehension processes (figuring out who-does-what-to-whom and grounding the action). From this we can derive a further testable prediction, viz, that when distinct or similar cues contribute to the same comprehension process, they should affect comprehension in a similar manner; by contrast, if they inform distinct comprehension processes, their effects will show up as distinct.

Conclusions: predicting (visual) context effects

The present review identified three predictions concerning visual context effects in an attempt to derive a more principled account of context effects:

  1. P1

    Comprehender characteristics e.g., age, literacy, language skills modulate context effects in language comprehension (“Variation in expectation-based comprehension and in context effects by comprehender characteristics”)

  2. P2

    Referential cueing of information in visual context elicits—on average and by comparison—more rapid and stronger effects than non-referential mediation of visual context (“Differentiating grounding: do different world-language relations elicit distinct context effects?”); world-language relations thus predict context effects.

  3. P3

    When distinct or similar cues contribute to the same comprehension process, they should affect comprehension in a similar manner; by contrast, if they inform distinct comprehension processes, their effects will show up as distinct (“Differentiating grounding: do different world-language relations elicit distinct context effects?”).

While language comprehension in visual context can clearly vary as a function of comprehender characteristics, the reviewed accounts do not explicitly integrate speaker and comprehender characteristics as a modulatory factor in language comprehension. Moreover, they are underspecified regarding how grounding and context effects influence comprehension. Much psycho- and neurolinguistic research has, however, assessed to what extent language comprehension depends on the perceived context, the body, and long-term linguistic knowledge of the comprehender (e.g., Barsalou 2008; Hasson et al. 2018; Holcomb et al. 1992; Kotz 2009; Osterhout et al. 2008; Zwaan and Ross 2004; Zwaan 2016) among others. Extant approaches have constrained predictions of context effects in language processing among others via

  1. 1.

    a hierarchy of physical world, an agent’s body, and context in the ‘TEST’ framework (Myachykov et al. 2014)

  2. 2.

    levels of embodiment from demonstration to abstraction (Zwaan 2014, p. 229)

  3. 3.

    the presence of working memory representations, capacity limits, and cognitive control (Huettig et al. 2011), and

  4. 4.

    a referential mechanism and decay in working memory (Knoeferle and Crocker 2006, 2007), verification mechanisms (Knoeferle et al. 2014), and speaker/listener characteristics (Münster and Knoeferle 2018) in the (social) coordinated interplay account

Situation-based and TEST-based prediction of context effects

Zwaan (2014) differentiates five forms of language comprehension depending on overlap between the context and the communicative situation. In a first high-overlap situation, agents, objects, and actions are present in the situation (‘demonstrations’, e.g., cooking); in a second situation, objects are not immediately present such as when giving instructions to someone to fetch an object. Even less embedded are projections (mapping a past state of the context onto the present context, e.g., a builder explaining the changes to a house following renovation, p. 232), followed by displacements (a context unrelated to the present context), and abstractions (e.g., scientific articles). This situation-based approach would predict strong context effects in language comprehension for referential and instruction situations, with diminishing context effects when the referential context and the communicative situation are less overlapping. The approach would not be able to make finer-grained predictions such as whether within one such situation, referential relations influence language comprehension more than other world-language relations.

Myachykov et al. (2014) present a framework of embodied representations and distinguish between invariant aspects of the world such as gravity (p. 446) from somewhat stable embodied aspects (e.g. of an agent’s state such as frontal vision, p. 446) to less stable aspects such as an agent’s interpretation of a specific environment given her goals. Myachykov et al. (2014) view these aspects as hierarchical, and differentially susceptible to learning. The framework can make predictions about context effects when comparing the relative influence of physical aspects of the world compared with aspects of the immediate context or an agent’s body. To accommodate distinct language-world relations, we would, in addition, need the context part of the framework to be specified further.

Working memory as a hub in language-mediated visual attention

Huettig et al. (2011) focus on accommodating listening-looking interactions, and posit that “first the visual display is processed up to a high level, including the creation of conceptual and linguistic representations. At this high level, these representations subsequently match up with those activated by the linguistic input, activation that then feeds back to the linked location” (p. 142 Huettig et al. 2011). Huettig et al. (2011) posit that working memory is central in accommodating language-mediated attention to visual (object) contexts.

From that postulate, they derive a number of predictions such as that working memory representations should suffice to guide attention to objects during comprehension and this may happen automatically. A further prediction was that active working memory representations are needed for attentional guidance, and that long-term memory can also guide attention. The discussed evidence supports the view that representations in working memory are necessary to guide attention but that long-term memory may also influence attention to things. For situated language comprehension, the authors argue that with enough time, comprehenders link semantic and phonological representations to an object and its location (see related accounts of visual search and language-mediated situation model construction Spivey et al. 2001; Spivey and Geng 2001). This approach might be able to accommodate distinct context effects—provided that they fall out of the activation of representations in working memory. To the extent that comprehender characteristics modulate context effects via working or long-term memory, the account might be able to predict comprehender-specific modulation of context effects (but may require further refinement).

Comprehender characteristics and processing preferences

In relation to these proposals, the present approach focuses on differentiating context effects in language comprehension via a characterization of more or less core world-language relations; on the effects of comprehenders characteristics (see Münster and Knoeferle 2018, for more detail); and on a more principled account of (mostly visual) context effects, relying on the predictions (P1)–(P3). I have proposed to foreground distinct world-language relations and comprehender characteristics to predict (variability in) context effects. Comprehender characteristics have been shown to cause variability in context effects. Comprehender characteristics such as literacy would on that account set the probability of expecting a specific word such as ‘door’ to a low value for low literates, and to a high value for high literates, effectively capturing the absence of expectations and visual anticipation in low literates (see p. 7, Münster and Knoeferle 2018).

Concerning world-language relations, not every aspect of (visual) context influences language comprehension in the same manner. Indeed, how language mediates context information (e.g., via reference or other links), appears to predict (variability in) context effects. Processing preferences come into play in that some context effects (e.g., of verb-mediated actions and referential relations) seem particularly robust and pervasive—in reading and listening, across real-world, video-taped, and depicted environments, and across children, young, and older adults, suggesting generalizability beyond individual variation.

An open issue is how to rank these two predictors (comprehender characteristics and referential preferences)—would they be on equal footing, or rather hierarchical in their influence on real-time comprehension? The account by Münster and Knoeferle (2018) predicts that comprehender characteristics would modulate context effects [see Figure 1 in Münster and Knoeferle (2018), where comprehender characteristics can modulate the content of working memory and associated probabilistic expectations].