Rational over-specification in visually-situated comprehension and production

Tourtouri, Elli N.; Delogu, Francesca; Sikos, Les; Crocker, Matthew W.

doi:10.1007/s41809-019-00032-6

Rational over-specification in visually-situated comprehension and production

Research Paper
Open access
Published: 13 September 2019

Volume 3, pages 175–202, (2019)
Cite this article

Download PDF

You have full access to this open access article

Journal of Cultural Cognitive Science Aims and scope Submit manuscript

Rational over-specification in visually-situated comprehension and production

Download PDF

3611 Accesses
20 Citations
Explore all metrics

Abstract

Contrary to the Gricean maxims of quantity (Grice, in: Cole, Morgan (eds) Syntax and semantics: speech acts, vol III, pp 41–58, Academic Press, New York, 1975), it has been repeatedly shown that speakers often include redundant information in their utterances (over-specifications). Previous research on referential communication has long debated whether this redundancy is the result of speaker-internal or addressee-oriented processes, while it is also unclear whether referential redundancy hinders or facilitates comprehension. We present an information-theoretic explanation for the use of over-specification in visually-situated communication, which quantifies the amount of uncertainty regarding the referent as entropy (Shannon in Bell Syst Tech J 5:10, https://doi.org/10.1002/j.1538-7305.1948.tb01338.x, 1948). Examining both the comprehension and production of over-specifications, we present evidence that (a) listeners’ processing is facilitated by the use of redundancy as well as by a greater reduction of uncertainty early on in the utterance, and (b) that at least for some speakers, listeners’ processing concerns influence their encoding of over-specifications: Speakers were more likely to use redundant adjectives when these adjectives reduced entropy to a higher degree than adjectives necessary for target identification.

Color discriminability makes over-specification efficient: Theoretical analysis and empirical evidence

Article Open access 17 June 2021

When more is more: redundant modifiers can facilitate visual search

Article Open access 17 February 2021

Spontaneous perspective-taking in real-time language comprehension: evidence from eye-movements and grain of coordination

Article Open access 05 April 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

In situations that require interlocutors to collaborate in order to manipulate objects around them, the visual environment plays a crucial part in establishing reference and creating meaning. Thus, in order to successfully identify a target object in a visually-situated communication task, speakers need to mention precisely those properties that distinguish an intended referent from the other objects in the environment. For example, in a setting that contains a blue and a green ball, the expression ‘the ball’ fails to select a referent (it is under-specified), because the mentioned property (shape) is shared between the two objects. By contrast, the modified expression ‘the blue ball’ (which is minimally-specified) successfully establishes reference, by exploiting a contrast between the two objects. According to Grice’s maxims of quantity (Grice 1975, 1989), in order for communication to be successful speakers’ expressions should convey the minimal amount of information that is necessary—no less (first maxim), and crucially no more (second maxim)—unless speakers intend their listeners to infer some implicit meaning (implicature). Returning to the previous example, if the visual context contained a blue ball and a green mug, the expression ‘the blue ball’ would no longer be appropriate (it is over-specified) since the target referent could be disambiguated by mention of its shape alone. In this case, the adjective ‘blue’ would be redundant.

Research on reference production has nevertheless shown that speakers frequently use such over-specified expressions to refer to singleton objects (see Engelhardt et al. 2006 for a 10–60% estimation), thereby violating the second maxim of Quantity. Even though the Gricean theory does not make any predictions regarding the online processing of utterances that violate the maxims, it does have implications for the addressees (Grice 1989), in that they should expect speakers to observe the conversational principle and the maxims that follow from it. Redundant information may therefore engage addressees in unintended pragmatic inferencing (e.g., in the previous example, that a second ball is relevant but not visible to them), which might lead to comprehension difficulties (cf. Sedivy et al. 1999). This point raises an important question: What motivates speakers’ tendency to include redundant information in their utterances, if doing so may impede comprehension for their listeners? As we will see below, however, it is still under debate whether over-specification hinders comprehension (e.g., Davies and Katsos 2013; Engelhardt et al. 2006; Engelhardt et al. 2011) or not (e.g., Arts et al. 2011a; Tourtouri et al. 2015).

Whatever the effect of over-specification on comprehension, the frequent use of redundancy poses a challenge to traditional Gricean pragmatic accounts of communication, which argue that rational speakers should not over-specify, when their goal is to merely establish reference to an object. More recent bounded-rational approaches to communication (cf. Hale 2003, 2006; Frank and Jaeger 2008; Jaeger 2010; Levy and Jaeger 2007) may be better suited to explain this behaviour. Hale (2006) argues that processing effort is proportional to the reduction of uncertainty (entropy; Shannon 1948) about upcoming material in a sentence. In addition, Levy and Jaeger (2007) propose that due to cognitive resource limitations, peaks in the amount of information conveyed by words can increase processing effort for the addressees, and consequently speakers’ production choices are motivated by an intent to distribute this information (and thereby processing effort) more evenly across their utterances. Under these accounts, redundant expressions may be preferred to minimal descriptions, because they distribute the same content across more linguistic units, which strengthens the signal and provides addressees with additional cues to guide visual search, thereby making target identification faster and less effortful.

In this article, we crucially consider both the comprehension and production sides of visually-situated communication. In a first experiment, we seek to determine the impact of over-specification on situated comprehension, in order to gain a better understanding of what factors may contribute to speakers’ use of redundancy in a subsequent referential communication experiment. In both experiments, we manipulate the contribution of a word to the reduction of uncertainty regarding the target referent, which we quantify as entropy (Shannon 1948). We then examine how the rate of entropy reduction in a referring expression may influence listeners’ comprehension above and beyond any effects of specificity, and whether an adjective’s potential to reduce uncertainty can explain its redundant use by speakers.

Production of over-specified referring expressions

Since Grice put forth his cooperative principle, much work has investigated whether speakers do in fact observe the conversational maxims in everyday language use (e.g., Arts et al. 2011b; Belke and Meyer 2002; Davies and Katsos 2013; Deutsch and Pechmann 1982; Engelhardt et al. 2006, 2011; Koolen et al. 2011; Koolen et al. 2013; Koolen et al. 2015; Maes et al. 2004; Pechmann 1989; Rubio-Fernández 2016; Tarenskeen et al. 2015; Vogels et al. 2019; Vonk et al. 1992, among others). Despite the different visual settings, tasks or languages employed, these studies share a common finding: that speakers frequently use redundant information in their referring expressions. This redundancy is not in line with (a strict interpretation of) the Gricean maxims (see however Bach 2006; Geurts and Rubio-Fernández 2015), especially when compared to the low proportion of under-specifications (violations of the first maxim of Quantity). The consistency with which over-specification appears in referential communication gives rise to the question: Why do speakers over-specify?

Generally speaking, two kinds of explanations have been offered, namely, that over-specification is the result of production-internal processes (egocentric view) or that it is addressee-oriented (audience-design view) (cf. Arnold 2008). Under the egocentric view, in the presence of a visual display that contains referents differing in various attributes, speakers may start to speak before they have fully scanned the display for possible competitors to their intended referent; they may therefore include attributes that turn out to be unnecessary (cf. Pechmann 1989). It is also possible that in the interest of easing attribute selection and production processes, speakers use features that are visually salient and therefore preferred, such as colour (cf. Belke and Meyer 2002; Koolen et al. 2015; among others). By contrast, the audience-design account holds that speakers include redundant information in an effort to facilitate comprehension for their addressees, for instance by including properties that are visually salient or those which allow the addressees to create a mental image of the target to guide their visual search (cf. Arts et al. 2011b; Paraboni et al. 2007).

To determine the extent to which egocentric or audience-design concerns underlie referential over-specification, past research has tried to identify which factors contribute to the use of redundancy. That is, if speakers are found to over-specify more frequently when the experimentally manipulated factors are associated with the addressees’ performance, this should constitute evidence for the audience-design view. Two studies that manipulated exactly this found a higher over-specification rate for speakers who thought that their addressees had to carry out a demanding task such as performing a surgical operation (Arts et al. 2011b), or learning instructions on how to set an alarm clock (Maes et al. 2004).

On the other hand, if the use of over-specification is influenced by factors that are mostly relevant to the speaker, this would provide evidence for the egocentric view. Previous work has also manipulated such factors. For instance, properties of the target object such as cardinality (Koolen et al. 2011), or perceptual features such as colour salience (Belke and Meyer 2002; Belke 2006; Tarenskeen et al. 2015) have been shown to affect the rate of over-specifications produced. Other research underlines the role of availability in the production of redundant adjectives, such that properties that are conceptually more available to the speaker, such as colour or category, tend to be redundantly included in object descriptions more frequently (e.g., Schriefers and Pechmann 1988).

As speakers need to contrast the intended referent with the distractor objects in order to identify the properties in which they differ, the role of the visual context in the production of over-specified reference has also been investigated. For instance, some studies (Gatt et al. 2017; Rubio-Fernández 2016) have found that the rate of over-specification increased with context size (number of distractors). Furthermore, scene variation was also shown to play a role, as it was more likely for speakers to produce redundant colour adjectives in polychrome compared to monochrome displays (Rubio-Fernández 2016), and in displays where more distinguishing properties were relevant for the disambiguation of the target referent (Koolen et al. 2013). Finally, the presence of visual clutter (thematically related objects) was also shown to contribute to the production of redundant references (Koolen et al. 2015). All these factors are, however, related to perceptual characteristics of the referents, and it is possible that while they ease attribute selection for the speaker, they may also facilitate visual search and identification of the target for the listener.

In sum, despite Gricean expectations speakers frequently over-specify in referential production studies, but there is no agreement regarding whether the use of redundancy is driven by egocentric or audience-design considerations.

Comprehension of over-specified referring expressions

As mentioned above, existing research is divided over whether referential redundancy impedes comprehension or not. Some studies report that over-specification hinders listeners’ online processing and results in slower and less accurate identification of the target referent (cf. Davies and Katsos 2013; Engelhardt et al. 2011), while other work suggests that over-specification may even facilitate comprehension (cf. Arts et al. 2011a; Brodbeck et al. 2015; Tourtouri et al. 2015).

For instance, in an event-related brain potential (ERP) study, Engelhardt et al. (2011) found that when visual scenes contained two objects of different shapes, redundant prenominal adjectives (colour and size) yielded larger N400-like amplitudes time-locked to the onset of the adjective when compared to scenes with two objects of the same shape (i.e., where the adjective was required for identifying the target). The N400 component is generally thought to reflect the degree to which the context supports semantic processing, and larger N400 amplitudes are associated with increased processing difficulty (see Kutas and Federmeier 2011 for a review). Therefore, Engelhardt and colleagues took this N400-like effect to indicate that over-specification hampers comprehension. The observation of this effect may, however, hinge on the simplicity of the visual context. Namely, it is possible that extra information was strikingly redundant with visual contexts as highly simplified as the ones used in this experiment (only two objects appeared in the visual scene, differing in a maximum of two features). Moreover, any effects of over-specification might also emerge on the following noun region, while Engelhardt and colleagues only focused on the adjective. In a similar vein, Davies and Katsos (2013) found evidence that over-specification was dispreferred by listeners as indicated by the lower ratings and longer response times for over-specified compared to minimally-specified utterances. Material in this study, however, comprised expressions containing evaluative and size adjectives, which are known to invoke a contrastive interpretation (cf. Sedivy et al. 1999; Sedivy 2003, 2005).

Other offline and online experiments offer evidence in the opposite direction, namely that over-specification facilitates comprehension. Arts et al. (2011a), for instance, showed that referential redundancy might, in fact, be beneficial to understanding, and ease participants’ identification of the target referent. However, they only measured identification times after participants were exposed to the linguistic stimulus, thus no conclusions about the online processing of the referring expression can be drawn. Another ERP study (Tourtouri et al. 2015) provides further support for the notion that over-specification facilitates processing. In this experiment, participants viewed visual stimuli presenting six objects that differed in colour and pattern, and listened to concurrent spoken instructions to locate one of the objects (e.g., ‘Find the yellow bowl’), while their EEG was recorded. The presence of a shape competitor on the scene made the instruction minimally-specified, while when no competitor was available the instruction was over-specified. The authors found no difference between the two conditions in the adjective region, and an attenuated N400 for the over-specified compared to the minimally-specified condition in the noun region. In this study, however, visual displays in the over-specified condition always depicted exactly one object matching the property mentioned in the adjective, but this was not the case in the minimally-specified condition. In other words, when listeners were instructed to ‘Find the yellow bowl’, there was only one yellow referent in over-specified displays (i.e., the only object that was yellow was the bowl), but two yellow referents in minimally-specified displays (i.e., apart from the bowl, a yellow watering-can was also available). As the authors note, the facilitation observed for the over-specified compared to the minimally-specified instructions may merely be due to the relative predictability of the target referent in the two conditions. That is, in the over-specified but not in the minimally-specified condition, after hearing the adjective participants were able to predict the upcoming word. It is therefore possible that processing of over-specification would have been hindered had the scene included a second yellow object, as was the case in the minimally-specified condition, and even more so if this competitor object fitted Gricean considerations (i.e., was part of a contrast pair, thus making an adjective necessary).

Sedivy et al. (1999) manipulated exactly this factor in a visual-world eye-tracking study using either colour or size prenominal adjectives.^{Footnote 1} They report shorter fixation latencies to the target when it was part of a contrast pair (minimally-specified) compared to when it was not (over-specified). The authors interpreted this finding as evidence that participants readily used pragmatic inferencing to inform their interpretation of the utterance as it unfolded. It is, however, possible that this result was due to the specific experimental task rather than listeners’ contrastive interpretation of the adjective. Visual scenes consisted of four objects: a contrast pair differing in one feature (e.g., a yellow and a pink comb), and two singletons, one bearing the same feature as the target (e.g., a yellow bowl) and a distractor object. While the critical instruction mentioned one of the two referents with the shared feature (i.e., one of the yellow objects), it always came second after an instruction that referred to one object in the contrast pair. Therefore, an alternative interpretation of the results is that participants were faster to fixate the target when a contrasting object was available (that is, when the instruction was minimally-specified) because their attention was already allocated to the contrast pair. Two additional experiments in which the critical instruction came first yielded similar results, but these studies used scalar adjectives such as ‘tall’, which inherently invoke a comparison between the members of a contrast pair.

In sum, there is conflicting evidence regarding the comprehension of over-specifications, with some studies suggesting that over-specification hinders comprehension and others indicating a facilitation. This evidence, however, comes from experiments that vary in the size of the referent set, adjectives used, and crucially whether a competitor object fitting Gricean expectations was available in the visual scene. Each of these factors may have contributed to the observed effects.

The current study

The goal of the current study was to explore how the distributional properties of the visual context may (a) influence the comprehension of over-specifications (i.e., whether an adjective’s entropy reduction potential influences comprehension above and beyond specificity), and (b) affect the tendency of speakers to include redundant adjectives in their utterances.

In Experiment 1, we investigated the influence of referential specificity and entropy reduction on visually-situated comprehension by orthogonally manipulating these factors. In order to assess processing effort, we measured the index of cognitive activity (ICA)—a direct measure of cognitive load (see end of “Referential entropy reduction”)—as well as eye movements as participants followed auditory instructions to locate objects in a visual scene. While the instructions always included a prenominal adjective, we manipulated whether the intended referent was a singleton (over-specified reference) or was part of a contrast set (minimally-specified reference), in order to assess whether listeners compute Gricean pragmatic inferences online and whether their comprehension of the expression is adversely affected when expectations based on those inferences are not met. As in Sedivy et al. (1999), both types of referents (singleton and contrasted) were available in the scene regardless of whether instructions were minimally-specified or over-specified. In addition, we examined whether the rate of referential entropy reduction in the expression would further influence processing, and whether this influence is additive to any effects of specificity. We turn to this point in the next section.

Concerning production, Experiment 2 evaluated whether the entropy reduction potential of a property (colour or pattern) in the referential space would influence speakers’ redundant mention of this property. In other words, speakers may over-specify for a feature of the target referent not only because it stands out, but also based on the extent to which it reduces listener uncertainty about which object is the intended referent. For instance, speakers may be inclined to redundantly use an adjective such as ‘blue’ to identify a singleton object, not only because the colour blue is a salient property, and therefore easy to refer to, but also because it may help narrow down the referential space: If the set of objects that ‘blue’ selects is smaller than the set of other objects, the redundant mention of ‘blue’ before the noun would rapidly restrict the search space and at the same time distribute the effort of target identification over a longer sequence of linguistic elements. If, however, the blue objects outnumbered other objects, ‘blue’ would not be as effective as before in reducing uncertainty (the number of remaining referential candidates after hearing ‘blue’ would in this case be greater than before). Although a few recent studies have considered similar notions, such as discriminability, and their effects on referential over-specification (Koolen et al. 2015; Fukumura 2018; Vogels et al. 2019), none of these studies directly manipulated such factors.

Thus, Experiment 2 investigates whether and how the distributional properties of the visual scene influence the production of referential over-specification by carefully manipulating the potential of a word to reduce entropy (uncertainty regarding the target referent; cf. Hale 2006; Frank 2013). Identifying which property is more entropy—reducing in order to include it in a description is arguably more demanding for the speakers, than just relying on simple heuristics, such as mentioning the most salient feature. Our hypothesis, therefore, is that over-specifications that include the most informative property—in terms of uncertainty reduction—aim at making visual search more effective for addressees and thus facilitate referential communication. As this hypothesis rests upon the extent to which over-specification inhibits or facilitates comprehension processes, we first turn to comprehension, before testing these predictions in production.

Referential entropy reduction

In situated communication, the visual and linguistic context similarly influence listeners’ expectations for the upcoming linguistic material in an unfolding utterance (e.g., Altmann and Kamide 1999; Knoeferle et al. 2005; Tanenhaus et al. 1995). For example, when a listener hears ‘Find the blue’ while immersed in a visual environment such as the one in Fig. 1a, he expects either of two objects to be mentioned next,^{Footnote 2} the ball or the oven mitt. In other words, in this context ‘blue’ reduces the set of potential referents from 6 to 2 objects and thus drastically reduces listener’s uncertainty about the target referent.

We use Shannon’s entropy (Shannon 1948), given in (1) below, to quantify this uncertainty regarding the intended referent (referential entropy).

$$H\left( X \right) = - \sum P\left( x \right)log_{2} P\left( x \right)$$

(1)

In the visual context of Fig. 1a, at ‘Find the’ (i.e., before any information about the target becomes available), all objects are equally likely to be referred to and referential entropy is 2.58 bits, as determined by Eq. (1).

For communication to be successful, the speaker must provide enough information for the listener to reduce this uncertainty to zero. In other words, the listener’s mental representation of what the target referent is must move from a state of maximum entropy to a state of minimum entropy, so that by the end of the utterance he will be able to unambiguously identify this object. As the referring expression unfolds over time, incoming words (potentially) contribute to the reduction of referential entropy. This reduction is measured by ΔΗ, given in (2) below, and is the difference in referential entropy between two consecutive states of the listener’s representation (or two consecutive words in the utterance, w-1 and w).

$$\Delta H_{w} = H_{w - 1} - H_{w}$$

(2)

That is, when ‘Find the blue’ is uttered in the context of Fig. 1a, referential entropy at ‘blue’ is 1 bit, and ‘blue’ reduces entropy by ΔH_blue = 1.58 bits. On the other hand, if the expression is ‘Find the green’, referential entropy at ‘green’ is 2 bits, and ‘green’ contributes to the reduction of entropy by ΔΗ_green = 0.58 bits. That is, while the prenominal adjective in both cases contributes to the reduction of referential entropy, it does so to differing degrees, depending on the size of the referential domain each adjective selects. Thus, in situated communication, information conveyed by a word does not only depend on its probability to occur in a particular (visual and linguistic) context (surprisal), but also on the amount of uncertainty about the target referent that this word reduces (cf. Hale 2003, 2006; Frank 2013, for entropy reduction as a measure of processing difficulty outside visually-situated communication).

Hale’s (2006) entropy reduction hypothesis linked the reduction of entropy to processing difficulty, suggesting that the effort associated with processing a word should be directly proportional to this word’s contribution to the reduction of uncertainty about the rest of the sentence, quantified in bits of information. According to this hypothesis, addressees should experience some difficulty at each entropy reduction point (i.e., on every word in a sentence), but they should encounter greater difficulty the more bits of information this word reduces. This prediction was tested with reading times, both using corpora (Frank 2010, 2013; Wu et al. 2010) and in a self-paced reading experiment (Linzen and Jaeger 2016). Results showed that the rate of entropy reduction brought about by a word was a significant predictor of processing difficulty on that word, with higher reduction resulting in longer reading times. One recent visual world study (Ankener et al. 2018) tested the effects of entropy reduction on the processing of an object noun, based on the selectional restrictions of a preceding verb. That is, when the verb selected fewer objects in the visual scene (high entropy reduction), processing was facilitated on the subsequent noun, as indexed by ICA and visual attention. However, contra the entropy reduction hypothesis, no differences in processing effort were found after the high reduction of entropy on the verb itself.

In the current research, we examine the influence of referential entropy reduction on processing in visually-situated contexts, and seek to determine how the degree of reduction effected by an adjective may modulate listeners’ comprehension processes and explain the use of over-specification by speakers. To estimate processing effort we used the index of cognitive activity (ICA), which in Ankener et al. (2018) resulted in reliable results (but see also Demberg and Sayeed, Experiment 7; Sekicki and Staudte 2018; Vogels et al. 2018 for the use of ICA in visual world studies). The ICA is a direct measure of cognitive load that is based on pupillary response. Fluctuations of pupil size index cognitive effort in a variety of tasks, including language processing (e.g., Engelhardt et al. 2010; Frank and Thompson 2012; Just and Carpenter 1993; Scheepers and Crocker 2004). However, changes in the lighting conditions of the environment are also responsible for pupil dilation. The ICA (Marshall 2000) measures cognitive workload by separating variation in pupil size due to cognitive effort and due to light reflex, while also accounting for random noise. The small and rapid pupil dilations that remain are associated with higher cognitive workload (Marshall 2002). Demberg and Sayeed (2016) showed, for example, that the ICA is sensitive to linguistic manipulations such as ungrammaticality, with conditions related to higher processing demands resulting in higher ICA values. They also demonstrated that the ICA is particularly suitable for the visual world paradigm since it is robust to the change of fixation positions and can thus complement the standard visual attention metrics in order to assess cognitive effort during linguistic processing.

Experiment 1

Experiment 1 aimed to establish whether referential over-specification impedes or facilitates comprehension, and also whether this is further modulated by the rate of entropy reduction in the expression. We recorded participants’ ICA and eye movements as they attended to audio instructions to locate a referent in a visual scene (e.g., ‘Find the blue ball’ in German, combined with displays such as those in Fig. 1). While the instruction was held constant, scenes differed in whether the intended referent belonged to a contrast set (cf. Fig. 1a, b, where a shape competitor is available) or it was a singleton (cf. Fig. 1c, d, where there is no shape competitor). Thus, depending on the visual context, the prenominal adjective was either necessary or redundant, and the description minimally-specified (MS) or over-specified (OS), respectively. In addition to specificity, we manipulated entropy reduction, that is the number of objects that matched the adjective (cf. two blue objects in Fig. 1a, c and four blue objects in Fig. 1b, d). Thus, the adjective restricted the set of potential referents to a greater or lesser degree, contributing to a high reduction (HR) of referential entropy (1.58 bits in Fig. 1a, c) or a Low Reduction (LR) of referential entropy (0.58 bits in Fig. 1b, d), respectively. Importantly, this reduction resulted in a smaller (1 bit) or larger (2 bits) amount of residual entropy, respectively, to be eliminated at the noun. In the analyses below, we report ICA values as a measure of comprehension difficulty, fixation probabilities as a measure of visual attention, and response times for comparisons to prior studies.

We considered two regions of interest: the adjective, and the noun. Note, however, that in the adjective region only the entropy reduction manipulation is of interest, because at this point in the utterance participants were not yet able to determine whether the unfolding expression was minimally- or over-specified. Based on the entropy reduction hypothesis (Hale 2006), we expected to find effects of processing effort at each reduction point, with higher reduction resulting in increased processing difficulty. More specifically, ICA values on the adjective should be higher in HR compared to LR conditions. In contrast, ICA values on the noun should be lower in HR compared to LR conditions, since residual entropy on the noun in the HR condition should be low due to the previous high reduction of entropy on the adjective. It is, however, possible that we only observe an effect on the noun, as in Ankener et al. (2018), where a verb that selected for fewer objects did not itself elicit increased ICA values, but did nevertheless result in lower processing effort on the subsequent noun. Finally, if redundant prenominal adjectives facilitate processing by reducing referential entropy, this should be manifest in an interaction between specificity and entropy reduction, with a larger benefit (lower ICA values) in the OS-HR condition (cf. Fig. 1c).

Anticipatory eye movements triggered by the adjective might, however, reveal how listeners interpret the prenominal adjective (cf. Weber et al. 2006). Within each level of specificity, displays either contained one contrast object that fitted a contrastive reading of the adjective (cf. the blue ball in Fig. 1a, b and the blue mitt in Fig. 1c, d), or one singleton object that did not match a contrastive reading (cf. the mitt in Fig. 1a, b and the ball in Fig. 1c, d). If listeners are Gricean (i.e., if they assume that an adjective identifies a contrast rather than simply providing redundant information), then the adjective should trigger more anticipatory eye movements towards the contrast object compared to the singleton object.

Regarding specificity, the Gricean account predicts greater processing effort on the noun (higher ICA values) in OS compared to MS conditions. In contrast, the bounded-rational view does not predict such a difference; redundancy may be preferred because it distributes information (i.e., processing effort) across a longer sequence of linguistic elements. As visual attention (proportion of fixations) is primarily informative regarding expectations of upcoming material, we do not expect it to reveal anything on the noun beyond correct identification of the target.

Methods

Participants

Twenty-four native speakers of German (mean age = 25, 17 female), with normal or corrected-to-normal vision and no colour blindness were recruited through the Saarland University Psycholinguistic Group’s participant database. Participants were compensated 7 Euros for their participation.

Materials

Pictures of 30 everyday objects (e.g., mugs, bowls, etc.) were used to create the visual displays. The objects differed in colour (red, blue, green) and pattern (dotted, striped, checkered). Both colour and pattern were, therefore, used as distinguishing features to make sure that any effects would not merely be due to colour salience, but also to increase visual complexity and avoid tuning the task to one feature. Pattern was chosen over size, which is more commonly used (cf. Engelhardt et al. 2011; Sedivy et al. 1999), because pattern, like colour, is an intrinsic property of the object and does not invoke a comparison with other objects in the context. We, thus, made sure that preference for a contrastive reading of the adjective would be due to the manipulation and not to the contrastive nature of size adjectives. GIMP (Version 2.8.10) was used to adjust colour hue and brightness and match them across objects. The pictures were then submitted to an offline picture naming task measuring naming agreement for the objects. Twenty-four independent participants were presented with the object pictures in all colours and patterns (distributed across 8 lists) and were asked to provide a description including colour and pattern. Only objects with naming agreement of 80% or higher were then used to create the visual stimuli.

Overall, 660 visual displays were created, of which 480 were used to construct the experimental items, and the rest were used in the fillers. Experimental items were the combination of 4 displays and one spoken instruction (cf. Fig. 1). Displays in one experimental item were essentially four versions of the same display, counterbalancing the target position within the item (cf. the position of the blue ball in Fig. 1), and the colour and pattern per object type throughout the experiment. This gave rise to 120 experimental items, half of which were paired with colour instructions (colour items), and the other half with pattern instructions (pattern items; cf. Fig. 10 in the Appendix). All experimental displays were created in a way that neither the target feature nor the target referent would be identifiable before hearing the critical words. To this end, six objects were used per display in two colours and two patterns. Two of the objects were singletons, and the rest were paired in two contrast sets, such that they could potentially serve as an over-specified or minimally-specified referent, respectively, with either a colour or a pattern instruction. Furthermore, because determiners in German are marked for gender, only same-gender objects were used in each display, to make sure that the determiner would not reveal the target and that the first point of entropy reduction would always be the adjective. Similarly, no phonological competitors appeared in the same scene, so that adjective onset would always be the first point of disambiguation across items.

Filler displays differed from experimental displays in several respects. First, 105 filler displays depicted only four objects, thereby introducing some variation in the stimuli set while also making the 6-object experimental displays more complex relative to the filler trials. Furthermore, half of the filler items were minimally-specified, and the other half were either over- or under-specified (with a higher proportion of over- relative to under-specifications). In this way, we introduced more variation in the stimuli requiring the listener to be more attentive (as it could be the case that reference could not be resolved), while maintaining a lower proportion of over-specifications as is normally found in language use (cf. Engelhardt et al. 2006). Moreover, all filler displays apart from the under-specified ones contained a set of three same-shape objects (e.g., three balls) differing for both colour and pattern, thus making the use of a second adjective necessary for disambiguation. Under-specified fillers were similar in structure to the experimental displays, but failed to establish reference (e.g., ‘the green rucksack’ when two objects fit the description; cf. Fig. 1a and c). Twelve fillers were used as practice items in a familiarisation session before the experiment.

Experimental displays were paired with spoken instructions containing a prenominally modified referring expression like ‘Find the blue ball’ in German (‘Finde den blauen Ball’), while filler instructions could mention one, two or no modifiers. The order of mention of colour and pattern adjectives was counterbalanced in the two-modifier fillers. Audio stimuli were recorded with Cubase AI5 in a soundproof booth by a female native speaker of German. Speech was continuous, and no artificial pauses were inserted between words. Sentences were then annotated for adjective and noun onsets using Praat (Version 5.3). Mean word duration was 397.2 ms (SD = 49.6) for colour adjectives, 605.1 (SD = 75.1) for pattern adjectives, and 557.2 ms (SD = 75.7) for the nouns.

Stimuli were divided into 4 lists of 288 trials so that one version of an item was in each list, and no participants saw more than one condition of a given item. Lists were pseudo-randomised for each participant, making sure that at least one filler appeared between consecutive experimental items, and items of the same condition did not appear more than two times in a row. The experiment was implemented and run using E-prime 2.0 (Psychology Software Tools, Inc., Pittsburgh, PA, USA).

Procedure

Participants’ eye movements were tracked at a rate of 250 Hz using an SMI RED 250 eye tracker (SensoMotoric Instruments GmbH, Berlin, Germany) attached to the bottom of a 22-inch Dell monitor. After participants gave informed consent, they read the instructions, and they were seated at a distance of approximately 60 cm in front of the monitor. A chinrest was used to minimise head movements. A familiarisation phase was first administered, during which the experimenter gave feedback after each trial, to make sure that the task was clear before the experiment began. Each experimental session was divided into 4 blocks, in between which participants could take short breaks. Calibration was performed at the beginning of each block. On average, participants needed 40 min to complete the experiment.

Visual stimuli were presented at a resolution of 1680 × 1050 pixels. At the beginning of each trial a cross appeared in the middle of the display for a period controlled by the experimenter. After that, the objects appeared while the cross remained on the screen for another 500 ms. The audio instruction was played 1500 ms later. After the end of the instruction, the objects remained on the screen for a wrap-up period of 500 ms. At the end of the trial, a prompt screen appeared asking participants to indicate which side of the screen the target referent was on, or whether it was not possible to tell (under-specified fillers) by pressing the corresponding button on a response pad in front of them.

Data analysis

We analysed the ICA, gaze probabilities as well as response times in two time windows, after adjective and after noun onset. For all analyses, we fitted (generalised) linear mixed models (lme4 package; Bates et al. 2015) in R (version 3.5.1; R Core Team 2018) including entropy reduction and specificity as well as the Feature (colour vs. pattern) of the target referent as fixed factors, and crossed random intercepts and slopes for participants and items. All factors were contrast coded, with positive contrast coding (0.5) for the levels of HR, MS and colour, and negative contrast coding (− 0.5) for LR, OS and pattern. Whenever the maximal models did not converge, we simplified the random effects structure as suggested by Barr et al. (2013). All analyses included only trials with correct responses.

Response times Response times (RTs) were time locked to the onset of the prompt display. Analyses were carried out on log-transformed response times using linear mixed models.

Index of cognitive activity (ICA) To calculate the ICA we used the BeGaze™ software equipped with the ICA Module (SensoMotoric Instruments GmbH, Berlin, Germany) and Workload RT (EyeTracking, Inc., Solana Beach, CA, USA). Since the ICA values output by the BeGaze™ software are too coarse-grained for the type of effects we expect, we used the ICA Coefficients to compute ICA values per 100 ms (see Demberg and Sayeed 2016, for more details). Data points with a pupil diameter smaller than 2.5 SD per participant were eliminated, and a mean ICA value for both eyes was calculated. We compared mean ICA values across conditions within a window of 600 ms starting from the middle of each region (cf. Sekicki and Staudte 2018).

Fixations Eye-tracking data were pre-processed as follows. First, because the objects used in the visual displays could differ in size (cf. rucksack vs. mitt), areas-of-interest were calculated per object as the surface that the object covered on the screen in pixels plus 30 pixels around it. Next, fixations shorter than 80 ms were pooled with the immediately preceding or following fixation, if the distance between them was smaller than 12 pixels; otherwise they were excluded from the analysis. Finally, trials with recording problems (e.g., miscalibrations, track loss, etc.) were excluded from the analysis. For the analysis in the adjective region, to account for the difference in duration of colour and pattern adjectives, we considered a region from 200 ms before adjective offset until 200 ms after noun onset,^{Footnote 3} since it is known that it takes around 200 ms to plan and execute a saccade (Matin et al. 1993). As discussed above, the specificity manipulation is not relevant for the adjective, as it is based on information given on the noun. We therefore collapsed across MS and OS conditions, and coded looks to the singleton vs. the contrast objects to estimate whether participants assigned a contrastive reading to the prenominal adjective. For the analysis of eye movements during the noun, we were interested in the influence of specificity and entropy reduction on fixating the target referent, and not in possible early effects (anticipatory eye movements are analysed in the adjective region). We therefore considered fixations that started between 300 and 800 ms after noun onset. In both regions, we considered mean log-gaze probability ratios (cf. Knoeferle and Kreysa 2012) of participants’ fixations to (a) the singleton over the contrast object in the adjective region and (b) the target over the competitor object in the noun region. A positive ratio for (a) would indicate that the singleton object was more likely to be fixated over the contrast object, and a positive ratio for (b) that the target object was more likely to be fixated over the competitor object. Negative values should be interpreted in the opposite way (i.e., as more looks to the contrast object in the adjective region and as more looks to the competitor object in the noun region). A score of zero would indicate no differences in the probability with which each object was fixated. Because the log ratios are based on aggregation, it is not possible to include crossed random effects of participants and items in the same model. We, therefore, fitted separate linear mixed effects models over participants and over items.

Results

Response times

All of the factors included in the model significantly influenced RTs. Participants were faster to give a response in HR (611 ms, SD = 374) compared to LR conditions (659 ms, SD = 397; β = − 0.0796, SE = 0.0155, t = − 5.14, p < 0.001), and faster in OS (614 ms, SD = 372) compared to MS conditions (656 ms, SD = 398; β = 0.058, SE = 0.016, t = 3.755, p < 0.001). Faster responses were further observed when the mentioned feature was colour (570 ms, SD = 323) compared to pattern (703 ms, SD = 432; β = − 0.192, SE = 0.027, t = − 7.217, p < 0.001). In addition, the three-way interaction between entropy reduction, specificity and feature significantly influenced RTs (β = 0.135, SE = 0.062, t = 2.181, p < 0.05). We followed up this interaction by fitting separate models for colour and pattern items, and we observed similar results. In colour items, RTs were faster in HR (545 ms, SD = 306) compared to LR conditions (594 ms, SD = 338; β = − 0.086, SE = 0.020, t = − 4.235, p < 0.001), and faster in OS (555 ms, SD = 323) compared to MS conditions (584 ms, SD = 323; β = 0.053, SE = 0.020, t = 2.651, p < 0.01). Similarly in pattern items, RTs were faster in HR (679 ms, SD = 423) vs. LR conditions (726 ms, SD = 439; β = − 0.073, SE = 0.023, t = − 3.147, p < 0.01), and faster in OS (676 ms, SD = 409) vs. MS conditions (729 ms, SD = 452; β = 0.064, SE = 0.023, t = 2.773, p < 0.01). The entropy reduction × specificity interaction was marginally significant (β = − 0.078, SE = 0.046, t = − 1.688, p = 0.092), such that RTs were slower in the MS-LR condition.

ICA

Adjective In the adjective time window (see Fig. 2), the entropy reduction manipulation was found to significantly influence cognitive effort, with higher ICA values in HR vs. LR conditions (β = − 0.026, SE = 0.013, z = − 2.068, p = 0.039). The effect of feature and the interaction between the two factors did not reach significance (p > 0.05).

Noun All of the factors significantly affected participants’ cognitive workload in the noun region (Fig. 3). Specifically, we again observed a significant effect of entropy reduction, this time with higher ICA values in LR compared to HR conditions (β = − 0.073, SE = 0.023, z = − 3.160, p < 0.01). Furthermore, specificity and feature were also found to be significant predictors of cognitive load, with higher ICA values for MS compared to OS conditions (β = 0.079, SE = 0.026, z = 3.069, p < 0.01), and for colour compared to pattern items (β = − 0.076, SE = 0.022, z = − 3.372, p < 0.001). None of the interactions reached significance (p > 0.05).

Log-gaze probabilities

Adjective As mentioned above, the specificity manipulation is not relevant in the adjective window (see “Data analysis”). We therefore collapsed across specificity, and included only entropy reduction and feature as fixed factors in the models. We computed log-gaze probability ratios comparing fixations to the singleton and contrast objects. Table 1 presents the results of this analysis. As indicated by the significant intercept (both by participants and by items), upon hearing the adjective participants were more likely to fixate the contrast object over the singleton object (see negative coefficient). This viewing pattern seemed to be modulated by an interaction between the rate of entropy reduction and the mentioned feature, which we followed up with separate analyses for colour and pattern items. In colour items (Fig. 4), none of the comparisons reached significance; there was only a marginal effect on the intercept in the by-participants analysis. In pattern items (Fig. 5), the contrast object was more likely to be fixated over the singleton, and this effect seemed to be stronger in HR vs. LR conditions.

Table 1 Experiment 1 results—adjective region

Full size table

Noun The specificity manipulation becomes relevant during the noun region, as it is at this point that the target referent is mentioned. We, therefore, considered fixations to the target vs. the competitor object, and specificity was included as a predictor in the models. The results of these analyses are presented in Table 2. Even though the analysis by participants resulted only in a marginally significant three-way specificity × reduction × feature interaction, and no other comparison reached significance, several significant effects were found in the by-items analysis. First, there was an effect of specificity with more looks to the target over the competitor object in OS vs. MS conditions. We also found an effect of reduction such that the target was more likely to be fixated than the competitor object in HR vs. LR conditions, and an effect of feature with more fixations to the target object in colour vs. pattern items. Additionally, there was a significant specificity × feature interaction with more fixations to the target object in the OS condition for colour items. We followed up the interactions by performing separate analyses for colour and pattern items. In the colour items (Fig. 6), the by-participant analysis resulted only in a marginally significant effect of specificity, with more looks to the target object in OS conditions. The by-items analysis revealed a significant effect of specificity in the same direction and a significant effect of reduction with more looks to the target over the competitor object in HR vs. LR conditions. In the pattern items (Fig. 7), both by-participants and by-items analyses resulted in a specificity × reduction interaction, which was significant and marginally significant, respectively. This interaction seemed to be driven by a smaller log ratio in the MS-LR condition (see Table 3).

Table 2 Experiment 1 results—noun region

Full size table

Table 3 Experiment 1 results—mean log-gaze probability ratios (SD in parentheses) for fixations to the target over fixations to the competitor object in the noun time window of pattern items

Full size table

Discussion

In this experiment, we aimed to assess whether comprehension of over-specified expressions is hindered or facilitated relative to minimally-specified expressions, and whether the rate at which referential entropy is reduced in the expression further affects processing. In the noun region, we found no evidence that over-specification hinders comprehension. Participants’ ICA values were in fact lower in OS vs. MS conditions, indicating that over-specification does not adversely affect comprehension, but if anything over-specification facilitates comprehension. These findings are further supported by the log-gaze probabilities and RTs. Unsurprisingly, participants looked more towards the target than the competitor object after hearing the noun, but this effect was modulated by specificity, such that looks to the target vs. the competitor object were more likely when the noun followed a redundant vs. necessary adjective. Furthermore, participants’ RTs in this task were also faster in OS vs. MS conditions. In the adjective region, anticipatory looks to the singleton vs. contrast objects were expected to reflect participants’ interpretation of the adjective. Whereas there is some evidence that contrast objects were fixated more than singletons (supporting the Gricean account), this only occured with pattern adjectives—which are more difficult to discern than colour. It is therefore possible that this effect is related to the length of pattern adjectives, which in this experiment were on average 200 ms longer than colour adjectives. Thus, with pattern adjectives participants may have had more time to consider which object could possibly be the target referent and to employ Gricean reasoning. Nevertheless, participants’ gaze behaviour in the adjective region of colour items, as well as the facilitation found for OS conditions with both colour and pattern items in the noun time window, contradict the Gricean account and support the view that over-specification facilitates comprehension.

Moreover, our findings support the entropy reduction hypothesis (Hale 2006) and show that the reduction of uncertainty is a predictor of comprehension difficulty in visually-situated communication. In contrast to Ankener et al. (2018), we found effects of reduction at each reduction point. A high reduction of entropy on the adjective resulted in increased cognitive effort (higher ICA values) in that region, but facilitated processing (lower ICA values) on the following noun; residual entropy on the noun—and the cognitive effort associated with the reduction of this entropy—was correspondingly lower in HR than in LR trials (lower ICA values in HR trials). The facilitation for HR vs. LR conditions was further indexed by the increased likelihood to fixate the target over the competitor object at the noun region in the two conditions, as well as by faster RTs for HR vs. LR conditions.

In sum, Experiment 1 found evidence that referential redundancy benefits processing. Furthermore, a high reduction of entropy on the adjective was found to increase effort in that region (indexed by the higher ICA values for HR relative to LR conditions), but to facilitate processing on the subsequent noun (ICA values here were lower for HR vs. LR conditions). These effects differed between colour and pattern adjectives, with colour resulting in greater facilitation (main effect of feature with lower ICA values in colour compared to pattern items). We now turn to the question whether speakers are sensitive to these processing concerns and whether they take comprehension effort into account when planning their utterances in situated communication contexts.

Experiment 2

The goal of Experiment 2 was to identify what factors motivate speakers’ over-specifications, and whether these factors are primarily associated with egocentric or addressee-oriented (whether Gricean or bounded-rational) concerns. In a referential communication experiment, pairs of participants sat in front of different monitors and collaborated to identify whether the location of a target object on a visual display, such as those in Fig. 8, was the same for both participants. Objects differed in colour and pattern, and in critical trials one feature (adjective) was necessary to identify the intended referent. We manipulated which feature was necessary for disambiguation (colour—necessary vs. pattern—necessary), and which feature was more entropy—reducing (colour—reducing vs. pattern—reducing vs. equally—reducing). We measured the proportion of over-specifications produced per condition.

The egocentric view holds that production preferences are tuned to minimise speakers’ effort, regardless of the addressees’ needs. Therefore, if over-specifications are the result of egocentric production processes, speakers’ choices should not be affected by the manipulations described above (i.e., the rate of over-specifications that egocentric speakers produce should be independent of the experimental condition).

Conversely, according to the addressee-oriented view, speakers should prefer structures that ease comprehension for their listeners—both the Gricean and the bounded-rational approaches are in accord with this view. The Gricean account predicts that for all conditions speakers should prefer to convey the minimal amount of information that is necessary, as this is what would be expected by the listeners. That is, speakers should use the expression ‘the blue ball’ to refer to the intended referent in the colour—necessary conditions (cf. top panels in Fig. 8) and the expression ‘the striped ball’ in the pattern—necessary conditions (cf. bottom row in Fig. 8), independent of their entropy reduction potential.

By contrast, the bounded-rational approach predicts that speakers should be more likely to over-specify particularly when the entropy reduction potential of the redundant adjective is higher than that of the necessary adjective. For example, in Fig. 8e ‘blue’ would be redundant, but it also reduces entropy to a higher degree than the necessary adjective ‘striped’ (ΔH_blue = 1.58 bits vs. ΔH_striped = 0.58 bits). Thus, the redundant ‘blue’ should be used more often in Fig. 8e than in Fig. 8d when the necessary adjective (‘striped’) is more entropy—reducing. Such production preferences would be in line with the findings from Experiment 1 that listeners favour utterances that manage entropy more effectively. Finally, colour over-specifications are expected to be more frequent than pattern over-specifications. This prediction is based on the results from Experiment 1 as well as previous research (cf. Sedivy 2003; Rubio-Fernández 2016).