To more rigorously examine the key hypothesis H1 thrown into doubt by the studies reported in the previous section, as well as to address H2, our main study implemented the cancellation paradigm with a combination of eye tracking and plausibility ratings. By combining these two measures, we can separately test hypotheses about what automatic inferences are initially triggered (H1, part 1) and whether they get suppressed or influence further cognition (H1, part 2).
In this study, participants read and rated items in which appearance sentences employing ‘look’, ‘appear’, or ‘seem’ were followed by a sequel that was either inconsistent or consistent with the hypothesised stereotypical inference from the appearance verb to a doxastic conclusion:
The dress seemed blue. Hannah thought it was green. (stereotype-inconsistent)
The dress seemed blue. Hannah thought it was navy. (s-consistent)
As in the study of Fischer and colleagues (2019), stereotype-inconsistent items with ‘look’, ‘appear’, and ‘seem’ were intended to invite phenomenal (re-)interpretation of the verb. Participants also read and rated otherwise identical items with the contrast verb ‘is’, which lacks stereotypical association with doxastic patient properties (where, for convenience, we retain the label stereotype- or ‘s-in/consistent’ for counterparts of stereotype-in/consistent appearance items):
The dress was blue. Hannah thought it was green. (s-inconsistent)
The dress was blue. Hannah thought it was navy. (s-consistent)
When we read sentences, our eyes may pass over the same words several times.Footnote 16 Whereas first-pass reading timesFootnote 17 are largely determined by word length, word frequency, and the word’s predictability in context (‘cloze probability’) (Rayner 1998), difficulties in integrating information from different parts of the sentence may have us reread bits of the sentence (Rayner et al. 2004; Clifton et al. 2007). Specifically, such integration difficulties may have us reread the regions where the difficulty becomes manifest (‘conflict region’) and regions perceived as the source of the difficulty (‘source region’). Where inferences triggered by previous words (‘seemed blue’) clash with subsequent text (‘Hannah thought it was green’), this leads to higher rereading times for either the conflict region (‘green’) or the source region (‘seemed blue’), or both. This increases the total reading times for these regions (defined as the sum of all fixations in a region) and the second pass reading times (defined as total minus first pass reading times). These two measures are known as ‘late’ reading times. The hypothesis that appearance verbs trigger doxastic inferences (H1, part 1) predicts
- [Prediction RT]:
Late reading times for conflict or source regions will be higher in s-inconsistent appearance-items (like 1 above) than in ‘is’-counterparts (like 3 above).
When initially triggered stereotypical inferences clash with contextual information or with background beliefs, they can be suppressed within one second and before they influence further cognition (Fischer and Engelhardt 2017b). The hypothesis that the doxastic inferences of interest influence further cognition (H1, part 2) therefore needs to be tested separately. It predicts that, in a subsequent non-speeded plausibility rating task, these inferences will reduce the plausibility of s-inconsistent appearance sentences, where they clash with the sequel, but will not affect the plausibility of s-consistent items. Hence:
- [Prediction PL1]:
S-inconsistent appearance items (like 1 above) will be deemed less plausible than s-consistent appearance items (like 2 above).
Since s-inconsistent ‘is’-items claim that protagonists are wrong about typically obvious matters (like the colour of a dress), we would expect participants to find them mildly implausible. However, the doxastic inferences posited by H1 would render s-inconsistent appearance items outright contradictory, and reduce their plausibility even further. Hence:
- [Prediction PL2]:
If s-consistent appearance and ‘is’ items (like 2 and 4) are deemed equally plausible, s-inconsistent appearance items (like 1) will be deemed less plausible than ‘is’-counterparts (like 3).
Such plausibility differences would provide evidence of cognitively influential doxastic inferences from appearance verbs, in inappropriate contexts (namely, in s-inconsistent items which invite phenomenal interpretation of the word).
The competing Hypothesis H2 suggests that plausibility judgments about s-inconsistent texts (used in the previous plausibility ranking study and figuring among the items in this new study) are driven by subjectivity ratings (rather than by contextually cancelled doxastic inferences). To assess this hypothesis, we elicit subjectivity-ratings for the claims under discussion in our items. These claims are expressed by the first sentences of ‘is’ versions (e.g., ‘The dress was blue’). H2 assumes that readers of s-inconsistent appearance items will assign the verb’s patient role to the author. This assignment—and only this assignment—turns the appearance-sentence into the expression of an authorial opinion (I think the dress was blue) (Sect. 3.1). H2 suggests that participants then find appearance sentences, interpreted as expressions of opinions, more appropriate, and appearance items more plausible, the more subjective they deem the claim under discussion, and will find corresponding ‘is’ sentences more appropriate, and ‘is’-items more plausible, the more objective (less subjective) they deem the claims under discussion (Sect. 3.5). H2 thus predicts:
- [Prediction PL3]:
For appearance items, there will be a positive correlation between subjectivity ratings (for claims under discussion) and plausibility ratings (for items); for corresponding ‘is’ items, there will be a negative correlation between these ratings
In stereotype-inconsistent items, the patient-role assignment assumed by H2 can be either due to reassignment in response to the perceived inconsistency or made from the start (Sect. 3.1). If that assignment is made from the start, H2 should apply also to stereotype-consistent items. Since we have not ruled out this possibility, we will examine [PL3] first for all items and then for stereotype-consistent and -inconsistent items, separately.
Forty-eight first- and second-year undergraduate psychology students (9 males) from the University of East Anglia participated for course credit. All were native speakers of English with normal or corrected-to-normal vision.
Each participant read 48 critical items (six for each of eight conditions) and 48 fillers. All items were about visual objects. Half of the critical items involved basic visual properties (colour, shape, size, 8 items each). The other half involved less basic, but easily visually ascertainable properties like material (silver, wood), or age (young, old). S-inconsistent items used antonyms in first and second sentence. S-consistent items used synonyms, or the second sentence used a sub-ordinate category (blue–navy). Appendix 1 gives a list of critical items. As verbs were rotated across items, mean length and frequency of words in the source regions were the same across verb conditions (except for the unavoidable differences between ‘is’, ‘look’, ‘appear’, and ‘seem’). Following the norming work (described below) we ensured that, in the conflict regions, neither the mean frequencies (consistent: 126, inconsistent: 182 occurrences in reference corpus Leech et al. 2001) nor the mean lengths (consistent: 5.54 characters, inconsistent: 5.25) of the adjectives differed significantly between the s-consistent and the s-inconsistent items (length: t(46) = −0.58, p = 0.57, frequency: t(46) = 0.80, p = 0.43).
To guard against floor effects and ensure intelligibility of items, a norming study with twenty-six participants from the same population rated the plausibility of ‘is’-versions of candidate items (half s-consistent, half s-inconsistent), on a 5-point scale. Participants identified words they did not understand and did not rate the items containing them. We excluded all items where the s-inconsistent version attracted a mean rating < 2.5, and excluded or rephrased all items where at least two participants failed to understand a constituent word.
Eye movements were recorded with an SR Research Ltd. EyeLink 1000 eye-tracker which records the position of the reader’s eye every millisecond. Head movements were minimised with a chin rest. Eye movements were recorded from the right eye. The sentences were presented in 12 pt. Arial black font on a white background.
Design and procedure
We manipulated the verb in the first sentence (‘is’, ‘look’, ‘appear’, ‘seem’) and the consistency of the sequel with hypothesised doxastic inferences (s-consistent vs. s-inconsistent), in a 4 × 2 design. All variables were manipulated within subject. We measured first pass, second pass, and total reading times for source regions, conflict regions, and their constituent words.
After a 9-point calibration and validation procedure, participants completed two practice trials and 96 experimental trials. These included 48 critical trials. Each participant saw an equal number of items in each condition, as verbs were rotated across items using a Latin Square Design. Before each trial, participants fixated a drift-correction dot on the left edge of the monitor, centred vertically. The sentence appeared after an interval of 500 ms. The initial letter of each sentence was displayed in the same position as the drift correction dot. The entire sentence appeared on a single line on the screen. The participant read the sentence silently and then pressed the spacebar on the keyboard. A plausibility-rating prompt appeared, and participants rated sentences’ plausibility on a scale from 1 to 5, by pressing the corresponding key on the keyboard. Endpoints were explained as ‘very implausible’ (1) and ‘very plausible’ (5), and the midpoint (3) as ‘neither plausible nor implausible; the decision feels arbitrary’.
To preview findings, results largely bore out predictions derived from H1, but not predictions from H2.
We analysed plausibility ratings for all items with a 2 × 4 (context × verb) repeated-measures ANOVA. This revealed large main effects of consistency F(1,46) = 387.21, p < 0.001, η2 = 0.89 and verb F(1,46) = 7.53, p < 0.01, η2 = 0.14, and a marginal 2-way interaction F(1,46) = 3.29, p = 0.076, η2 = 0.07 (see Fig. 1). Participants rated s-consistent items distinctly plausible, or significantly above neutral mid-point, in all verb conditions (p’s < 0.001 for all mean ratings), and deemed s-consistent items with different verbs equally plausible F(3,138) = 0.59, p = 0.62, η2 = 0.01. By contrast, s-inconsistent items with all verbs were deemed distinctly implausible, or significantly below mid-point (all p’s < 0.001), and there were significant differences between verb conditions F(3,138) = 3.54, p < 0.05, η2 = 0.07. As per prediction [PL1], s-inconsistent items with an appearance verb were deemed less plausible than s-consistent counterparts (‘look’: t(46) = 15.07, p < 0.001; ‘appear’: t(46) = 15.87, p < 0.001; ‘seem’: t(46) = 16.11, p < 0.001). Prediction [PL2] predicted that if s-consistent items with appearance verbs and ‘is’ are deemed equally plausible, the consistency manipulation will render appearance items less plausible than ‘is’ items. Participants indeed rated s-consistent items with all verbs equally plausible (above). As predicted by [PL2], s-inconsistent items with ‘appear’ and ‘seem’ were deemed less plausible than s-inconsistent items with the contrast verb ‘is’ (appear vs. is: t(46) = −2.09, p = 0.04; seem vs. is: t(46) = 2.65, p = 0.01). The mean plausibility rating for s-inconsistent items with ‘look’ (2.40) was numerically lower than the mean rating for similar ‘is’-items (2.43), but, against predictions, this difference was not significant t(46) = 0.36, p = 0.72. Further comparisons between s-inconsistent items revealed that the difference between ‘appear’ and ‘seem’ was not significant t(46) = 0.74, p = 0.46, the difference between ‘look’ versus ‘seem’ was significant t(46) = 2.48, p = 0.02, and that between ‘look’ versus ‘appear’ trended towards significance t(46) = −1.41, p = 0.09.
Our prediction [RT] about reading times only concerns later reading times for s-inconsistent items. The observed plausibility ratings suggest that later reading times will vary between verbs only for such items. Our analysis of reading times therefore focused on s-inconsistent items. We predicted that late (total and second pass) reading times for conflict or source regions will be higher for s-inconsistent items with verbs ‘look’, ‘appear’, and ‘seem’ than for corresponding items with ‘is’. However, most relevant trials involved a striking pattern of eye movements: When reading s-inconsistent items, participants regressed from the end of the final sentence to the source region (e.g., to ‘seemed blue’ in ‘The dress seemed blue. Hannah thought it was green’), reread the source region, and then progressed to the plausibility-rating screen without rereading the conflict region (e.g., ‘green’). To take these findings into account, we report total reading times for the conflict region. Since second pass (= total minus first pass) reading times are the most precise measure of integration difficulties, we report these for the crucial source region.
We analysed reading times for s-inconsistent items with a one-way repeated-measures ANOVA with verb type having four levels (manipulated within item). This revealed that total reading times for the conflict region (e.g., ‘green’) were not significantly different for items with different verbs F(3,138) = 1.39, p = 0.25, η2 = 0.03 (see Fig. 2). By contrast, second pass reading times for the source region, obtained by summing across the first verb and first object (e.g., ‘seemed blue’), showed a large effect of verb F(3,138) = 4.65, p < 0.01, η2 = 0.24. Paired comparisons revealed second-pass reading times were appreciably higher in ‘appear’-items than ‘is’-items t(46) = 3.41, p = 0.001 and in ‘seem’-items than in ‘is’-items t(46) = −2.85, p = 0.007, consistent with prediction [RT]. By contrast, the difference between ‘look’-items and ‘is’-items was not significant t(46) = −1.34, p = 0.19. Differences between items with different appearance verbs also remained shy of significance: Differences were marginally significant between ‘look’- and ‘appear’-items t(46) = 1.92, p = 0.06, but not between ‘look’- and ‘seem’-items t(46) = −1.55, p = 0.13. For further reading times, with discussion, see Appendix 2.
As we will presently discuss, these findings provide evidence that the intended phenomenal uses of ‘appear’ and ‘seem’ in our items triggered doxastic inferences that influence further cognition, and support hypothesis H1 for ‘appear’ and ‘seem’, but not for ‘look’.
To assess whether plausibility ratings were largely based on subjectivity judgments, in the way suggested by H2, we recruited 107 participants with the same approach and from the same population as in Experiments 1–3.Footnote 18
Participants rated the claims under discussion for all critical items used in the main study, as expressed by the first sentence of items’ ‘is’ versions (e.g., ‘The dress was blue’). Participants rated them on a scale from 1 (‘completely objective’) to 7 (‘completely subjective’). We then calculated the mean subjectivity ratings and reanalysed the data from the main study to assess H2’s prediction [PL3] that mean plausibility ratings for items would correlate with mean subjectivity ratings for claims under discussion, positively for items with appearance verbs and negatively for items with ‘is’. Accordingly, the following analyses were conducted on items, rather than participants.
Mean subjectivity ratings for claims under discussion varied between 1.52 (‘The word’s spelling was correct’) and 5.31 (‘The building was quite grand’), and neatly divided into halves close to mid-point (4), with half the claims receiving mean ratings of 4.2 or above. A two-way ANCOVA, which included context (consistent and inconsistent) and verb type (with four levels) and subjectivity ratings as a covariate, showed significant main effects of context F(1,22) = 16.87, p < 0.001, η2 = 0.43 and subjectivity F(1,22) = 5.78, p = 0.025, η2 = 0.21 as well as a significant interaction between context and subjectivity F(1,22) = 5.13, p = 0.03, η2 = 0.19. Crucially, however, results revealed no significant effect of verb F(1,22) = 1.29, p = 0.27, η2 = 0.001 and no interaction between verb and subjectivity F(1,22) = 0.15, p = 0.74, η2 = 0.001, and there were also no significant correlations between subjectivity and plausibility ratings (all p’s > 0.42). This is inconsistent with prediction [PL3] derived from hypothesis H2.
However, H2 was initially advanced as a hypothesis about s-inconsistent items (Sect. 3.4), and was then tentatively extended to s-consistent items (Sect. 4.1). Next, we therefore considered stereotype-consistent and—inconsistent items separately. The s-consistent items showed no significant main effects or interaction (all p’s > 0.30). The correlations between subjectivity and plausibility ratings were not significant, either (is: r = 0.07, p = 0.75; look: r = 0.16, p = 0.45, appear: r = 0.05, p = 0.83; seem: r = −0.08, p = 0.70). That is: Differences in subjectivity did not correspond to any changes in plausibility. The more subjective half of the s-consistent items were deemed roughly as plausible as the more objective half of the s-inconsistent items, across all verb conditions (subjective half mean plausibility = 4.51 and objective half mean plausibility = 4.41).
The crucial s-inconsistent items showed no main effect of verb F(1,22) = 2.78, p = 0.11, η2 = 0.11, but a significant and large main effect of subjectivity F(1,22) = 14.40, p = 0.001, η2 = 0.40. Crucially, however, there was no interaction between verb and subjectivity F(1,22) = 1.10, p = 0.31, η2 = 0.048. Against prediction [PL3], we found negative correlations between subjectivity and plausibility ratings not only for items with ‘is’ (r = −0.50, p = 0.014) but also for items with ‘look’ (r = −0.56, p = 0.005), ‘appear’ (r = −0.47, p = 0.019), and ‘seem’ (r = −0.44, p = 0.030). That is: The more subjective the relevant claims were deemed, the less plausible s-inconsistent items were judged, regardless of the verb used (i.e., there was no effect of verb). The more subjective half of the s-inconsistent items were deemed less plausible than the more objective half of the s-inconsistent items, across all verb conditions (subjective half mean plausibility = 2.07 and objective half mean plausibility = 2.53). Strikingly, plausibility ratings were numerically higher for ‘is’ items (2.17) than for appearance items (2.04), for the more subjective half of our s-inconsistent items (where H2 suggested that appearance verbs would be deemed more appropriate and appearance items more plausible than items with the supposedly more objective ‘is’).
The follow-up study’s findings speak against H2, and against subjectivity being a confound for the main study. According to H2, participants base plausibility ratings for items on how appropriate the first sentence’s verb (‘is’ vs. ‘appear’, etc.) is in view of the subjectivity of the claim under discussion. Differences in plausibility are then attributed to differences in subjectivity, which are held to affect the plausibility of ‘is’ and appearance sentences in different ways. However, when considering the whole sample, we found no significant correlations between subjectivity ratings (for claims under discussion) and plausibility ratings (for items), for any verb condition—despite a main effect of subjectivity. Moreover, this effect disappeared when we considered only s-consistent items (which employ the same first sentences as s-inconsistent items). This suggests that participants’ plausibility ratings were not influenced mainly by the first sentence and the fit between its verb and the subjectivity of the claim under discussion. Subjectivity ratings did affect the plausibility of items in the s-inconsistent condition. However, it affected the plausibility of all s-inconsistent items in the same way, namely decreased it across all verb conditions—rather than increasing the plausibility of appearance items and decreasing that of ‘is’ items (as per H2). Differences in subjectivity therefore cannot explain the differences in plausibility between s-inconsistent items with different verbs.
The follow-up findings help us interpret the main study’s findings concerning H1. They help address the question of how participants interpret the crucial s-inconsistent items in the main study (‘The dress seemed blue. Hannah thought it was green’): whether they assign the patient role of the appearance verb to the protagonist (Hannah) or to the author, either from the start or as a result of reassigning the patient role from protagonist to author in response to the otherwise severe clash with the sequel (see Sect. 3.1). Assignment of the patient role to the author would turn the appearance sentence into the expression of an authorial opinion or hedged judgment about a matter of fact. This would lead participants to interpret s-inconsistent items as expressing a clash of subjective opinions between the author and the protagonist. The claim that there is such a clash should seem more plausible when the matter at hand is deemed more subjective. In line with [PL3], patient-role assignment to the author thus predicts a positive correlation between subjectivity and plausibility ratings for appearance-items. However, we observe a negative correlation for the crucial s-inconsistent items. We infer that, as long as no alternative assignment is explicitly suggested to them (as in Exp.2), participants assign the patient-role of the appearance verb to the protagonist in the text (The dress seemed blue to Hannah).
As long as this assignment is maintained, the most obvious interpretation of s-inconsistent items that avoids a contradiction is the phenomenal interpretation intended by the experimenters. On this reading, the items’ first sentence attributes merely an experiential attitude to the protagonist (The dress visually appeared blue, to Hannah), but no doxastic attitude (Hannah may well still believe that it has another colour than it looks to her, here and now, under these lighting conditions, etc.). This avoids contradiction with the sequel but makes for a mildly implausible scenario. Since the items do not make any deviations from stereotypical viewing situations explicit, participants will infer with the I-heuristic (Levinson 2000; see Sec. 2.1 above) the absence of factors (like odd lighting) that could lead a protagonist to distrust appearances, and will think it mildly implausible that the protagonist should think the object has one visual property (as asserted by the sequel), when it looks another to them. To win through to this non-contradictory, phenomenal, interpretation of these implausible items, readers need to completely suppress the doxastic component features of the situation schemas associated with appearance verbs.
Higher second-pass reading times for source regions in s-inconsistent appearance sentences (‘seemed green’) than in corresponding ‘is’ sentences (‘was green’) could be due either to such suppression effort or to cognitive effort expended on patient reassignment (from protagonist to author, see above). Having excluded patient reassignment, we interpret observed elevated second-pass reading times for source regions in sentences with ‘appear’ and ‘seem’ as evidence of the effort to suppress contextually inappropriate schema components that is involved in phenomenal interpretation of the appearance verb.Footnote 19 Lower plausibility ratings for s-inconsistent items with ‘seem’ and ‘appear’ than ‘is’ suggest that this effort meets with only partial success, and the doxastic conclusions inferred continue to influence ratings. We thus take these findings to support hypothesis H1 for two of the three appearance verbs examined: At any rate phenomenal uses of ‘seem’ and ‘appear’ trigger doxastic inferences which are at most partially suppressed and continue to influence further judgment.
In contrast with ‘appear’ and ‘seem’, ‘look’-items pattern with ‘is’-items, in the s-inconsistent condition: Neither the second-pass reading times for the source regions nor the plausibility ratings are significantly different, and both are deemed distinctly implausible (if more plausible than ‘appear’ and ‘seem’). This suggests that participants construct for both ‘look’ and ‘is’-sentences situation models where the protagonist looks at the object; they then find it roughly equally implausible that the protagonist should judge the object viewed to have one property, when it actually possesses (‘is’) or looks another, in stereotypical viewing conditions (see above). Nearest neighbour analyses of ‘look at’ (Fischer et al. 2019; Fn.22) and ‘look F’ (Fischer et al. 2015, Appendix) suggest these verbs are associated with epistemic and doxastic features, respectively, but more weakly than ‘seem’ and ‘appear’ are. In both cases, these stereotypical features need to be suppressed, to arrive at a consistent—e.g., phenomenal—interpretation of s-inconsistent items. Given the weaker association of the relevant features, suppression requires less effort than with ‘seem’ and ‘appear’, as evidenced by lower second-pass reading times for the source region, and this lesser suppression effort meets with greater suppression success, as evidenced by higher (if still low) plausibility ratings in the main study. The fact that, in Exp.1, ‘look’ was the only appearance verb which led to lower contradictoriness ratings for s-inconsistent items with visual objects than abstract objects provides further evidence that participants find it easier for ‘look’ than ‘appear’ and ‘seem’ to win through to a largely phenomenal interpretation which attributes experiential features (available for items with visual objects, but not with abstract objects) but is largely devoid of doxastic implications.
Yet further evidence is provided by a follow-up study we undertook in response to a reviewer query (and report in Appendix 3). Participants judged the acceptability of items that make, respectively, doxastic and non-doxastic uses of appearance verbs to describe familiar cases of non-veridical perception (where nobody is taken in). For all three appearance verbs, they deemed non-doxastic uses acceptable (e.g., ‘Seen from the beach, the huge ships anchored out at sea look small’, though nobody believes they are small). This suggests that all three verbs are used—also—in a non-doxastic ‘phenomenal’ sense, in ordinary discourse. But non-doxastic uses were deemed more acceptable for ‘look’ than the other verbs. This suggests that for ‘look’ this non-doxastic phenomenal sense is more salient, and the doxastic sense less salient, than for the other verbs, so that contextually inappropriate doxastic inferences are easier to suppress, and exert less influence on further cognition (as per the Salience Bias Hypothesis).
To sum up, present findings largely confirm H1 and suggest that phenomenal uses of ‘seem’ and ‘appear’ trigger contextually inappropriate doxastic inferences that influence further cognition, whereas any doxastic inferences readers may make from ‘look’ can be swiftly suppressed. The main study also helps us assess the hypothesis H2 that explains observed plausibility judgments with reference to subjectivity judgments rather than to inappropriate doxastic inferences. Present findings speak against this potential confound affecting the present main study. It also excludes this confound for a previous plausibility ranking experiment (Fischer et al. 2019) (see Appendix 4).
However, present findings partially diverge from findings of this and another previous study employing the forced-choice plausibility-ranking task: Fischer et al. (2019) observed significant preferences of ‘is’-sentences over appearance sentences, in s-inconsistent items with all three appearance verbs. In the present study, mean plausibility ratings were numerically minimally lower for s-inconsistent ‘look’- than for ‘is’-sentences. The forced-choice plausibility-ranking paradigm may translate such insignificant plausibility differences into significant differences in preference. However, an earlier study (Fischer and Engelhardt 2016) using the same paradigm found random preferences for ‘look’- over ‘appear’-sentences, whose mean plausibility ratings in the present study display a larger numerical difference than ‘look’ and ‘is’. The experimental findings specifically for ‘look’ thus present a mixed picture.