Introduction

When experiencing events, our cognitive system routinely makes use of expectations to anticipate upcoming information and to guide action (e.g., Rao & Ballard, 1999; Friston, 2010; Wacongne et al., 2012). The violation of expectations, and their relation to stimulus plausibility, has been explored in areas as diverse as language comprehension to visual scene perception (e.g., Kutas & Hillyard, 1980; Van Berkum et al., 1999; Henderson et al., 1999; Ganis & Kutas, 2003; Hagoort et al., 2004; Mudrik et al., 2010; Võ & Wolfe, 2013; Coco et al., 2014). An open challenge remains to understand how the cognitive system utilizes expectancy mechanisms to synchronously hold information across multiple points in time and integrate it to produce action responses (Bar 2007). The growing attention towards this challenge can be traced to current proposals in the cognitive sciences that aim to bridge low-level perceptual processes, high-level expectancy mechanisms, and motor control within the same predictive processing framework (e.g., Clark, 2013; Pickering & Clark, 2014). Lupyan and Clark (2015), for example, suggest that perception is an inferential process, whereby prior beliefs are combined with incoming sensory data to optimize in-the-moment processing and to improve future predictions. In the current study, we draw from this framework to explore how different types of expectations interactively mediate comprehension, and how these comprehension processes can be captured over time through a fine-grained analysis of response behavior.

Previous research on expectancy mechanisms has largely employed electrophysiology (EEG) measures. A common finding in this research is that a mismatch between predictions and incoming linguistic or non-linguistic stimuli trigger negative shifts in event-related brain potentials (e.g., Ganis & Kutas, 2003; Kutas & Hillyard, 1980; DeLong et al, 2005; Mudrik et al., 2010). Such shifts are typically interpreted as evidence that a prediction had occurred and that prior knowledge has been updated. Moreover, the magnitude of these negative shifts is modulated by the context in which unexpected stimuli is placed (e.g., a sentential context, Marslen-Wilson & Tyler, 1980; Kutas, 1993), as well as by additional information preceding it, such as a narrative of visual scenes (Sitnikova et al. 2008; Cohn et al. 2012).

Although EEG studies have provided invaluable insights into contextual effects and processing costs at the moment unexpected stimuli is encountered, there may also be additional costs that persist after the initial encounter. This is particularly true when expectations must be maintained in memory to perform an explicit decision response, such as in verification tasks where participants are asked to assess the congruency of a pair of sequentially presented stimuli (i.e., whether a sentence and a picture convey the same message or not; e.g., Clark & Chase, 1972; Carpenter & Just, 1975). We hypothesize, per a predictive processing framework, that when incongruent stimuli are encountered, activation elicited from the initial stimulus will compete with bottom-up sensory activation from the second stimulus, taking time to resolve as the error signal is adjusted. Moreover, we expect multiple sources of expectancy to contribute to the prediction activation strength, i.e., not only the congruence between consecutive stimuli, but also their plausibility (i.e., whether the stimuli being conveyed depicts something plausible or implausible). The interaction between these sources has been only marginally investigated and there have been no studies, as far as we are aware, that have examined how the costs driven by congruency and plausibility activate, compete, and are resolved throughout an extended decision response.

The examination of this interaction requires an extension of typical verification tasks along with novel ways to track competition costs. Beginning with the task, participants in our study were presented with two consecutive mixed modality stimuli (a sentence and a scene) and asked to verify the congruency of the content. As an extension, the content of the stimuli were manipulated to be consistent or inconsistent with prior knowledge expectations (i.e., stimuli convey plausible or implausible content). For example, participants might be asked to read a sentence that is plausible or not (e.g., they boy is eating a hamburger vs. eating a brick), and then are shown a visual scene that does or does not match the earlier content (this order is also reversed for another set of participants in a “scene-first” version; see Fig. 1 for the example design). In this way, expectations generated when processing the first stimulus are allowed to interact with a subsequent second stimulus in terms of both plausibility and mutual congruence (and potentially modulated by modality).

Fig. 1
figure 1

Experimental design and example of experimental stimuli for Order of presentation: Sentence-First (top row) and Scene-First (bottom row), crossing Plausibility and Congruency. For each Order, a sentence or a scene is presented either as a first or as a second stimulus. In Sentence-First, a sentence is read self-paced, then a scene is presented for 1 second. In Scene-First, a scene is presented for 1 second, then a sentence is read. After being exposed to the pair of stimuli, participants are asked to use the computer mouse to evaluate whether the messages conveyed by the two stimuli were congruent or not, see also Fig. 2 for an example of a trial run. The sentence in Portuguese is o rapaz está a comer ... and the 4 versions created as: (um hamburger, Congruent/ Plausible), (um tijolo, Congruent/Implausible), (um peixe, Incongruent/Plausible), (uma alça, Incongruent/Implausible)

The above experimental setup provides a fully-crossed design of congruency and plausibility expectations, which we employ to explore how systematic mismatches of expectations result in different competition costs. We implemented this design under an action dynamics approach that involved tracking participants’ computer-mouse cursor movements during verification - specifically as participants navigated from the bottom of their screens to the top of their screens where “yes” and “no” options were located in opposite corners. In previous work, the resulting movement trajectories have been shown to reveal moment-by-moment competition between the response options as decisions unfold, typically in trajectories that deviate toward a competitor response en route to a target selection (Spivey & Dale, 2004; Magnuson, 2005; Spivey et al., 2005; Farmer et al., 2007; Dale et al., 2007; Duran et al., 2010; Papesh & Goldinger 2012).

In the current paradigm, for example, we may observe response conflicts of this type when the first stimulus conveys plausible information, activating a “yes” response (i.e., initial movement toward this option on the screen), that must be corrected when a subsequent mismatching stimulus appears (to a “no” response). This result would corroborate previous mouse-tracking research that has found response conflicts to be associated with congruency mismatch in verification tasks (e.g., van Vugt & Cavanagh, 2012). Critically, however, our study makes it possible to examine whether conflicts due to violation of congruency expectations are modulated by the plausibility of the stimuli. In particular, we expect that when an initial stimulus conveys information violating prior knowledge, i.e., it is implausible, a ’no’ response is activated. This response must be corrected to a ’yes’ response if the subsequent stimulus is congruent with the initial stimulus, despite the subsequent stimulus also conveying the same implausible content. This is the case, for example, when the participant reads a sentence such as the boy is eating a brick, and then he/she is presented with a scene congruently depicting a boy eating a brick. In particular, this response conflict should emerge very early in the initial angle of the movement towards the incorrect response (to a ’no’ response), and be observed throughout the trajectory as a consistent deviation towards the incorrect choice, or as a higher number of directional changes. Such evidence would help refine the predictive processing account by showing that the matching of congruency expectations and bottom- up sensory information is alone not sufficient for facilitating cognitive processing, but much depends on the plausibility of the information being integrated.

Method

Previous literature has adopted the notion of congruency to indicate both stimulus implausibility (e.g., Mudrik et al., 2010) and mismatch between consecutive stimuli (e.g., West & Holcomb, 2002; Sitnikova et al., 2008). In our 2 ×2 experimental design, Congruency indicates whether stimuli matched in content (Congruent, Incongruent) between stimuli, and Plausibility indicates whether content was expected or unexpected within stimuli (Plausible, Implausible). Moreover, the cross-modal Order of presentation (Sentence First, Scene First) was manipulated as two separate experiments (between-participants design). In Fig. 1, we provide a schematic description of the experimental conditions for both studies, and provide a full set of crossed pairs of stimuli.

Participants

Sixty-four students at the University of Lisbon, all native speakers of Portuguese, participated in the study for course credits. The experiment was granted by the Ethics Committee of the Department of Psychology, in accordance with the University’s Ethics Code of Practice.

Materials

We used 125 photorealistic scenes, originally published in Mudrik et al. (2010), and added another 100 scenes based on open-access material from the Internet (e.g., Flickr). Each of this 225 unique scenes (size = 550 ×550 px) were presented in the two conditions of Plausibility and Implausibility, which means 450 scenes in total between these two conditions (e.g., the picture of the boy either eating a brick, or a hamburger). In order to generate the Congruency conditions, we constructed 2 types of sentence for each condition, for a total of 900 sentences, which is the total number of itemsFootnote 1. As exemplified in Fig. 1, top row: (a) Congruent/Plausible is the scene of the boy eating an hamburger paired with the sentence the boy is eating a hamburger, Congruent/Implausible is the scene of the boy eating a brick paired with the sentence the boy is eating a brick, Incongruent/Plausible is the scene of a boy eating a hamburger and the sentence saying the boy is eating a fish, Incongruent/Implausible is the scene of the boy eating a brick, and the sentence saying the boy is eating an handle.

The sentences were written in Portuguese and checked for grammaticality by two independent native-speaking annotators. The target word (e.g., hamburger vs. brick) was always positioned at the end of the sentence. The annotators also ensured that the target object depicted in the scene was recognized as the target word used in the sentence.

We divided these 900 stimuli into 4 Latin-Squared lists, 225 items each, such that no items were repeated within the list. From each list, we selected 100 items to present to an individual participant, and made sure that across participants all 900 stimuli, distributed in the 4 lists, were presented an equal number of times.

In order to assess how plausibility, congruency, and other possible task co-variates (such as the grammaticality of the sentence) were perceived during the experiment across trials, we asked the participant at the end of each trial to rate on a scale from 1 to 6 (i.e., from very strongly disagree to very strongly agree) the: 1) plausibility of the scene, 2) visual saliency of the target object, 3) congruency between the scene and the sentence, 4) grammaticality of the sentence. For these ratings, the previously shown stimuli were displayed and there was no time limit to answer.

In the Supplementary Material, we present analyses of the ratings, in terms of both accuracy and response time, that confirm the validity of our experimental manipulations (’Question answer confidence scores’).

Apparatus and procedure

The experiment was designed using Adobe Flash 13.0 (sampling at 60 Hz) and run in the computer laboratory of the Department of Psychology at the University of Lisbon. The stimuli were presented on a 21” plasma screen at a resolution of 1024×768 pixels. Participants sat between 60 and 70 cm from the computer screen. Calibration of the mouse position was ensured by forcing participant to click on a black target circle (36 pixels across) located precisely at the bottom-center of the screen at the start of the trial and throughout its different phases. The optical computer mouse was located directly on the table, rather than on a mouse-pad, and participants had enough space around themselves to produce natural responses.

Participants first read a sentence, using a word-by-word self-presentation method, by clicking on the calibration button located at the bottom of the screen. After the last word was read, a visual scene was displayed for 1 second. This length of preview time is based on previous work using the same stimuli as Mudrik et al. (2010), and gives enough time to extract scene information and identify the critical target object. The scene then disappeared, and the response options (yes, no) were displayed at the top of the screen, counterbalanced (left/right) between participants. Once the participant clicked on a response, the four questions were presented one at a time across separate screens, after which a new trial started (Fig. 2 illustrates an example trial). The Order condition was simply created by presenting the scene for 1 second prior to the reading of the sentence.

Fig. 2
figure 2

An example of a trial run. A target circle is shown at the beginning of every trial. The target ensures that all participants are calibrated to the same starting position. The target is then clicked to display the sentence one word at time. When the last word is reached, this triggers the presentation of the scene that is displayed for 1000ms. After the display, the yes/no verification buttons, equally spaced from the center of the screen, are shown at the top of the screen. After the decision is made, the participant is asked to rate the four questions on a Likert-scale that gauges: 1) the plausibility of the scene, 2) the visual saliency of the target object, 3) the congruency between the scene and the sentence, 4) the grammaticality of the sentence. For all ratings, the previously viewed scene and sentence are visible, which removes the need to recall the stimuli from memory

At the beginning of the session, participants performed three practice trials to familiarize themselves with the task; and each completed 100 randomized trials, and spent between 45 and 60 minutes to complete the task.

Analysis

The dataset we analyzed contained a total of 5,851 unique trials. We removed 8 % (549 trials) from the full dataset if verification times were greater than 4 standard deviations from the mean or due to machine error.

We first analyzed the accuracy of performance during the verification task. This resulted in 5085 accurate trials (≈87 %). Based on this set, we analyzed the time to make a decision and the dynamics of the decision process itself (i.e., associated mouse trajectory). In particular, from the verification trajectory, we extracted the following measures: (a) initial degree (the degree of deviation from vertical after the mouse trajectory leaves a 50-pixel radius from the starting point, same as Buetti and Kerzel (2009)), where positive values indicate angles towards the incorrect target (b) latency (the time taken to move outside an initial region of 50 pixels around the starting point), (c) x-flips in motion (the number of directional changes on the x-axis), and (d) the area under the curve (AUC) (the trapezoidal area between the trajectory and an imaginary line drawn directly from the calibration button to the correct response bottom). Each measure captures complementary information about the decision process. Initial degree captures the earliest response where conflict can be observed. Latency gets at the initial hesitancy to commit to a decision, whereas x-flips gauges uncertainty and changes of mind as the decision unfolds. Finally, AUC is a summary measure for the overall strength of competition toward an incorrect response, where greater area suggests stronger competition. These measures have been detailed in previously published research (e.g., Buetti and Kerzel, 2009; Freeman & Ambady, 2010; Dale & Duran, 2011).

To perform our analysis, we employed linear mixed-effects models based on the R statistical package lme4 (Bates et al. 2011). We built full models with all main effects and possible interactions with a maximal-random structure, where each random variable of the design (e.g., Participants), is introduced as intercept and as uncorrelated slope on the predictors of interest (e.g., Plausibility), see Barr et al. (2013). We adopted this approach to also account for the large variance observed across participants in mouse-tracking experiments (Bruhn et al. 2014).

The dependent measures examined are: AccuracyFootnote 2 (probability), Response Time (in seconds), Initial Degree (in degrees), Latency (in seconds), x-flips (count), and AUC. The fixed variables of our design are Congruency (Congruent, Incongruent), Plausibility (Plausible, Implausible) and Order (Scene-First, Sentence-First). The random variables of our design are Participants (64), Scenes (450, as we have 225 scenes in two conditions of Plausibility) and Counterbalancing (2, left and right). We report tables with the coefficients of the predictors, their t-values, and indicate their p-value significance.

Results

We start by examining the performance accuracy and response time across all experimental conditions. We then examine how the response conflict develops along the decision trajectory by looking at the measures of initial degree, latency, x-flips, and area under the curve, which characterize its underlying dynamics. We report both the raw observed data (mean and SD), as well as the coefficients of the mixed-effects maximal model.

Accuracy and reaction time

When looking at the response accuracy, we find it to be quite high overall (≈87 %), indicating that participants are able to perform the verification task correctly (refer to Table 1). Accuracy is strongly mediated by Plausibility and Congruency, as well as by the Order used to present the sentence/scene pairs. In particular, participants are more accurate when plausible stimuli are presented, and when the sentence is presented prior to the scene (refer to Table 2 for the model coefficients and their significance). Congruency is not significant as a main effect, but only in interaction with Plausibility, where implausible and congruently matching stimuli are verified less accurately, independent from the Order of presentation.

Table 1 Observed data (mean and standard deviation) of all summary measures reported in the study, which are organized in the table by order of presentation (Sentence-First, Scene-First), Congruency (Congruent, Incongruent) and Plausibility (Plausible, Implausible)
Table 2 Coefficients of mixed-effects models with maximal random structure (intercept and slopes on Participants, Scenes and Counterbalancing). Each dependent measure, organised across columns, is modelled as a function of the centred and contrast coded predictors: Congruency (Congruent = 0.5, Incongruent = -0.5), Plausibility (Plausible = 0.5, Implausible = -0.5), Order (Sentence-First = -0.5, Scene-First = 0.5). We report the β with the associated p-value, and the t-value from which it was derived

On the response times to take a decision (correct trials only), we observe again a main effect of Plausibility, whereby implausible information is verified slower than plausible information. Moreover, we confirm the interaction between plausibility and congruency, whereby implausible but congruent stimuli take longer to be verified.

Initial degree, latency, x-flip, and area under the curve

Table 2 reports the coefficients for the mixed-model analyses of the initial degree, latency to start the movement, the flips along the x-axis during the trajectory, and the area sub-tending it.

From the very first moments of the trajectory (initial degree), the participants display a larger conflict in their response when the stimuli are implausible and congruently matched (two-way interaction with Plausibility and Congruency). This can be also seen from the observed data in Table 1 where we observe a more positive initial angle - with positive indexing an angular movement towards the incorrect response - when implausible stimuli are congruent. Conversely, the least deviated initial angle is observed when implausible stimuli are incongruent. Nothing else is observed as main effects, nor as interactions, with initial angle.

For latency, participants hesitate more before starting their verification when the stimuli are implausible, a pattern similar to that observed with overall response time. Interesting results on the latency are obtained as interactions with Order of presentation. We find that when the scene is presented first and the stimuli are implausible, the participants are faster to initiate the movement. We also find a significant interaction between Congruency and Order, such that when the scene is presented first, and the stimuli are incongruent, participants take longer to initiate the movement.

The measures of x-flip and area under the curve converge on the same significant result: implausible but congruent pairs generate more complex trajectories (interaction with Plausibility and Congruency). These results corroborate all other analyses reported above: greater conflicts on the response are generated when implausible stimuli have to be accepted as congruent.

In the Appendix, we also model the angular trajectories using growth-curve analysis, corroborating the results reported above at an even finer-grained resolution. Furthermore, in the Supplementary Material, we re-analyze all summary measures presented in this study, but instead of using the two-level categorical distinction for Plausibility (Plausible and Implausible) and Congruency (Congruent and Incongruent) as predictors, we use the ratings provided by the participants during question answering (refer to Fig. 2 for an example of experimental trial with these ratings). This additional analysis largely confirms the results presented in the main paper, and serves to demonstrate the ecological validity of our experimental manipulations and verification paradigm.

Discussion

There has been a recent and growing interest in understanding the cognitive system along central principles of predictive processing (e.g., Clark, 2013; Pickering & Clark, 2014; Lupyan & Clark, 2015). The current study adds to these attempts by employing a task that required participants to verify whether two sequentially presented stimuli, which also varied in plausibility of content, conveyed the same message. In doing so, two sources of expectancy (stimuli congruency and stimuli plausibility) were allowed to converge and compete during verification. Moreover, in a departure from the majority of previous research, we examined this competition in the continuous updating of the motor system using an action dynamics approach. This allowed a novel view into the temporally-extended activation and resolution of competition. Whereas reaction time primarily informs the time it takes to make a response, and EEG on the neural resources that are recruited at a specific time, we were able to examine competition across the earliest moments of processing and throughout a decision process.

Our analysis of the motor movements focused on key summary measures, including initial degree of movement, latency of movement, x-flips, and area under the curve (AUC). We interpreted greater latencies, increased complexity, and deviations toward an incorrect response as signaling greater processing costs and the application of error-signal updating. All measures consistently point to the same core result: implausible but congruently matching stimuli generated greater response competition than incongruent and implausible stimuli. A match between expectations elicited by an initial stimulus and incoming sensory information from the second stimulus would normally facilitate processing. However, in our scenario, such facilitation was actually disrupted because the incoming information violated other expectations based on prior knowledge.

As evidenced by the computer mouse movements, when the implausible stimulus is encountered, there is a very early hesitation and bias to respond “no” that is activated and persists during verification, competing with the congruency expectation to respond “yes.”

This result is in line with current proposals of predictive processing by underscoring how the cognitive system is actively engaging generative mechanisms of active expectation (i.e., predictive coding) and error correction (e.g., Rao & Ballard, 1999; Bar, 2007; Hinton, 2007; Kok et al., 2012; Wacongne et al., 2012), drawing on multiple sources of expectations from the local context (congruency expectations) and from prior knowledge (plausibility expectations). Moreover, this research helps refine current proposals of predictive processing by showing that congruency expectations can be immediately disrupted and temporarily overridden when incoming stimuli is in violation of prior knowledge.

The cross-modal paradigm also tested whether the directionality of the response costs are sensitive to the modality order of presentation (i.e., sentence-first vs. scene-first) (e.g., Clark & Chase, 1972; Federmeier & Kutas, 2001; Pecher et al., 2003). We found hints of an asymmetry on the latency to start the movement, especially in the angular trajectory (see Appendix), whereby stronger conflicts are observed when an implausible sentence is reinforced by a subsequent congruently matching scene (sentence-first scenario). Based on the literature on simulation models in sentence-picture verification tasks (e.g., Zwaan et al., 2002; Ferguson et al., 2013), we speculate that when a sentence is presented first, the resulting simulated imagery, opposed to merely seeing a picture, relies on deeper connections to prior knowledge to generate its content, and thus such condition shows a greater sensitivity to stimulus plausibility. Future research will be needed to better assess these asymmetries triggered by modality order by providing, for example, greater initial linguistic context to make the simulations even richer.

Another natural avenue of research for investigating plausibility and congruency on predictive processing is to use a fully crossed design, as in the current study, with EEG measures. Intriguing results about the impact of such sources on neural responses have been reported, for example, by Dikker and Pylkkanen (2011), who used a picture-name matching task and observed a very early signature of prediction violation, at around 100ms, when words did not accurately describe the pictures. This result has recently been corroborated by Coco et al. (2015), where congruency and plausibility were found to be associated with distinct temporal latencies, such that incongruent pairs generated revision costs as early as 100ms after the stimulus onset, whereas implausibility begins at 250ms (e.g., same latency of Mudrik et al. (2010) and more recently (Mudrik et al. 2014)). Of great interest would be to compare neural responses with the behavioral patterns reported here and draw joint conclusions about on-line stimuli processing and later integrative verification dynamics. Other key questions relate to the adaptation (or not) of participants to implausible verification, and the amount of exposure needed to modify prior knowledge so that implausible information is accepted without additional costs.

In conclusion, the current study provides new evidence for how response conflicts arise in the motor system as decision responses are acted out. We take this as demonstrating that the plausibility of stimuli mediates congruency expectations in a verification task. More generally, these results provide compelling evidence in support of a predictive processing account by showing that the matching of congruency expectations is not sufficient for facilitating cognitive processing, but depends greatly on the plausibility of information that is being integrated.