Significance statement

Misinformation often has an ongoing effect on people’s reasoning even after they receive corrections. Therefore, to reduce the impact of misinformation, it is important to design corrections that are as effective as possible. One suggestion often made by front-line communicators is to use stories to convey complex information. The rationale is that humans are uniquely “tuned” to stories, such that the narrative format facilitates understanding and retention of complex information. Some scholars have also suggested that a story format may help overcome resistance to corrections that threaten a worldview-consistent misconception. It is, therefore, a possibility that misinformation corrections are more effective if they are presented in a narrative versus a non-narrative, more fact-oriented format. The present study tests this possibility. We designed narrative and non-narrative corrections that differ in format while conveying the same relevant information. In Experiment 1, corrections targeted misinformation contained in fictional event reports. In Experiment 2, the corrections targeted false claims commonly encountered in the real world. Experiment 3 used real-world claims that are controversial, in order to test the notion that a narrative format may facilitate corrective updating primarily when it serves to reduce resistance to correction. In all experiments, we also manipulated test delay, as any benefit of the narrative format may only arise in the short term (if the story format aids primarily with initial understanding) or after a delay (if the story format aids primarily with later memory for the correction). It was found that narrative corrections are no more effective than non-narrative corrections. Therefore, while stories and anecdotes can be powerful, there is no fundamental benefit of using a narrative format when debunking misinformation. Front-line communicators are advised to focus primarily on correction content—while there will be cases where a narrative frame will naturally lend itself to a particular debunking situation, this study suggests that a narrative approach to debunking will not generally be superior.

Introduction

The contemporary media landscape is awash with false information (Lazer et al. 2018; Southwell and Thorson 2015; Vargo et al. 2018). Misinformation featured in the media ranges from preliminary accounts of newsworthy events that are superseded by more accurate accounts as evidence accrues (e.g., a wildfire is initially believed to be arson-related but is later found to have been caused by a fallen power pole), to commonly encountered “myths” about causal relations (e.g., alleged links between childhood vaccinations and various negative health outcomes), to strategically disseminated disinformation that intends to deceive, confuse, and sow social division (e.g., doctored stories intended to discredit or denigrate a political opponent during an election campaign; see Lewandowsky et al. 2017).

From a psychological perspective, an insidious aspect of misinformation is that it often continues to influence people’s reasoning after a clear correction has been provided, even when there are no motivational reasons to dismiss the correction; this is known as the continued influence effect (CIE; Johnson and Seifert 1994; Rapp and Salovich 2018; Rich and Zaragoza 2016; Thorson 2016; for reviews see Chan et al. 2017; Lewandowsky et al. 2012; Walter and Tukachinsky 2020). Theoretically, the CIE is thought to arise either from failure to integrate the corrective information into the mental model of the respective event or causal relationship or from selective retrieval of the misinformation (e.g., familiarity-driven retrieval of the misinformation accompanied by failure to recollect the correction; see Ecker et al. 2010; Gordon et al. 2017, 2019; Rich and Zaragoza 2016; Walter and Tukachinsky 2020).

Given the omnipresence of misinformation, it is of great importance to investigate the factors that make corrections more effective. For example, corrections are more effective if they come from a more credible source (Ecker and Antonio 2020; Guillory and Geraci 2013; Vraga et al. 2020), contain greater detail (Chan et al. 2017; Swire et al. 2017), or a greater number of counterarguments (Ecker et al. 2019). However, even optimized debunking messages typically cannot eliminate the continued influence of misinformation, not even if reasoning is tested immediately after a correction is provided, let alone after a delay (see Ecker et al. 2010, 2020a; Paynter et al. 2019; Rich and Zaragoza 2016; Swire et al. 2017; Walter and Tukachinsky 2020). Thus, additional factors to enhance the effectiveness of corrections need to be identified. The present paper is thus concerned with one particular avenue that might make corrections more effective, which is important because greater correction effects mean smaller continued influence effects.

Specifically, one piece of advice often given by educators and science communicators regarding the communication of complex information, such as misinformation corrections, is to use stories (e.g., Brewer et al. 2017; Caulfield et al. 2019; Dahlstrom 2014; Klassen 2010; Marsh et al. 2012; Shelby and Ernst 2013). For example, Shelby and Ernst (2013) argued that part of the reason why some misconceptions are common among the public is that disinformants use the power of storytelling, while fact-checkers often rely exclusively on facts and evidence. Indeed, people seem to be influenced by anecdotes and stories more so than stated facts or statistical evidence in their medical decision-making (Bakker et al. 2019; Fagerlin et al. 2005), risk perceptions (Betsch et al. 2013; de Wit et al. 2008; Haase et al. 2015), behavioral intentions and choices (Borgida and Nisbett 1977; Dillard et al. 2018), and attitudes (Lee and Leets 2002).

Despite some fragmentation in defining what constitutes a story, researchers generally agree that stories are defined by their chronology and causality: they depict characters pursuing goals over time, and may feature access to characters’ thoughts and emotions (Brewer and Lichtenstein 1982; Bruner 1986; Pennington and Hastie 1988; Shen et al. 2014; van Krieken and Sanders 2019). Research on narrative processing often contrasts narrative messages with non-narrative formats (such as those that feature statistics or facts, descriptive passages, or texts that use a list-based, informative format; sometimes these are also called “expository” or “informational” texts; Ratcliff and Sun 2020; Reinhart 2006; Shen et al. 2014; Zebregs et al. 2015b). Though non-narrative formats may differ in form and substance, they often share an abstract, logic-based, decontextualized message style (relative to narratives), and tend to evoke analytical processing. Research from advertising and consumer psychology suggests that even short messages featuring several lines of text can evoke narrative or analytical processing styles, based on their content (Chang 2009; Escalas 2007; Kim et al. 2017).

Stories can impact reasoning and decision making through several mechanisms (see Hamby et al. 2018; Shaffer et al. 2018). Compared to processing of non-narrative messages, narrative processing is usually associated with greater emotional involvement in the message (Busselle and Bilandzic 2008; Golke et al. 2019; Green and Brock 2000; Ratcliff and Sun 2020). While narrative and non-narrative messages can be cognitively engaging, the nature of engagement differs. Readers of narratives apply more imagery and visualization and may even report feelings of transportation into the world of the story, in which they experience story events as though they were happening to them personally (Bower and Morrow 1990; Green and Brock 2000; Hamby et al. 2018; Mar and Oatley 2008). Additionally, narrative processing tends to reduce resistance to message content; not only are narratives usually less overtly persuasive than their non-narrative counterparts, but audiences are often less motivated to generate counterarguments when processing narratives, as this would disrupt the enjoyable experience of immersion in the story (Green and Brock 2000; Krakow et al. 2018; Slater and Rouner 1996). Stories may thus lead to stronger encoding and comprehension of information embedded within because of the cognitive and emotional involvement they tend to evoke (Browning and Hohenstein 2015; Romero et al. 2005; Zabrucky and Moore 1999).

In addition, a story format may facilitate information retrieval (Bower and Clark 1969; Graesser et al. 1980). This may arise from the aforementioned enhanced processing at encoding, to the extent that enhanced encoding results in a more vivid and coherently integrated memory representation (Graesser and McNamara 2011). Bruner (1986) argued that the story format provides the most fundamental means by which people construct reality, and enhanced retrieval of information presented in story format may therefore also result from the fact that stories typically offer a structured series of retrieval cues (e.g., markers of spatiotemporal context or characters’ emotional states or introspections) that are consistent with the way in which people generally think. In the context of misinformation processing, a correction that is more easily retrieved during a subsequent reasoning task will naturally promote use of correct information and reduce reliance on the corrected misinformation (see Ecker et al. 2011).

However, the evidence regarding the persuasive superiority of the story format over non-narrative text is not entirely consistent. Some studies contrasting narrative and non-narrative formats of health-related messages found both formats equally able to effect changes to attitudes and behavioral intentions (Dunlop et al. 2010; Zebregs et al. 2015a). Greene and Brinn (2003) even reported that narratives were inferior to non-narrative texts in reducing use of tanning beds. Early meta-analyses found that narrative information is either less persuasive than statistical information (Allen and Preiss 1997) or that there is no clear difference in favor of either approach (Reinhart 2006). More recent meta-analyses, however, found stronger support for the narrative approach (e.g., Ratcliff and Sun 2020), while also highlighting that communication effectiveness depends on persuasion context: While Zebregs et al.’s (2015b) analysis found that narrative information was superior to statistical information when it comes to changing behavioral intentions, they found that statistical evidence had stronger effects on attitudes and beliefs. Shen, Sheer, and Li (2015) found that narratives were more effective than non-narrative communications when it came to fostering prevention but not cessation behaviors.

Similar to the approach taken in the present study, Golke et al. (2019) contrasted standard non-narrative texts with so-called informative narratives—enhanced fact-based texts that present essentially the same information as the standard non-narrative fact-based text, but in a storyline format. They found that the narrative format did not enhance reading comprehension and even reduced comprehension in two of their three experiments. Wolfe and Mienko (2007) found no retrieval benefit for informative narratives, and Wolfe and Woodwyk (2010) reported that readers showed enhanced integration of new information with existing knowledge when reading non-narrative texts compared to informative narratives. In the context of misinformation corrections, this may suggest that narrative elements may distract the reader from the core correction and/or that non-narrative corrections may facilitate integration of the correction into the reader’s mental model, which may render them more effective than informative-narrative corrections (see Kendeou et al. 2014).

In sum, while there may be some rationale in using a story format to correct misinformation, the question of whether corrections are more effective when they are given in a story format rather than a non-narrative format remains to be empirically tested. To the best of our knowledge, only one study has investigated the effectiveness of narrative corrections. Sangalang et al. (2019) explored whether narrative corrections could reduce smokers’ misinformed beliefs about tobacco. Results were inconclusive, as a narrative correction was found to reduce misconceptions in only one of the two experiments reported. Importantly, this study did not contrast narrative and non-narrative corrections. This was the aim of the present study.

In three experiments, we contrasted corrections that focus on factual evidence with corrections designed to present the same amount of relevant corrective information, but in a narrative format. In designing these corrections, we took inspiration from the broader literature on narrative persuasion reviewed above (in particular, Shen et al. 2014; van Krieken and Sanders 2019) to ensure narrative and non-narrative corrections differed on relevant dimensions. Narrative corrections featured characters’ experiences and points of view, quotes, chronological structure, and/or some form of complication or climax, whereas non-narrative corrections focused more on the specific facts and pieces of evidence, had a less engaging and emotive writing style, and adhered more closely to an inverted-pyramid format (essential facts followed by supportive evidence and more general background information).

In order to investigate the robustness of potential narrative effects, we aimed to correct both fictional event misinformation and real-world misconceptions: Experiment 1 used fictional event reports of the type used in most research on the continued influence effect (e.g., Ecker et al. 2017). The reports first introduced a piece of critical information that related to the cause of the event, while the correction refuted that piece of critical information. Participants’ inferential reasoning regarding the event, in particular their reliance on the critical information, was then measured via questionnaire. Experiment 2 corrected some common real-world “myths” while affirming some obscure facts (as in Swire et al. 2017). We measured change in participants’ beliefs, as well as their posttreatment inferential reasoning relating to the false claims. Experiment 3 examined the effect of correction format in the context of more controversial, real-world claims. To the extent that a narrative advantage arises from reduced resistance to the corrective message (see Green and Brock 2000; Krakow et al. 2018; Slater and Rouner 1996), it should become particularly apparent with corrections of worldview-consistent misconceptions. We hypothesized that narrative corrections will generally be more effective at reducing misinformation-congruent reasoning and beliefs.

In all experiments, we additionally manipulated retention interval (i.e., study-test delay). The rationale for this is as follows: Any potential story benefit might arise immediately—to the extent that the narrative format boosts engagement with and comprehension of the correction, and thus facilitates its mental-model integration. However, a story benefit may only arise after a delay, to the extent that the narrative format facilitates correction retrieval at test, which will be more relevant after some delay-related forgetting has occurred. In other words, if the narrative format is beneficial for retrieval, this benefit may not become apparent in an immediate test because participants are likely to remember both the narrative and the non-narrative correction just minutes after encoding; however, a story benefit may emerge with a delay, when the corrections are no longer “fresh” in one’s memory (see Ecker et al. 2020a; Swire et al. 2017).

Experiment 1

Method

Experiment 1 presented fictional event reports in four conditions. There were two control conditions: One featured no misinformation (noMI condition), another featured a piece of misinformation that was not corrected (noC condition). The two experimental conditions corrected the initially-provided misinformation using either a non-narrative (NN) or narrative (N) correction. The test phase followed the study phase either immediately or after a 2-day delay. The experiment thus used a mixed within-between design, with the within-subjects factor of condition (noMI; NN; N; noC), and the between-subjects factor of test delay (immediate; delayed).

Participants

Participants were US-based adults recruited via the platform Prolific.Footnote 1 An a priori power analysis (using G*Power 3; Faul et al. 2007) suggested a minimum sample size of N = 352 to detect a small difference between the two within-subjects experimental conditions (i.e., NN vs. N; effect size f = 0.15; α = 0.05, 1 − β = 0.8). As the core planned analyses tested for effects in each delay condition separately, and to achieve an adequate sample size post-exclusions, it was thus decided to aim for a total of N = 800 participants pre-exclusions (n = 400 per delay condition). Due to inevitable dropout in the delayed condition (estimated at 20%), this condition was oversampled by a factor of 1.25 (i.e., 500 participants completed the study phase).

A total of 844 participants completed Experiment 1. Retention of participants in the delayed condition was slightly greater than expected (approx. 89%). After applying preregistered exclusions (described in “Results” section), the final sample size for analysis was N = 770 (n = 357 and n = 413 in the immediate and delayed conditions, respectively); the sample comprised 383 men, 379 women, and 8 participants of undisclosed gender; mean age was M = 34.01 years (SD = 11.56, age range 18–89).

Materials

Experiment 1 used four fictitious event reports detailing four different newsworthy events (e.g., a wildfire); each report comprised two articles. In the study phase, participants were presented with all four reports in the four different conditions. In three of the conditions, the report’s first article contained a piece of misinformation (e.g., the wildfire was caused by arson; this was simply omitted from the report in the no-misinformation condition); in these conditions, the report’s second article either contained or did not contain a correction. If a correction was provided, it was given in either a non-narrative format (e.g., explaining that an investigation had found that a rotten power pole had fallen and the power line had melted on the ground, starting the fire) or a narrative format (e.g., explaining that a fire chief inspected the scene, found the power pole, noticed the rot, and discovered that the power line had melted on the ground, concluding it had started the fire). Narrative and non-narrative corrections thus presented the same critical corrective information, but differed in the way it was presented: Narrative corrections featured specific characters and a causally ordered description sequence; non-narrative corrections featured objective, generalized descriptions of the events (per our definition of narrative and non-narrative format; Brewer and Lichtenstein 1982; Bruner 1986; Pennington and Hastie 1988; Shen et al. 2014; van Krieken and Sanders 2019). All reports thus existed in four versions (matching the conditions; all report versions are provided in “Appendix”). We aimed to keep non-narrative and narrative reports as equivalent as possible in terms of informativeness, length, and reading difficulty. A pilot study confirmed that our narrative corrections were perceived as more “story-like” than the non-narrative corrections, and also as more vivid and more easily allowing the events to be imagined. By contrast, the two correction versions were rated as relatively comparable on informativeness and comprehensibility (for details, see “Appendix”). Assignment of event reports to experimental conditions, as well as condition and event order, was counterbalanced across participants using four different presentation sequences in a Latin-square design, as shown in Table 1.

Table 1 Presentation sequences (S1–4) used in experiment 1

The test comprised a memory question and six inference questions per report. The memory questions were four-alternative-choice questions targeting an arbitrary detail provided twice in the report (once in each article; e.g., “The fire came close to the town of Cranbrook/Kimberley/Lumberton/Bull River”). The sole purpose of the memory questions was to ensure adequate encoding; data from participants who did not demonstrate adequate encoding were excluded from analysis (see exclusion criteria below). The inference questions were designed to measure misinformation-congruent inferential reasoning, following previous CIE research (e.g., Ecker et al. 2017). Five of the six inference questions per report were rating scales asking participants to rate their agreement with a misinformation-related statement on a 0–10 Likert scale (e.g., “Devastating wildfire intentionally lit” would be an appropriate headline for the report). One inference question was a four-alternative-choice question targeting the misinformation directly (e.g., “What do you think caused the wildfire? Arson/Lightning/Power line/None of the above”). Such measures have been found appropriate for online CIE studies (Connor Desai and Reimers 2019). All questions are provided in “Appendix”.

All materials were presented via experimental surveys designed and administered via Qualtrics (Qualtrics, Provo, UT). The survey file, including all materials, is available on the Open Science Framework (https://osf.io/gtm9z/). Surveys with immediate and delayed tests were necessarily run separately due to the need for different signup instructions (the immediate survey was run at the same time as the delayed test). Participants in the delayed condition were reminded via e-mail to complete the test phase 48 h after launch of the study phase; they had 48 h to complete from launch of the test phase but were encouraged to complete within 24 h.

The experiment took approximately 12 min. Participants in the immediate condition were reimbursed GBP1.50 (approx. US$1.95) via Prolific; participants in the delayed condition were reimbursed GBP0.70 (approx. US$0.90) for the study phase and GBP0.80 (approx. US$1.05) for the test phase.

Procedure

Initially, participants were provided with an ethics-approved information sheet. Participants were asked to provide an English proficiency rating (1: excellent to 5: poor), gender, and age information and indicate their country of residence. The four reports were then presented, with each article presented on a separate screen, with applied fixed minimum times (set at approx. 150 ms per word).

The test followed after a short (1-min, filled with a word puzzle) or long (2 days) retention interval. Participants were presented with a questionnaire for each report, each comprising the memory question and the six inference questions. The order of questionnaires followed the order of the reports in the study phase; the order of questions in each questionnaire was fixed (see “Appendix”).

Following the test phase, participants were given a “data use” question asking them to provide honest feedback on whether or not their data should be included in our analysis (“In your honest opinion should we use your data in our analysis? This is not related to how well you think you performed, but whether you put in a reasonable effort.”). This question could be answered with “Yes, I put in reasonable effort (1)”; “Maybe, I was a little distracted (2)”; or “No, I really wasn’t paying any attention (3)”.

Results

Data analysis was preregistered at https://osf.io/svy6f; the data are available at https://osf.io/gtm9z/. Analysis adhered to the following procedure: First, exclusion criteria were applied. We excluded data from participants who (a) indicated they do not reside in the USA (n = 0); (b) indicated their English proficiency is only “fair” or “poor” (n = 3); (c) responded to the “data use” question with “No [do not use my data], I really wasn’t paying any attention” (n = 5); (d) failed three or more memory questions in the immediate test (n = 28), or all four in the delayed test (n = 15)Footnote 2; (e) responded in a “cynical” manner by selecting the “none of the above” response option for all four multiple-choice inference questions (n = 1); and (f) responded uniformly (a response SD across all 20 raw rating-scale inference-question responses < 0.5; n = 22). Finally, to identify inconsistent, erratic responding, we calculated response SD for each set of five inference questions and then calculated mean SD across the four sets. We (g) excluded outliers on this measure, using the interquartile rule with a 2.2 multiplier (i.e., cutoff = Q3 + 2.2 × IQR; Hoaglin and Iglewicz 1987; n = 0).

We coded the multiple-choice inference-question responses as either 10 (misinformation option) or 0 (non-misinformation options). We then calculated four mean inference scores for the noC, NN, N, and noMI conditions; this was the main dependent variable, with greater scores reflecting greater misinformation reliance. We ran a two-way mixed ANOVA with factors condition (within-subjects) and delay (between-subjects) on inference scores (see Fig. 1). This yielded significant main effects of condition, F(3,2304) = 250.94, MSE = 4.79, η 2p  = .246, p < .001, and delay, F(1,768) = 11.33, MSE = 15.77, η 2p  = .015, p ≤ .001, which were qualified by a significant interaction, F(3,2304) = 10.75, η 2p  = .014, p < .001, such that inference scores were higher after delay in all conditions but the no-correction condition. We tested the core hypothesis with planned contrasts, assessing the difference between NN and N conditions (planned contrast: NN > N; i.e., narrative correction more effective at reducing reliance on misinformation than non-narrative correction) in each delay condition; both contrasts were nonsignificant, Fs < 1. There was thus no difference between non-narrative and narrative corrections.

Fig. 1
figure 1

Mean inference scores across conditions in Experiment 1. noMI, no-misinformation; noC, no correction; NN, non-narrative; N, narrative. Greater values indicate greater misinformation reliance. Error bars indicate within-subjects standard error of the mean (Morey 2008)

We also tested the interaction contrast of NN versus N × immediate versus delayed. The direction of a potential interaction was not prespecified: We speculated that a potential narrative benefit may only emerge after a delay if the effect reflects retrieval facilitation, or may emerge immediately if it reflects stronger correction encoding or integration into the mental event model. However, the contrast was nonsignificant, F < 1.

To complement this frequentist analysis (and to quantify evidence in favor of the null), we ran Bayesian t-tests comparing NN and N in both delay conditions. In the immediate condition, this returned a Bayes Factor of BF01 = 12.26; in the delayed condition, we found BF01 = 17.76. This means that the data are approx. 12–18 times more likely under the null hypothesis of no difference between narrative conditions. This constitutes strong evidence in favor of the null (Wagenmakers et al. 2018).

Finally, for the sake of completeness, we ran an additional series of five secondary planned contrasts for each delay condition (see Table 2). Statistical significance was established using the Holm-Bonferroni correction, applied separately to each set of contrasts. These contrasts demonstrated that uncorrected misinformation increased reliance on the misinformation relative to the no-misinformation baseline and that corrections were very effective, strongly reducing misinformation reliance, albeit not quite down to baseline, which demonstrates the presence of a small continued influence effect.

Table 2 Secondary contrasts run in Experiment 1

We performed two additional analyses that were not preregistered. First, we tested whether correction effects were reduced after a delay, as would be expected based on previous research (e.g., Paynter et al. 2019; Swire et al. 2017). To this end, we tested the interaction contrast of immediate versus delayed test × no-correction versus (pooled) correction conditions. This yielded a significant result, F(1,768) = 20.49, MSE = 6.62, η 2p  = .026, p < .001, confirming the expectation. Second, we tested for the effect of delay on memory performance, finding that as expected memory was better in the immediate test (M = .81; SE = .013) compared to the delayed test (M = .62, SE = .013), F(1,808) = 106.23, MSE = .07, η 2p  = .116, p < .001 (this analysis included participants who failed exclusion criterion (d) related to memory performance).

Discussion

Experiment 1 investigated whether corrections of event-related misinformation are more effective if presented in a narrative format. In line with much previous research (e.g., Chan et al. 2017; Walter and Tukachinsky 2020), we found a continued influence effect, in that corrected misinformation had a small but reliable effect on inferential reasoning. Also congruent with previous work, we found reduced memory and correction impact after a delay, which are both easily explained through standard forgetting of materials (see Paynter et al. 2019; Swire et al. 2017). However, results did not support the core hypothesis: narrative and non-narrative corrections were equally effective at reducing the effects of the misinformation. This suggests that the narrative format did not facilitate comprehension of the corrective information, its integration into the event model, nor its later retrieval during reasoning in a substantial manner. It is possible, however, that no narrative advantage was observed because the event reports provided sufficient narrative scaffolding in both conditions. In other words, to the extent that the events were already processed as narratives, it may have been easy to integrate the correction in either condition, and as such the format of the correction itself may have not provided additional benefit. It is, therefore, possible that a narrative advantage may only arise with misinformation that is not part of an event report. To test this, Experiment 2 used false real-world claims.

Experiment 2

To examine the robustness and generality of the results of Experiment 1, Experiment 2 examined the effect of narrative versus non-narrative corrections on real-world beliefs.

Method

Experiment 2 presented claims encountered in the real world, including both true “facts” and common misconceptions, henceforth referred to as “myths”. Claims were followed by explanations that affirmed the facts and corrected the myths. Corrections were either in a non-narrative (NN) or narrative (N) form, and the test was again either immediate or delayed. Thus, Experiment 2 had a 2 × 2 mixed within-between design, with the within-subjects factor of correction type (NN; N) and the between-subjects factor of test delay (immediate; delayed). Fact-affirmation trials acted as fillers outside of this design (although basic affirmation effects are reported).

Participants

Experiment 2 used the same recruitment procedures as Experiment 1. Sample size was increased by 10% to allow for the exclusion of participants with more than one initial myth-belief rating of zero (see below).Footnote 3 Participants who participated in Experiment 1 were not allowed to participate in Experiment 2.

A total of 906 participants completed Experiment 2. Retention of participants in the delayed condition was approx. 85%. After applying preregistered exclusion criteria (described in “Results” section), the final sample size for analysis was N = 776 (n = 385 and n = 391 in the immediate and delayed conditions, respectively); the sample comprised 375 men, 393 women, seven nonbinary participants, and one participant of undisclosed gender; mean age was M = 33.47 years (SD = 11.44, age range 18–78).

Materials

Experiment 2 used eight claims (four myths; four facts). An example myth is “Gastritis and stomach ulcers are caused by excessive stress.” The non-narrative corrections explained the evidence against the claim (e.g., that there is evidence that gastritis and stomach ulcers are primarily caused by the bacterium Helicobacter pylori and that this discovery earned the scientists involved a Nobel Prize); the narrative correction detailed the story behind this discovery (e.g., that a scientist drank a broth contaminated with the bacterium to prove his hypothesis, which earned him and his colleague a Nobel Prize). Again, a pilot study confirmed that the narrative corrections were perceived as more story-like and vivid than the non-narrative correction, while being relatively comparable on informativeness and comprehensibility dimensions (see “Appendix” for details). Fact affirmations were of an expository nature similar to the non-narrative corrections. All claims and explanations are provided in “Appendix”.

Each participant received two NN and two N corrections. Assignment of claims (myths MA-D) to correction type was counterbalanced, using all six possible combinations (presentation versions V1-6 shown in Table 3); the presentation order of the eight claims (and thus the order of corrections/affirmations as well as narrative conditions) was randomized.

Table 3 Presentation versions used in Experiment 2

Participants rated their belief in each claim on a 0–10 Likert scale immediately after its initial presentation in the study phase (pre-explanation), and again at test (post-explanation). In addition to the second belief rating, the test comprised three inference questions per claim, each requiring a rating of agreement with a statement on a 0–10 Likert scale. The inference questions were designed to measure claim-congruent inferential reasoning (e.g., “Patients with stomach ulcers should avoid any type of stress”). All questions are provided in “Appendix”.

Administration of the survey proceeded as in Experiment 1; the survey file is available at https://osf.io/gtm9z/. The experiment took approximately 10 min. Participants in the immediate condition were reimbursed GBP1.25 (approx. US$1.60) via Prolific; participants in the delayed condition were reimbursed GBP0.60 (US$0.77) for the study phase and GBP0.65 (US$0.83) for the test phase.

Procedure

The initial part of the survey was similar to Experiment 1. In the study phase, participants were presented with all eight claims and rated their belief in each. Each rating was followed by an affirmation, or a non-narrative or narrative correction. Materials were again presented for fixed minimum times and the test phase was immediate or delayed (retention interval 1 min vs. 2 days). In the test phase, participants were first presented with the questionnaires of three inference questions per claim. The order of questionnaires was randomized; the order of questions in each questionnaire was fixed (see “Appendix”). Subsequently, participants rated their belief in all claims for a second time. Following the test phase, participants were presented a “data use” question as in Experiment 1.

Results

Data analysis was preregistered at https://osf.io/akugv; the data are available at https://osf.io/gtm9z/. Analysis adhered to the following procedure: First, exclusion criteria were applied. We excluded data from participants who (a) indicated they do not reside in the USA (n = 2); (b) indicated their English proficiency is “fair” or “poor” (n = 2); (c) responded to the “data use” question with “No [do not use my data], I really wasn’t paying any attention” (n = 1); or (d) responded uniformly (a response SD across all 24 raw rating-scale inference-question responses < 0.5; n = 17). To identify inconsistent, erratic responding, we calculated response SD for each set of four test-phase questions and then calculated mean SD across the eight sets. We (e) excluded outliers on this measure, using the interquartile rule with a 2.2 multiplier (i.e., cutoff = Q3 + 2.2 × IQR; n = 4). Finally, we excluded participants who (f) had more than one initial myth-belief rating of zero (n = 104).

We calculated four dependent variables relating to myth corrections and fact affirmations, respectively: mean belief-rating change (belief-rating 2−belief-rating 1) for the NN and N conditions, and mean inference scores for the NN and N conditions. We first ran a two-way mixed ANOVA with factors condition (within-subjects) and delay (between-subjects) on myth-belief-change scores (see Fig. 2). This yielded a significant main effect of delay, F(1,774) = 10.78, MSE = 10.90, η 2p  = .014, p = .001, indicating greater belief change in the immediate test. Both the main effect of condition and the interaction were nonsignificant, F < 1. The planned contrasts of NN versus N conditions at either delay were also nonsignificant, F < 1. Mean belief change for facts was M = 3.66 (SD = 2.39) in the immediate test and M = 3.87 (SD = 2.35) in the delayed test. Both values differed significantly from zero, t(384/390) > 30.05, p < .001, but not from each other, F(1,774) = 1.47, MSE = 5.62, η 2p  = .002, p = .225.

Fig. 2
figure 2

Mean myth-belief-change scores across conditions in Experiment 2; theoretically-possible range was + 10 to − 10. Error bars indicate within-subjects standard error of the mean (Morey 2008)

We then ran the same two-way mixed ANOVA on inference scores (see Fig. 3). This yielded a significant main effect of delay, F(1,774) = 8.52, MSE = 10.44, η 2p  = .011, p = .004, indicating lower scores in the immediate test. There was also a marginal main effect of condition, F(1,774) = 3.98, MSE = 2.65, η 2p  = .005, p = .046, suggesting lower scores in the narrative condition (F < 1 for the interaction). However, the core planned NN versus N contrast was nonsignificant in both the immediate test, F(1,774) = 2.90, η 2p  = .004, p = .089, and the delayed test, F(1,774) = 1.25, η 2p  = .002, p = .264. Mean inference scores for facts were M = 7.77 (SD = 1.18) in the immediate test and M = 7.65 (SD = 1.26) in the delayed test; this was not a significant difference, F(1,774) = 1.95, MSE = 1.49, η 2p  = .003, p = .163.

Fig. 3
figure 3

Mean myth inference scores across conditions in Experiment 2. Greater values indicate greater misinformation reliance. Error bars indicate within-subjects standard error of the mean (Morey 2008)

To complement this frequentist analysis (and to quantify evidence in favor of the null), we ran Bayesian t tests comparing NN and N in both delay conditions. We first did this with belief-change scores: In the immediate condition, this returned a Bayes Factor of BF01 = 17.37; in the delayed condition, we found BF01 = 17.55. This means that the data are approx. 17 times more likely under the null hypothesis of no difference between narrative conditions, which is strong evidence in favor of the null (Wagenmakers et al. 2018). We then tested inference scores: In the immediate condition, this returned BF01 = 3.70; in the delayed condition, we found BF01 = 9.92. This means that the data are approx. 4–10 times more likely under the null hypothesis of no difference between narrative conditions; this constitutes moderate evidence in favor of the null (Wagenmakers et al. 2018).

Furthermore, to take initial belief levels into account more generally, we ran linear mixed-effects models. Presentation version and participant ID (nested in presentation version) were included as random effects, and experimental condition, delay, their interaction, and initial belief were fixed effects, predicting test-phase myth-belief ratings and inference scores. As with the ANOVAs, we did this for the full 2 × 2 design, but also separately for each delay condition, thus with only condition and initial belief as fixed effects. Results are provided in Table 4. In the full design, myth belief at test (belief rating 2) was predicted significantly by delay and the initial belief rating 1. Inference scores were likewise predicted significantly by delay and belief rating 1. In both cases, experimental condition was not a significant predictor. When analyses were restricted to the immediate and delayed conditions, respectively, the results were comparable: only initial belief was a significant predictor of test-phase belief, and experimental condition was not a significant predictor.

Table 4 Linear mixed-effects modeling results in Experiment 2

Discussion

Experiment 2 tested whether corrections targeting real-world misconceptions are more effective if they are provided in a narrative versus non-narrative format. The results were clearcut: While corrections effected substantial belief change, which was only moderately reduced by a 2-day delay, there was no difference between narrative and non-narrative conditions. When assessing myth beliefs through more indirect post-correction inference questions, there was likewise little evidence of a narrative benefit: While the main effect of condition was marginally significant in the omnibus analysis, the core contrasts of narrative and non-narrative conditions at each delay were nonsignificant. Moreover, the Bayesian analyses consistently provided support in favor of the null hypothesis of no difference between narrative and non-narrative conditions.

Experiments 1 and 2 therefore provide evidence that narrative corrections do not promote more event-memory updating or knowledge revision than non-narrative corrections. These results suggest that the narrative format does not facilitate comprehension, integration, or retrieval of the correction. However, it is possible that the narrative format produces corrective benefit in situations where there might be some opposition to the content of the correction, given past work showing that narratives reduce resistance persuasive messages relative to non-narrative counterparts (see Green and Brock 2000; Krakow et al. 2018; Slater and Rouner 1996). Experiment 3 tested this possibility.

Experiment 3

Narratives reduce counterarguing relative to non-narrative messages (Green and Brock 2000; Slater and Rouner 1996). One might, therefore, suggest that narrative-format corrections should be particularly effective (relative to non-narrative corrections) when the content of a message challenges a person’s worldview. Experiment 3 examined the effect of messages addressing more controversial, real-world claims, where a correction can be expected to be worldview-inconsistent for the majority of participants. It therefore enabled a more focused test of underlying process, as well as an examination of the effect of corrective message format in a context of practical significance. Specifically, two myths expected to resonate with more conservative participants were used, and only people who identified as conservative were recruited as participants.

Method

Experiment 3 presented claims encountered in the real world, including both facts and myths, which were followed by affirmations and corrections. Corrections were again either non-narrative (NN) or narrative (N), and the test was immediate or delayed. Thus, Experiment 3 had a 2 × 2 mixed within-between design, with the within-subjects factor of correction type (NN; N) and the between-subjects factor of test delay (immediate; delayed). Fact-affirmation trials acted as fillers outside of this design (although basic affirmation effects will be reported).

Participants

Target sample size was the same as in Experiment 2, but we used a sample of adult US residents who indicated that they identify as politically conservative, recruited via Prolific.Footnote 4 Participants who participated in Experiment 1 or 2 were not allowed to participate in Experiment 3. Similar to Experiment 2, oversampling (again, by 10%) was applied to account for exclusions of participants with low initial myth-belief ratings. Due to a large number of exclusions based on preregistered criteria, minor resampling was used to achieve the required sample size, as per the preregistered plan.

Initially, a total of 953 participants completed Experiment 2. Retention of participants in the delayed condition was greater than expected (approx. 93%). After applying preregistered exclusion criteria (described in “Results” section), 725 participants remained, with n = 345 in the immediate condition and n = 380 in the delayed condition. As the number of participants in the immediate condition dropped below the minimum prespecified cell size of n = 352, we resampled, following the preregistered plan, obtaining an additional eight participants in the immediate condition. The final sample size for analysis was N = 733 (n = 353 and n = 380 in the immediate and delayed conditions, respectively); the sample comprised 435 men, 297 women, and one participant of undisclosed gender; mean age was M = 38.47 years (SD = 14.22, age range 18–84).

Materials

Experiment 3 used four claims (two myths; two facts). One myth was “Humans are made to eat red meat; it should be part of every person’s diet.” The other was “Children of homosexual parents have more mental health issues.”Footnote 5 The non-narrative corrections explained the evidence suggesting that the claim is false (e.g., evidence that eating red meat on a regular basis will shorten people’s lifespans and that replacing it with other foods could lower mortality risk by 7 to 19%); the narrative corrections contained the same facts but were presented as a quote from someone to whom the claim is directly relevant (e.g., a meat-lover explaining how their daughter pleaded with them to eat less red meat and rotate in other foods). Again, a pilot study confirmed that the narrative corrections were perceived as more story-like and vivid than the non-narrative correction, while being relatively comparable on informativeness and comprehensibility dimensions (see “Appendix” for details).Footnote 6 Fact affirmations were expository in nature, similar to the non-narrative corrections. All claims and explanations are provided in “Appendix”. Each participant received one NN and one N correction. The correction type applied to each myth was counterbalanced, and presentation order of the claims was randomized. Measures were implemented as in Experiment 2 (an example inference question is “To maintain a healthy diet, people should regularly consume red meat”). All questions are provided in “Appendix”.

Administration of the survey proceeded as in Experiment 2; the survey file is available at https://osf.io/gtm9z/. The experiment took approximately 8 min. Participants in the immediate condition were reimbursed GBP1 (approx. US$1.30) via Prolific; participants in the delayed condition were reimbursed GBP0.45 (US$0.60) for the study phase and GBP0.55 (US$0.70) for the test phase.

Procedure

The procedure was identical to Experiment 2 (with the exception that participants viewed only four claims).

Results

Data analysis was preregistered at https://osf.io/5yxse. Analysis adhered to the same procedure as Experiment 2: First, exclusion criteria were applied. We excluded data from participants who (a) indicated they do not reside in the USA (n = 2); (b) indicated their English proficiency is “fair” or “poor” (n = 0); (c) responded to the “data use” question with “No [do not use my data], I really wasn’t paying any attention” (n = 1); or (d) responded uniformly (a response SD across all 12 raw rating-scale inference-question responses < 0.5; n = 24). To identify inconsistent, erratic responding, we calculated response SD for each set of four test-phase questions, then calculated mean SD across the four sets. We (e) excluded outliers on this measure, using the interquartile rule (i.e., cutoff = Q3 + 2.2 × IQR; n = 6). Finally, we excluded participants with any initial myth-belief rating < 1, or both initial myth-belief ratings < 2 (n = 195).Footnote 7

We calculated mean belief-rating change (belief-rating 2−belief-rating 1) for the NN and N conditions, and mean inference scores for the NN and N conditions. We first ran a two-way mixed ANOVA with factors condition (within-subjects) and delay (between-subjects) on myth-belief-change scores (see Fig. 4). This yielded a significant main effect of delay, F(1,731) = 16.23, MSE = 9.71, η 2p  = .022, p < .001, indicating greater belief change in the immediate test. Both the main effect of condition and the interaction were nonsignificant, F ≤ 1.06. The planned contrasts of NN versus N conditions at either delay were also nonsignificant, F ≤ 1.16. Mean belief change for facts was M = 1.80 (SD = 1.86) in the immediate test and M = 1.46 (SD = 1.93) in the delayed test. Both values differed significantly from zero, t(352/379) > 14.71, p < .001, and also from each other, F(1,731) = 5.90, MSE = 3.61, η 2p  = .008, p = .015.

Fig. 4
figure 4

Mean myth-belief-change scores across conditions in Experiment 3; theoretically-possible range was + 10 to − 10. Error bars indicate within-subjects standard error of the mean (Morey 2008)

We then ran the same two-way mixed ANOVA on inference scores (see Fig. 5). This yielded a significant main effect of delay, F(1,731) = 9.49, MSE = 10.62, η 2p  = .013, p = .002, indicating lower scores in the immediate test. There was no main effect of condition, F < 1, but a significant delay × condition interaction, F(1,731) = 5.78, MSE = 4.68, η 2p  = .008, p = .016. The core planned NN versus N contrast was nonsignificant in the immediate test, F(1,731) = 1.73, η 2p  = .002, p = .188. The contrast was significant in the delayed test, F(1,731) = 4.40, η 2p  = .006, p = .036; however, this effect was in the opposite direction than predicted, with lower inference scores in the non-narrative condition. Mean inference score for facts were M = 7.87 (SD = 1.53) in the immediate test and M = 7.92 (SD = 1.46) in the delayed test; this difference was not significant, F < 1.

Fig. 5
figure 5

Mean myth inference scores across conditions in Experiment 3. Greater values indicate greater misinformation reliance. Error bars indicate within-subjects standard error of the mean (Morey 2008)

As in Experiment 2, we ran complementary Bayesian t tests comparing the effect of correction format in both delay conditions, separately. We first examined the effect on belief-change scores: In the immediate condition, this returned a Bayes Factor of BF01 = 9.39; in the delayed condition, we found BF01 = 16.25. These results provide moderate to strong evidence in favor of the null. We then tested the effect on inference scores: In the immediate condition, this returned BF01 = 7.03, providing moderate evidence in favor of the null; in the delayed condition, we found BF01 = 2.03, which provides only anecdotal evidence, but also in favor of the null (Wagenmakers et al. 2018).Footnote 8

As in Experiment 2, we ran linear mixed-effects models to take initial myth belief into account. Results are provided in Table 5. In the full design, delay and the initial belief rating 1 predicted test-phase myth belief (belief rating 2). Inference scores were predicted only by belief rating 1. In both cases, experimental condition was not a significant predictor. Analyses restricted to the immediate and delayed conditions, respectively, yielded comparable results: Initial myth belief was a significant predictor of test-phase belief and experimental condition was not.

Table 5 Linear mixed-effects modeling results in Experiment 3

Discussion

Experiment 3 tested whether narrative corrections would be more effective than non-narrative corrections when debunking worldview-consistent misconceptions. It has been argued that efforts to correct such worldview-supported beliefs are potentially less effective (Lewandowsky et al. 2012; Nyhan and Reifler 2010; but see Ecker et al. 2020; Swire-Thompson et al. 2020; Wood and Porter 2019). Therefore, identifying ways to successfully reduce belief in worldview-consistent misinformation may be particularly valuable. The corrections applied in this study did not change beliefs as much as in Experiment 2, presumably due to the effect of worldview. More importantly, narrative corrections were not more effective in reducing beliefs than non-narrative corrections. While there was a small effect of correction format on inference scores in the delayed condition, this effect indicated more misinformation reliance in the narrative condition compared to the non-narrative condition. However, we do not interpret this finding as suggesting that narrative corrections are inferior, given that in the pilot study the non-narrative corrections in Experiment 3 were rated as slightly more informative than the narrative corrections.

General discussion

In three experiments, we tested the hypothesis that narrative corrections are more effective than non-narrative corrections at reducing misinformation belief and reliance. We observed a range of findings that conform to previous research: We found a small continued influence effect in Experiment 1; correction effects were generally larger in the immediate versus delayed tests; and post-correction belief ratings and inference scores were predicted by test-phase delay and initial belief ratings in the mixed-effects modeling. However, with regard to the core hypothesis of a narrative benefit, results were clearcut: The narrative versus non-narrative format of the correction had no impact on the correction’s effectiveness, in terms of either misinformation belief change or inferential reasoning scores.

Theoretically, we proposed that narrative corrections might be more effective due to (1) enhanced processing of the correction, as stories tend to result in stronger emotional involvement and transportation (e.g., Green and Brock 2000; Hamby et al. 2018); (2) suppression of counterargument generation, caused by immersion in the narrative (e.g., Green and Brock 2000; Slater and Rouner 1996); or (3) enhanced retrieval, resulting either from a more vivid memory representation or the availability of potent retrieval cues relating to the narrative structure (e.g., Bruner 1986; Graesser and McNamara 2011). Our results provided no support for these proposals. Instead, results suggest that the narrative versus non-narrative format does not matter for misinformation debunking, as long as corrections are easy to comprehend and contain useful, relevant, and credible information (see Lewandowsky et al. 2020; Paynter et al. 2019). An alternative interpretation is that a narrative format potentially does have benefits, but that these were offset in our study by the narrative elements distracting from the correction’s core message. However, given that the null effect of correction format was replicated across three experiments with substantial differences in materials, we prefer the simpler interpretation that the format of a correction (narrative or non-narrative) has little effect on a corrective message’s efficacy.

This, in turn, suggests that anecdotal evidence for the superiority of narrative corrections may have arisen from confounds between the narrative versus non-narrative correction format and other elements such as the amount, quality (i.e., persuasiveness), or novelty of information provided. For example, past work shows that effective corrections contain greater detail (e.g., Chan et al. 2017; Swire et al. 2017) or feature a causal alternative explanation (e.g., Ecker et al. 2010; Johnson and Seifert 1994). In the current work, we held constant not only the amount but also the type of corrective details (i.e., causal explanations) included in each correction.

The present study contributes broadly to the substantial body of research comparing the persuasive efficacy of different message formats, which has yielded conflicting results: While some work shows that narratives and non-narratives are equally persuasive (Dunlop et al. 2010), other findings suggest that one format is superior to the other (Greene and Brinn 2003; Ratcliff and Sun 2020; Zebregs et al. 2015a). These diverging results suggest that a line of inquiry directed toward identifying when message format makes a difference in both initial and corrective persuasion may be fruitful. For instance, the claim and corrective contexts examined in the current work generally mirrored those that are encountered in news media. A recent meta-analysis (Freling et al. 2020) identified message content as a determinant of the persuasive efficacy of message format, such that narrative-based messages are more persuasive when emotional engagement is high (as when focal content involves a severe threat to health or oneself). It is similarly possible that the format of a corrective message may matter when the topic is emotionally engaging, but not in more generally informative scenarios such as those examined in the present work. In support of this position, it has been suggested that personal experiences of people affected by COVID-19 can serve to reduce misconceptions about the pandemic (Mheidly and Fares 2020).

A challenge in comparing the persuasive (or corrective) efficacy of narrative versus non-narrative messages lies in operationalizing message format in a way that is true to their conceptual definition but that does not also introduce confounds (van Krieken and Sanders 2019). While we carefully attempted to minimize confounds in the present work, there are several limitations. In fact, our efforts to make narrative and non-narrative messages as equivalent as possible on the dimensions of length and featured content may obscure differences on these dimensions that occur naturally. Further, while steps were taken to enhance external validity in the current work, participants in online experiments are not representative of the public at large, and engagement with the materials in such experiments is always somewhat contrived. Specifically, experimental procedures involving corrections are subject to demand characteristics, and participants are incentivized to pay attention to all presented information. Part of stories’ persuasive potential lies in their ability to attract and retain attention, which is particularly important in the modern media environment. Thus, future work examining the effect of message format on debunking efforts in a field context is warranted. Stories that are cocreated with the audience may be useful in addressing misinformation, particularly in contexts characterized by limited access to or engagement with high-quality, fact-oriented information sources. Moreover, approaches that jointly present evidence and narrative elements, such as narrative data visualization (e.g., Dove and Jones 2012), might provide a particularly promising approach for future interventions. What we can conclude from the present study, however, is that the narrative format, in itself, does not generally (i.e., under all conditions) produce an advantage when it comes to misinformation debunking.