Evaluating everyday explanations

Zemla, Jeffrey C.; Sloman, Steven; Bechlivanidis, Christos; Lagnado, David A.

doi:10.3758/s13423-017-1258-z

Evaluating everyday explanations

Brief Report
Published: 08 March 2017

Volume 24, pages 1488–1500, (2017)
Cite this article

Download PDF

Psychonomic Bulletin & Review Aims and scope Submit manuscript

Evaluating everyday explanations

Download PDF

Jeffrey C. Zemla¹,
Steven Sloman¹,
Christos Bechlivanidis² &
…
David A. Lagnado²

6731 Accesses
45 Citations
34 Altmetric
1 Mention
Explore all metrics

Abstract

People frequently rely on explanations provided by others to understand complex phenomena. A fair amount of attention has been devoted to the study of scientific explanation, and less on understanding how people evaluate naturalistic, everyday explanations. Using a corpus of diverse explanations from Reddit’s “Explain Like I’m Five” and other online sources, we assessed how well a variety of explanatory criteria predict judgments of explanation quality. We find that while some criteria previously identified as explanatory virtues do predict explanation quality in naturalistic settings, other criteria, such as simplicity, do not. Notably, we find that people have a preference for complex explanations that invoke more causal mechanisms to explain an effect. We propose that this preference for complexity is driven by a desire to identify enough causes to make the effect seem inevitable.

How good is an explanation?

Article Open access 02 February 2023

Information and Explanatory Goodness

Article Open access 20 April 2023

Scientific Explanation and Trade-Offs Between Explanatory Virtues

Article 16 December 2019

People are explanatory creatures. We often seek to generate explanations based on our own knowledge of how the world works. However, our ability to generate complete explanations on our own is frequently inadequate. We may not have all of the evidence or the expertise to be able to form accurate models of complex phenomena. So we use the knowledge of experts, friends, and communities to piece together explanations. Our beliefs about science are not limited to intuitive preconceptions, but are also derived from scientists who inform us of how things work. Our beliefs about the economy are affected not only by our own experiences, but also by what economists and politicians tell us about large-scale financial systems. We rely on the explanations of others to form our own beliefs. How, then, do we evaluate the explanations of others?

Explanatory criteria

A common view has emerged that the quality or value of an explanation can be determined by how well it satisfies a set of criteria known as explanatory virtues (Lipton, 2004; Thagard, 1978; Harman, 1965; Mackonis, 2013; Glymour, 2014; Lombrozo, 2011). However, there is disagreement about what counts as an explanatory virtue, how these virtues are defined and measured, and how they are weighted when we evaluate an explanation. Two commonly proposed virtues are simplicity and coherence. For example, a good explanation should be simple, requiring the fewest number of causes to explain a phenomenon (e.g., Lombrozo, 2007). A good explanation should also be coherent; it should be compatible with our existing beliefs, and consistent with the evidence and with itself (e.g., Thagard, 1989).

We may also evaluate an explanation using other criteria, such as the credibility of the explainer, or how well the explanation is articulated, that do not reflect the intrinsic value of an explanation. These criteria are useful in satisfying goals beyond identifying the information inherent to an explanation (Patterson, Operskalski, & Barbey, 2015). For instance, a well-articulated explanation can be useful for pedagogical reasons. The perceived credibility of an explainer may affect whether or not one believes the explanation, regardless of its intrinsic merit. While these criteria do not affect the inherent quality of an explanation, they may still serve important pragmatic functions and can be useful indicators of explanation quality.

Everyday explanations

Philosophers have examined features central to scientific explanation that may improve our understanding. Does an explanation need to appeal to general laws (Hempel, 1965)? Should an explanation aim to unify the widest range of phenomena (Kitcher, 1989)? A common method in philosophical inquiry is to analyze existing scientific explanations: what types of explanations do scientists provide, and what makes them good or bad explanations?

However, many of the criteria used for evaluating explanations in a scientific context may differ from the criteria that are important for explaining everyday events. Explanations in non-scientific domains may require a different set of explanatory criteria because they are structured differently. For example, historical explanations are more likely to appeal to a narrative, and less likely to invoke general laws (Dray, 2000).

Although some philosophical theories suggest that abstract explanations are desirable (e.g., Strevens, 2007), people sometimes prefer explanations that are more concrete (Bechlivanidis, Lagnado, Zemla, & Sloman, 2017) and less generalizable (Khemlani, Sussman, & Oppenheimer, 2011). Despite philosophical claims that explanations should be simple (e.g., Thagard, 1978), people tend to explain inconsistencies by positing additional causes rather than disputing a premise, resulting in a more complex causal structure (Khemlani & Johnson-Laird, 2011; Johnson-Laird, Girotto, & Legrenzi, 2004). Non-scientific explanations may also serve different explanatory goals. For instance, Newtonian mechanics is a source of good explanations for pedagogical and most practical purposes, even though Einstein’s relativistic mechanics provides a more faithful explanation of how the world works.

In contrast to the philosophical literature on explanations, psychologists have tended to study short and simple explanations (e.g., Kelemen & Rosset, 2009; Weisberg, Keil, Goodstein, Rawson, & Gray, 2008; Cimpian & Salomon, 2014). These explanations have minimal causal structure, often only a single causal relation. Some experiments of this type rely on causal inference (e.g., Lombrozo, 2007; Khemlani et al., 2011); they ask participants to identify the cause or causes that best explain the observed effects, often holding constant the probability of an effect given its cause. We intend to test whether results obtained with these paradigms also apply to explanations that are more naturalistic.

Explanations that do consist of multiple causal relations require people to consider additional criteria, such as whether there are gaps in the causal structure (Keil, 2006). For example, it is undoubtedly true that leaves change color in autumn because chlorophyll in the leaves breaks down. However, this explanation omits parts of the causal model, such as why chlorophyll causes leaves to be green, and what causes chlorophyll to break down. People may be sensitive to this omission, leading them to evaluate the explanation negatively even if they agree on the primary cause.

In more natural settings, we sometimes construct complex causal explanations in order to explain many pieces of evidence. Pennington and Hastie (1986, 1988) found that people explain complex events by constructing stories around the evidence, and that these stories can differ depending on the order that evidence is presented. These stories can be evaluated by how well they cohere with the available evidence (Byrne, 1995) using a set of coherence principles (Thagard, 1989). It is generally taken for granted that these principles are desirable, and subsequent work has provided some empirical support for these principles (Read & Marcus-Newhall, 1993; Schank & Ranney, 1992).

We should also consider how an explanation fits with our broader knowledge of the world. When evaluating a single explanation, we should consider possible alternative explanations (Fernbach, Darlow, & Sloman, 2010) and counterfactuals (Woodward & Hitchcock, 2003). When explanations provide evidence in support of a causal mechanism (Sloman, 2005), that evidence should be evaluated independently to determine whether it is credible and relevant (Kuhn, 1991).

Real-world explanations are typically more nuanced than experimental stimuli, and thus provide a more ecologically valid way of understanding the explanatory criteria people use to evaluate explanations. Experimental stimuli used to test explanatory criteria are often focused on a narrow subset of explanation types—for instance, explanations that explain token events or explanations that explain classes of events (types), but not both. Though many explanatory criteria have been established for evaluating scientific explanations, we test whether those same criteria are seen as virtues in everyday contexts. In addition, evaluating explanations can require us to engage in a number of processes simultaneously, including dialectical reasoning (resolving inconsistencies), probabilistic reasoning (finding the most likely causes, or the causes that make the effect most likely), and didactic methods (educating the reader). We observe whether previously touted explanatory virtues endure in the face of these multiple goals.

Experiment 1

To investigate how people evaluate everyday explanations, we compiled a small corpus of explanations that were generated in a non-scientific and non-experimental context. Specifically, we gathered explanations from Reddit’s Explain Like I’m Five (ELI5; www.reddit.com/r/explainlikeimfive), an Internet community that receives roughly 7 million unique visitors per month. The explanations in our corpus were rated by participants on a host of explanatory criteria that have been proposed in prior literature.

Method

Participants

Two hundred and forty participants located in the United States were recruited using Amazon’s Mechanical Turk (Paolacci, Chandler, & Ipeirotis, 2010). Five participants were removed from the data set prior to analyses for failing an attention check question^{Footnote 1} (Oppenheimer, Meyvis, & Davidenko, 2009). Of the remaining 235 participants, 131 were male and 104 were female, aged 18–69 years (median age of 34 years).

Materials

Eight explananda^{Footnote 2} (see Table 1) were selected from ELI5 with three explanations for each, for a total of 24 explanations. The explananda were selected to fit into one of four categories: historical, public health, legal, and social policy. These categories were chosen to contrast with scientific explanation, and also reflect topics of interest to the general public. By selecting explanations from several categories, we sought to identify whether explanatory criteria are domain-general rather than apply only in certain domains. The explanations also varied in style, including a mixture of token and type explanations, as well as teleological and mechanistic explanations. We selected explananda from ELI5 that had a high level of engagement (i.e., many unique explanations and many “votes” from the site’s users). For each explanandum, we chose three different explanations that proposed distinct mechanisms or offered different evidence in support of a given mechanism. In addition, the specific explanations were chosen because they varied prima facie on several explanatory criteria, such as appeals to expertise and evidence, complexity, and generality. An example explanation is shown in Table 2, and all of the explanations are provided in the Supplementary Material.

Table 1 Explananda used in Experiment 1

Full size table

Table 2 An example explanation used in Experiment 1 to explain “If Ebola is so difficult to transmit (direct contact with bodily fluids), how do trained medical professionals with modern safety equipment contract the disease?

Full size table

Procedure

Each participant was shown one explanandum with one corresponding explanation. After reading the explanation in full, participants assessed the quality of the explanation by rating whether the text constitutes a “good explanation.” Afterwards, participants rated the remainder of the attributes (see Table 3) in a randomized order. To prevent participants from referring to their previous ratings, each attribute was rated on its own page, with two exceptions. Generality was rated on the same page as principle consensus because the latter question refers to the former. Evidence credibility and evidence relevance were rated last and on the same page, after participants were asked to highlight any evidence in the explanation. All attributes were rated using a 7-point Likert scale ranging from Strongly Disagree to Strongly Agree.

Table 3 List of attributes rated in Experiment 1

Full size table

Results

Overview

We first examined the relation between explanation quality and each attribute without controlling for the other attributes. A mean score was computed for each attribute for each of the 24 explanations. Partial correlations were computed using a mixed effect model for each attribute, in each case treating quality as the dependent variable, 1 of the 20 remaining attributes as a fixed effect, and explanandum as a random effect. A partial correlation was used in place of a simple Pearson correlation because the 24 data points are not truly independent (there are three explanations for each explanandum). All subsequent correlations reported for Experiment 1 reflect a partial R after controlling for explanandum as a random effect.

Of the 20 attributes, 14 significantly predicted explanation quality, as shown in Table 4. Those that did not include: the desired complexity of an explanation, whether the evidence was credible (evidence credibility), whether the evidence was relevant (evidence relevance), whether the explanation referred to an expert, whether the participant had a lot of prior knowledge in the domain, and whether the explanandum required an explanation (requires explanation). To aid in interpretation, we also corrected for multiple comparisons using a full Bonferroni correction, though it is likely that this correction is overly conservative: all tests were planned a priori, and the tested hypotheses are often complementary rather than orthogonal. Nonetheless, six attributes survived the multiple comparison correction, suggesting that their relation with explanation quality may be particularly strong: whether the explanation had a number of possible alternatives, the articulation of the explanation, whether there were gaps in the explanation (incompleteness), whether the parts of the explanation fit together (internal coherence), whether the explanation was regarded as true (perceived truth), and whether most people agree with the general rule provided in the explanation (principle consensus).

Table 4 Means and SDs for each attribute as well as partial correlations between each attribute and explanation quality, controlling for explanandum as a random effect. Adjusted p-values are computed using a full Bonferroni correction for multiple comparisons

Full size table

Though many of the attributes were able to predict explanation quality, we also observed substantial covariance between the attributes. The attribute correlation matrix (Fig. 1) depicts the magnitude of the correlation between all attributes pairwise, including explanation quality. It is likely that many of these attributes are not independent predictors of explanation quality, but instead reflect a smaller number of latent factors. For further discussion of how these attributes group together, see the Supplementary Material.

Expertise

When evaluating an explanation, it can be helpful to assess the credibility of the explainer. Our knowledge about an individual can be used to predict what else that person is likely to know (Keil, Stein, Webb, Billings, & Rozenblit, 2008), which could play a role in judging whether the premises of an explanation are true. However, it is not always clear what cues are used to judge expertise. For example, while we might expect experts to use more technical language, using long words needlessly can make an author appear less intelligent (Oppenheimer, 2006). Similarly, scientific jargon does not always affect ratings of explanation quality (Weisberg, Taylor, & Hopkins, 2015; though see Eriksson, 2012).

Additionally, classifying the explainer as an expert can make the explanation more credible, but a good explanation does not have to be constructed by an expert. If two explanations are identical except for their source, they are presumably equivalent in their explanatory power even if they are not assigned the same degree of belief.

We assessed expertise using two dependent measures: whether the explanation referred to an expert (expert), and whether the participant believed the explanation was written by an expert (perceived expertise). The two factors are not identical, though they are related, R = .61, p = .002. An explanation can refer to an expert by self-identifying the explainer as an expert, or by citing an authoritative source. In contrast, someone may judge an explanation to be written by an expert through the quality of the language and level of technical sophistication. Both factors positively predict explanation quality, however perceived expertise is a stronger predictor (see Table 4). One possibility is that identifying an expert primarily serves to increase the perceived expertise of the explainer. A mediation model lends support to this hypothesis: although both expert and perceived expertise are positive predictors of quality (and each other), only perceived expertise is a significant predictor of quality when using multiple regression (Fig. 2). Sobel’s test (Sobel, 1982) confirms that perceived expertise mediates expert and quality, z(24) = 1.87, p = .06.

Coherence

One of the most often cited explanatory virtues is coherence. Despite having received much attention in the literature, the term has been defined in several different contradictory ways. While some authors use coherence to refer to consistency with prior knowledge and beliefs (Murphy & Medin, 1985; Mackonis, 2013), other authors use it to refer to whether the components of an explanation are compatible or complement each other (Thagard, 1989; Bovens & Olsson, 2000; Keil, 2006).

We distinguish between internal and external coherence. External coherence refers to how much of the explanation overlaps or “fits” with what the reader already knows. Internal coherence refers to “how well the parts of the explanation fit together.” We found that internal coherence is nearly twice as predictive as external coherence (R_int = .82, R_ext = .47, see Table 4). Using multiple regression, after accounting for internal coherence, external coherence did not significantly correlate with quality judgments (R_int = .73, p _int < .001, R_ext = .21, p _ext = .33). Previous research has suggested that people may not spontaneously generate or consider possible alternatives when evaluating an explanation (Hirt & Markman, 1995). This failure to take an outside view when reasoning (Sloman & Lagnado, 2015) may explain why internal coherence takes precedence over fit with background knowledge.

Articulation

Despite providing no epistemic value, the articulation of an explanation was a strong predictor of perceived explanation quality (R = .79). We examined several linguistic markers to determine if surface features could explain perceived articulation and, by extension, predict explanation quality.

Articulation was correlated with a multitude of surface features, such as the number of words in an explanation (R = .64, p = .002), the median word frequency^{Footnote 3} in an explanation (R = –.54, p = .02), and the average word length (R = .45, p = .056). Perceived articulation also correlated with two related well-known readability metrics (Flesch, 1948; Kincaid, Fishburne, Rogers, & Chissom, 1975), Flesch-Reading Ease (R = –.54, p = .018), and Flesch-Kincaid Grade Level (R = .63, p = .003). Additionally, the proportion of nouns in an explanation predicted articulation (R = .54, p = .016).

Oddly, none of these metrics were significantly correlated with judgments of explanation quality (all R < .31, all p > .17), with the exception of word count (R = .60, p = .003). This finding is peculiar, given that articulation was highly correlated with explanation quality. As such, it is not entirely clear whether explanations are rated highly because they are articulate, or whether this correlation is the result of a third variable. For instance, an intelligent person might be skilled at both writing and explaining (identifying the causal structure), even if one does not directly impact the other.

Simplicity

A guiding principle in explanatory reasoning is that of Occam’s Razor: All things being equal, the simplest hypothesis should be preferred. Thus, we initially predicted a negative correlation between subjective complexity and explanation quality. Surprisingly, we observed a positive correlation, with explanations that were rated as more complex also rated as better explanations (R = .49, p = .03).

To further investigate this relationship, we examined other measures of complexity. Explanations may be deemed complex for many reasons, and it is not immediately clear what aspect of complexity our subjective measure is capturing. One possibility is that an explanation may be complex because it appeals to a large number of mechanisms. That is, the explanation suggests the explanandum occurred as a result of many causal pathways. Alternatively, an explanation may be complex because it is very detailed. An explainer may go into great detail about even a single mechanism. We test both of these hypotheses.

Causal pathways

One reason an explanation may be judged complex is because its underlying causal structure appeals to a large number of mechanisms. The four authors jointly identified the causal model for each of the 24 explanations in our corpus (see Fig. 3 for an example; all of the causal models are provided in the Supplementary Material) by identifying causal language in the explanation (Sloman, 2005). Each node in a causal model represents a cause or effect, or both. Node labels were used as shorthand to represent the underlying cause or effect. A directed link from node A to node B indicates that A is a cause of B, though not necessarily a sufficient cause. Non-causal information, such as the credibility of the speaker or flowery language, was not included in the causal model. Simple facts that are not causally related to the rest of the explanation were also excluded. Specific anecdotes and evidence used in the explanation were represented in the causal model by virtue of the fact that causal relations were distilled from more concrete examples. The causal models were constructed to be acyclic, consistent with Bayesian graphical models used elsewhere (e.g., Sloman, 2005). Though this process is somewhat subjective, we converged on a single causal model for each explanation.

We estimated the number of causal mechanisms in an explanation by counting the number of root causes in the model (nodes without a parent node and connected by some pathway to the explanandum). This measure is consistent with previous research that suggests the number of unexplained causes, rather than the absolute number of causes, has an impact on explanation judgments (Lombrozo & Vasilyeva, 2017). As predicted, the number of root causes significantly predicts explanation quality, R = .64, p = .005. A reasonable objection might be that explanations that appeal to more causes also explain more effects. However, the correlation remains significant even when we control for the number of final effect nodes in the model and a subjective measure of how much the explanation explains (scope), R = .63, p = .015.

Explanation length

Another reason an explanation may be complex is because it contains a lot of details. One way to operationalize this is to simply count the number of words in an explanation. Those explanations that use more words to describe the causal system can be seen as more detailed. Indeed, we found that as the length of an explanation increased, so did its perceived quality, R = .60, p = .004. Furthermore, it appears that the number of causal mechanisms and explanation length are independent predictors of explanation quality, as both are significant predictors in a multiple regression, R_words = .52, (p = .02), R_{root_nodes} = .51 (p = .03). One caveat is that explanation length is also correlated with articulation, as reported earlier. When subjective articulation was included in the regression analysis, explanation length no longer predicted perceived quality, R_words = .26 (p = .27), R_{root_nodes} = .43 (p = .08), R_articulation = .44 (p = .04), indicating shared variance between the three attributes.

Complexity and expertise

Complexity may also have indirect effects on ratings of explanation quality. For example, it is possible that a complex explanation may make the explainer seem knowledgeable, which in turn increases the quality of an explanation.

Explanation quality is significantly correlated with both perceived expertise and judgments of complexity (see Table 4). In addition, subjective complexity is strongly correlated with perceived expertise, R = .55, p = .008. We tested the hypothesis that perceived expertise mediates the relationship between complexity and explanation quality. However Sobel’s test for mediation (Fig. 4) does not reach significance (z = 1.7, p = .088).

We also conducted a multiple regression analysis to see if perceived expertise could explain variance in explanation quality ratings independent of other measures of complexity (subjective complexity, explanation length, and number of root causes). Although subjective complexity is no longer significant in this analysis (see Table 5), the other predictors remain significant. This finding suggests that although all of the factors in Table 5 reflect measures of complexity, these factors are not interchangeable.

Table 5 Using multiple regression analysis, explanation length, number of root causes, and perceived expertise each significantly predict explanation quality ratings

Full size table

Incompleteness

We expected that explanations containing gaps in the proposed causal mechanisms would be rated lower than explanations that did not contain any gaps. That is, if an explanation suggests that A causes B, but it is not immediately clear how A causes B, participants will be sensitive to this omission. In support of this, we found that ratings of incompleteness (whether “there are gaps in the explanation”) significantly correlated with explanation quality (R = –.65, p < .001).

We explored this further by examining the average path length in each of the causal models, measuring the average number of steps from a root cause to the explanandum. Pathways that contain more steps are likely to contain fewer gaps, and could be rated higher. However, this was not the case, R = .08, p = .74.

Discussion

These findings suggest that the explanatory criteria used to evaluate everyday explanations may differ from those previously identified. The biggest departure from existing theories is the finding that people prefer complex explanations—specifically, a preference for explanations that appeal to multiple causal mechanisms (though see Ahn & Bailenson, 1996). One limitation of the study, however, is its reliance on correlational analyses. In addition, by using naturalistic explanations that were not modified extensively, the explanations vary in many respects other than complexity. To address these concerns, we conducted an additional experiment that manipulated the number of mechanisms present in each explanation.

Experiment 2

We conducted a follow-up study using controlled stimuli to examine whether explanations with multiple independent causal pathways are preferred. We expected that people would prefer explanations that appeal to multiple causal mechanisms, even when a single mechanism is sufficient.