How good is an explanation?

How good is an explanation and when is one explanation better than another? In this paper, I address these questions by exploring probabilistic measures of explanatory power in order to defend a particular Bayesian account of explanatory goodness. Critical to this discussion is a distinction between weak and strong measures of explanatory power due to Good (Br J Philos Sci 19:123–143, 1968). In particular, I argue that if one is interested in the overall goodness of an explanation, an appropriate balance needs to be struck between the weak explanatory power and the complexity of a hypothesis. In light of this, I provide a new defence of a strong measure proposed by Good by providing new derivations of it, comparing it with other measures and exploring its connection with information, confirmation and explanatory virtues. Furthermore, Good really presented a family of strong measures, whereas I draw on a complexity criterion that favours a specific measure and hence provides a more precise way to quantify explanatory goodness.


Introduction
It would be difficult to overstate the interest among philosophers of science on the topic of explanation. Much of this has focussed on the nature of explanation (Woodward, 2017) , with modern discussions stemming from the deductive-nomological model of Hempel and Oppenheim (1948) and going on to consider other models such as statistical relevance (Salmon, 1971) , unification (Friedman, 1974;Kitcher, 1989) , causal-mechanical (Salmon, 1984) , causal (Woodward, 2003) and pragmatic accounts (van Fraassen, 1980) . Although not receiving so much attention, there has also been interest in quantifying or comparing explanations using probability (Popper, Good, 1960Good, , 1968 , including a number of recent proposals (Schupbach and Sprenger, 2011;Crupi and Tentori, 2012;Glass, 2007Glass, , 2021 .
It might be thought that an answer to the question 'what is an explanation?' would be needed before attempting to answer questions such as 'how good is an explanation?' or 'when is one explanation better than another?', but that does not seem to be the case. Discussions about the nature of explanation typically involve pre-theoretical intuitions about explanation and often these extend to intuitions about the goodness of explanations, particularly comparative judgments. One can hold that quantum mechanics provides a very good explanation of blackbody radiation or that thermodynamics provides a better explanation of heat than the caloric theory without settling the question of what exactly constitutes an explanation. Similarly, in the area of medical diagnosis, one might try to formalize comparative judgments about which condition best explains the symptoms without taking a view on the metaphysics of explanation.
Explanatory goodness is clearly very important in the context of inference to the best explanation (IBE) (Lipton, 2004;Douven, 2017) . While defenders of IBE need not be committed to a particular view on the nature of explanation, they nevertheless need to be able to give an account of the goodness of an explanation and more specifically how to compare explanations. In this context, discussion of explanatory virtues is important with explanations that do better according to a range of virtues being judged better (Mackonis, 2013). Of particular relevance here are approaches to IBE that seek to evaluate explanations using probability theory (Douven, 1999(Douven, , 2013Schupbach, 2018;Glass, 2012Glass, , 2021 , though one might expect that such approaches should also capture at least some of the explanatory virtues. However, questions about explanatory goodness and comparative judgments, which arise both within science and outside it, are still important irrespective of whether one is committed to IBE as a legitimate mode of scientific inference. In this paper, I will focus on measures of explanatory power to address the central question. A variety of probabilistic measures of explanatory power in this sense have been proposed in the literature (Good, 1968;Schupbach and Sprenger, 2011;Schupbach, 2011Schupbach, , 2018Crupi and Tentori, 2012) . Arguably, these measures can be seen as attempts to quantify how well a hypothesis would explain a given explanandum if the hypothesis were true. According to a distinction due to Good (1968), which is central to the current paper, these are measures of weak explanatory power whereas strong measures take into account not only how well the hypothesis would account for the explanandum if it were true, but also how likely it is to be true in the first place.

Measures of explanatory power
One way to approach the questions 'how good is an explanation?' and the related question 'when is one explanation better than another?' would be to appeal to various probabilistic measures of explanatory power. At the outset, it is worth noting that these measures are not intended to define what constitutes an explanation, but only to measure the explanatory power of a hypothesis that has been determined on other grounds to provide an explanation. With that in mind, considering an explanandum e and an explanans or explanatory hypothesis h, these measures quantify-in a sense to be discussed-the extent to which h explains e. For example, after identifying seven adequacy conditions to quantify explanatory power, Schupbach and Sprenger (2011) show that the only measure satisfying their conditions is: Here and elsewhere, P represents a probability function, which is assumed to be regular (for any contingent proposition q, 0 < P(q) < 1), e represents the explanandum and h a hypothesis that provides at least a potential explanation of e. Probabilities are taken to represent degrees of belief relevant to background knowledge, which is omitted in the notation for convenience. Hence, the current approach should be thought of in Bayesian terms. Crupi and Tentori (2012) present an axiomatic representation for measures ordinally equivalent to E 1 and then, after offering criticisms of some aspects of Schupbach and Sprenger's approach, they present an alternative axiomatization for measures ordinally equivalent to their preferred measure: Cohen (2016) draws attention to another measure that had been proposed by Good (1960), who provided an axiomitzation for it, and also discussed by McGrew (2003): Cohen shows how measures ordinally equivalent to E 3 could also be given a much simpler axiomatic representation by drawing on a result from . Another measure of explanatory power was proposed by Popper (1959) as follows: though he also considered E 3 to provide an adequate definition of explanatory power as well. In fact, E 4 is ordinally equivalent to E 3 so both of these measures produce the same comparative explanatory judgments. What exactly are these measures intended to quantify? According to Schupbach and Sprenger, the conception of explanatory power they have in mind is that of a 'hypothesis's ability to decrease the degree to which we find the explanandum surprising ' (2011, p. 108) and similarly Crupi and Tentori claim that their account captures 'how the background surprisingness/expectedness of explanandum e is reduced by assuming candidate explanans h' (2012, p. 375). Plausibly all these measures can be understood as attempts to capture explanatory power in the sense of 'h reducing surprise in e' or perhaps 'h increasing expectedness of e' (though see Sect. 4.3). How surprising e is will differ from one case to another, but the key factor is the reduction of surprise. We can think of this by comparing the P(e|h), the probability of e given h, with P(e), the probability of e given only background knowledge. A low value of P(e) would represent a case where e is surprising and the lower the value of P(e) the greater the extent to which e is surprising. If P(e|h) is greater than P(e), this would represent the situation where h reduces the surprise in e and the greater P(e|h) the greater the reduction. Hence, if one is comparing two hypotheses for a given explanandum, it is the one that increases its probability most that has greater explanatory power (see Sect. 2.3). More importantly, what these measures have in common is that they attempt to quantify how well h would explain e (in the sense just noted) if h were true. That is, they are intended to capture something about the relationship between h and e under the assumption that h is true. 1 In Good's (1968) terminology, they are all measures of weak explanatory power. I will return to his distinction between weak and strong explanatory power in Sect. 3.
Before considering the suitability of these measures, it is worth commenting on some concerns about the general approach. If measures of explanatory power are concerned with reduction of surprise in the sense noted above, the problem of old evidence that is posed for Bayesianism is relevant (Glymour, 1980) . Essentially, the problem is that if e is old evidence it is included in background knowledge so that P(e) = 1, which also means that P(e|h) = 1 and hence P(h|e) = P(h), so that e cannot confirm h. Some Bayesians have appealed to variants of Garber's (1983) approach, which sought to show that what confirms h is not e but the discovery that h entails e. However, this strategy does not help in the current context since if it is accepted that P(e|h) = P(e) = 1, then there can be no reduction in surprise. Alternatively, in the counterfactual approach to the problem, the idea is to suppose that e is not known to be true, removing it from background knowledge, and then to consider the impact that learning e would have on h. This approach was defended by Howson (1991), but he later rejected it because of difficulties involved in extracting e from background knowledge and inconsistency with a subjective Bayesian approach. Instead, he argued that 'a minimalist version of Objective Bayesianism does straightforwardly solve the problem' (Howson, 2017) and based his approach on earlier work by Rosenkrantz (1983). An important aspect of this approach is that a probability less than one can be legitimately assigned to e in the case of old evidence. If a solution along these lines is viable or if the counterfactual approach can be defended against criticisms, this would undermine the possible concern about the general approach adopted here. 2 Another concern is that it might reasonably be doubted whether explanation could be fully analyzed in probabilistic terms. In response, it can be noted that these measures are not intended to define what constitutes an explanation, but only to measure the explanatory power of a hypothesis that has been determined on other grounds to provide an explanation. However, further objections might relate specifically to using probability to measure explanatory goodness. For example, a number of philosophers have argued that explanation is intimately tied to understanding (see, for example, Friedman, 1974;Kitcher, 1989) and, if correct, it might seem questionable whether this could be fully analyzed probabilistically. While this can be acknowledged as a potential limitation, the proposed strategy is to explore the probabilistic approach to see how far it goes. The measures of explanatory power discussed above have had some success in this regard and the hope is to extend that success further. Arguably, the suggestion that the explanatory measures described above capture 'reduction in surprise' and the argument that the account proposed here does justice to a number of explanatory virtues (see Sect. 4.5) might go some way to addressing this concern.
A further concern is that in the context of probabilistic explanation some have argued that low probabilities explain just as well as high probabilities (see, Jeffrey, 1969;Salmon, 1971;Railton, 1981), a viewpoint known as egalitarianism. Yet according to all the measures discussed above, explanatory power is greater for a hypothesis that confers a higher probability on a given explanandum (see Sect. 2.3). This is consistent with 'moderate elitism', a view defended by Strevens (2000Strevens ( , 2014 which does not deny that low probability events can be explained, but maintains that conferring a high probability is better. Consider, for example, a polarizer oriented at angle θ to the vertical. According to quantum theory, a probability of cos 2 (θ ) that an incoming, vertically polarized photon will be transmitted. On an egalitarian view, the transmission of the photon is equally well explained irrespective of whether θ is small or large, and hence the probability of transmission large or small (though not zero) respectively. A motivation for the egalitarian view is that in the low probability scenario, there are no further relevant factors that could be cited. However, there also seems to be a clear motivation for saying that the transmission is better explained by small θ (and hence high probability). Suppose we know that θ was either small (oriented very close to the vertical) or large (very close to the horizontal) and that both hypotheses are equally plausible in light of background knowledge. The transmission of the vertically polarized photon would be surprising given large θ , but much less surprising given small θ . Furthermore, the reduction of surprise would be greater given small θ if multiple vertically polarized photons were all transmitted. Hence, thinking about explanation in terms of reduction of surprise (as well as in the context of IBE, see introduction) gives some reason for thinking that the small θ hypothesis provides a better explanation in this case. A detailed discussion of these issues is beyond the scope of this paper, but these points suggest that in at least some cases there is justification for pursuing the current approach. 3 Could these measures be used to make judgments about explanatory goodness? According to Schupbach and Sprenger (2011), their goal is to propose a measure of explanatory power that 'would clarify the conditions under which hypotheses are judged to provide strong versus weak explanations' (p. 106). They further claim that an appropriate analysis of explanatory power 'would also clarify the meaning of comparative explanatory judgments such as "hypothesis A provides a better explanation of this fact than does hypothesis B"' (p. 106). However, they also point out that they 'take no position on whether our analysis captures the notion of explanatory power generally; it is consistent with our account that there be other concepts that go by this name but which do not fit our measure' (p. 106). According to Schupbach (private communication), their measure E 1 is appropriate for making judgments about explanatory goodness in some cases, such as those where priors are not accessible or whenever agents have knowingly ungrounded subjective priors, but in other cases judgments of explanatory goodness may require other factors to be taken into account. In particular, they may require a trade-off between explanatory power in their sense, which corresponds to Good's notion of weak explanatory power, and the improbability of the hypothesis since a hypothesis with high explanatory power might not rank so well in terms of overall explanatory goodness if it has a low prior probability. Following Good (1968), I will assume that the probability and complexity of a hypothesis are inversely related, so the more improbable a hypothesis, the greater its complexity. 4 Achieving an appropriate trade-off between weak explanatory power and improbability/complexity is the focus of the current paper.
In the rest of this section, I will highlight the need for such a trade-off in a wide range of cases and to that end I will focus on what the four measures identified so far have in common.

Entailment
E 1 and E 2 are maximal in cases where h entails e, while E 3 and E 4 take on their greatest values for a given e in cases where h entails e. Although this is appropriate for the specific concept of explanatory power these measures attempt to explicate (reduction of surprise), it seems to be a distinct weakness if one is trying to evaluate the overall goodness of an explanation or to compare explanations with each other. We can often distinguish between how well two hypotheses explain the evidence in cases where both of them entail the evidence. For example, explanationists typically cite simplicity as an explanatory virtue that could discriminate in such cases. If a conspiracy theory is deliberately constructed in such a way that if it were true, it would entail the explanandum in question, it would still be reasonable to think that it is a very poor explanation if it is very unlikely to be true in the first place.

Equal likelihoods and irrelevant conjunction
Closely related to the case of entailment, it turns out that all four measures satisfy the following condition for a given e: E(e, h 1 ) = E(e, h 2 ) if and only if P(e|h 1 ) = P(e|h 2 ). In fact, this condition is enshrined in the principle of positive relevance, which is used in the axiomatization of these measures (Cohen, 2016) : An application of positive relevance gives rise to another important feature of all four measures known as irrelevant conjunction. It says that conjoining an irrelevant hypothesis, h 2 , to a given hypothesis, h 1 , has no effect on h 1 's (weak) explanatory power: 5 Irrelevant conjunction. If h 2 is probabilistically independent of e, h 1 and their conjunction, then It is easy to see that this follows from positive relevance since P(e|h 1 ∧h 2 ) = P(e|h 1 ) when h 2 is probabilistically independent of e and h 1 , and hence given positive relevance that E(e, h 1 ∧ h 2 ) = E(e, h 1 ). Schupbach and Sprenger argue for this condition on the grounds that 'h 1 ∧ h 2 will not make e any more or less surprising than h 1 by itself already does ' (2011, p. 110) and hence has no effect on explanatory power. Crupi and Tentori agree, noting that 'it does not alter the degree to which e is explained' (2012, p. 367). While this is appropriate for measures of explanatory power that seek to explicate how well a hypothesis would explain the explanandum if the hypothesis were true, it seems clear that adding an irrelevant hypothesis results in a worse explanation overall. Why? Because once again considerations of simplicity and plausibility come into play. Let e be a description of the bending of light from a distant source by the sun and h 1 an explanation of this by Einstein's theory of general relativity. Let h 2 be the hypothesis that I have an identical twin elsewhere in the universe. All four measures judge that Einstein's theory explains the bending of light to the same extent that the conjunction of Einstein's theory and the hypothesis about my identical twin explains it. A plausible measure of explanatory goodness should show that this conjunction provides a worse explanation.
The foregoing discussion suggests a satisfactory measure of explanatory goodness should capture the idea that in the case of irrelevant conjunction the more concise explanation is better: Concise explanation. If h 2 is probabilistically independent of e, h 1 and their conjunction, then E(e, h 1 ∧ h 2 ) ≤ E(e, h 1 ), with equality only in the case where P(h 2 |h 1 ) = 1.
However, the more general point that applies to all cases where two hypotheses have equal likelihoods is that explanatory factors such as simplicity or plausibility can discriminate between them. This motivates the following adequacy condition for a measure of explanatory goodness based on the relevance of the initial plausibility of the hypotheses as measured by their prior probabilities given only background knowledge: Note that the concise explanation condition follows from initial plausibility. I will explore the role of prior probabilities further below.

Probabilistic relevance
Could a hypothesis which is negatively relevant to e provide a better explanation than one which is positively relevant to e? Consider a bag consisting of 99 fair coins and one coin with a bias towards heads such that its objective chance of landing heads is 0.51. A coin is selected at random and, on being tossed, lands heads. Consider the hypotheses: 'the selected coin is fair' (h 1 ) and 'the selected coin is biased' (h 2 ). Note that h 1 is negatively related to the observation since P(e|h 1 ) = 0.5 < 0.5001 = P(e) while h 2 is positively related to it. Thinking of explanation in terms of how well the hypotheses would account for the explanandum if they were true, which is what the four measures specified earlier seem to explicate, h 2 provides the better explanation. However, this does not take into account the prior improbability of h 2 , which is relevant if we are assessing the overall goodness of the explanations. In this sense, given the very small difference in the likelihoods and the much greater prior probability of h 1 , it is plausible to think that h 1 provides a much better explanation overall. Arguably, a trade-off needs to be made between probabilistic relevance and complexity (in the sense of lower probability) when evaluating an explanation, though how exactly that trade-off should be made is not immediately obvious. I explore this matter in Sects. 2.5 and 3. 6 Even though h 1 seems to provide a better overall explanation than h 2 , there is also something deficient about h 1 as an explanation due to its negative relevance to the explanandum and this should feature in any plausible account of explanatory goodness. I will return to this point in Sect. 4.4.

Striking the balance
While it is perfectly reasonable to consider explanatory power in the sense explicated by measures E 1 − E 4 (weak explanatory power) as a factor in explanatory goodness, the focus in this section has been on the need for a trade-off between weak explanatory power and complexity. Or to put it another way, a measure of explanatory goodness should combine weak explanatory power and prior probability in an appropriate manner. The initial plausibility condition specifies how the priors can play a role when the likelihoods are equal, but how should they be taken into account more generally?
It might be thought that Bayes' theorem provides an answer since it essentially combines E 3 with the prior probability. In that case, the goodness of an explanation would be identified with its posterior probability. But there are good reasons to reject this approach. First, while priors are relevant to explanatory goodness, arguably this approach gives too much weight to priors via Bayes' theorem. While explanationists would like to think that the best explanation would often turn out to be the most probable hypothesis, it certainly seems possible that in at least some scenarios this might fail to be the case. Second, it also seems that in some cases a conjunctive explanation that combines two compatible hypotheses, h 1 ∧ h 2 say, could turn out to be a better explanation than either h 1 or h 2 , yet this is ruled out if explanatory goodness is identified with posterior probability.
One way of putting this is as follows. If h 1 and h 2 have equal posteriors then since P(e|h 1 ) · P(h 1 ) = P(e|h 2 ) · P(h 2 ), if we were to treat them as equal in terms of explanatory goodness, we would essentially be giving the priors as much importance as likelihoods. While I have argued that excluding priors is too extreme in one direction, giving them this much of a role is arguably too extreme in the other direction; a better balance is needed. In light of these considerations, it seems reasonable to use likelihoods to discriminate between hypotheses with equal posterior probabilities. Hence, despite my concerns about the positive relevance condition, it does seem appropriate to apply it in cases where the priors or the posteriors of the hypotheses are equal. This suggests the following restricted version of the positive relevance condition: Now we are in a position to consider how to make the appropriate trade-off between weak explanatory power and improbability/complexity.

A Good approach to good explanation
The mathematician and World War II cryptologist I. J. Good made significant contributions to this topic. I have already drawn attention to his measure in Eq. (3), which is probably the best known measure of explanatory power. Based on the desiderata he set out in his 1960 paper, he argued that this measure was 'essentially the only possible explicatum for explanatory power' (Good, 1960, p. 320). However, in another paper in 1968 he distinguished between explanatory power in the weak sense (weak explanatory power) and the strong sense (strong explanatory power) and noted that 'the double meaning of "explanatory power" has previously been overlooked' (Good, 1968, p. 124). By weak explanatory power, he meant that the explanatory power of a hypothesis h is 'unaffected by cluttering up [h] with irrelevancies', while strong explanatory power 'is affected by the cluttering' (Good, 1968, p. 123).
When is a hypothesis 'cluttered up with irrelevancies'? One of Good's desiderata (axiom 10 in the 1968 paper) provides the answer and hence the key distinction between weak and strong measures of explanatory power. This desideratum is essentially the irrelevant conjunction condition specified earlier. So irrelevant conjunction must be satisfied by a weak measure of explanatory power since it is unaffected by the inclusion of an irrelevant hypothesis (clutter). However, strong measures do not satisfy irrelevant conjunction, but instead take the prior probability into account to penalize the inclusion of an irrelevant hypothesis.
Note that strong explanatory power is intended to penalize not just the addition of irrelevant hypotheses, but also improbable/complex hypotheses more generally. An analogy with model selection might help to motivate this approach. By adopting a sufficiently complex model, it is possible to obtain an excellent fit to the data, but in doing so one is likely to over-fit the model to noise in the data. Hence, a trade-off between how well the model fits the data and the complexity of the model is sought and this can be achieved by penalizing models for their complexity. In Bayesian model selection, more complex models can be assigned lower probabilities so they are penalized more. This trade-off is closely related to that needed here. For example, in many cases it is possible to come up with an ad hoc hypothesis or conspiracy theory that has been deliberately constructed to entail the explanandum (see Sect. 2.2) even though the hypothesis itself is very improbable. To avoid this, hypotheses need to be penalized for their improbability/complexity. In model selection, this is often expressed in terms of Ockham's razor and as we will see this is also how Good refers to his approach.
The strong measure advocated by Good is: where 0 < γ < 1 is a constant and so (5) provides a continuum of measures of strong explanatory power. According to Good, 'the constant γ measures the degree to which the simplicity of the hypothesis is regarded as desirable … as compared with its weak explanatory power' (Good, 1968, p. 130). If γ = 0 were permitted then E 5 would just be Good's weak measure, E 3 , so weak explanatory power can be seen as a limiting case of strong explanatory power. Furthermore, requiring γ > 0 means that E 5 satisfies the concise explanation condition. Also, if γ = 1 were permitted then E 5 would just be the log of posterior probability and so requiring γ < 1 means that E 5 satisfies the restricted positive relevance condition. Relating this to the discussion in Sect. 2, we can see that the positive relevance condition is closely related to weak explanatory power since it entails the irrelevant conjunction condition. By contrast, the initial plausibility condition is closely related to strong explanatory power since it entails the concise explanation condition. And while a strong measure should not satisfy the positive relevance condition, it should nevertheless satisfy the restricted positive relevance condition.
The four measures discussed in Sect. 2 are appropriate if one is making judgments about weak explanatory power and so debates about their relative merits are to be understood in that light. However, if instead one is interested in when one hypothesis provides a better overall explanation of a given explanandum than another hypothesis does, then it seems that something along the lines of Good's strong sense of explanatory power is needed. In fact, Good proposes what he calls a 'sharpened version of "Ockham's razor" which is that if our primary purpose is explanation we should select the hypothesis (among those we know) which has the maximum strong explanatory power ' (1968, p. 123).
A measure motivated by considerations of coherence provides another example of a measure of strong explanatory power (Glass, 2021) : Strictly speaking, E 6 was not proposed as a measure of explanatory power as such, but rather as a measure for ranking hypotheses as explanations of an explanandum e. It is easy to show that E 6 satisfies the initial plausibility, concise explanation and restricted positive relevance conditions. Furthermore, when comparing h 1 and h 2 as explanations of e it judges h 1 to be better than h 2 if and only if: and hence it provides the same ordering of explanations as Good's strong measure, E 5 , if we set γ = 1 /2, which Good describes as the simplest explicatum. I will return to this point in Sect. 4.4. The overlap coherence measure was also proposed for ranking explanations. For a hypothesis h and explanandum e the overlap coherence is given by (Glass, 2002;Olsson, 2002) : Like E 5 and E 6 , it also satisfies the initial plausibility, concise explanation and restricted positive relevance conditions and so can be considered as another strong measure of explanatory power. So Good's strong measure is not the only measure of strong explanatory power and hence further reasons need to be given in its defence. I now turn to that task.

Deriving Good's strong measure
In his 1968 paper, Good adopted a two stage strategy to show that a measure of strong explanatory power must be a monotonically increasing function of his E 5 measure. First, he drew on his 1960 paper where he showed that a weak measure of explanatory power must be a monotonically increasing function of his E 3 measure based on ten axioms or desiderata for such a measure. Then he made some assumptions about a strong measure and its relation to his weak measure in order to derive his result concerning E 5 .
Here I want to present two new derivations that relate more closely to some of the desiderata for measures of explanatory power found in the recent literature. The first approach does not require establishing E 3 as a measure of weak explanatory power, but rather a property of it, which is sufficient to establish E 5 . The second follows Good's strategy of first establishing E 3 , in this case drawing on a result by , before using Good's result to establish E 5 as a measure of strong explanatory power.
The first condition is based on Crupi and Tentori (2012). It is a formal assumption about measures of weak and strong explanatory power, which I will denote as E W and E S respectively.
(A1) Let L be a propositional language and L c the contingent formulas in L. Let P be the set of regular probability functions that can be defined over L and let E W : L c × L c × P → R and E S : L c × L c × P → R. There exist continuous, differentiable functions w and s such that, for any e, h ∈ L c and any P ∈ P,

), P(h), P(e)] and E S (e, h) = s[P(e ∧ h), P(h), P(e)].
In terms of the dependence on P(e ∧ h), P(h) and P(e), A1 just says that E W and E S are functions of absolute and conditional probabilities of logical combinations of h and e since all of these probabilities are determined by P(e ∧ h), P(h) and P(e). The requirement of continuity and differentiability enables us to take advantage of part of Good's proof and ensures that the functions are well-behaved.
The second condition requires that E S depend only on E W and P(h). This is motivated by the distinction between a weak and strong measure of explanatory power since the latter should take into account the simplicity/complexity of the hypothesis in addition to its weak explanatory power.

(A2) E S can be expressed as a function of E W and P(h) so that E S (e, h) = s[P(e ∧ h), P(h), P(e)] = s W [E W (e, h), P(h)].
A possible objection to this condition is that while it might be accepted that E S should depend on E W and P(h), it might be questioned whether it should only depend on these two factors. However, we need to distinguish conceptually between a measure of strong explanatory power and a measure of overall explanatory goodness. Given Good's account of strong explanatory power, this condition seems unobjectionable. Whether a strong measure will turn out to provide a plausible measure of overall explanatory goodness will depend on how well it captures various explanatory virtues (see Sect. 4.5).
The third condition says that a weak measure of explanatory power should treat probabilistic independence between e and h as a special case by assigning it a fixed, neutral value. This clearly holds for E 1 − E 4 since they are measures of probabilistic relevance.
(A3) E W has a fixed, neutral point α such that E W (e, h) = α if and only if h and e are probabilistically independent.
Suppose that h 1 provides an explanation of e 1 and h 2 provides an explanation of e 2 , but that h 2 and e 2 are irrelevant to h 1 and e 1 . The fourth condition says that the degree to which h 1 ∧ h 2 explains e 1 ∧ e 2 is a function of the degree to which h 1 explains e 1 and the degree to which h 2 explains e 2 and that this applies for both weak and strong measures of explanatory power. Such a condition is discussed by Good (1968) in the context of strong explanatory power and by Cohen (2016), who presents a generalized version of this condition for an arbitrary number of explanandum-explanans pairs. Formally, it can be stated as follows: (A4) If h 2 and e 2 are each probabilistically independent of h 1 , e 1 and their conjunction, then E W (e 1 ∧ e 2 , h 1 ∧ h 2 ) can be expressed as a function, w c , of E W (e 1 , h 1 ) and where w c is strictly increasing in each argument when the other argument is fixed and non-extreme (i.e. neither its maximum or minimum value) and non-decreasing otherwise. Similarly there is a corresponding function, s c , for In some cases, it seems very appropriate to combine independent explanations in this way. Cohen (2016), for example, highlights its relevance to sets of experiments where each is carried out in a different laboratory and has a separate hypothesis. However, Cohen does not propose this property as a necessary requirement for measures of explanatory power since he sees its virtue as being one of convenience. Certainly, it can be very convenient if a measure decomposes into products or sums, as is the case for Good's measures. For Good's weak measure we have: when the appropriate independence relationships hold and it is easy to show that the corresponding result holds for his strong measure as well. While such a decomposition is convenient, there are good reasons to think that A4 should indeed be a necessary condition for measures of explanatory power. Suppose a patient reports two symptoms, e 1 and e 2 . Whatever the patient might think, suppose the doctor has good reason to believe that there is no dependence between these symptoms and is able to explain them by conditions h 1 and h 2 respectively, which again are independent of each other and of the evidence they do not explain. In such a case, it is reasonable to combine these independent hypotheses to explain the symptoms to the patient. Furthermore, how well they explain the symptoms is very plausibly taken to be an increasing function of each explanation. For example, suppose the doctor had two potential explanations, h 2 and h 3 , for e 2 and that both satisfied the relevant independence conditions with h 1 and e 1 . It seems clear that if h 2 provides a better explanation of e 2 than h 3 does, then the combined explanatory power of h 1 and h 2 would be greater than that of h 1 and h 3 .
So the importance of A4 lies not merely its convenience, but rather in the plausibility of requiring that when explanations are combined and the relevant independence conditions are met, explanatory power should be an increasing function of each explanation. This becomes clear when we see scenarios where measures such as E 1 and E 2 violate A4. (e 1 , h 1 ), (e 2 , h 2 ) and (e 2 , h 2 ) are three explanandumexplanans pairs satisfying the relevant independence conditions for A4. These can be thought of as three pairs of symptoms and corresponding conditions that explain them, with each of the three pairs being irrelevant to the other pairs. Suppose P(e 1 |h 1 ) = 1, P(e 1 ) = 0.5, P(e 2 |h 2 ) = 0.4, P(e 2 ) = 0.2, P(e 2 |h 2 ) = 0.8 and P(e 2 ) = 0.75. Clearly, E 1 (e 1 , h 1

Example 1 Suppose that
So although (a) h 1 explains e 1 and (b) h 2 explains e 2 better than h 2 explains e 2 , E 1 and E 2 counterintuitively give the result that (c) h 1 ∧ h 2 explains e 1 ∧ e 2 less well than h 1 ∧ h 2 explains e 1 ∧ e 2 . In fact, matters are worse than this since, according to E 1 and E 2 , the combination of two poorer explanations can be better than the combination of two better explanations. These results provide good reasons for adopting A4 as a necessary requirement for both weak and strong measures of explanatory power.
In the discussion so far, I have argued for A4 by appealing to examples that are intended to highlight its plausibility. However, since E 1 and E 2 violate A4, these examples provide counterexamples to these measures. A possible response is to say that even in terms of weak explanatory power E 1 and E 2 are better thought of as explications of a different concept from the one being proposed here and hence from E 3 . I think this is a reasonable response and will return to it in Sect. 4.2 where I discuss the fact that A4 leads to a property of E 3 and E 5 that has been criticized in the literature.
The final two conditions were discussed earlier: (A5) E S satisfies initial plausibility (see Sect. 2.3).
Based on these assumptions and recalling that E S is a function of E W and P(h) as expressed in A2, we then get the following theorem for Good's strong measure of explanatory power, E 5 . 7 Good's strong measure,E 5 . 8 For an alternative way to derive Good's measure, consider the following conditions for a weak measure of explanatory power:

Theorem 1 If E W and E S are weak and strong measures of explanatory power respectively that satisfy A1 -A6, then E S is a monotonically increasing function of
(A7) For any e, h 1 , h 2 ∈ L c and P ∈ P, E W (e, h 1 ) E W (e, h 2 ) if and only if P(e|h 1 ) P(e|h 2 ), i.e. E W satisfies positive relevance (see Sect. 2.3).
(A8) For any e 1 , e 2 , h ∈ L c and P ∈ P, E W (e 1 , h) E W (e 2 , h) if and only if P(h|e 1 ) P(h|e 2 ). A7 (positive relevance) seems like a very plausible condition for weak explanatory power. Of course, positive relevance was criticized in Sect. 2.3 in the context of overall explanatory goodness, but it is appropriate to retain it as a condition for weak explanatory power. Furthermore, all of the weak measures considered in this paper, E 1 − E 4 , satisfy A7.
What about A8? In some particular cases, there is clear justification for A8 if we think of weak explanatory power in terms of reducing surprise. If P(e 1 ) = P(e 2 ), then for E W (e 1 , h) > E W (e 2 , h) seems to require that P(e 1 |h) > P(e 2 |h), from which it follows that P(h|e 1 ) > P(h|e 2 ). Similarly, if P(e 1 |h) = P(e 2 |h), then E W (e 1 , h) > E W (e 2 , h) seems to require that P(e 1 ) < P(e 2 ), from which it again follows that P(h|e 1 ) > P(h|e 2 ). More generally, however, A8 is widely accepted as a necessary requirement for measures of the degree to which e confirms h. Indeed, so central is it that Crupi and Tentori (2014) include it as part of their definition of confirmation, calling it final probability. In the present context, A8 can then be understood as stating a fundamental relationship between explanation and confirmation. It ensures that if h provides explanations of e 1 and e 2 , then it weakly explains (reduces the surprise of) e 1 better than e 2 exactly when e 1 provides greater confirmation of h than does e 2 . This seems very plausible indeed since it is precisely the ability to explain otherwise very surprising phenomena that can provide strong confirmation of a hypothesis. 9 Using A7 and A8 we then get the following theorem for Good's strong measure of explanatory power, E 5 .
Theorem 2 If E W and E S are weak and strong measures of explanatory power respectively that satisfy A1, A2, A4 -A8 then E W is a monotonically increasing function of Good's weak measure,E 3 , and E S is a monotonically increasing function of Good's strong measure, E 5 . 10

Irrelevant evidence
The most significant objection to Good's weak measure of explanatory power, but which applies equally to his strong measure, is the problem of irrelevant evidence due to Schupbach and Sprenger (2011). Let e be a general description of Brownian motion and h be Einstein's atomic explanation of it. Assuming P(e|h)/P(e) 1, Good's weak measure correctly judges this to be a good explanation. However, let e be the irrelevant proposition that the mating season for an American green tree frog takes place from mid-April to mid-August. According to Good's measures (weak and strong) this has no bearing on the explanatory power of Einstein's account since P(e ∧ e |h)/P(e ∧ e ) = P(e|h)/P(e). By contrast, according to the measures E 1 and E 2 , the addition of e reduces the explanatory power.
Is this consequence of Good's measures as counterintuitive as Schupbach and Sprenger claim? I will respond by trying to show that there is a very plausible way to make sense of the alleged counterexample. Unlike the measures E 1 and E 2 , Good's measures place no upper boundary on the degree of explanatory power. If there are two explananda, e 1 and e 2 , Good's weak measure can be expressed as E 3 (e 1 ∧ e 2 , h) = log P(e 1 ∧ e 2 |h) P(e 1 ∧ e 2 ) = log P(e 2 |h, e 1 ) P(e 2 |e 1 ) + log P(e 1 |h) P(e 1 ) , where E 3 (e 2 , h|e 1 ) represents the conditional weak explanatory power, i.e. the degree to which h weakly explains e 2 after conditioning on e 1 . Hence, the weak explanatory power of h for e 1 ∧ e 2 is obtained by adding the degree to which it weakly explains e 2 conditional on e 1 to the degree to which it weakly explains e 1 . Good (1960) refers to this as strict additivity of the first kind. Clearly, there is always scope for the explanatory power to be greater when e 2 is included than it was in the case of just e 1 . If the degree to which h explains e 2 given e 1 is positive, then the explanatory power increases, if it is negative, explanatory power decreases, and if it is zero, explanatory power remains unchanged. Even if h entails e 1 , the explanatory power could increase further. For example, if h also entailed e 2 then it would increase further (provided e 1 did not entail e 2 ). Returning to the earlier example, while Einstein's atomic account, h, provides an excellent explanation of e, which gives a general description of Brownian motion, its explanatory power would be increased further if it could explain additional relevant evidence. For example, Einstein's account explained not only the general phenomenon of Brownian motion, but also the much more specific results of Perrin's 1908 experiments to determine the mean square displacement of particles undergoing Brownian motion and its relation to Avogadro's number, which further confirmed the atomic theory. By contrast, the explanatory power of Einstein's account would have decreased had there been additional evidence, e 2 say, for which Einstein's account had negative explanatory power (given e). So conjoining the original evidence e with additional positively relevant evidence explained by h would increase the explanatory power of h, while conjoining e with additional negatively relevant evidence such as e 2 would decrease the explanatory power of h. What effect should conjoining e with a proposition about American green tree frogs, which is completely irrelevant to h, have on the explanatory power of h? According to E 3 , it has no effect whatsoever, which seems very reasonable.
However, measures E 1 and E 2 are not additive and so give very different results. In fact, this gives rise to a counterintuitive feature of these measures relating to entailment. If a hypothesis, h entails evidence, e, then conjoining e with further evidence cannot increase the explanatory power of h, no matter how well h explains this further evidence. And so if Einstein's account entails a general description of Brownian motion, then its explanatory power would not be increased by conjoining this evidence with Perrin's findings relating to Avogadro's number.
Furthermore, suppose a hypothesis entails an explanandum that is not at all surprising because it has a high prior probability in light of background knowledge, then its explanatory power cannot be enhanced by entailing a further surprising explanandum. Planetary orbits (e 1 ) that could be derived from Newton's theory could also be derived from Einstein's theory (h) and so the explanatory power of Einstein's theory would be one according to E 1 and E 2 . However, the perihelion of Mercury (e 2 ) could also be derived from Einstein's theory, but according to E 1 and E 2 its explanatory power for e 1 ∧ e 2 would not be any greater than it was for e 1 . We might call this the problem of relevant evidence.
In summary, there is a very plausible way to make sense of the irrelevant evidence issue from the perspective of Good's weak measure (and hence his strong measure too). Furthermore, I have argued that the non-additive nature of measures such as E 1 and E 2 can give rise to counterintuitive judgments about explanatory power and, in particular, the problem of relevant evidence. However, maybe there is another way to view these differences. Although E 1 , E 2 and Good's measure, E 3 , as well as E 4 , are all weak measures of explanatory power, they may nevertheless be explicating different concepts. Arguably, measures such as E 1 and E 2 are better understood as explications of the degree to which h entails e. 11 However, if one wants a weak measure of explanatory power that does justice to explanatory scope and so increases appropriately as it explains more evidence, then an additive measure such as Good's weak measure, E 3 , is suitable since it satisfies Eq. (10) as well as A4. I will also argue below that E 3 has advantages in terms of explicating reduction of surprise. 12 Good (1968) considers how his measures of weak and strong explanatory relate to semantic information. According to one very widely used account, the semantic information or information content of h is given by Bar-Hillel and Carnap (1953):

Explanatory power and information
for a probability distribution P, while the information content of h given e is: The information concerning h provided by e is given by: 11 Both E 1 and E 2 are well-known measures of the degree to which h confirms e and can very plausibly be considered as measures of confirmation in the sense of partial entailment since they are maximal when h entails e and minimal when h entails the negation of e. For further discussion, see Fitelson (2006), . 12 E 6 and E 7 do not face the problems of irrelevant evidence, irrelevant conjunction or relevant evidence. E 6 also satisfies A4, but E 7 does not.
which Good also calls the mutual information between h and e since it is symmetric in h and e (Good, 1966(Good, , 1968 . Hence, Good identifies the degree to which h weakly explains e [see Eq. (3)] with the information concerning h provided by e or equivalently, and perhaps more appropriately, the information concerning e provided by h.
Since Inf(e) is a decreasing function of P(e), it could be taken to represent the degree to which e is surprising, in which case Inf(e|h) would represent the degree to which e is surprising given h. Good's weak measure of explanatory power can then be understood as representing how well h reduces the degree to which e is found to be surprising since it can be expressed as follows: 13 Schupbach and Sprenger (2011) also interpret their measure of explanatory power in terms of reducing surprise, but there are a couple of advantages to Good's measure in this respect. First, as we have just seen, Good's weak measure can be formulated very straightforwardly in terms of semantic information. Second, Schupbach and Sprenger's measure, E 1 , fails to discriminate appropriately in terms of reduction of surprise for different explananda which are entailed by a hypothesis, and the same is true of Crupi and Tentori's measure, E 2 , since both give the maximum value of one in such cases. Suppose that e 1 is very surprising in light of background knowledge, while e 2 is not surprising at all. Further suppose that h entails e 1 and also e 2 . While E 1 and E 2 quantify the degree to which h explains e 1 to be the same as the degree to which it explains e 2 , according to Good's measure,E 3 , h provides a much better weak explanation of e 1 than it does of e 2 . In fact, since Inf(e 1 |h) = Inf(e 2 |h) = 0, the degree to which h explains e 1 is just Inf(e 1 ) and similarly the degree to which h explains e 2 is Inf(e 2 ) according to E 3 . Since Inf(e 1 ) can be thought of as the degree to which e 1 is surprising in light of background knowledge only, it is clearly much greater than Inf(e 2 ). As noted earlier, E 1 and E 2 are better thought of as measures of the degree to which h entails e.
Good's strong measure of explanatory power, E 5 [see Eq. (5)], can be expressed in terms of semantic information as follows: In light of our discussion, we can then say that strong explanatory power measures how well h reduces the degree to which e is found surprising together with the inclusion of a penalty for the complexity of h.

Making Good's measure precise
Recall that Good's measure, E 5 , has a parameter, γ , which is required to be in the interval (0, 1). Can a particular value for γ be defended? As Good pointed out, E 5 can be expressed as follows: On the basis of this expression, he suggested γ = 1 /2 as the simplest explicatum of E 5 since it gives equal weighting to (weak) explanatory power and the term −Inf(h|e), which he associates with 'the avoidance of "clutter"'. However, while Good's suggestion is not implausible a more convincing justification is needed.
To address this point, we can draw on a complexity criterion proposed for explanatory goodness (Glass, 2023) . The criterion requires that for an explanation h of explanandum e to be a good one, the reduction in complexity of e brought about by h must be greater than the complexity introduced by h in the context of e, where the first of these quantities is represented by Inf(h, e) and the second by Inf(h|e). Expressed in terms of strong explanatory power, it is: Complexity criterion for strong explanatory power. If E S (e, h) is a measure of strong explanatory power of h for e then: Note that since Inf(e, h) is Good's weak measure and Inf(h|e) ≥ 0, this means that a positive value of weak explanatory power is necessary, but not sufficient, for an explanation to be a good one. In light of (16), E 5 (h, e) > 0 if and only if (1 − γ )Inf(h, e) > γ Inf(h|e) and hence if E 5 is to satisfy the complexity criterion, γ must be 1 /2. This provides a strong justification for adopting this specific version of Good's measure and, as noted earlier, for a given explanandum e this will give the same ordering of hypotheses as measure E 6 .
Let us now return to example from Sect. 2.4 about a bag containing 99 fair coins and one with an objective chance of 0.51 of landing heads. On being tossed, a randomly selected coin lands heads (e) and we considered the hypotheses 'the selected coin is fair' (h 1 ) and 'the selected coin is biased' (h 2 ). Using Good's measure with γ = 1 /2, we find that E 5 (e, h 1 ) log(0.9998) + log(0.9950) −0.0023 which is greater than E 5 (e, h 2 ) log(1.0198) + log(0.1) −0.9915. So h 1 is indeed the better explanation according to E 5 , whereas h 2 would be judged better by weak measures since it is positively relevant to e while h 1 is not. According to E 5 , h 2 does not sufficiently reduce the complexity of e to compensate for the complexity introduced by h 2 . Notice, however, that even though E 5 judges h 1 to be better, it is clearly deficient in the sense that it has a negative degree of explanatory power, so it might be more accurate to say that h 1 is not as bad an explanation as h 2 .

Explanatory virtues and inference to the best explanation
We have already seen that Good relates his strong measure of explanatory power to the explanatory virtue of simplicity. According to his version of Ockham's razor, if two hypotheses have equal likelihoods with respect to the explanandum we should prefer the simpler of the two, which he says is 'equivalent to the choice of the more probable hypothesis ' (1968, p. 139). Given the discussion in Sects. 2.2 and 2.3, Good's measure does indeed accommodate simplicity in a way that weak measures do not.
Good's measure is also able to do justice to other explanatory virtues such as scope and unification. More specifically, it is his weak measure, which is a factor in his strong measure, that is able to capture these virtues. In terms of explanatory scope, we have already seen from Eq. (10) that the explanatory power of a hypothesis increases as it explains more evidence. In terms of unification, Myrvold (2003) develops an account in terms of informational relevance. Expressing a result of Myrvold's in terms of Good's measure of weak explanatory power gives: where U (e 1 , e 2 ; h) is the degree to which h unifies e 1 and e 2 and is given by hence h weakly explains e 1 ∧ e 2 to a degree that is the sum of how well it weakly explains e 1 and e 2 separately plus the degree to which it unifies them. It follows that if the sum of the weak explanatory power for e 1 and e 2 is the same for two hypotheses, then the one that unifies e 1 and e 2 more will have greater weak explanatory power. If they also have the same priors, then the hypothesis that unifies e 1 and e 2 more will have greater strong explanatory power as well. A similar conclusion can be reached concerning Whewell's (1847) 'consilience of inductions' in terms of the value of diverse evidence (see, McGrew, 2016). Does this mean that Good's strong measure fully captures explanatory goodness? The various weak measures may well capture an aspect of explanatory goodness, but since they fail to accommodate simplicity, I have argued that they are not plausible candidates of explanatory goodness in a general sense. Since Good's measure incorporates simplicity as well as the other virtues described above, it is a much more plausible candidate. Whether it fully captures explanatory goodness, however, is another matter. As acknowledged in Sect. 2.1, there may be some limitations to what can be captured probabilistically and this could include limitations arising from the fact that the account does not attempt to capture what constitutes an explanation. Also, the current approach does not take into account the potential relevance of manipulations to explanatory goodness. Eva and Stern (2019) have shown how this can be done for Schupbach and Sprenger's measure of explanatory power, so it would be interesting to explore whether a similar approach might be appropriate for the current measure. Nevertheless, as it stands, Good's measure does seem to go a long way to capturing key aspects of explanatory goodness.
A related topic concerns the relevance of Good's strong measure to IBE. Recent work has demonstrated the merits of E 6 [Eq. (6)] in this regard (Glass, 2021) and, as we have seen, it produces the same ranking as Good's strong measure when γ is 1 /2. Results showed that using this measure for IBE finds the actual or true hypothesis much more frequently than versions of IBE based on weak measures. There is a lot more that could be said about explanatory virtues and IBE, but this brief discussion suggests that Good's strong measure does well on both fronts.

Conclusion
Strong measures of explanatory power attempt to strike a balance between how well a hypothesis accounts for the explanandum (weak explanatory power) and the improbability/complexity of the hypothesis. As such, they can be viewed as ways of making Ockham's razor precise. While weak measures seek to capture an important aspect of explanation, I have argued that strong measures are better for quantifying explanatory goodness. In defence of Good's strong measure, I have presented two new derivations of it, explored its connection with information theory and explanatory virtues, shown how it can be made precise, and addressed objections to it. Since Good's strong measure depends on his weak measure, I have also presented several reasons for preferring his weak measure to the other weak measures. In particular, his weak measure is able to differentiate between explanatory power in cases where a given hypothesis entails two explananda where one is more surprising than the other.
There are various directions for further work. As noted above, it would be interesting to explore the potential relevance of manipulations to explanatory goodness in the context of Good's measure. Also, in debates about IBE and Bayesianism, it is usually assumed that a Bayesian approach requires selecting the hypothesis with the highest posterior probability. However, this is not the case for the Bayesian approach to IBE based on Good's measure. This is particularly relevant in cases where there are multiple compatible hypotheses. Strong measures of explanatory power should shed light on when it is appropriate to accept conjunctive explanations involving two or more hypotheses rather than just a single hypothesis. and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. that for any (x, y, z 1 ) and (x, y, z 2 ) ∈ D w 1 , w 1 (x, y, z 1 ) = w 1 (x, y, z 2 ). Hence, lemma A.1 requires that there must exist w 2 such that, for any e, h ∈ L c and P ∈ P, E W (e, h) = w 2 [P(h|e), P(h)] and w 2 (x, y) = w 1 (x, y, z). Hence it follows from (A2) that there is a differentiable function s 2 such that, for any e, h ∈ L c and P ∈ P, E S (e, h) = s 2 [P(h|e), P(h)] since a differentiable function of differentiable functions is itself differentiable. This establishes lemma A.3.
Given lemma A.3 and (A4), Good shows in his 1968 paper that, up to a differentiable monotonic transformation, E S (e, h) is given by where γ is a constant or alternatively, log P(e|h) P(e) + γ log[P(h)].
These conditions also require that any monotonic transformation of this function must be increasing. Suppose that E S (e, h) were a decreasing function of (A2). Suppose also that P(e|h 1 ) = P(e|h 2 ). Then, if γ log[P(h 1 )] > γ log[P(h 2 )], and hence P(h 1 ) > P(h 2 ), it would follow that E S (e, h 1 ) ≤ E S (e, h 2 ), which contradicts (A5). This establishes theorem 1.

Proof of theorem 2
The result for E W follows from (A1), (A7) and (A8) as demonstrated by theorem 3 of Cohen (2016), which was in turn proved in the context of E 3 as a confirmation measure by . Lemma A.1 follows trivially given the result for E W and lemma A.3 then follows straightforwardly from lemma A.1 and (A2). The result for E S can then by established from the relevant part of the proof for theorem 1 based on lemma A.3, (A4), (A5) and (A6).