Abstract
In the general problem of verisimilitude, we try to define the distance of a statement from a target, which is an informative truth about some domain of investigation. For example, the target can be a state description, a structure description, or a constituent of a first-order language (Sect. 1). In the problem of legisimilitude, the target is a deterministic or universal law, which can be expressed by a nomic constituent or a quantitative function involving the operators of physical necessity and possibility (Sect. 2). The special case of legisimilitude, where the target is a probabilistic law (Sect. 3), has been discussed by Roger Rosenkrantz (Synthese, 1980) and Ilkka Niiniluoto (Truthlikeness, 1987, Ch. 11.5). Their basic proposal is to measure the distance between two probabilistic laws by the Kullback–Leibler notion of divergence, which is a semimetric on the space of probability measures. This idea can be applied to probabilistic laws of coexistence and laws of succession, and the examples may involve discrete or continuous state spaces (Sect. 3). In this paper, these earlier studies are elaborated in four directions (Sect. 4). First, even though deterministic laws are limiting cases of probabilistic laws, the target-sensitivity of truthlikeness measures implies that the legisimilitude of probabilistic laws is not easily reducible to the deterministic case. Secondly, the Jensen-Shannon divergence is applied to mixed probabilistic laws which entail some universal laws. Thirdly, a new class of distance measures between probability distributions is proposed, so that their horizontal differences are taken into account in addition to vertical ones (Sect. 5). Fourthly, a solution is given for the epistemic problem of estimating degrees of probabilistic legisimilitude on the basis of empirical evidence (Sect. 6).
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 The similarity approach to truthlikeness
Karl Popper’s (1963) original work on truthlikeness was based on the concepts of truth value (true or false) and logical deduction (entailment). Theories were represented as deductively closed sets of sentences in some language L, and the comparative notion “more truthlike” was characterized by set-theoretical comparisons of the truth content and falsity content of rival theories. The main lesson from the dramatic failure of Popper’s definition in 1974 was the need to add the notion of similarity or resemblance to the logical toolbox. In the first formulations of the similarity approach, Risto Hilpinen (1976) represented theories as classes of possible worlds and employed spheres of similarity from David Lewis’ approach to counterfactuals, while Pavel Tichý defined theories as disjunctions of propositional constituents and Ilkka Niiniluoto as disjunctions of monadic constituents, adding a function to measure the distance between constituents. Soon these notions were extended to full first-order languages by Tichý, Niiniluoto, Raimo Tuomela and Graham Oddie, with systematic summaries in Oddie´s Likeness to Truth (1986), Niiniluoto´s Truthlikeness (1987), and Theo Kuipers’ edition What Is Closer-to-the-Truth? (1987).
In Hilpinen’s treatment, the truthlikeness of a theory depends on its maximum and minimum distances from the actual world. Tichý and Oddie favor the average distance, while Niiniluoto combines the minimum distance with the normalized sum of all distances. In the linguistic formulations, the degree of truthlikeness Tr(H,C*) of a theory H in language L depends on the similarity of the disjuncts of H with the true constituent C* of L. Here the target C* is the most informative truth expressible in the conceptual framework L, and Tr(H,C*) is maximal when H is identical with this complete truth C*.Footnote 1
A successful theory H should give a full and correct description of a domain of investigation by its conceptual resources in language L. In other words, H should specify the L-structure of the actual world with respect to L, and strongest theories are able to do this up to isomorphism. Here L may include qualitative or quantitative concepts. But the choice of the logical complexity of language L allows a finer discrimination: the target can be chosen for the purposes of the relevant cognitive problem, so that it may be a propositional constituent, a state description, a structure description, a monadic constituent, a polyadic constituent of depth-d, or a complete first-order theory.Footnote 2 For each of these choices, the task is to define the distances of statements in L from the given target.
Following Popper, one should distinguish here two problems. In the logical problem of truthlikeness, we are given the true target C*, and we ask what it means to say that a theory is close to C* or closer to C* than another theory. In the epistemic problem of truthlikeness, the true target C* is unknown, and we ask how we can rationally claim or estimate on available evidence E that one theory is close to C* or closer to C* than another theory.
To illustrate the similarity approach to these two problems, let L be a monadic first-order language with k one-place predicates, and let \({\mathbf{Q}} = \left\{ {{\text{Q}}_{1} , \ldots ,{\text{Q}}_{{\text{K}}} } \right\}\) be the Q-predicates of L (K = 2 k). The Q-predicates can be defined by the conjunction of negated or unnegated primitive predicates of L, so that there is a natural distance \({\uprho }_{{{\text{uv}}}} = {\text{ d}}\left( {{\text{Q}}_{{\text{u}}} ,{\text{Q}}_{{\text{v}}} } \right)\) between them.Footnote 3 The Q-predicates are the strongest predicates expressible in L, and they constitute a classification system of individuals in the domain of L. A state description in L locates each individual in one and only one “cell” defined by a Q-predicate, while a structure description specifies the proportions of individuals in these cells. A monadic constituent Ci of L specifies which Q-predicates are empty and which are non-empty:
where (+ / −) is replaced by negation or nothing. As an empty universe is excluded, the number of constituents is \({\text{q}} = 2^{{\text{K}}} - 1\). If CTi is the class of occupied cells by Ci, then (1) can be rewritten in the following form:
If for the true constituent C* there are no empty cells, so that CT* = Q, the world is atomistic in the sense that there are no true universal generalizations. For example, the truth of the generalization \(\left( {\text{x}} \right)\left( {{\text{Fx}} \to {\text{Gx}}} \right)\) means that the cell F& ~ G is empty. A simple distance between monadic constituents is the Clifford-measure:
where Δ is the symmetric difference (see Fig. 1). Variants of (3), which take into account distances between Q-predicates, have been considered by Tichý, Oddie, and Niiniluoto.Footnote 4 Then the degree of truthlikeness Tr(H,C*) of a generalization H in L depends on the Clifford-distances (or their variants) of the disjuncts of H from the true constituent C*. A comparative notion “H1 is closer to the truth than H2” is explicated by the condition Tr(H1,C*) > Tr(H2,C*).
If C* is unknown, but a rational epistemic probability measure P is defined over the class of constituents of L, then the unknown degree Tr(H,C*) can be estimated by its expected value on the basis of evidence E:
where the sum goes over i = 1, …, q.Footnote 5 For monadic languages, the relevant posterior probabilities P(Ci/E) of constituents (i.e., degrees of belief on the truth of Ci given E) are given by Jaakko Hintikka’s system of inductive logic.Footnote 6
2 Verisimilitude vs. Legisimilitude
Following the difference between accidental and lawlike generalizations, a distinction between verisimilitude and legisimilitude has been proposed by L. J. Cohen. In the logical problem of legisimilitude, the target is not just the strongest true statement about the world (in a given language), but a genuine law of nature. A solution of this problem for universal or deterministic laws can be based on S. Uchii’s notion of nomic constituent.Footnote 7 Let L(□) be a modal monadic language with the operators of nomic necessity □ and nomic possibility ◊, satisfying the system S5. Then a nomic constituent tells which Q-predicates are possible and which are impossible:
The number of nomic constituents in L(□) is again \({\text{q}} = 2^{{\text{K}}} - 1\). As actuality implies possibility, and impossibility implies non-actuality, nomic constituent (5) is partly weaker and partly stronger than constituent (3). Laws of nature are disjunctions of nomic constituents. For example, the law \(\square \left( {\text{x}} \right)\left( {{\text{Fx}} \to {\text{Gx}}} \right)\) is equivalent to the disjunction of all nomic constituents which state that the cell F& ~G is physically impossible. The distance \(\Delta \left( {{\text{B}}_{1} ,{\text{B}}_{2} } \right)\) between nomic constituents B1 and B2 can be defined by the Clifford-measure \(|{\text{CT}}_{1} \Delta {\text{CT}}_{2} |/{\text{K}}\) or its variants. The degree of legisimilitude of a law of nature H depends on its distance to the true nomic constituent B*:
Alternatively, if the cognitive aim is to combine verisimilitude and legisimilitude, the target could be the conjunction B* & C*.Footnote 8 Estimation of legisimilitude can again employ expected values based on inductive probabilities.Footnote 9
Nomic constituents (5) represent laws of coexistence, i.e., lawlike connections between attributes or properties. To define laws of succession, introduce a discrete temporal index t to Q-predicates:
Here T lists all possible transitions between successive states. For deterministic laws, for each i there is only one j such that < i,j > ε T. Again the Clifford-measure can be applied to measure the distance between laws of succession:\(|{\text{T}}_{1} \Delta {\text{T}}_{2} |/{\text{K}}^{2}\).Footnote 10
The class of Q-predicates of a monadic language can be generalized to a quantitative state space \({\mathbf{Q}} \subseteq {\mathbf{R}}^{k}\) generated by real-valued quantities \({\text{h}}_{1} , \ldots ,{\text{h}}_{{\text{k}}}\).Footnote 11 In the simplest case, Q is the real line or its part with the geometrical distance between points \({\uprho }\left( {{\text{x}},{\text{x}^\prime}} \right) = \left| {{\text{x}}{-}{\text{x}^\prime}} \right|\). More generally, Q is a k-dimensional metric space with the Euclidean metric. Here laws of coexistence specify regions of nomically possible states
The Clifford-distance between two such laws F1 and F2 is defined by
An alternative approach to quantitative laws expresses how a function hk necessarily depends on \({\text{h}}_{1} , \ldots ,{\text{h}}_{{{\text{k}} - 1}} :\)
The distances between two such real-valued functions can be defined by the Minkowski or Lp-metrics for functions:
Here p = 1 is the Manhattan metric, p = 2 the Euclidean metric, and p = ∞ the Tchebycheff metric sup \({\mid }{\text{f}}\left( {\text{x}} \right){-}{\text{g}}\left( {\text{x}} \right){\mid }\).Footnote 12 The degree of legisimilitude of the law f then depends on its distance to the true law f*:
Further, f is closer to the truth than g if and only of \({\text{leg}}\left( {{\text{f}},{\text{f}}^*} \right) > {\text{leg}}\left( {{\text{g}},{\text{f}}^*} \right)\).
Quantitative laws of succession can be formulated by relativizing the state x with a time: h(x,t) = state of x at time t. Then deterministic dynamical laws tell how the state depends on time t and some initial state at time to:
A law of succession specifies nomically possible trajectories \({\text{F}}:{\mathbf{R}} \times {\mathbf{Q}} \to {\mathbf{Q}}\). The distance between such laws can be defined by taking for each \({\text{Q}}{\kern 1pt} {\upvarepsilon }{\kern 1pt} {\mathbf{Q}}\) the Minkowski distance between the trajectories F1(t,Q) and F2(t,Q), for \({\text{t}}\;{\upvarepsilon }\;{\mathbf{R}}\), and then summing over all possible initial states \({\text{Q}}{\kern 1pt} {\upvarepsilon }{\kern 1pt} {\mathbf{Q}}\).Footnote 13
3 Probabilistic laws
The notion of a universal or deterministic law introduced in Sect. 2 can be generalized to probabilistic laws, if an objective physical probability measure is available.Footnote 14 Following Leibniz, such physical probabilities express “degrees of possibility”. This can be understood in terms of single-case propensities: P(G/F) = r means that a physical set-up has a numerical disposition of strength r to produce an outcome of type G in each trial of type F. Thus, probability statements involve a dispositional modal operator, so that they differ from extensional statistical statements about actual relative frequencies of attributes in reference classes (i.e., structure descriptions). Universal laws of coexistence and deterministic laws of succession are limiting special cases of probabilistic laws (with propensities 0 and 1).Footnote 15 Genuine probabilistic laws presuppose that the world is indeterministic, but in statistical modelling one may assign in some sense objective probabilities to random phenomena (e.g., coin tossing, roulette) even when the underlying reality is deterministic. For the task of defining approximation to such probabilities, the philosophical issue of indeterminism and determinism can be left open.
To define probabilistic constituents, replace in a nomic constituent (5) the operator of physical possibility ◊ with a probability measure P over the discrete state space Q of Q-predicates, now understood as the “sample space” or the class of outcomes of a trial x:
Probabilities (8) over Q define a multinomial context. Then \({\text{p}}_{{\text{i}}} = {\text{P}}\left( {{\text{Q}}_{{\text{i}}} \left( {\text{x}} \right)} \right) > 0\) if and only if Qi is physically possible, for all \({\text{i}} = 1, \ldots ,{\text{ K}}\), so that here P is applied to the open formula Qi(x) instead of the existential statement in (5). Now a probabilistic constituent (8) is compatible with the nomic constituent Bi with CTi if and only if it assigns a positive probability to the Q-predicates in CTi and zero probability to other Q-predicates. This means that typically a nomic constituent is an infinite disjunction of probabilistic constituents.
In many statistical applications, the trial x counts the number of successes in a repeated experiment (e.g., binomial and Poisson distributions), so that the state space Q is a subclass of the set N of natural numbers. Then the distance \({\uprho }\left( {{\text{x}},{\text{x}^{\prime}}} \right)\) between points in Q is their normalized arithmetical difference.
An important distinction can be made between pure and mixed probabilistic laws. A probabilistic constituent, where CTi is a proper subset of Q, is a mixed law in the sense that it entails a universal law (cells in Q–CTi are necessarily empty). Pure probabilistic laws have no such entailments: the world is atomistic in the sense that all Q-predicates in Q are nomically possible (so that no universal laws hold), and a positive probability is assigned to all Q-predicates.
To define probabilistic laws of succession, for a discrete space Q the set T of possible transitions between states is replaced by a matrix of transition probabilities
where \({\text{p}}_{{1/{\text{i}}}} + \ldots + {\text{p}}_{{{\text{K}}/{\text{i}}}} = {\text{ }}1\) for each i. This definition involves the Markov condition, i.e., the next state depends only on the present state. If transition probabilities are 0 or 1, this law reduces to the deterministic law (6).Footnote 16 Equation (9) determines the n-step transition probabilities
and for an irreducible stationary Markov chain the limits of pj/i(n), for n → ∞, give a long-run probability distribution. These notions can be generalized to Markov processes with a continuous time.Footnote 17 Special cases of probabilistic laws of succession can be formulated by quantitative dynamic laws like the law of radioactive decay
where Q(x,t) states that atom x decays within the time-interval [0,t] and λ is a constant. Finally, for a quantitative state space Q a probability measure on Q∞ (i.e., infinite sequences of successive states) assigns a physical probability to possible trajectories of a time-continuous stochastic process.
4 Distance between probabilities
The problem of legisimilitude for probabilistic laws has not yet received much attention. The main focus in the literature has been on cases, where the target is a universal or deterministic law, either qualitative or quantitative. The only detailed proposals have been given by Rosenkrantz (1980) and Niiniluoto (1987), 403–405, who apply the Kullback–Leibler notion of divergence as a measure of distance from a probabilistic truth.
Mathematicians have suggested a great number of measures for distances between probability distributions. In a comprehensive survey, Cha (2007) lists 45 different measures,Footnote 18 which have been used for various purposes. For example, the central limit theorem (the sum of n independent random variables approximates in the limit the normal distribution) and laws of large numbers (observed relative frequencies and predictive probabilities approach almost surely objective probabilities in a multinomial Bernoulli process) express distances between epistemic probabilities q and objective probabilities p by their geometrical distance |q–p|.Footnote 19 This amounts to the Manhattan metric
For discrete probabilities, the squared Euclidean or quadratic metric
or its variant χ2, is a standard way of measuring the fit between two distributions or structural descriptions.Footnote 20 In the special case of scoring, where qi are probabilistic estimates of the truth values pi of n rival exclusive hypotheses (pj = 1, otherwise 0), Glenn Brier’s 1950 measure of inaccuracy is quadratic, i.e., \({\text{d}}\left( {1,{\text{q}}} \right) = \left( {1{-}{\text{q}}} \right)^{2}\) and \({\text{d}}\left( {0,{\text{q}}} \right) = {\text{q}}^{2} ,\) Footnote 21 while I. J. Good in 1952 favored the logarithmic measure \({\text{d}}\left( {1,{\text{q}}} \right) = - {\text{ lnq}},{\text{ d}}\left( {0,{\text{q}}} \right) = - {\text{ ln}}\left( {1 - {\text{q}}} \right)\).Footnote 22 From these local measures the total scoring measure is obtained by summing the inaccuracies of all qi.
Some measures are based on the inner products piqi. Hellinger’s 1909 proposal was modified in 1946 in Bhattacharyya’s dissimilarity coefficient:
The directed divergence of a discrete random variable p from another q was defined by Solomon Kullback and Richard Leibler in 1951 as the expected logarithmic difference between p and q with respect to p:
Here log can be taken to have the binary base 2, so that log2 = 1. Formula (11) is also called the relative entropy of p with respect to q. This measure is only a semimetric: non-negative, \({\text{div}}\left( {{\text{p}},{\text{p}}} \right) = 0,{\text{ div}}\left( {{\text{p}},{\text{q}}} \right) = 0\) if and only if p = q, but non-symmetric (usually \({\text{div}}\left( {{\text{p}},{\text{q}}} \right) \ne {\text{div}}\left( {{\text{q}},{\text{p}}} \right))\), and the triangle equation is not satisfied. For continuous probability densities f and g on R, Eq. (11) is replaced by
The symmetric divergence between p and q is defined by
Other variants include the λ-divergence
For λ = ½, it gives the Jensen-Shannon divergence
where \({\text{H}}\left( {\text{p}} \right) = - \Sigma {\text{p}}_{{\text{i}}} {\text{logp}}_{{\text{i}}}\) is the Shannon entropy of p. Formula (12) is non-negative and symmetric, and its square root is a metric. Renyi divergence is defined by
For α = 1, it gives in the limit the Kullback–Leibler divergence, and for α = ½ twice the Bhattacharyya distance.
Divergence was originally intended as a tool in information theory.Footnote 23 In Bayesian statistics it has been used to measure the difference between prior and posterior distributions. It can be also used for assessing the similarity of a probabilistic model with some aspect of reality.Footnote 24 The first connection to the studies in truthlikeness was developed by Roger Rosenkrantz (1980). Inspired by I. J. Good’s notion of the weight of evidence, Rosenkrantz suggested that for a random experiment x and truth h*, hypothesis h is more truthlike than hypothesis h′ if
This idea was connected to the similarity approach to truthlikeness by Niiniluoto (1987): the distance of a probabilistic hypothesis h from the probabilistic target h* is measured by div(h*,h). Thus, hypothesis h is more truthlike than h´ if and only if \({\text{div}}\left( {{\text{h}}^*,{\text{h}}} \right) < {\text{div}}\left( {{\text{h}}^*,{\text{h}^\prime}} \right)\). When the relevant hypotheses h, h′, and h* are specified by probabilities pi, qi, and pi*, this comparative condition holds if and only if \(\Sigma {\text{p}}_{{\text{i}}}^*{\text{log}}\left( {{\text{q}}_{{\text{i}}} /{\text{p}}_{{\text{i}}} } \right) < 0\). Generalization to probability density functions is immediate.
More generally, Niiniluoto (1987) recommends divergence div as a solution to the problem of probabilistic legisimilitude:
-
for probabilistic laws of coexistence (8) in qualitative conceptual spaces, the distance to the true probabilistic constituent
-
for probabilistic laws of succession (9) in qualitative languages, the distance to the matrix of true probability transitions
-
for probabilistic laws of succession in the quantitative space Q∞, the distance to the true probability on Q∞.
Alternative solutions could replace div by some other distance measure, e.g., Manhattan, Euclidean, or Bhattacharyya.
The Kullback–Leibler divergence div(p,q) has a limitation which is not noted in Niiniluoto (1987). Its definition (11) presupposes that p is absolutely continuous with respect to q, i.e., if qi = 0, then pi = 0. Further, when pi = 0, the factor 0log0 in the sum vanishes. The same condition is required for probability densities: if g(x) = 0, then f(x) = 0. This means that the KL-divergence can be applied only to pure probabilistic laws, since for mixed probabilistic constituents mistakes in the empty cells (or zero points in Q) in the target and hypothesis would not count at all. The same problem is faced by the Bhattacharyya distance, whose factors vanish as soon as pi or qi is 0, but not by the Minkowski metrics.
This problem with divergence is observed by Rosenkrantz (1980), who suggests that in the evaluation of \({\text{div}}\left( {{\text{P}}\left( {{\text{x}},{\text{h}}} \right),{\text{P}}\left( {{\text{x}},{\text{h}}^*} \right)} \right)\) zero probabilities are replaced by a slightly positive possibility of misclassification, but this ad hoc move is unsatisfactory. As a better solution one can recommend the use of the Kullback–Leibler directed divergence div for pure probabilistic laws, and the Jensen-Shannon divergence divJS (instead of div) to measure the distance between mixed probability distributions over cells Q or in the transition matrix. The JS-divergence shares the good properties of the KL-divergence, but it has a finite value in all cases even when some of the probabilities are zero.Footnote 25
The following examples illustrate various possibilities in analyzing approach to probabilistic laws.
Example 1.
Let L be a monadic language with two primitive predicates Fx = x is a swan and Gx = x is white. Then there are four Q-predicates in L:
Then the true constituent C* in L states that all Q-predicates are instantiated. Let H be the false universal generalization “All swans are white”. H states that Q2 is empty and leaves other cells as question marks. Applying the min-sum definition with weights γ and γ´ for the min and sum factors, respectively, the degree of truthlikeness of H isFootnote 26
Choosing γ = 2/3 and γ´ = 1/3, this is equal to 5/8. The degree of truthlikeness of the false constituent C1 with CT1 = {Q1, Q3, Q4} isFootnote 27
The same numerical results hold for the nomic versions of C*, H, and C1. In the probabilistic framework, H corresponds to the law P(Gx/Fx) = 1, but now the target is the true probability distribution P* over the cells Q1,…,Q4, and H is a disjunction of probabilistic constituents with probability 0 for Q2. As the number of black swans is small in comparison to white swans, the true probabilistic law is something like P(Gx/Fx) = 0.95. It follows, for any reasonable distance measure, that H has a relatively high degree of truthlikeness, and in any case higher than that of the law \({\text{P}}\left( {{\text{Gx}}/{\text{Fx}}} \right) = 0.5\). But if the cognitive interest of the investigator is to know both the nomic and actual features of birds, so that the target is the conjunction P* & C*, then H’s overall truthlikeness is reduced, since it mistakenly excludes the cell Q2, while laws of the form \({\text{P}}\left( {{\text{Gx}}/{\text{Fx}}} \right) = {\text{r}} < 1\) allow for the actual existence of black swans.
Example 2.
Already Example 1 illustrates the fact that the comparison of ordinary, nomic and probabilistic constituents is a complicated matter, as they involve different targets. For example, a probabilistic constituent P is equivalent to a single nomic constituent B only in the special case where just one cell Qi is physically possible – and, hence, has probability 1. In other cases, the true nomic constituent B* is an infinite disjunction of probabilistic constituents, and the target-sensitivity does not allow a direct comparison of the degrees of truthlikeness of these different types of hypotheses. In particular, the atomistic nomic constituent, which states the possibility of all Q-predicates, is the disjunction of all pure probabilistic laws. This means that there is no connection between divergence and Clifford-distance for pure probabilistic laws. To see this, assume that P1 and P2 are two different laws, and B1 and B2 are the nomic constituent entailed P1 and P2. If P1 and P2 are pure laws, then CT1 = CT2 = Q and CT1ΔCT2 = ø, so that div(P1,P2) > 0 but the Clifford-distance ΔC(B1,B2) = 0.Footnote 28 But some simple comparisons can be made for the special case of uniform mixed laws. Thus, suppose C1 and C2 are monadic nomic constituents in a language with K Q-predicates with \(|{\text{CT}}_{1} {-}{\text{CT}}_{2} \left| { = {\text{A}},{\text{ }}} \right|{\text{CT}}_{2} {-}{\text{CT}}_{1} \left| { = {\text{B}},\,{\text{ and}}\,{\text{ }}} \right|{\text{CT}}_{1} \cap {\text{CT}}_{2} | = {\text{D}}\), so that the Clifford distance ΔC between C1 and C2 is \(\left( {{\text{A}} + {\text{B}}} \right)/{\text{K}}\) (see (3)). Let P1 and P2 be probabilistic constituents which allocate probability uniformly to CT1 and CT2 (1/c and 1/c′, respectively), where \({\text{c}^\prime} \ge {\text{c}}\). Now \({\text{D}} = {\text{c}}{-}{\text{A}} = {\text{c}^\prime} - {\text{B}}\). Then the Manhattan distance satisfies
If \({\text{c}} = {\text{c}^\prime}\), this value equals \({\text{K}}\Delta _{{\text{C}}} \left( {{\text{C}}_{1} ,{\text{C}}_{2} } \right)/{\text{c}}\). For the Euclidean distance with \({\text{c}} = {\text{c}^\prime}\) we have
A similar connection to the Clifford measure holds for the Jensen-Shannon divergence:
However, such connections fail for non-uniform laws. For example, if nomic constituents B1 and B2 are otherwise almost equal, but B1 makes correct possibility claims about cells Qi with high true probability \({p}_{i}^{*}\) while B2 makes such claims about cells Qj with low probability \({p}_{j}^{*}\), then it may happen that truthlikeness ordering is reversed when the target changes from B* to P*: \(\Delta _{{\text{C}}} \left( {{\text{B}}_{1} ,{\text{B}}^*} \right) > \Delta _{{\text{C}}} \left( {{\text{B}}_{2} ,{\text{B}}^*} \right)\), but \({\text{d}}\left( {{\text{B}}_{1} ,{\text{P}}^*} \right) < {\text{d}}\left( {{\text{B}}_{2} ,{\text{P}}^*} \right).\)Footnote 29
Example 3
If p and q are disjoint mixed probabilistic laws (i.e.,\({\text{CT}}_{{\text{p}}} \cap {\text{CT}}_{{\text{q}}} ={\oslash}\)), then
Example 4
The Poisson distribution for a randomly occurring rare event \({\text{p}}\left( {\text{i}} \right),{\text{ i}} = 1,{\text{ }}2, \ldots ,\) with a constant mean λ is defined by
The KL-divergence between two Poisson distributions with rates λ and λ´ (where λ´ > λ) is
The proof uses the Taylor series
Example 5
The Manhattan difference between two exponential laws (10) with decay rates λ and λ′ (where λ′ > λ) is
Example 6
Let p and q be deterministic laws of succession such that \({\text{p}}_{{2/1}} = 1,{\text{ p}}_{{1/1}} = 0\;{\text{ and}}\;{\text{ q}}_{{2/1}} = 0,{\text{ q}}_{{1/1}} = 1\). Then
For indeterministic r with \({\text{r}}_{{2/1}} = {\text{r}}_{{1/1}} = \raise.5ex\hbox{$\scriptstyle 1$}\kern-.1em/ \kern-.15em\lower.25ex\hbox{$\scriptstyle 2$} ,\)
These examples illustrate that several alternative distance measures give fairly similar comparative results. For uniform nomic constituents their results are related to the Clifford-measure between ordinary constituents, but this relation is not straightforward for non-uniform constituents and disappears for pure probabilistic laws. When it comes to measure the distance between particular probability values, geometrical and quadratic differences seem simple and useful, but for the distance between whole probability distributions or densities divergence is a convenient choice. The applicability of the Kullback–Leibler divergence is restricted to pure probabilistic laws, so that the Jensen-Shannon divergence turns out to be valuable complement which can be applied to mixed laws which assign zero probabilities to some Q-predicates or sample points.
5 Vertical versus horizontal distance measures
An important debate about the explication of truthlikeness for monadic languages concerned the question, whether the distance between constituents should reflect distances between Q-predicates. The Clifford-measure \(\Delta _{{\text{C}}} \left( {{\text{C}}_{1} ,{\text{C}}^*} \right)\) counts all errors of C1 about the Q-predicates equally: mistaken existence claims in \({\text{CT}}_{1} {-}{\text{CT}}^*\) and mistaken non-existence claims in \({\text{CT}}^* - \;{\text{CT}}_{1}\) have the same weight 1/K in (3). It is natural to consider also situations where the cognitive seriousness of errors in a false constituent are treated differently, so that the distance from the truth is not simply the cardinality of the symmetric difference. Niiniluoto proposed in 1976 two modifications of the Clifford-measure on the basis of the ρ-measure between Q-predicates.Footnote 30 In the Jyväskylä measure dJ false existence claims are weighted by their distance to the nearest non-empty cell, and false non-existence claims by their distance to the nearest empty cell, while in the weighted symmetric difference dw the first condition holds, but false non-existence claims are weighted by the minimum distance to a really non-empty cell. Then ΔC and dw (unlike dJ) are symmetric, and ΔC and dJ (unlike dw) are specular, where a specular distance (in the sense of Festa, 1993) satisfies the condition that the maximally distant constituent from Ci is its photographic negative (i.e., all positive claims are replaced by negative ones and vice versa).Footnote 31 If the ρ-measure reflects resemblances between predicates in a family (e.g. colors), then for the Jyväskylä measure the generalization “All ravens are grey” is closer to the truth than “All ravens are white”.Footnote 32
Tichý’s (1976) general definition of truthlikeness implies for the monadic case a distance measure between constituents which differs from the Clifford-measure ΔC and its modifications dJ and dw.Footnote 33 A linkage η between sets CTi and CTj is a surjective mapping from the larger of the sets to the smaller one. The cardinality card(η) of η is then \({\text{max}}\left\{ {\left| {{\text{CT}}_{{\text{i}}} } \right|,\;|{\text{CT}}_{{\text{j}}} |} \right\}\), and the breadth of η is the average distance between the linked predicates:
The distance \({\text{d}}_{{\text{T}}} \left( {{\text{C}}_{{\text{i}}} ,{\text{C}}_{{\text{j}}} } \right)\) between constituents Ci and Cj is then defined as the breadth of the narrowest linkage between CTi and CTj.
Niiniluoto (1987) rejects Tichý’s proposal for several reasons. The use of average in (13) leads to unintuitive examples, and constituents should not be treated as if they consisted only of existence claims. Indeed, dT is not specular and does not reflect the cognitive goal of finding true universal generalizations. The most fundamental objection is that \({\text{d}}_{{\text{T}}} \left( {{\text{C}}_{{\text{i}}} ,{\text{C}}^*} \right)\) can be derived as the minimum distance between two state descriptions s and s′, where s entails the uniformly distributed infinite structure description entailing Ci and s′ entails C*. Thus, Tichý is not defining the distance between Ci and C* in terms of the counted or weighted differences in claims about the Q-predicates (and thereby the ability of Ci to express true generalizations), but rather in terms of putting an infinite number of individuals in their right places in a classification system.Footnote 34 The latter problem should be solved by choosing the target as the true state description and by replacing constituents (in a non ad hoc way) as disjunctions of state descriptions.
In spite of this criticism, Tichý’s basic idea is interesting, since the notion of a linkage resembles metrics defined for trees in terms of the number of transformations needed to change one tree to another.Footnote 35 A linkage takes seriously (but perhaps in a wrong way) the demand that “horizontal” distances between Q-predicates are relevant. The goal of distributing an infinite number of individuals to their right places could be viewed as analogous to the task of distributing a probability mass (of measure 1) to its right place. Indeed, a discrete probabilistic constituent (8) allocates the probabilities to a finite number of points in the space Q of Q-predicates, and a continuous probability density f on a state space \({\mathbf{Q}} \subset {\text{R}}^{{\text{n}}}\) does the corresponding assignment to an infinite number of points. This can be illustrated by the simple case where Q is a subset of the real line R and f: \({\mathbf{Q}} \to {\text{R}}^{ + }\). If we denote by Df the region between the curve f(x) ≥ 0 and the real axis, i.e.,
then the density f gives the probability measure 1 to Df. For two probability densities f and g, the symmetric difference \({\text{D}}_{{\text{f}}} \Delta {\text{D}}_{{\text{g}}}\) covers the region between the functions f(x) and g(x) (see Fig. 2). The Manhattan distance is simply the area of this region:
which is a direct analogue of the Clifford-distance (3) between constituents.
The Manhattan measure, as well as its Euclidean and divergence alternatives, are in an obvious sense vertical, since they measure the distance between probability distributions by the absolute, quadratic, or logarithmic differences between the values of f(x) and g(x), without consideration of horizontal distances between points in the sample space Q. This verticality is dramatically seen in Example 3, where the distances between the values of two disjoint discrete probabilistic constituents p and q are maximal, and the distances Δ1(p,q) and divJS(p,q) have their maximal values quite independently of the location of p and q with respect to the space Q. A counterpart of this result for probability densities is the following observation: if f1 and f2 are geometrical distributions with the same shape but disjoint domains, then Δ1(f1,f2), Δ2(f1,f2), and divJS(f1,f2) have their maximal values quite independently of the geometrical distance a between these densities (see Fig. 3). In fact, all distance measures surveyed by Cha (2007), which are applicable to mixed probabilistic laws, share this feature of verticality.
The observations above motivate the idea that one could try to find measures which in some way take into account the horizontal distances between probability distributions (in addition to their vertical ones). Then a modification of Tichý’s linkages might be fruitful. The detailed development of this suggestion has to be left for another occasion, but a simple illustration of the idea can be given here. Consider again real-valued probability densities f which define regions Df in a subspace S of R2. Let \({\upbeta }:{\text{ S}} \to {\text{S}}\) be an area-preserving function, so that β[A] has the same area as A for all subregions A of S. Thus, β maps Df onto Dg by moving the whole probability mass from Df to Dg.Footnote 36 The length of the vector (< x,y > ,β(x,y)) is defined by the metric of S, and the breadth of β is defined as the sum (integral) of all these lengths for points < x,y > in Df. Then the distance between probabilities f and g is the breadth of the narrowest transformation β between Df and Dg. For example, in Fig. 3 the mapping \({\upbeta }\left( {{\text{x}},{\text{y}}} \right) = \left( {{\text{x}} - {\text{a}},{\text{ y}}} \right)\), i.e., linear shift to the left,Footnote 37 gives a linkage between f2 and f1 whose breadth is a, since
For probabilities on the discrete sample space Q, which in effect define columns on the points of Q with the total length one, the corresponding idea is to measure the distance between p and q by looking for the shortest length-preserving transformation between p and q. Such a transformation divides the columns of q into pieces and moves them in order to reach a fit with p. If a part of a column qi is moved to Qj, then the length of this part is multiplied with the distance ρ(Qi,Qj). For example, let \({\mathbf{Q}} = \left\{ {0,1,2} \right\},{\text{ }}{\uprho }\left( {0,1} \right) = {\uprho }\left( {1,2} \right) = 1/2,{\text{ }}{\uprho }\left( {0,2} \right) = 1\). Then p and q have the maximal distance 1, if p gives all probability to 0 and q to 2. If
then the distance between p and q is
But if
then the distance between p and q is ¼. These measures, which combine vertical and horizontal aspects, are applicable to both pure and mixed probabilistic laws.
6 Estimating distance from probabilistic truth
According to the similarity approach to the epistemic problem of truthlikeness, unknown degrees of truthlikeness can be estimated by their expected value (4) using a posterior probability distribution over constituents. The same idea can be applied for the estimation of unknown degrees of divergence, which measure distance from the true probabilistic law.
Example 7
Footnote 38 If p is the true probability of success in a binomial model
and q is our guessed value, then the divergence of q from p in a single trial is
As divergence is additive for independent distributions, this divergence in n trials is
If the prior distribution g(p) of p is the uniform Beta(1,1), i.e., g(p) = 1 for 0 ≤ p ≤ 1, by Bayes´ theorem the posterior distribution g(p/s) of p with s successes and n-s failures is Beta(s + 1, n-s + 1), i.e.,
whose mean is (s + 1)/(n + 2).Footnote 39 Then the estimated distance of q from p can be calculated by
It follows that, given s successes in n trials, guess q´ is estimated to be closer to the truth than q if and only if
Note that for the deterministic hypothesis q = 1 the value of div(q,p) is not well defined, but for the Jensen-Shannon distance
Hence,
Example 8
Footnote 40 Let x1,…, xn be independent measurements of an unknown real-valued quantity θ with a normal distribution \({\text{N}}({{\uptheta }},\sigma ^{2} )\):
Then their mean value \({\text{y}} = \left( {{\text{x}}_{1} + \cdots + {\text{x}}_{{\text{n}}} } \right)/{\text{n}}\) is normally distributed N(θ,σ2/n). If the prior probability of θ is sufficiently flat normal, then the posterior distribution g(θ/y) of θ is approximately N(y,σ2/n), where y is the observed mean. If f(x/θ) is the true distribution, and f(x/θo) is our guess, then their estimated directed divergence is
Here the mean y as the best estimate agrees with the result of the Bayes-rule of minimizing expected quadratic loss.
7 Conclusion
We have seen in this paper that the basic idea of the similarity approach to truthlikeness can be extended from qualitative and quantitative first-order languages to cases where probabilistic statements (and their disjunctions) are compared with probabilistic targets. Sections 2 and 3 show how one can naturally proceed from universal and deterministic laws to probabilistic laws. Section 4 argues that the Kullback–Leibler divergence has to be supplemented by the Jensen-Shannon divergence as a measure between mixed probabilistic laws, i.e., laws which assign zero probabilities to some sample points and thereby entail some universal laws. Section 5 formulates a research program for studying a new class of measures which account for the horizontal differences between probability densities, based on distances between sample points. In this way the theory of probabilistic truth approximation does not only lend tools from probability calculus but may suggest novel kinds of problems for mathematicians. Finally, Sect. 6 gives examples to show that the method of estimating degrees of legisimilitude by their expected value can be generalized from the case of deterministic laws to probabilistic laws.
Notes
This the Manhattan or Hamming distance between Qu and Qv (i.e., the number of disagreements in their definition), but normalized to take values between 0 and 1. Q-predicates can also be defined by Carnapian families of predicates, and their distance by the Euclidean metric. See Niiniluoto (1987), pp. 44–47.
This solution to the epistemic problem was proposed by Niiniluoto in 1977 (cf. Niiniluoto, 1987, Ch. 7).
For a survey, see Niiniluoto (2011).
See Niiniluoto (1987), p. 377.
For monadic nomic constituents, the relevant posterior probabilities are again obtained from Hintikka’s inductive logic (see Niiniluoto, 1987, pp 98–102).
See Niiniluoto (1987), p. 379.
See Niiniluoto (1987), Ch. 3.
See Niiniluoto (1987), p. 385. The metric (7) is based on the differences between the values of two functions, but it does not reflect the similarity of their mathematical form (see Niiniluoto, 2019, p. 131). For a proposal to measure the distance between quantitative laws as a combination of accuracy and nomicity, see Garcia Lapeña (2021).
See Niiniluoto (1987), p. 393.
See Niiniluoto (1987), pp. 118–121.
For probabilistic laws with single-case propensities, see Fetzer (1981).
Discrete state systems, both deterministic and probabilistic, with a Markov condition have been studied by Rescher (1970).
See Parzen (1962), pp. 248, 277.
Cf. Niiniluoto (1987), pp. 7–8.
See Festa (1993).
See Niiniluoto (1987), pp. 15–16, pp. 302–303, pp. 321–322.
See McCutcheon (2019).
See Kullback (1959).
For example, Sober (2002) follows the statistician H. Akaike in measuring the distance between a fitted model (with fixed parameter values) and the truth by the Kullback–Leibler distance.
divJS is absolutely continuous, since in (12) (pi + qi)/2 = 0 implies pi = 0 and qi = 0.
See formula (9.21) with b = b´= 1 and q = 4 in Niiniluoto (1987), p. 338.
See formula (6.88) with |I |= 24 and av(*,B) = ½ in Niiniluoto (1987), p. 229.
Festa (2007) has proposed a way of measuring the distance between a monadic generalization and “the statistical truth” (i.e., true probabilistic constituent). The idea, roughly speaking, is to divide Q-predicates into “statistically common” and “rare” ones, and then demand that a truthlike generalization should make true existential claims about common predicates and false exclusion claims only about rare predicates.
See Niiniluoto (1987), p. 319.
See Niiniluoto (1987), p. 340.
See also Oddie (1986), pp. 91–99, who applies this method to depth-d constituents.
See Niiniluoto (1987), pp. 328–330.
See the Boorman-Olivier and Fu metrics for trees in Niiniluoto (1987), pp. 11–14.
A transformation β(x,y) = < f(x,y),g(x,y) > , where f and g are linear functions, is area-preserving, if the absolute value of its Jacobian determinant is one at every point. The Jacobian is composed of the partial derivatives of f and g:
\(\begin{array}{l} |\partial {{\text{f}(\text{x},\text{y})/\text{dx} }}\,\,\partial {{\text{f}(\text{x},\text{y})/\text{dy} |}}\\ {|}\partial {{\text{g}(\text{x},\text{y})/\text{dx} }}\,\, \partial {{\text{g}(\text{x},\text{y})/\text{dy} |}} \end{array}\)
Note that the Jacobian determinant of this transformation is.
\(\begin{array}{l} {{|1\,\, 0|}}\\ {{|0\,\, 1|}} \end{array}\)
so that its value is 1 × 1 + 0 × 0 = 1.
Cf. Niiniluoto (1987), p. 404.
See Festa (1993, pp. 60–61. Here Γ(n) = (n-1)! is the gamma function.
Cf. Niiniluoto (1987), pp. 281–283, 428.
References
Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1, 300–307.
Festa, R. (1993). Optimum inductive methods. Kluwer.
Festa, R. (2007). Verisimilitude, qualitative theories, and statistical inferences. In S. Pihlström, P. Raatikainen, & M. Sintonen (Eds.), Approaching truth: Essays in honour of Ilkka Niiniluoto (pp. 143–178). College Publications.
Fetzer, J. (1981). Scientific knowledge: Causation, explanation, and corroboration. D. Reidel.
Garcia Lapeña, A. (2021). Truthlikeness for quantitative deterministic laws. The British Journal for the Philosophy of Science.
Hilpinen, R. (1976). Approximate truth and truthlikeness. In M. Przelecki, K. Szaniawski, & R. Wojcicki (Eds.), Formal methods in the methodology of empirical sciences (pp. 19–42). D. Reidel.
Kuipers, T. (1982). Approaching descriptive and theoretical truth. Erkenntnis, 18, 343–387.
Kuipers, T. (Ed.). (1987). What is closer-to-the-truth? Rodopi.
Kuipers, T. (2019). Nomic truth approximation revisited. Springer.
Kullback, S. (1959). Information theory and statistics. Wiley.
McCutcheon, R. (2019). In favor of logarithmic scoring. Philosophy of Science, 86, 286–303.
Niiniluoto, I. (1978). Truthlikeness in first-order languages. In J. Hintikka, I. Niiniluoto, & E. Saarinen (Eds.), Essays on mathematical and philosophical logic (pp. 437–458). D. Reidel.
Niiniluoto, I. (1987). Truthlikeness. D. Reidel.
Niiniluoto, I. (1998). Verisimilitude: The third period. The British Journal for the Philosophy of Science, 49, 1–29.
Niiniluoto, I. (2011). The development of the Hintikka program. In D. Gabbay, S. Hartmann, & J. Woods (Eds.), Handbook of the history of logic (Vol. 10, pp. 311–356). North-Holland.
Niiniluoto, I. (2019). Truth-seeking by abduction. Springer.
Oddie, G. (1986). Likeness to truth. D. Reidel.
Oddie, G. (2016). Truthlikeness. In E. Zalta (Ed.), The Stanford encyclopedia of philosophy, https://plato.stanford.edu/archives/win2016/entries/truthlikeness/. Read Oct. 20, 2020.
Oddie, G. (2019). What accuracy could not be. The British Journal for the Philosophy of Science, 70, 551–580.
Parzen, E. (1962). Stochastic processes. Holden-Day.
Pettigrew, R. (2015). Accuracy and laws of credence. Oxford University Press.
Popper, K. (1963). Conjectures and refutations. Routledge and Kegan Paul.
Rescher, N. (1970). Scientific explanation. The Free Press.
Rosenkrantz, R. (1980). Measuring truthlikeness. Synthese, 45, 463–488.
Sober, E. (2002). Instrumentalism, parsimony, and the Akaike framework. Philosophy of Science, 69(S3), S112–S123.
Tichý, P. (1976). Verisimilitude redefined. The British Journal for the Philosophy of Science, 27, 25–42.
Acknowledgements
I am grateful to Gustavo Cevolani and Theo Kuipers for useful comments on probabilistic truth approximation in general and my paper in particular.
Funding
Open access funding provided by University of Helsinki including Helsinki University Central Hospital.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the topical collection "Approaching Probabilistic Truths", edited by Theo Kuipers, Ilkka Niiniluoto, and Gustavo Cevolani.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Niiniluoto, I. Approaching probabilistic laws. Synthese 199, 10499–10519 (2021). https://doi.org/10.1007/s11229-021-03256-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11229-021-03256-8