1 The similarity approach to truthlikeness

Karl Popper’s (1963) original work on truthlikeness was based on the concepts of truth value (true or false) and logical deduction (entailment). Theories were represented as deductively closed sets of sentences in some language L, and the comparative notion “more truthlike” was characterized by set-theoretical comparisons of the truth content and falsity content of rival theories. The main lesson from the dramatic failure of Popper’s definition in 1974 was the need to add the notion of similarity or resemblance to the logical toolbox. In the first formulations of the similarity approach, Risto Hilpinen (1976) represented theories as classes of possible worlds and employed spheres of similarity from David Lewis’ approach to counterfactuals, while Pavel Tichý defined theories as disjunctions of propositional constituents and Ilkka Niiniluoto as disjunctions of monadic constituents, adding a function to measure the distance between constituents. Soon these notions were extended to full first-order languages by Tichý, Niiniluoto, Raimo Tuomela and Graham Oddie, with systematic summaries in Oddie´s Likeness to Truth (1986), Niiniluoto´s Truthlikeness (1987), and Theo Kuipers’ edition What Is Closer-to-the-Truth? (1987).

In Hilpinen’s treatment, the truthlikeness of a theory depends on its maximum and minimum distances from the actual world. Tichý and Oddie favor the average distance, while Niiniluoto combines the minimum distance with the normalized sum of all distances. In the linguistic formulations, the degree of truthlikeness Tr(H,C*) of a theory H in language L depends on the similarity of the disjuncts of H with the true constituent C* of L. Here the target C* is the most informative truth expressible in the conceptual framework L, and Tr(H,C*) is maximal when H is identical with this complete truth C*.Footnote 1

A successful theory H should give a full and correct description of a domain of investigation by its conceptual resources in language L. In other words, H should specify the L-structure of the actual world with respect to L, and strongest theories are able to do this up to isomorphism. Here L may include qualitative or quantitative concepts. But the choice of the logical complexity of language L allows a finer discrimination: the target can be chosen for the purposes of the relevant cognitive problem, so that it may be a propositional constituent, a state description, a structure description, a monadic constituent, a polyadic constituent of depth-d, or a complete first-order theory.Footnote 2 For each of these choices, the task is to define the distances of statements in L from the given target.

Following Popper, one should distinguish here two problems. In the logical problem of truthlikeness, we are given the true target C*, and we ask what it means to say that a theory is close to C* or closer to C* than another theory. In the epistemic problem of truthlikeness, the true target C* is unknown, and we ask how we can rationally claim or estimate on available evidence E that one theory is close to C* or closer to C* than another theory.

To illustrate the similarity approach to these two problems, let L be a monadic first-order language with k one-place predicates, and let \({\mathbf{Q}} = \left\{ {{\text{Q}}_{1} , \ldots ,{\text{Q}}_{{\text{K}}} } \right\}\) be the Q-predicates of L (K = 2 k). The Q-predicates can be defined by the conjunction of negated or unnegated primitive predicates of L, so that there is a natural distance \({\uprho }_{{{\text{uv}}}} = {\text{ d}}\left( {{\text{Q}}_{{\text{u}}} ,{\text{Q}}_{{\text{v}}} } \right)\) between them.Footnote 3 The Q-predicates are the strongest predicates expressible in L, and they constitute a classification system of individuals in the domain of L. A state description in L locates each individual in one and only one “cell” defined by a Q-predicate, while a structure description specifies the proportions of individuals in these cells. A monadic constituent Ci of L specifies which Q-predicates are empty and which are non-empty:

$${\text{C}}_{{\text{i}}} =\left( { + / - } \right)\left( {{\text{Ex}}} \right){\text{Q}}_{1} \left( {\text{x}} \right)\& \ldots \& \left( { + / - } \right)\left( {{\text{Ex}}} \right){\text{Q}}_{{\text{K}}} \left( {\text{x}} \right)$$
(1)

where (+ / −) is replaced by negation or nothing. As an empty universe is excluded, the number of constituents is \({\text{q}} = 2^{{\text{K}}} - 1\). If CTi is the class of occupied cells by Ci, then (1) can be rewritten in the following form:

$$C_{i} = \mathop \prod \limits_{{j \in CT_{i} }} \left( {Ex} \right)Q_{j} \left( x \right)\& \left( x \right)[\mathop \vee \limits_{{j \in CT_{i} }} Q_{j} \left( x \right)].$$
(2)

If for the true constituent C* there are no empty cells, so that CT* = Q, the world is atomistic in the sense that there are no true universal generalizations. For example, the truth of the generalization \(\left( {\text{x}} \right)\left( {{\text{Fx}} \to {\text{Gx}}} \right)\) means that the cell F& ~ G is empty. A simple distance between monadic constituents is the Clifford-measure:

$$\Delta _{{\text{C}}} \left( {{\text{C}}_{{\text{i}}} ,{\text{C}}_{{\text{j}}} } \right) = |{\text{CT}}_{{\text{i}}} \Delta {\text{CT}}_{{\text{j}}} |/{\text{K }} = {\text{ the number of disagreements of}}\;{\text{ C}}_{{\text{i}}} \;\;{\text{and}}\;\;{\text{C}}_{{\text{j}}} ,$$
(3)

where Δ is the symmetric difference (see Fig. 1). Variants of (3), which take into account distances between Q-predicates, have been considered by Tichý, Oddie, and Niiniluoto.Footnote 4 Then the degree of truthlikeness Tr(H,C*) of a generalization H in L depends on the Clifford-distances (or their variants) of the disjuncts of H from the true constituent C*. A comparative notion “H1 is closer to the truth than H2” is explicated by the condition Tr(H1,C*) > Tr(H2,C*).

Fig. 1
figure 1

Clifford-distance between monadic constituents

If C* is unknown, but a rational epistemic probability measure P is defined over the class of constituents of L, then the unknown degree Tr(H,C*) can be estimated by its expected value on the basis of evidence E:

$${\text{ver}}\left( {{\text{H}}/{\text{E}}} \right) = \sum {\text{ P}}\left( {{\text{C}}_{{\text{i}}} /{\text{E}}} \right){\text{ Tr}}\left( {{\text{H}},{\text{C}}_{i} } \right),$$
(4)

where the sum goes over i = 1, …, q.Footnote 5 For monadic languages, the relevant posterior probabilities P(Ci/E) of constituents (i.e., degrees of belief on the truth of Ci given E) are given by Jaakko Hintikka’s system of inductive logic.Footnote 6

2 Verisimilitude vs. Legisimilitude

Following the difference between accidental and lawlike generalizations, a distinction between verisimilitude and legisimilitude has been proposed by L. J. Cohen. In the logical problem of legisimilitude, the target is not just the strongest true statement about the world (in a given language), but a genuine law of nature. A solution of this problem for universal or deterministic laws can be based on S. Uchii’s notion of nomic constituent.Footnote 7 Let L(□) be a modal monadic language with the operators of nomic necessity □ and nomic possibility ◊, satisfying the system S5. Then a nomic constituent tells which Q-predicates are possible and which are impossible:

$${\text{B}}_{{\text{i}}} = \mathop \prod \limits_{{{\text{j}} \in {\text{CT}}_{{\text{i}}} }} \diamondsuit \left( {{\text{Ex}}} \right){\text{Q}}_{{\text{j}}} \left( {\text{x}} \right)\& \square \left( {\text{x}} \right)[\mathop \vee \limits_{{{\text{j}} \in {\text{CT}}_{{\text{i}}} }} {\text{Q}}_{{\text{j}}} \left( {\text{x}} \right)].$$
(5)

The number of nomic constituents in L(□) is again \({\text{q}} = 2^{{\text{K}}} - 1\). As actuality implies possibility, and impossibility implies non-actuality, nomic constituent (5) is partly weaker and partly stronger than constituent (3). Laws of nature are disjunctions of nomic constituents. For example, the law \(\square \left( {\text{x}} \right)\left( {{\text{Fx}} \to {\text{Gx}}} \right)\) is equivalent to the disjunction of all nomic constituents which state that the cell F& ~G is physically impossible. The distance \(\Delta \left( {{\text{B}}_{1} ,{\text{B}}_{2} } \right)\) between nomic constituents B1 and B2 can be defined by the Clifford-measure \(|{\text{CT}}_{1} \Delta {\text{CT}}_{2} |/{\text{K}}\) or its variants. The degree of legisimilitude of a law of nature H depends on its distance to the true nomic constituent B*:

$${\text{leg}}\left( {{\text{H}},{\text{B}}^*} \right){\text{ }} = {\text{ }}1{\text{ }}{-}{\text{ }}\Delta \left( {{\text{H}},{\text{B}}^*} \right).$$

Alternatively, if the cognitive aim is to combine verisimilitude and legisimilitude, the target could be the conjunction B* & C*.Footnote 8 Estimation of legisimilitude can again employ expected values based on inductive probabilities.Footnote 9

Nomic constituents (5) represent laws of coexistence, i.e., lawlike connections between attributes or properties. To define laws of succession, introduce a discrete temporal index t to Q-predicates:

$$\mathop \prod \limits_{{< {i,j} > \in T}} \diamondsuit \left( {Ex} \right)(Q_{i}^{t} \left( x \right)\& Q_{j}^{{t + 1}} \left( x \right))~\& \square \left( x \right)[\mathop \vee \limits_{{< {i,j} > \in T}} (Q_{i}^{t} \left( x \right)\& Q_{j}^{{t + 1}} \left( x \right))].$$
(6)

Here T lists all possible transitions between successive states. For deterministic laws, for each i there is only one j such that < i,j > ε T. Again the Clifford-measure can be applied to measure the distance between laws of succession:\(|{\text{T}}_{1} \Delta {\text{T}}_{2} |/{\text{K}}^{2}\).Footnote 10

The class of Q-predicates of a monadic language can be generalized to a quantitative state space \({\mathbf{Q}} \subseteq {\mathbf{R}}^{k}\) generated by real-valued quantities \({\text{h}}_{1} , \ldots ,{\text{h}}_{{\text{k}}}\).Footnote 11 In the simplest case, Q is the real line or its part with the geometrical distance between points \({\uprho }\left( {{\text{x}},{\text{x}^\prime}} \right) = \left| {{\text{x}}{-}{\text{x}^\prime}} \right|\). More generally, Q is a k-dimensional metric space with the Euclidean metric. Here laws of coexistence specify regions of nomically possible states

$${\text{F}}\left( {\text{x}} \right) = \{ {\text{x}}{\upvarepsilon }{\mathbf{Q}}|{\text{ f}}\left( {{\text{h}}_{1} \left( {\text{x}} \right), \ldots ,{\text{h}}_{{\text{k}}} \left( {\text{x}} \right)} \right) = 0{\text{ }}\} .$$

The Clifford-distance between two such laws F1 and F2 is defined by

$$\mathop \int \limits_{{F_{1} \Delta F_{2} }}^{{}} dQ~.$$

An alternative approach to quantitative laws expresses how a function hk necessarily depends on \({\text{h}}_{1} , \ldots ,{\text{h}}_{{{\text{k}} - 1}} :\)

$${\text{h}}_{{\text{k}}} \left( {\text{x}} \right) = {\text{g}}\left( {{\text{h}}_{1} \left( {\text{x}} \right), \ldots ,{\text{h}}_{{{\text{k}} - 1}} \left( {\text{x}} \right)} \right).$$

The distances between two such real-valued functions can be defined by the Minkowski or Lp-metrics for functions:

$$\Delta _{{\text{p}}} \left( {{\text{f}},{\text{g}}} \right) = (\smallint{\mid }{\text{f}}\left( {\text{x}} \right){-}{\text{g}}\left( {\text{x}} \right){\mid }^{{\text{p}}} )^{{1/{\text{p}}}} .$$
(7)

Here p = 1 is the Manhattan metric, p = 2 the Euclidean metric, and p = ∞ the Tchebycheff metric sup \({\mid }{\text{f}}\left( {\text{x}} \right){-}{\text{g}}\left( {\text{x}} \right){\mid }\).Footnote 12 The degree of legisimilitude of the law f then depends on its distance to the true law f*:

$${\text{leg}}\left( {{\text{f}},{\text{f}}^*} \right){\text{ }} = \frac{1}{{1 + ~\Delta \left( {{\text{f}},{\text{f}}^*} \right)}}.$$

Further, f is closer to the truth than g if and only of \({\text{leg}}\left( {{\text{f}},{\text{f}}^*} \right) > {\text{leg}}\left( {{\text{g}},{\text{f}}^*} \right)\).

Quantitative laws of succession can be formulated by relativizing the state x with a time: h(x,t) = state of x at time t. Then deterministic dynamical laws tell how the state depends on time t and some initial state at time to:

$$\left( {{\text{h}}_{1} \left( {{\text{x}},{\text{t}}} \right), \ldots ,{\text{ h}}_{{\text{k}}} \left( {{\text{x}},{\text{t}}} \right)} \right) = {\text{F}}\left( {{\text{t}},{\text{h}}_{1} \left( {{\text{x}},{\text{t}}_{{\text{o}}} } \right), \ldots ,{\text{h}}_{{\text{k}}} \left( {{\text{x}},{\text{t}}_{{\text{o}}} } \right)} \right).$$

A law of succession specifies nomically possible trajectories \({\text{F}}:{\mathbf{R}} \times {\mathbf{Q}} \to {\mathbf{Q}}\). The distance between such laws can be defined by taking for each \({\text{Q}}{\kern 1pt} {\upvarepsilon }{\kern 1pt} {\mathbf{Q}}\) the Minkowski distance between the trajectories F1(t,Q) and F2(t,Q), for \({\text{t}}\;{\upvarepsilon }\;{\mathbf{R}}\), and then summing over all possible initial states \({\text{Q}}{\kern 1pt} {\upvarepsilon }{\kern 1pt} {\mathbf{Q}}\).Footnote 13

3 Probabilistic laws

The notion of a universal or deterministic law introduced in Sect. 2 can be generalized to probabilistic laws, if an objective physical probability measure is available.Footnote 14 Following Leibniz, such physical probabilities express “degrees of possibility”. This can be understood in terms of single-case propensities: P(G/F) = r means that a physical set-up has a numerical disposition of strength r to produce an outcome of type G in each trial of type F. Thus, probability statements involve a dispositional modal operator, so that they differ from extensional statistical statements about actual relative frequencies of attributes in reference classes (i.e., structure descriptions). Universal laws of coexistence and deterministic laws of succession are limiting special cases of probabilistic laws (with propensities 0 and 1).Footnote 15 Genuine probabilistic laws presuppose that the world is indeterministic, but in statistical modelling one may assign in some sense objective probabilities to random phenomena (e.g., coin tossing, roulette) even when the underlying reality is deterministic. For the task of defining approximation to such probabilities, the philosophical issue of indeterminism and determinism can be left open.

To define probabilistic constituents, replace in a nomic constituent (5) the operator of physical possibility ◊ with a probability measure P over the discrete state space Q of Q-predicates, now understood as the “sample space” or the class of outcomes of a trial x:

$$\mathop \prod \limits_{{Q_{i} \in \user2{Q}}} [P(Q_{i} \left( {x)} \right) = ~p_{i} ].$$
(8)

Probabilities (8) over Q define a multinomial context. Then \({\text{p}}_{{\text{i}}} = {\text{P}}\left( {{\text{Q}}_{{\text{i}}} \left( {\text{x}} \right)} \right) > 0\) if and only if Qi is physically possible, for all \({\text{i}} = 1, \ldots ,{\text{ K}}\), so that here P is applied to the open formula Qi(x) instead of the existential statement in (5). Now a probabilistic constituent (8) is compatible with the nomic constituent Bi with CTi if and only if it assigns a positive probability to the Q-predicates in CTi and zero probability to other Q-predicates. This means that typically a nomic constituent is an infinite disjunction of probabilistic constituents.

In many statistical applications, the trial x counts the number of successes in a repeated experiment (e.g., binomial and Poisson distributions), so that the state space Q is a subclass of the set N of natural numbers. Then the distance \({\uprho }\left( {{\text{x}},{\text{x}^{\prime}}} \right)\) between points in Q is their normalized arithmetical difference.

An important distinction can be made between pure and mixed probabilistic laws. A probabilistic constituent, where CTi is a proper subset of Q, is a mixed law in the sense that it entails a universal law (cells in Q–CTi are necessarily empty). Pure probabilistic laws have no such entailments: the world is atomistic in the sense that all Q-predicates in Q are nomically possible (so that no universal laws hold), and a positive probability is assigned to all Q-predicates.

To define probabilistic laws of succession, for a discrete space Q the set T of possible transitions between states is replaced by a matrix of transition probabilities

$${\text{p}}_{{{\text{j}}/{\text{i}}}} = P\left( {Q_{j}^{{t + 1}} \left( x \right)/Q_{i}^{t} \left( x \right)} \right),$$
(9)

where \({\text{p}}_{{1/{\text{i}}}} + \ldots + {\text{p}}_{{{\text{K}}/{\text{i}}}} = {\text{ }}1\) for each i. This definition involves the Markov condition, i.e., the next state depends only on the present state. If transition probabilities are 0 or 1, this law reduces to the deterministic law (6).Footnote 16 Equation (9) determines the n-step transition probabilities

$${\text{p}}_{{{\text{j}}/{\text{i}}}} \left( {\text{n}} \right) = P\left( {Q_{j}^{{t + n}} \left( x \right)/Q_{i}^{t} \left( x \right)} \right),$$

and for an irreducible stationary Markov chain the limits of pj/i(n), for n → ∞, give a long-run probability distribution. These notions can be generalized to Markov processes with a continuous time.Footnote 17 Special cases of probabilistic laws of succession can be formulated by quantitative dynamic laws like the law of radioactive decay

$${\text{P}}\left( {{\text{Q}}\left( {{\text{x}},{\text{t}}} \right)} \right) = 1{-}{\text{e}}^{{{ - }{\uplambda }{\text{t}}}} ,$$
(10)

where Q(x,t) states that atom x decays within the time-interval [0,t] and λ is a constant. Finally, for a quantitative state space Q a probability measure on Q (i.e., infinite sequences of successive states) assigns a physical probability to possible trajectories of a time-continuous stochastic process.

4 Distance between probabilities

The problem of legisimilitude for probabilistic laws has not yet received much attention. The main focus in the literature has been on cases, where the target is a universal or deterministic law, either qualitative or quantitative. The only detailed proposals have been given by Rosenkrantz (1980) and Niiniluoto (1987), 403–405, who apply the Kullback–Leibler notion of divergence as a measure of distance from a probabilistic truth.

Mathematicians have suggested a great number of measures for distances between probability distributions. In a comprehensive survey, Cha (2007) lists 45 different measures,Footnote 18 which have been used for various purposes. For example, the central limit theorem (the sum of n independent random variables approximates in the limit the normal distribution) and laws of large numbers (observed relative frequencies and predictive probabilities approach almost surely objective probabilities in a multinomial Bernoulli process) express distances between epistemic probabilities q and objective probabilities p by their geometrical distance |q–p|.Footnote 19 This amounts to the Manhattan metric

$$\Delta _{1} \left( {{\text{p}},{\text{q}}} \right) = \sum \left| {{\text{ p}}_{{\text{i}}} - {\text{ q}}_{{\text{i}}} } \right|.$$

For discrete probabilities, the squared Euclidean or quadratic metric

$$\Delta _{2} \left( {{\text{p}},{\text{q}}} \right) = \sum \left( {{\text{p}}_{{\text{i}}} {-}{\text{ q}}_{{\text{i}}} } \right)^{2} ,$$

or its variant χ2, is a standard way of measuring the fit between two distributions or structural descriptions.Footnote 20 In the special case of scoring, where qi are probabilistic estimates of the truth values pi of n rival exclusive hypotheses (pj = 1, otherwise 0), Glenn Brier’s 1950 measure of inaccuracy is quadratic, i.e., \({\text{d}}\left( {1,{\text{q}}} \right) = \left( {1{-}{\text{q}}} \right)^{2}\) and \({\text{d}}\left( {0,{\text{q}}} \right) = {\text{q}}^{2} ,\) Footnote 21 while I. J. Good in 1952 favored the logarithmic measure \({\text{d}}\left( {1,{\text{q}}} \right) = - {\text{ lnq}},{\text{ d}}\left( {0,{\text{q}}} \right) = - {\text{ ln}}\left( {1 - {\text{q}}} \right)\).Footnote 22 From these local measures the total scoring measure is obtained by summing the inaccuracies of all qi.

Some measures are based on the inner products piqi. Hellinger’s 1909 proposal was modified in 1946 in Bhattacharyya’s dissimilarity coefficient:

$$- {\text{ log}}\sum \sqrt {p_{{i~}} q_{i} } .$$

The directed divergence of a discrete random variable p from another q was defined by Solomon Kullback and Richard Leibler in 1951 as the expected logarithmic difference between p and q with respect to p:

$${\text{div}}\left( {{\text{p}},{\text{q}}} \right) = \sum {\text{p}}_{{\text{i}}} {\text{log}}\left( {{\text{p}}_{{\text{i}}} /{\text{q}}_{{\text{i}}} } \right).$$
(11)

Here log can be taken to have the binary base 2, so that log2 = 1. Formula (11) is also called the relative entropy of p with respect to q. This measure is only a semimetric: non-negative, \({\text{div}}\left( {{\text{p}},{\text{p}}} \right) = 0,{\text{ div}}\left( {{\text{p}},{\text{q}}} \right) = 0\) if and only if p = q, but non-symmetric (usually \({\text{div}}\left( {{\text{p}},{\text{q}}} \right) \ne {\text{div}}\left( {{\text{q}},{\text{p}}} \right))\), and the triangle equation is not satisfied. For continuous probability densities f and g on R, Eq. (11) is replaced by

$${\text{div}}\left( {{\text{f}}\left( {\text{x}} \right),{\text{g}}\left( {\text{x}} \right)} \right) = \mathop \int \limits_{{ - \infty }}^{{ + \infty }} f\left( x \right)\log \left[\frac{{f\left( x \right)}}{{g\left( x \right)}}\right]dx.$$

The symmetric divergence between p and q is defined by

$${\text{div}}_{{\text{s}}} \left( {{\text{p}},{\text{q}}} \right) = {\text{div}}\left( {{\text{p}},{\text{q}}} \right) + {\text{div}}\left( {{\text{q}},{\text{p}}} \right) = \sum \left( {{\text{p}}_{{{\text{i}}~}} - {\text{ q}}_{{\text{i}}} } \right){\text{log }}\left( {{\text{p}}_{{\text{i}}} /{\text{q}}_{{\text{i}}} } \right).$$

Other variants include the λ-divergence

$${\text{div}}_{{\uplambda }} \left( {{\text{p}},{\text{q}}} \right) = {\uplambda }{\text{div}}({\text{p}},{\uplambda }{\text{p}} + (1 - {\uplambda }){\text{q}}) + (1 - {\uplambda }){\text{div}}({\text{q}},{\uplambda }{\text{p}} + (1 - {\uplambda }){\text{q}})$$

For λ = ½, it gives the Jensen-Shannon divergence

$$\begin{aligned} {\text{div}}_{{{\text{JS}}}} \left( {{\text{p}},{\text{q}}} \right) & = \frac{{{\text{1}} }}{2}{\text{div}}({\text{p}},\frac{{{\text{p}} +{\text{q}}}}{2})+\frac{{{\text{1}} }}{2}{\text{div}}({\text{q}},\frac{{{\text{p}} +{\text{q}}}}{2}) = \frac{{{\text{1}} }}{2}{\text{ H}}(\frac{{p + q}}{2}) - \frac{{{\text{H}}\left( p \right) + {\text{H}}\left( q \right)}}{2} \\ & = \frac{{{\text{1}} }}{2}\sum {\text{ p}}_{{\text{i}}} {\text{logp}}_{{\text{i}}} + \frac{{{\text{1}} }}{2}\sum {\text{q}}_{{\text{i}}} {\text{logq}}_{{\text{i}}} - \frac{{{\text{1}} }}{2} \sum \left( {{\text{p}}_{{\text{i}}} + {\text{ q}}_{{\text{i}}} } \right){\text{ log}}\left[ {\left( {{\text{p}}_{{\text{i}}} + {\text{ q}}_{{\text{i}}} } \right)/2} \right] \\& = \frac{{{\text{1}} }}{2}\sum {\text{ p}}_{{\text{i}}} {\text{log }}\left[ {2{\text{p}}_{{\text{i}}} /\left( {{\text{p}}_{{\text{i}}} + {\text{q}}_{{\text{i}}} } \right)} \right] + \frac{{{\text{1}} }}{2}\sum {\text{ q}}_{{\text{i}}} {\text{log}}\left[ {2{\text{q}}_{{\text{i}}} /\left( {{\text{p}}_{{\text{i}}} + {\text{ q}}_{{\text{i}}} } \right)} \right]. \\ \end{aligned}$$
(12)

where \({\text{H}}\left( {\text{p}} \right) = - \Sigma {\text{p}}_{{\text{i}}} {\text{logp}}_{{\text{i}}}\) is the Shannon entropy of p. Formula (12) is non-negative and symmetric, and its square root is a metric. Renyi divergence is defined by

$${\text{D}}_{{\upalpha }} \left( {{\text{p}},{\text{q}}} \right) =\frac{1}{\alpha-1} ~{\text{log}}\sum ({\text{p}}_{{\text{i}}} ^{{\upalpha }} /{\text{q}}_{{\text{i}}} ^{{{\upalpha } - 1}} ).$$

For α = 1, it gives in the limit the Kullback–Leibler divergence, and for α = ½ twice the Bhattacharyya distance.

Divergence was originally intended as a tool in information theory.Footnote 23 In Bayesian statistics it has been used to measure the difference between prior and posterior distributions. It can be also used for assessing the similarity of a probabilistic model with some aspect of reality.Footnote 24 The first connection to the studies in truthlikeness was developed by Roger Rosenkrantz (1980). Inspired by I. J. Good’s notion of the weight of evidence, Rosenkrantz suggested that for a random experiment x and truth h*, hypothesis h is more truthlike than hypothesis h′ if

$${\text{div}}\left( {{\text{P}}\left( {{\text{x}}/{\text{h}}} \right),{\text{P}}\left( {{\text{x}},{\text{h}}^*} \right)} \right) < {\text{div}}\left( {{\text{P}}\left( {{\text{x}}/{\text{h}^\prime}} \right),{\text{P}}\left( {{\text{x}}/{\text{h}}^*} \right)} \right).$$

This idea was connected to the similarity approach to truthlikeness by Niiniluoto (1987): the distance of a probabilistic hypothesis h from the probabilistic target h* is measured by div(h*,h). Thus, hypothesis h is more truthlike than h´ if and only if \({\text{div}}\left( {{\text{h}}^*,{\text{h}}} \right) < {\text{div}}\left( {{\text{h}}^*,{\text{h}^\prime}} \right)\). When the relevant hypotheses h, h′, and h* are specified by probabilities pi, qi, and pi*, this comparative condition holds if and only if \(\Sigma {\text{p}}_{{\text{i}}}^*{\text{log}}\left( {{\text{q}}_{{\text{i}}} /{\text{p}}_{{\text{i}}} } \right) < 0\). Generalization to probability density functions is immediate.

More generally, Niiniluoto (1987) recommends divergence div as a solution to the problem of probabilistic legisimilitude:

  • for probabilistic laws of coexistence (8) in qualitative conceptual spaces, the distance to the true probabilistic constituent

  • for probabilistic laws of succession (9) in qualitative languages, the distance to the matrix of true probability transitions

  • for probabilistic laws of succession in the quantitative space Q, the distance to the true probability on Q.

Alternative solutions could replace div by some other distance measure, e.g., Manhattan, Euclidean, or Bhattacharyya.

The Kullback–Leibler divergence div(p,q) has a limitation which is not noted in Niiniluoto (1987). Its definition (11) presupposes that p is absolutely continuous with respect to q, i.e., if qi = 0, then pi = 0. Further, when pi = 0, the factor 0log0 in the sum vanishes. The same condition is required for probability densities: if g(x) = 0, then f(x) = 0. This means that the KL-divergence can be applied only to pure probabilistic laws, since for mixed probabilistic constituents mistakes in the empty cells (or zero points in Q) in the target and hypothesis would not count at all. The same problem is faced by the Bhattacharyya distance, whose factors vanish as soon as pi or qi is 0, but not by the Minkowski metrics.

This problem with divergence is observed by Rosenkrantz (1980), who suggests that in the evaluation of \({\text{div}}\left( {{\text{P}}\left( {{\text{x}},{\text{h}}} \right),{\text{P}}\left( {{\text{x}},{\text{h}}^*} \right)} \right)\) zero probabilities are replaced by a slightly positive possibility of misclassification, but this ad hoc move is unsatisfactory. As a better solution one can recommend the use of the Kullback–Leibler directed divergence div for pure probabilistic laws, and the Jensen-Shannon divergence divJS (instead of div) to measure the distance between mixed probability distributions over cells Q or in the transition matrix. The JS-divergence shares the good properties of the KL-divergence, but it has a finite value in all cases even when some of the probabilities are zero.Footnote 25

The following examples illustrate various possibilities in analyzing approach to probabilistic laws.

Example 1.

Let L be a monadic language with two primitive predicates Fx = x is a swan and Gx = x is white. Then there are four Q-predicates in L:

$${\text{Q}}_{1} {\text{x}} = {\text{Fx }}\& {\text{ Gx}}$$
$${\text{Q}}_{{\text{2}}} {\text{x }} = {\text{ Fx }}\& {\text{ }} \sim {\text{Gx}}$$
$${\text{Q}}_{3} {\text{x}} = \; \sim {\text{Fx }}\& {\text{ Gx}}$$
$${\text{Q}}_{4} {\text{x}} =\, \sim {\text{Fx }}\& {\text{ }} \sim {\text{Gx}}.$$

Then the true constituent C* in L states that all Q-predicates are instantiated. Let H be the false universal generalization “All swans are white”. H states that Q2 is empty and leaves other cells as question marks. Applying the min-sum definition with weights γ and γ´ for the min and sum factors, respectively, the degree of truthlikeness of H isFootnote 26

$${\text{Tr}}\left( {{\text{H}},{\text{C}}^*} \right) = 1{-}{\upgamma }/4 - 5{\upgamma }^\prime/8.$$

Choosing γ = 2/3 and γ´ = 1/3, this is equal to 5/8. The degree of truthlikeness of the false constituent C1 with CT1 = {Q1, Q3, Q4} isFootnote 27

$${\text{Tr}}\left( {{\text{C}}_{1} ,{\text{C}}^*} \right) = 1{-} {\upgamma }/4{-}{\upgamma }^\prime/32 = 79/96 > {\text{ }}5/8.$$

The same numerical results hold for the nomic versions of C*, H, and C1. In the probabilistic framework, H corresponds to the law P(Gx/Fx) = 1, but now the target is the true probability distribution P* over the cells Q1,…,Q4, and H is a disjunction of probabilistic constituents with probability 0 for Q2. As the number of black swans is small in comparison to white swans, the true probabilistic law is something like P(Gx/Fx) = 0.95. It follows, for any reasonable distance measure, that H has a relatively high degree of truthlikeness, and in any case higher than that of the law \({\text{P}}\left( {{\text{Gx}}/{\text{Fx}}} \right) = 0.5\). But if the cognitive interest of the investigator is to know both the nomic and actual features of birds, so that the target is the conjunction P* & C*, then H’s overall truthlikeness is reduced, since it mistakenly excludes the cell Q2, while laws of the form \({\text{P}}\left( {{\text{Gx}}/{\text{Fx}}} \right) = {\text{r}} < 1\) allow for the actual existence of black swans.

Example 2.

Already Example 1 illustrates the fact that the comparison of ordinary, nomic and probabilistic constituents is a complicated matter, as they involve different targets. For example, a probabilistic constituent P is equivalent to a single nomic constituent B only in the special case where just one cell Qi is physically possible – and, hence, has probability 1. In other cases, the true nomic constituent B* is an infinite disjunction of probabilistic constituents, and the target-sensitivity does not allow a direct comparison of the degrees of truthlikeness of these different types of hypotheses. In particular, the atomistic nomic constituent, which states the possibility of all Q-predicates, is the disjunction of all pure probabilistic laws. This means that there is no connection between divergence and Clifford-distance for pure probabilistic laws. To see this, assume that P1 and P2 are two different laws, and B1 and B2 are the nomic constituent entailed P1 and P2. If P1 and P2 are pure laws, then CT1 = CT2 = Q and CT1ΔCT2 = ø, so that div(P1,P2) > 0 but the Clifford-distance ΔC(B1,B2) = 0.Footnote 28 But some simple comparisons can be made for the special case of uniform mixed laws. Thus, suppose C1 and C2 are monadic nomic constituents in a language with K Q-predicates with \(|{\text{CT}}_{1} {-}{\text{CT}}_{2} \left| { = {\text{A}},{\text{ }}} \right|{\text{CT}}_{2} {-}{\text{CT}}_{1} \left| { = {\text{B}},\,{\text{ and}}\,{\text{ }}} \right|{\text{CT}}_{1} \cap {\text{CT}}_{2} | = {\text{D}}\), so that the Clifford distance ΔC between C1 and C2 is \(\left( {{\text{A}} + {\text{B}}} \right)/{\text{K}}\) (see (3)). Let P1 and P2 be probabilistic constituents which allocate probability uniformly to CT1 and CT2 (1/c and 1/c′, respectively), where \({\text{c}^\prime} \ge {\text{c}}\). Now \({\text{D}} = {\text{c}}{-}{\text{A}} = {\text{c}^\prime} - {\text{B}}\). Then the Manhattan distance satisfies

$$\Delta _{1} ({\text{P}}_{1} ,{\text{P}}_{2} ){\text{ }} = \frac{A}{c}~ + \frac{B}{{c}^\prime} + \left( {\frac{1}{c} - \frac{1}{{c}^\prime}} \right){\text{D}} = \frac{1}{{c}^\prime}~\left( {{\text{A}} + {\text{B}}} \right) - \frac{c}{{c}^\prime}+1 = \frac{K}{{c}^\prime}~\Delta _{C} \left( {{\text{C}}_{1} ,{\text{C}}_{2} } \right){\text{ }} + \frac{{{c}^\prime - c}}{{c}^\prime}.$$

If \({\text{c}} = {\text{c}^\prime}\), this value equals \({\text{K}}\Delta _{{\text{C}}} \left( {{\text{C}}_{1} ,{\text{C}}_{2} } \right)/{\text{c}}\). For the Euclidean distance with \({\text{c}} = {\text{c}^\prime}\) we have

$$\Delta _{2} ({\text{P}}_{1} ,{\text{P}}_{2} ) = \left( {{\text{A}} + {\text{B}}} \right)/{\text{c}}^{2} = {\text{ K }}\Delta _{{\text{C}}} \left( {{\text{C}}_{1} ,{\text{C}}_{2} } \right)/{\text{c}}^{2} .$$

A similar connection to the Clifford measure holds for the Jensen-Shannon divergence:

$$\begin{aligned} \text{div}_{\text{JS}} \left( {\text{P}_{1} ,\text{P}_{2} } \right) &= \frac{{log2c}}{{2c}}\left( {\text{A} +\text{B}} \right) + \frac{{logc}}{c}~{\text{D}}{-}{\text{logc}} \hfill \\ & =\frac{{log2c}}{{2c}}\left( {\text{A} +\text{B}} \right){\text{ }}{-}{\text{ }}\left( {1{\text{ }} - \frac{D}{c}} \right){\text{logc}} \hfill \\ & =\frac{{log2c}}{{2c}}\left( {\text{A} + \text{B}} \right){\text{ }} - \frac{{logc}}{c}\text{A} \hfill \\ &= \frac{{log2}}{c}\text{A} \hfill \\ &= \frac{K}{{2c}}\Delta _{C} \left( {\text{C}_{1} ,\text{C}_{2} } \right). \hfill \\ \end{aligned}$$

However, such connections fail for non-uniform laws. For example, if nomic constituents B1 and B2 are otherwise almost equal, but B1 makes correct possibility claims about cells Qi with high true probability \({p}_{i}^{*}\) while B2 makes such claims about cells Qj with low probability \({p}_{j}^{*}\), then it may happen that truthlikeness ordering is reversed when the target changes from B* to P*: \(\Delta _{{\text{C}}} \left( {{\text{B}}_{1} ,{\text{B}}^*} \right) > \Delta _{{\text{C}}} \left( {{\text{B}}_{2} ,{\text{B}}^*} \right)\), but \({\text{d}}\left( {{\text{B}}_{1} ,{\text{P}}^*} \right) < {\text{d}}\left( {{\text{B}}_{2} ,{\text{P}}^*} \right).\)Footnote 29

Example 3

If p and q are disjoint mixed probabilistic laws (i.e.,\({\text{CT}}_{{\text{p}}} \cap {\text{CT}}_{{\text{q}}} ={\oslash}\)), then

$$\Delta _{1} \left( {{\text{p}},{\text{q}}} \right) = \sum {\text{p}}_{{\text{i}}} ~ + \sum {\text{q}}_{{\text{i}}} = {\text{ }}1 + 1 = 2$$
$${\text{div}}_{{{\text{JS}}}} \left( {{\text{p}},{\text{q}}} \right) = - \sum \frac{{{\text{p}}_{{\text{i}}} }}{2}{\text{log}}\frac{{{\text{p}}_{{\text{i}}} }}{2} - ~\sum \frac{{{\text{q}}_{{\text{i}}} }}{2}~{\text{log}}\frac{{{\text{q}}_{{\text{i}}} }}{2}~ - \frac{1}{2}~{\text{H}}\left( {\text{p}} \right) - \frac{1}{2}~{\text{H}}\left( {\text{q}} \right) = {\text{ }}1.$$

Example 4

The Poisson distribution for a randomly occurring rare event \({\text{p}}\left( {\text{i}} \right),{\text{ i}} = 1,{\text{ }}2, \ldots ,\) with a constant mean λ is defined by

$${\text{p}}\left( {\text{i}} \right){\text{ }} = \frac{{\lambda ^{i} }}{{i!}}~e^{{ - \lambda }} .$$

The KL-divergence between two Poisson distributions with rates λ and λ´ (where λ´ > λ) is

$$\lambda ^{\prime} - \lambda - \lambda {\text{log}}\frac{{\lambda ^{\prime}}}{\lambda }~.~$$

The proof uses the Taylor series

$$e^{\lambda } ~ = \mathop \sum \limits_{0}^{\infty } \frac{{\lambda ^{i} }}{{i!}}.$$

Example 5

The Manhattan difference between two exponential laws (10) with decay rates λ and λ′ (where λ′ > λ) is

$$\Delta _{1} (1{\text{ }}{-}{\text{ }}e^{ - \lambda t} ,{\text{ }}1{\text{ }}{-}{\text{ }}e^{ -{\lambda^\prime t}} )~ = \mathop \int \limits_{0}^{\infty } (e^{{ - \lambda^\prime t}} - ~e^{{ - \lambda t}} )dt = ~\frac{{\lambda^\prime - \lambda }}{{\lambda \lambda^\prime }}.$$

Example 6

Let p and q be deterministic laws of succession such that \({\text{p}}_{{2/1}} = 1,{\text{ p}}_{{1/1}} = 0\;{\text{ and}}\;{\text{ q}}_{{2/1}} = 0,{\text{ q}}_{{1/1}} = 1\). Then

$${\text{div}}_{{{\text{JS}}}} \left( {{\text{p}},{\text{q}}} \right) = - \left(\frac{1}{2}{\text{log}}\frac{1}{2}+\frac{1}{2}{\text{log}}\frac{1}{2}\right){-}\left( {0 + 0} \right)/2 = -{\text{log}}\frac{1}{2} ~ = {\text{log}}2 = 1.$$

For indeterministic r with \({\text{r}}_{{2/1}} = {\text{r}}_{{1/1}} = \raise.5ex\hbox{$\scriptstyle 1$}\kern-.1em/ \kern-.15em\lower.25ex\hbox{$\scriptstyle 2$} ,\)

$${\text{div}}_{{{\text{JS}}}} \left( {{\text{p}},{\text{r}}} \right) = - \frac{3}{4}{\text{log}}\frac{3}{4} - \frac{1}{4}{\text{log}}\frac{1}{4} - \frac{1}{2}\left( { - \frac{1}{2}\log \frac{1}{2} - \frac{1}{2}\log \frac{1}{2}} \right) = 2\log 2 - \frac{3}{4}\log 3 = 0.811.$$

These examples illustrate that several alternative distance measures give fairly similar comparative results. For uniform nomic constituents their results are related to the Clifford-measure between ordinary constituents, but this relation is not straightforward for non-uniform constituents and disappears for pure probabilistic laws. When it comes to measure the distance between particular probability values, geometrical and quadratic differences seem simple and useful, but for the distance between whole probability distributions or densities divergence is a convenient choice. The applicability of the Kullback–Leibler divergence is restricted to pure probabilistic laws, so that the Jensen-Shannon divergence turns out to be valuable complement which can be applied to mixed laws which assign zero probabilities to some Q-predicates or sample points.

5 Vertical versus horizontal distance measures

An important debate about the explication of truthlikeness for monadic languages concerned the question, whether the distance between constituents should reflect distances between Q-predicates. The Clifford-measure \(\Delta _{{\text{C}}} \left( {{\text{C}}_{1} ,{\text{C}}^*} \right)\) counts all errors of C1 about the Q-predicates equally: mistaken existence claims in \({\text{CT}}_{1} {-}{\text{CT}}^*\) and mistaken non-existence claims in \({\text{CT}}^* - \;{\text{CT}}_{1}\) have the same weight 1/K in (3). It is natural to consider also situations where the cognitive seriousness of errors in a false constituent are treated differently, so that the distance from the truth is not simply the cardinality of the symmetric difference. Niiniluoto proposed in 1976 two modifications of the Clifford-measure on the basis of the ρ-measure between Q-predicates.Footnote 30 In the Jyväskylä measure dJ false existence claims are weighted by their distance to the nearest non-empty cell, and false non-existence claims by their distance to the nearest empty cell, while in the weighted symmetric difference dw the first condition holds, but false non-existence claims are weighted by the minimum distance to a really non-empty cell. Then ΔC and dw (unlike dJ) are symmetric, and ΔC and dJ (unlike dw) are specular, where a specular distance (in the sense of Festa, 1993) satisfies the condition that the maximally distant constituent from Ci is its photographic negative (i.e., all positive claims are replaced by negative ones and vice versa).Footnote 31 If the ρ-measure reflects resemblances between predicates in a family (e.g. colors), then for the Jyväskylä measure the generalization “All ravens are grey” is closer to the truth than “All ravens are white”.Footnote 32

Tichý’s (1976) general definition of truthlikeness implies for the monadic case a distance measure between constituents which differs from the Clifford-measure ΔC and its modifications dJ and dw.Footnote 33 A linkage η between sets CTi and CTj is a surjective mapping from the larger of the sets to the smaller one. The cardinality card(η) of η is then \({\text{max}}\left\{ {\left| {{\text{CT}}_{{\text{i}}} } \right|,\;|{\text{CT}}_{{\text{j}}} |} \right\}\), and the breadth of η is the average distance between the linked predicates:

$${\text{B}}\left( {\upeta } \right) = \frac{1}{{card\left( \eta \right)}}~\mathop \sum \limits_{{< {Q_{u} ,Q_{v} } > \in \eta }} \rho \left( {Q_{u} ,Q_{v} } \right).$$
(13)

The distance \({\text{d}}_{{\text{T}}} \left( {{\text{C}}_{{\text{i}}} ,{\text{C}}_{{\text{j}}} } \right)\) between constituents Ci and Cj is then defined as the breadth of the narrowest linkage between CTi and CTj.

Niiniluoto (1987) rejects Tichý’s proposal for several reasons. The use of average in (13) leads to unintuitive examples, and constituents should not be treated as if they consisted only of existence claims. Indeed, dT is not specular and does not reflect the cognitive goal of finding true universal generalizations. The most fundamental objection is that \({\text{d}}_{{\text{T}}} \left( {{\text{C}}_{{\text{i}}} ,{\text{C}}^*} \right)\) can be derived as the minimum distance between two state descriptions s and s′, where s entails the uniformly distributed infinite structure description entailing Ci and s′ entails C*. Thus, Tichý is not defining the distance between Ci and C* in terms of the counted or weighted differences in claims about the Q-predicates (and thereby the ability of Ci to express true generalizations), but rather in terms of putting an infinite number of individuals in their right places in a classification system.Footnote 34 The latter problem should be solved by choosing the target as the true state description and by replacing constituents (in a non ad hoc way) as disjunctions of state descriptions.

In spite of this criticism, Tichý’s basic idea is interesting, since the notion of a linkage resembles metrics defined for trees in terms of the number of transformations needed to change one tree to another.Footnote 35 A linkage takes seriously (but perhaps in a wrong way) the demand that “horizontal” distances between Q-predicates are relevant. The goal of distributing an infinite number of individuals to their right places could be viewed as analogous to the task of distributing a probability mass (of measure 1) to its right place. Indeed, a discrete probabilistic constituent (8) allocates the probabilities to a finite number of points in the space Q of Q-predicates, and a continuous probability density f on a state space \({\mathbf{Q}} \subset {\text{R}}^{{\text{n}}}\) does the corresponding assignment to an infinite number of points. This can be illustrated by the simple case where Q is a subset of the real line R and f: \({\mathbf{Q}} \to {\text{R}}^{ + }\). If we denote by Df the region between the curve f(x) ≥ 0 and the real axis, i.e.,

$${\text{D}}_{{\text{f}}} = \left\{ { < {\text{x}},{\text{y}} > |\;{\text{x }}{\upvarepsilon }\;{\mathbf{Q}},{\text{ }}0{\text{ }} \le ~{\text{y}} \le {\text{ f}}\left( {\text{x}} \right)} \right\},$$

then the density f gives the probability measure 1 to Df. For two probability densities f and g, the symmetric difference \({\text{D}}_{{\text{f}}} \Delta {\text{D}}_{{\text{g}}}\) covers the region between the functions f(x) and g(x) (see Fig. 2). The Manhattan distance is simply the area of this region:

$$\Delta _{1} \left( {{\text{f}},{\text{g}}} \right) = \left| {{\text{ D}}_{{\text{f}}} \Delta {\text{D}}_{{\text{g}}} } \right|,$$
Fig. 2
figure 2

Distance between probability densities f and g

which is a direct analogue of the Clifford-distance (3) between constituents.

The Manhattan measure, as well as its Euclidean and divergence alternatives, are in an obvious sense vertical, since they measure the distance between probability distributions by the absolute, quadratic, or logarithmic differences between the values of f(x) and g(x), without consideration of horizontal distances between points in the sample space Q. This verticality is dramatically seen in Example 3, where the distances between the values of two disjoint discrete probabilistic constituents p and q are maximal, and the distances Δ1(p,q) and divJS(p,q) have their maximal values quite independently of the location of p and q with respect to the space Q. A counterpart of this result for probability densities is the following observation: if f1 and f2 are geometrical distributions with the same shape but disjoint domains, then Δ1(f1,f2), Δ2(f1,f2), and divJS(f1,f2) have their maximal values quite independently of the geometrical distance a between these densities (see Fig. 3). In fact, all distance measures surveyed by Cha (2007), which are applicable to mixed probabilistic laws, share this feature of verticality.

Fig. 3
figure 3

Disjoint geometrical distributions

The observations above motivate the idea that one could try to find measures which in some way take into account the horizontal distances between probability distributions (in addition to their vertical ones). Then a modification of Tichý’s linkages might be fruitful. The detailed development of this suggestion has to be left for another occasion, but a simple illustration of the idea can be given here. Consider again real-valued probability densities f which define regions Df in a subspace S of R2. Let \({\upbeta }:{\text{ S}} \to {\text{S}}\) be an area-preserving function, so that β[A] has the same area as A for all subregions A of S. Thus, β maps Df onto Dg by moving the whole probability mass from Df to Dg.Footnote 36 The length of the vector (< x,y > ,β(x,y)) is defined by the metric of S, and the breadth of β is defined as the sum (integral) of all these lengths for points < x,y > in Df. Then the distance between probabilities f and g is the breadth of the narrowest transformation β between Df and Dg. For example, in Fig. 3 the mapping \({\upbeta }\left( {{\text{x}},{\text{y}}} \right) = \left( {{\text{x}} - {\text{a}},{\text{ y}}} \right)\), i.e., linear shift to the left,Footnote 37 gives a linkage between f2 and f1 whose breadth is a, since

$$\mathop \int \limits_{{D_{{f2}} }}^{{}} | < {\text{x}},{\text{y}} > , < {\text{x}} - {\text{a}},{\text{y}} > |{\text{dxdy}} = \mathop \int \limits_{{D_{{f2}} }}^{{}} a~{\text{dxdy}} = {\text{a}}.$$

For probabilities on the discrete sample space Q, which in effect define columns on the points of Q with the total length one, the corresponding idea is to measure the distance between p and q by looking for the shortest length-preserving transformation between p and q. Such a transformation divides the columns of q into pieces and moves them in order to reach a fit with p. If a part of a column qi is moved to Qj, then the length of this part is multiplied with the distance ρ(Qi,Qj). For example, let \({\mathbf{Q}} = \left\{ {0,1,2} \right\},{\text{ }}{\uprho }\left( {0,1} \right) = {\uprho }\left( {1,2} \right) = 1/2,{\text{ }}{\uprho }\left( {0,2} \right) = 1\). Then p and q have the maximal distance 1, if p gives all probability to 0 and q to 2. If

$$\begin{aligned} & {\text{p}}_{1} = {\text{ p}}_{2} = {\text{p}}_{3} = 1/3 \hfill \\& {\text{q}}_{1} = 1/6,{\text{ q}}_{2} = 2/3,{\text{ q}}_{3} = 1/6. \hfill \\ \end{aligned}$$

then the distance between p and q is

$$\frac{1}{6}.~\frac{1}{2} + ~\frac{1}{6}.\frac{1}{2} = ~\frac{1}{6}.$$

But if

$${\text{q}}_{1} = {\text{ }}1/6,{\text{ q}}_{2} = 1/6,{\text{ q}}_{3} = {\text{ }}2/3,$$

then the distance between p and q is ¼. These measures, which combine vertical and horizontal aspects, are applicable to both pure and mixed probabilistic laws.

6 Estimating distance from probabilistic truth

According to the similarity approach to the epistemic problem of truthlikeness, unknown degrees of truthlikeness can be estimated by their expected value (4) using a posterior probability distribution over constituents. The same idea can be applied for the estimation of unknown degrees of divergence, which measure distance from the true probabilistic law.

Example 7

Footnote 38 If p is the true probability of success in a binomial model

$${\text{B}}\left( {{\text{p}},{\text{s}}} \right) = \left( {\begin{array}{*{20}c} n \\ s \\ \end{array} } \right){\text{p}}^{{\text{s}}} \left( {1 - {\text{p}}} \right)^{{{\text{n}} - {\text{s}}}}$$

and q is our guessed value, then the divergence of q from p in a single trial is

$${\text{div}}\left( {{\text{q}},{\text{p}}} \right) = {\text{plog}}\frac{p}{q} + \left( {1 - {\text{p}}} \right){\text{log}}\frac{{1 - p}}{{1 - q}}.$$

As divergence is additive for independent distributions, this divergence in n trials is

$${\text{n}}\left[{\text{plog}}\frac{p}{q} + \left( {1 - {\text{p}}} \right){\text{log}}\frac{{1 - p}}{{1 - q}}\right].$$

If the prior distribution g(p) of p is the uniform Beta(1,1), i.e., g(p) = 1 for 0 ≤ p ≤ 1, by Bayes´ theorem the posterior distribution g(p/s) of p with s successes and n-s failures is Beta(s + 1, n-s + 1), i.e.,

$${\text{g}}\left( {{\text{p}}/{\text{s}}} \right) = \frac{{\Gamma \left( {n + 2} \right)}}{{\Gamma \left( {s + 1} \right)\Gamma \left( {n - s - 1} \right)}}{\text{p}}^{{\text{s}}} \left( {1 - {\text{p}}} \right)^{{{\text{n}} - {\text{s}}}} ,$$

whose mean is (s + 1)/(n + 2).Footnote 39 Then the estimated distance of q from p can be calculated by

$$\mathop \int \limits_{0}^{1} g\left( {p/s} \right)div\left( {q,p} \right)dp.~$$

It follows that, given s successes in n trials, guess q´ is estimated to be closer to the truth than q if and only if

$$\begin{aligned} &{\text{log}}\frac{{q}^\prime}{q}~\mathop \int \limits_{0}^{1} g\left( {p/s} \right)p dp~ > {\text{log}}\frac{{1 - q^\prime}}{{1 - q}}~\mathop \int \limits_{0}^{1} g\left( {p/s} \right)\left( {1 - p} \right)dp~ \hfill \\ &{\text{iff}}\;\;~{\text{log}}\frac{{q^\prime~}}{q}~.~\frac{{s + 1}}{{n + 2}}~ > {\text{log}}\frac{{1 - q^\prime}}{{1 - q}}~.~\left[ {1 - ~\frac{{s + 1}}{{n + 2}}} \right] \hfill \\ & {\text{iff}}~\;\;~{\text{log}}\frac{{q^\prime}}{q}/{\text{log}}\frac{{1 - q}}{{1 - q^\prime}} > \frac{{n - s + 1}}{{s + 1}}. \hfill \\ \end{aligned}$$

Note that for the deterministic hypothesis q = 1 the value of div(q,p) is not well defined, but for the Jensen-Shannon distance

$${\text{div}}_{{{\text{JS}}}} \left( {1,{\text{p}}} \right) = \frac{1}{2}{\text{plog}}\frac{p}{2}~ - \frac{1}{2}~\left( {{\text{p}} + 1} \right){\text{log}}\frac{{p + 1}}{2} + \frac{1}{2}~{\text{log}}2.$$

Hence,

$${\text{div}}_{{{\text{JS}}}} \left( {1,0} \right) = \frac{1}{2}~{\text{log}}2 - \frac{1}{2}{\text{log}}\frac{1}{2} = {\text{log}}2 = 1.$$

Example 8

Footnote 40 Let x1,…, xn be independent measurements of an unknown real-valued quantity θ with a normal distribution \({\text{N}}({{\uptheta }},\sigma ^{2} )\):

$${\text{f}}({\text{x}}/{\uptheta }){\text{ }} = \frac{1}{{\sigma \sqrt {2\pi } }}e^{{ - \left( {x - ~\theta } \right)^{2} /2\sigma ^{2} }} .$$

Then their mean value \({\text{y}} = \left( {{\text{x}}_{1} + \cdots + {\text{x}}_{{\text{n}}} } \right)/{\text{n}}\) is normally distributed N(θ,σ2/n). If the prior probability of θ is sufficiently flat normal, then the posterior distribution g(θ/y) of θ is approximately N(y,σ2/n), where y is the observed mean. If f(x/θ) is the true distribution, and f(x/θo) is our guess, then their estimated directed divergence is

$$\begin{aligned} & \int g\left( {\theta /y} \right)~\int f\left( {x/\theta } \right)\log [f\left( {x/\theta } \right)/f\left( {x/\theta _{{\mathfrak{o}}} } \right)]\text{dx}\text{d}\uptheta \hfill \\ & = \int g\left( {\theta /y} \right)~[(\theta {-}\theta _{o} )/2\sigma ^{2} ]\text{d}\uptheta \hfill \\ & = \frac{1}{{2\sigma ^{2} }}\int \theta ~g\left( {\theta /y} \right)d\theta ~ - \frac{{\theta _{0} }}{{2\sigma ^{2} }} \hfill \\ & = (\text{y}{-}\uptheta _{o} )/2\upsigma ^{2} . \hfill \\ \end{aligned}$$

Here the mean y as the best estimate agrees with the result of the Bayes-rule of minimizing expected quadratic loss.

7 Conclusion

We have seen in this paper that the basic idea of the similarity approach to truthlikeness can be extended from qualitative and quantitative first-order languages to cases where probabilistic statements (and their disjunctions) are compared with probabilistic targets. Sections 2 and 3 show how one can naturally proceed from universal and deterministic laws to probabilistic laws. Section 4 argues that the Kullback–Leibler divergence has to be supplemented by the Jensen-Shannon divergence as a measure between mixed probabilistic laws, i.e., laws which assign zero probabilities to some sample points and thereby entail some universal laws. Section 5 formulates a research program for studying a new class of measures which account for the horizontal differences between probability densities, based on distances between sample points. In this way the theory of probabilistic truth approximation does not only lend tools from probability calculus but may suggest novel kinds of problems for mathematicians. Finally, Sect. 6 gives examples to show that the method of estimating degrees of legisimilitude by their expected value can be generalized from the case of deterministic laws to probabilistic laws.