Approaching probabilistic laws

Niiniluoto, Ilkka

doi:10.1007/s11229-021-03256-8

Approaching probabilistic laws

Original Research
Open access
Published: 09 July 2021

Volume 199, pages 10499–10519, (2021)
Cite this article

Download PDF

You have full access to this open access article

Synthese Aims and scope Submit manuscript

Approaching probabilistic laws

Download PDF

Ilkka Niiniluoto ORCID: orcid.org/0000-0003-3162-5970¹

1047 Accesses
Explore all metrics

Abstract

In the general problem of verisimilitude, we try to define the distance of a statement from a target, which is an informative truth about some domain of investigation. For example, the target can be a state description, a structure description, or a constituent of a first-order language (Sect. 1). In the problem of legisimilitude, the target is a deterministic or universal law, which can be expressed by a nomic constituent or a quantitative function involving the operators of physical necessity and possibility (Sect. 2). The special case of legisimilitude, where the target is a probabilistic law (Sect. 3), has been discussed by Roger Rosenkrantz (Synthese, 1980) and Ilkka Niiniluoto (Truthlikeness, 1987, Ch. 11.5). Their basic proposal is to measure the distance between two probabilistic laws by the Kullback–Leibler notion of divergence, which is a semimetric on the space of probability measures. This idea can be applied to probabilistic laws of coexistence and laws of succession, and the examples may involve discrete or continuous state spaces (Sect. 3). In this paper, these earlier studies are elaborated in four directions (Sect. 4). First, even though deterministic laws are limiting cases of probabilistic laws, the target-sensitivity of truthlikeness measures implies that the legisimilitude of probabilistic laws is not easily reducible to the deterministic case. Secondly, the Jensen-Shannon divergence is applied to mixed probabilistic laws which entail some universal laws. Thirdly, a new class of distance measures between probability distributions is proposed, so that their horizontal differences are taken into account in addition to vertical ones (Sect. 5). Fourthly, a solution is given for the epistemic problem of estimating degrees of probabilistic legisimilitude on the basis of empirical evidence (Sect. 6).

Truthlikeness for probabilistic laws

Article 18 June 2021

Logics for Strict Coherence and Carnap-Regular Probability Functions

A First-Order Logic for Reasoning About Higher-Order Upper and Lower Probabilities

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 The similarity approach to truthlikeness

Karl Popper’s (1963) original work on truthlikeness was based on the concepts of truth value (true or false) and logical deduction (entailment). Theories were represented as deductively closed sets of sentences in some language L, and the comparative notion “more truthlike” was characterized by set-theoretical comparisons of the truth content and falsity content of rival theories. The main lesson from the dramatic failure of Popper’s definition in 1974 was the need to add the notion of similarity or resemblance to the logical toolbox. In the first formulations of the similarity approach, Risto Hilpinen (1976) represented theories as classes of possible worlds and employed spheres of similarity from David Lewis’ approach to counterfactuals, while Pavel Tichý defined theories as disjunctions of propositional constituents and Ilkka Niiniluoto as disjunctions of monadic constituents, adding a function to measure the distance between constituents. Soon these notions were extended to full first-order languages by Tichý, Niiniluoto, Raimo Tuomela and Graham Oddie, with systematic summaries in Oddie´s Likeness to Truth (1986), Niiniluoto´s Truthlikeness (1987), and Theo Kuipers’ edition What Is Closer-to-the-Truth? (1987).

In Hilpinen’s treatment, the truthlikeness of a theory depends on its maximum and minimum distances from the actual world. Tichý and Oddie favor the average distance, while Niiniluoto combines the minimum distance with the normalized sum of all distances. In the linguistic formulations, the degree of truthlikeness Tr(H,C*) of a theory H in language L depends on the similarity of the disjuncts of H with the true constituent C* of L. Here the target C* is the most informative truth expressible in the conceptual framework L, and Tr(H,C*) is maximal when H is identical with this complete truth C*.^{Footnote 1}

A successful theory H should give a full and correct description of a domain of investigation by its conceptual resources in language L. In other words, H should specify the L-structure of the actual world with respect to L, and strongest theories are able to do this up to isomorphism. Here L may include qualitative or quantitative concepts. But the choice of the logical complexity of language L allows a finer discrimination: the target can be chosen for the purposes of the relevant cognitive problem, so that it may be a propositional constituent, a state description, a structure description, a monadic constituent, a polyadic constituent of depth-d, or a complete first-order theory.^{Footnote 2} For each of these choices, the task is to define the distances of statements in L from the given target.

Following Popper, one should distinguish here two problems. In the logical problem of truthlikeness, we are given the true target C*, and we ask what it means to say that a theory is close to C* or closer to C* than another theory. In the epistemic problem of truthlikeness, the true target C* is unknown, and we ask how we can rationally claim or estimate on available evidence E that one theory is close to C* or closer to C* than another theory.

To illustrate the similarity approach to these two problems, let L be a monadic first-order language with k one-place predicates, and let ${\mathbf{Q}} = \left\{ {{\text{Q}}_{1} , \ldots ,{\text{Q}}_{{\text{K}}} } \right\}$ be the Q-predicates of L (K = 2^k). The Q-predicates can be defined by the conjunction of negated or unnegated primitive predicates of L, so that there is a natural distance ${\uprho }_{{{\text{uv}}}} = {\text{ d}}\left( {{\text{Q}}_{{\text{u}}} ,{\text{Q}}_{{\text{v}}} } \right)$ between them.^{Footnote 3} The Q-predicates are the strongest predicates expressible in L, and they constitute a classification system of individuals in the domain of L. A state description in L locates each individual in one and only one “cell” defined by a Q-predicate, while a structure description specifies the proportions of individuals in these cells. A monadic constituent C_i of L specifies which Q-predicates are empty and which are non-empty:

$${\text{C}}_{{\text{i}}} =\left( { + / - } \right)\left( {{\text{Ex}}} \right){\text{Q}}_{1} \left( {\text{x}} \right)\& \ldots \& \left( { + / - } \right)\left( {{\text{Ex}}} \right){\text{Q}}_{{\text{K}}} \left( {\text{x}} \right)$$

(1)

where (+ / −) is replaced by negation or nothing. As an empty universe is excluded, the number of constituents is ${\text{q}} = 2^{{\text{K}}} - 1$. If CT_i is the class of occupied cells by C_i, then (1) can be rewritten in the following form:

$$C_{i} = \mathop \prod \limits_{{j \in CT_{i} }} \left( {Ex} \right)Q_{j} \left( x \right)\& \left( x \right)[\mathop \vee \limits_{{j \in CT_{i} }} Q_{j} \left( x \right)].$$

(2)

If for the true constituent C* there are no empty cells, so that CT^* = Q, the world is atomistic in the sense that there are no true universal generalizations. For example, the truth of the generalization $\left( {\text{x}} \right)\left( {{\text{Fx}} \to {\text{Gx}}} \right)$ means that the cell F& ~ G is empty. A simple distance between monadic constituents is the Clifford-measure:

$$\Delta _{{\text{C}}} \left( {{\text{C}}_{{\text{i}}} ,{\text{C}}_{{\text{j}}} } \right) = |{\text{CT}}_{{\text{i}}} \Delta {\text{CT}}_{{\text{j}}} |/{\text{K }} = {\text{ the number of disagreements of}}\;{\text{ C}}_{{\text{i}}} \;\;{\text{and}}\;\;{\text{C}}_{{\text{j}}} ,$$

(3)

where Δ is the symmetric difference (see Fig. 1). Variants of (3), which take into account distances between Q-predicates, have been considered by Tichý, Oddie, and Niiniluoto.^{Footnote 4} Then the degree of truthlikeness Tr(H,C*) of a generalization H in L depends on the Clifford-distances (or their variants) of the disjuncts of H from the true constituent C*. A comparative notion “H₁ is closer to the truth than H₂” is explicated by the condition Tr(H₁,C*) > Tr(H₂,C*).

If C* is unknown, but a rational epistemic probability measure P is defined over the class of constituents of L, then the unknown degree Tr(H,C*) can be estimated by its expected value on the basis of evidence E:

$${\text{ver}}\left( {{\text{H}}/{\text{E}}} \right) = \sum {\text{ P}}\left( {{\text{C}}_{{\text{i}}} /{\text{E}}} \right){\text{ Tr}}\left( {{\text{H}},{\text{C}}_{i} } \right),$$

(4)

where the sum goes over i = 1, …, q.^{Footnote 5} For monadic languages, the relevant posterior probabilities P(C_i/E) of constituents (i.e., degrees of belief on the truth of C_i given E) are given by Jaakko Hintikka’s system of inductive logic.^{Footnote 6}

2 Verisimilitude vs. Legisimilitude

Following the difference between accidental and lawlike generalizations, a distinction between verisimilitude and legisimilitude has been proposed by L. J. Cohen. In the logical problem of legisimilitude, the target is not just the strongest true statement about the world (in a given language), but a genuine law of nature. A solution of this problem for universal or deterministic laws can be based on S. Uchii’s notion of nomic constituent.^{Footnote 7} Let L(□) be a modal monadic language with the operators of nomic necessity □ and nomic possibility ◊, satisfying the system S5. Then a nomic constituent tells which Q-predicates are possible and which are impossible:

$${\text{B}}_{{\text{i}}} = \mathop \prod \limits_{{{\text{j}} \in {\text{CT}}_{{\text{i}}} }} \diamondsuit \left( {{\text{Ex}}} \right){\text{Q}}_{{\text{j}}} \left( {\text{x}} \right)\& \square \left( {\text{x}} \right)[\mathop \vee \limits_{{{\text{j}} \in {\text{CT}}_{{\text{i}}} }} {\text{Q}}_{{\text{j}}} \left( {\text{x}} \right)].$$

(5)

The number of nomic constituents in L(□) is again ${\text{q}} = 2^{{\text{K}}} - 1$. As actuality implies possibility, and impossibility implies non-actuality, nomic constituent (5) is partly weaker and partly stronger than constituent (3). Laws of nature are disjunctions of nomic constituents. For example, the law $\square \left( {\text{x}} \right)\left( {{\text{Fx}} \to {\text{Gx}}} \right)$ is equivalent to the disjunction of all nomic constituents which state that the cell F& ~G is physically impossible. The distance $\Delta \left( {{\text{B}}_{1} ,{\text{B}}_{2} } \right)$ between nomic constituents B₁ and B₂ can be defined by the Clifford-measure $|{\text{CT}}_{1} \Delta {\text{CT}}_{2} |/{\text{K}}$ or its variants. The degree of legisimilitude of a law of nature H depends on its distance to the true nomic constituent B*:

$${\text{leg}}\left( {{\text{H}},{\text{B}}^*} \right){\text{ }} = {\text{ }}1{\text{ }}{-}{\text{ }}\Delta \left( {{\text{H}},{\text{B}}^*} \right).$$

Alternatively, if the cognitive aim is to combine verisimilitude and legisimilitude, the target could be the conjunction B* & C*.^{Footnote 8} Estimation of legisimilitude can again employ expected values based on inductive probabilities.^{Footnote 9}

Nomic constituents (5) represent laws of coexistence, i.e., lawlike connections between attributes or properties. To define laws of succession, introduce a discrete temporal index t to Q-predicates:

$$\mathop \prod \limits_{{< {i,j} > \in T}} \diamondsuit \left( {Ex} \right)(Q_{i}^{t} \left( x \right)\& Q_{j}^{{t + 1}} \left( x \right))~\& \square \left( x \right)[\mathop \vee \limits_{{< {i,j} > \in T}} (Q_{i}^{t} \left( x \right)\& Q_{j}^{{t + 1}} \left( x \right))].$$

(6)

Here T lists all possible transitions between successive states. For deterministic laws, for each i there is only one j such that < i,j > ε T. Again the Clifford-measure can be applied to measure the distance between laws of succession:$|{\text{T}}_{1} \Delta {\text{T}}_{2} |/{\text{K}}^{2}$.^{Footnote 10}

The class of Q-predicates of a monadic language can be generalized to a quantitative state space ${\mathbf{Q}} \subseteq {\mathbf{R}}^{k}$ generated by real-valued quantities ${\text{h}}_{1} , \ldots ,{\text{h}}_{{\text{k}}}$.^{Footnote 11} In the simplest case, Q is the real line or its part with the geometrical distance between points ${\uprho }\left( {{\text{x}},{\text{x}^\prime}} \right) = \left| {{\text{x}}{-}{\text{x}^\prime}} \right|$. More generally, Q is a k-dimensional metric space with the Euclidean metric. Here laws of coexistence specify regions of nomically possible states

$${\text{F}}\left( {\text{x}} \right) = \{ {\text{x}}{\upvarepsilon }{\mathbf{Q}}|{\text{ f}}\left( {{\text{h}}_{1} \left( {\text{x}} \right), \ldots ,{\text{h}}_{{\text{k}}} \left( {\text{x}} \right)} \right) = 0{\text{ }}\} .$$

The Clifford-distance between two such laws F₁ and F₂ is defined by

$$\mathop \int \limits_{{F_{1} \Delta F_{2} }}^{{}} dQ~.$$

An alternative approach to quantitative laws expresses how a function h_k necessarily depends on ${\text{h}}_{1} , \ldots ,{\text{h}}_{{{\text{k}} - 1}} :$

$${\text{h}}_{{\text{k}}} \left( {\text{x}} \right) = {\text{g}}\left( {{\text{h}}_{1} \left( {\text{x}} \right), \ldots ,{\text{h}}_{{{\text{k}} - 1}} \left( {\text{x}} \right)} \right).$$

The distances between two such real-valued functions can be defined by the Minkowski or L^p-metrics for functions:

$$\Delta _{{\text{p}}} \left( {{\text{f}},{\text{g}}} \right) = (\smallint{\mid }{\text{f}}\left( {\text{x}} \right){-}{\text{g}}\left( {\text{x}} \right){\mid }^{{\text{p}}} )^{{1/{\text{p}}}} .$$

(7)

Here p = 1 is the Manhattan metric, p = 2 the Euclidean metric, and p = ∞ the Tchebycheff metric sup ${\mid }{\text{f}}\left( {\text{x}} \right){-}{\text{g}}\left( {\text{x}} \right){\mid }$.^{Footnote 12} The degree of legisimilitude of the law f then depends on its distance to the true law f*:

$${\text{leg}}\left( {{\text{f}},{\text{f}}^*} \right){\text{ }} = \frac{1}{{1 + ~\Delta \left( {{\text{f}},{\text{f}}^*} \right)}}.$$

Further, f is closer to the truth than g if and only of ${\text{leg}}\left( {{\text{f}},{\text{f}}^*} \right) > {\text{leg}}\left( {{\text{g}},{\text{f}}^*} \right)$.

Quantitative laws of succession can be formulated by relativizing the state x with a time: h(x,t) = state of x at time t. Then deterministic dynamical laws tell how the state depends on time t and some initial state at time t_o:

$$\left( {{\text{h}}_{1} \left( {{\text{x}},{\text{t}}} \right), \ldots ,{\text{ h}}_{{\text{k}}} \left( {{\text{x}},{\text{t}}} \right)} \right) = {\text{F}}\left( {{\text{t}},{\text{h}}_{1} \left( {{\text{x}},{\text{t}}_{{\text{o}}} } \right), \ldots ,{\text{h}}_{{\text{k}}} \left( {{\text{x}},{\text{t}}_{{\text{o}}} } \right)} \right).$$

A law of succession specifies nomically possible trajectories ${\text{F}}:{\mathbf{R}} \times {\mathbf{Q}} \to {\mathbf{Q}}$. The distance between such laws can be defined by taking for each ${\text{Q}}{\kern 1pt} {\upvarepsilon }{\kern 1pt} {\mathbf{Q}}$ the Minkowski distance between the trajectories F₁(t,Q) and F₂(t,Q), for ${\text{t}}\;{\upvarepsilon }\;{\mathbf{R}}$, and then summing over all possible initial states ${\text{Q}}{\kern 1pt} {\upvarepsilon }{\kern 1pt} {\mathbf{Q}}$.^{Footnote 13}

3 Probabilistic laws

The notion of a universal or deterministic law introduced in Sect. 2 can be generalized to probabilistic laws, if an objective physical probability measure is available.^{Footnote 14} Following Leibniz, such physical probabilities express “degrees of possibility”. This can be understood in terms of single-case propensities: P(G/F) = r means that a physical set-up has a numerical disposition of strength r to produce an outcome of type G in each trial of type F. Thus, probability statements involve a dispositional modal operator, so that they differ from extensional statistical statements about actual relative frequencies of attributes in reference classes (i.e., structure descriptions). Universal laws of coexistence and deterministic laws of succession are limiting special cases of probabilistic laws (with propensities 0 and 1).^{Footnote 15} Genuine probabilistic laws presuppose that the world is indeterministic, but in statistical modelling one may assign in some sense objective probabilities to random phenomena (e.g., coin tossing, roulette) even when the underlying reality is deterministic. For the task of defining approximation to such probabilities, the philosophical issue of indeterminism and determinism can be left open.

To define probabilistic constituents, replace in a nomic constituent (5) the operator of physical possibility ◊ with a probability measure P over the discrete state space Q of Q-predicates, now understood as the “sample space” or the class of outcomes of a trial x:

$$\mathop \prod \limits_{{Q_{i} \in \user2{Q}}} [P(Q_{i} \left( {x)} \right) = ~p_{i} ].$$

(8)

Probabilities (8) over Q define a multinomial context. Then ${\text{p}}_{{\text{i}}} = {\text{P}}\left( {{\text{Q}}_{{\text{i}}} \left( {\text{x}} \right)} \right) > 0$ if and only if Q_i is physically possible, for all ${\text{i}} = 1, \ldots ,{\text{ K}}$, so that here P is applied to the open formula Q_i(x) instead of the existential statement in (5). Now a probabilistic constituent (8) is compatible with the nomic constituent B_i with CT_i if and only if it assigns a positive probability to the Q-predicates in CT_i and zero probability to other Q-predicates. This means that typically a nomic constituent is an infinite disjunction of probabilistic constituents.

In many statistical applications, the trial x counts the number of successes in a repeated experiment (e.g., binomial and Poisson distributions), so that the state space Q is a subclass of the set N of natural numbers. Then the distance ${\uprho }\left( {{\text{x}},{\text{x}^{\prime}}} \right)$ between points in Q is their normalized arithmetical difference.

An important distinction can be made between pure and mixed probabilistic laws. A probabilistic constituent, where CT_i is a proper subset of Q, is a mixed law in the sense that it entails a universal law (cells in Q–CT_i are necessarily empty). Pure probabilistic laws have no such entailments: the world is atomistic in the sense that all Q-predicates in Q are nomically possible (so that no universal laws hold), and a positive probability is assigned to all Q-predicates.

To define probabilistic laws of succession, for a discrete space Q the set T of possible transitions between states is replaced by a matrix of transition probabilities

$${\text{p}}_{{{\text{j}}/{\text{i}}}} = P\left( {Q_{j}^{{t + 1}} \left( x \right)/Q_{i}^{t} \left( x \right)} \right),$$

(9)

where ${\text{p}}_{{1/{\text{i}}}} + \ldots + {\text{p}}_{{{\text{K}}/{\text{i}}}} = {\text{ }}1$ for each i. This definition involves the Markov condition, i.e., the next state depends only on the present state. If transition probabilities are 0 or 1, this law reduces to the deterministic law (6).^{Footnote 16} Equation (9) determines the n-step transition probabilities

$${\text{p}}_{{{\text{j}}/{\text{i}}}} \left( {\text{n}} \right) = P\left( {Q_{j}^{{t + n}} \left( x \right)/Q_{i}^{t} \left( x \right)} \right),$$

and for an irreducible stationary Markov chain the limits of p_j/i(n), for n → ∞, give a long-run probability distribution. These notions can be generalized to Markov processes with a continuous time.^{Footnote 17} Special cases of probabilistic laws of succession can be formulated by quantitative dynamic laws like the law of radioactive decay

$${\text{P}}\left( {{\text{Q}}\left( {{\text{x}},{\text{t}}} \right)} \right) = 1{-}{\text{e}}^{{{ - }{\uplambda }{\text{t}}}} ,$$

(10)

where Q(x,t) states that atom x decays within the time-interval [0,t] and λ is a constant. Finally, for a quantitative state space Q a probability measure on Q^∞ (i.e., infinite sequences of successive states) assigns a physical probability to possible trajectories of a time-continuous stochastic process.

4 Distance between probabilities

The problem of legisimilitude for probabilistic laws has not yet received much attention. The main focus in the literature has been on cases, where the target is a universal or deterministic law, either qualitative or quantitative. The only detailed proposals have been given by Rosenkrantz (1980) and Niiniluoto (1987), 403–405, who apply the Kullback–Leibler notion of divergence as a measure of distance from a probabilistic truth.

Mathematicians have suggested a great number of measures for distances between probability distributions. In a comprehensive survey, Cha (2007) lists 45 different measures,^{Footnote 18} which have been used for various purposes. For example, the central limit theorem (the sum of n independent random variables approximates in the limit the normal distribution) and laws of large numbers (observed relative frequencies and predictive probabilities approach almost surely objective probabilities in a multinomial Bernoulli process) express distances between epistemic probabilities q and objective probabilities p by their geometrical distance |q–p|.^{Footnote 19} This amounts to the Manhattan metric

$$\Delta _{1} \left( {{\text{p}},{\text{q}}} \right) = \sum \left| {{\text{ p}}_{{\text{i}}} - {\text{ q}}_{{\text{i}}} } \right|.$$

For discrete probabilities, the squared Euclidean or quadratic metric

$$\Delta _{2} \left( {{\text{p}},{\text{q}}} \right) = \sum \left( {{\text{p}}_{{\text{i}}} {-}{\text{ q}}_{{\text{i}}} } \right)^{2} ,$$

or its variant χ², is a standard way of measuring the fit between two distributions or structural descriptions.^{Footnote 20} In the special case of scoring, where q_i are probabilistic estimates of the truth values p_i of n rival exclusive hypotheses (p_j = 1, otherwise 0), Glenn Brier’s 1950 measure of inaccuracy is quadratic, i.e., ${\text{d}}\left( {1,{\text{q}}} \right) = \left( {1{-}{\text{q}}} \right)^{2}$ and ${\text{d}}\left( {0,{\text{q}}} \right) = {\text{q}}^{2} ,$ ^{Footnote 21} while I. J. Good in 1952 favored the logarithmic measure ${\text{d}}\left( {1,{\text{q}}} \right) = - {\text{ lnq}},{\text{ d}}\left( {0,{\text{q}}} \right) = - {\text{ ln}}\left( {1 - {\text{q}}} \right)$.^{Footnote 22} From these local measures the total scoring measure is obtained by summing the inaccuracies of all q_i.

Some measures are based on the inner products p_iq_i. Hellinger’s 1909 proposal was modified in 1946 in Bhattacharyya’s dissimilarity coefficient:

$$- {\text{ log}}\sum \sqrt {p_{{i~}} q_{i} } .$$

The directed divergence of a discrete random variable p from another q was defined by Solomon Kullback and Richard Leibler in 1951 as the expected logarithmic difference between p and q with respect to p:

$${\text{div}}\left( {{\text{p}},{\text{q}}} \right) = \sum {\text{p}}_{{\text{i}}} {\text{log}}\left( {{\text{p}}_{{\text{i}}} /{\text{q}}_{{\text{i}}} } \right).$$

(11)

Here log can be taken to have the binary base 2, so that log2 = 1. Formula (11) is also called the relative entropy of p with respect to q. This measure is only a semimetric: non-negative, ${\text{div}}\left( {{\text{p}},{\text{p}}} \right) = 0,{\text{ div}}\left( {{\text{p}},{\text{q}}} \right) = 0$ if and only if p = q, but non-symmetric (usually ${\text{div}}\left( {{\text{p}},{\text{q}}} \right) \ne {\text{div}}\left( {{\text{q}},{\text{p}}} \right))$, and the triangle equation is not satisfied. For continuous probability densities f and g on R, Eq. (11) is replaced by

$${\text{div}}\left( {{\text{f}}\left( {\text{x}} \right),{\text{g}}\left( {\text{x}} \right)} \right) = \mathop \int \limits_{{ - \infty }}^{{ + \infty }} f\left( x \right)\log \left[\frac{{f\left( x \right)}}{{g\left( x \right)}}\right]dx.$$

The symmetric divergence between p and q is defined by

$${\text{div}}_{{\text{s}}} \left( {{\text{p}},{\text{q}}} \right) = {\text{div}}\left( {{\text{p}},{\text{q}}} \right) + {\text{div}}\left( {{\text{q}},{\text{p}}} \right) = \sum \left( {{\text{p}}_{{{\text{i}}~}} - {\text{ q}}_{{\text{i}}} } \right){\text{log }}\left( {{\text{p}}_{{\text{i}}} /{\text{q}}_{{\text{i}}} } \right).$$

Other variants include the λ-divergence

$${\text{div}}_{{\uplambda }} \left( {{\text{p}},{\text{q}}} \right) = {\uplambda }{\text{div}}({\text{p}},{\uplambda }{\text{p}} + (1 - {\uplambda }){\text{q}}) + (1 - {\uplambda }){\text{div}}({\text{q}},{\uplambda }{\text{p}} + (1 - {\uplambda }){\text{q}})$$

For λ = ½, it gives the Jensen-Shannon divergence

$$\begin{aligned} {\text{div}}_{{{\text{JS}}}} \left( {{\text{p}},{\text{q}}} \right) & = \frac{{{\text{1}} }}{2}{\text{div}}({\text{p}},\frac{{{\text{p}} +{\text{q}}}}{2})+\frac{{{\text{1}} }}{2}{\text{div}}({\text{q}},\frac{{{\text{p}} +{\text{q}}}}{2}) = \frac{{{\text{1}} }}{2}{\text{ H}}(\frac{{p + q}}{2}) - \frac{{{\text{H}}\left( p \right) + {\text{H}}\left( q \right)}}{2} \\ & = \frac{{{\text{1}} }}{2}\sum {\text{ p}}_{{\text{i}}} {\text{logp}}_{{\text{i}}} + \frac{{{\text{1}} }}{2}\sum {\text{q}}_{{\text{i}}} {\text{logq}}_{{\text{i}}} - \frac{{{\text{1}} }}{2} \sum \left( {{\text{p}}_{{\text{i}}} + {\text{ q}}_{{\text{i}}} } \right){\text{ log}}\left[ {\left( {{\text{p}}_{{\text{i}}} + {\text{ q}}_{{\text{i}}} } \right)/2} \right] \\& = \frac{{{\text{1}} }}{2}\sum {\text{ p}}_{{\text{i}}} {\text{log }}\left[ {2{\text{p}}_{{\text{i}}} /\left( {{\text{p}}_{{\text{i}}} + {\text{q}}_{{\text{i}}} } \right)} \right] + \frac{{{\text{1}} }}{2}\sum {\text{ q}}_{{\text{i}}} {\text{log}}\left[ {2{\text{q}}_{{\text{i}}} /\left( {{\text{p}}_{{\text{i}}} + {\text{ q}}_{{\text{i}}} } \right)} \right]. \\ \end{aligned}$$

(12)

where ${\text{H}}\left( {\text{p}} \right) = - \Sigma {\text{p}}_{{\text{i}}} {\text{logp}}_{{\text{i}}}$ is the Shannon entropy of p. Formula (12) is non-negative and symmetric, and its square root is a metric. Renyi divergence is defined by

$${\text{D}}_{{\upalpha }} \left( {{\text{p}},{\text{q}}} \right) =\frac{1}{\alpha-1} ~{\text{log}}\sum ({\text{p}}_{{\text{i}}} ^{{\upalpha }} /{\text{q}}_{{\text{i}}} ^{{{\upalpha } - 1}} ).$$

For α = 1, it gives in the limit the Kullback–Leibler divergence, and for α = ½ twice the Bhattacharyya distance.

Divergence was originally intended as a tool in information theory.^{Footnote 23} In Bayesian statistics it has been used to measure the difference between prior and posterior distributions. It can be also used for assessing the similarity of a probabilistic model with some aspect of reality.^{Footnote 24} The first connection to the studies in truthlikeness was developed by Roger Rosenkrantz (1980). Inspired by I. J. Good’s notion of the weight of evidence, Rosenkrantz suggested that for a random experiment x and truth h*, hypothesis h is more truthlike than hypothesis h′ if

$${\text{div}}\left( {{\text{P}}\left( {{\text{x}}/{\text{h}}} \right),{\text{P}}\left( {{\text{x}},{\text{h}}^*} \right)} \right) < {\text{div}}\left( {{\text{P}}\left( {{\text{x}}/{\text{h}^\prime}} \right),{\text{P}}\left( {{\text{x}}/{\text{h}}^*} \right)} \right).$$

This idea was connected to the similarity approach to truthlikeness by Niiniluoto (1987): the distance of a probabilistic hypothesis h from the probabilistic target h* is measured by div(h*,h). Thus, hypothesis h is more truthlike than h´ if and only if ${\text{div}}\left( {{\text{h}}^*,{\text{h}}} \right) < {\text{div}}\left( {{\text{h}}^*,{\text{h}^\prime}} \right)$. When the relevant hypotheses h, h′, and h* are specified by probabilities p_i, q_i, and p_i*, this comparative condition holds if and only if $\Sigma {\text{p}}_{{\text{i}}}^*{\text{log}}\left( {{\text{q}}_{{\text{i}}} /{\text{p}}_{{\text{i}}} } \right) < 0$. Generalization to probability density functions is immediate.

More generally, Niiniluoto (1987) recommends divergence div as a solution to the problem of probabilistic legisimilitude:

for probabilistic laws of coexistence (8) in qualitative conceptual spaces, the distance to the true probabilistic constituent
for probabilistic laws of succession (9) in qualitative languages, the distance to the matrix of true probability transitions
for probabilistic laws of succession in the quantitative space Q^∞, the distance to the true probability on Q^∞.

Alternative solutions could replace div by some other distance measure, e.g., Manhattan, Euclidean, or Bhattacharyya.

The Kullback–Leibler divergence div(p,q) has a limitation which is not noted in Niiniluoto (1987). Its definition (11) presupposes that p is absolutely continuous with respect to q, i.e., if q_i = 0, then p_i = 0. Further, when p_i = 0, the factor 0log0 in the sum vanishes. The same condition is required for probability densities: if g(x) = 0, then f(x) = 0. This means that the KL-divergence can be applied only to pure probabilistic laws, since for mixed probabilistic constituents mistakes in the empty cells (or zero points in Q) in the target and hypothesis would not count at all. The same problem is faced by the Bhattacharyya distance, whose factors vanish as soon as p_i or q_i is 0, but not by the Minkowski metrics.

This problem with divergence is observed by Rosenkrantz (1980), who suggests that in the evaluation of ${\text{div}}\left( {{\text{P}}\left( {{\text{x}},{\text{h}}} \right),{\text{P}}\left( {{\text{x}},{\text{h}}^*} \right)} \right)$ zero probabilities are replaced by a slightly positive possibility of misclassification, but this ad hoc move is unsatisfactory. As a better solution one can recommend the use of the Kullback–Leibler directed divergence div for pure probabilistic laws, and the Jensen-Shannon divergence div_JS (instead of div) to measure the distance between mixed probability distributions over cells Q or in the transition matrix. The JS-divergence shares the good properties of the KL-divergence, but it has a finite value in all cases even when some of the probabilities are zero.^{Footnote 25}

The following examples illustrate various possibilities in analyzing approach to probabilistic laws.

Example 1.

Let L be a monadic language with two primitive predicates Fx = x is a swan and Gx = x is white. Then there are four Q-predicates in L:

$${\text{Q}}_{1} {\text{x}} = {\text{Fx }}\& {\text{ Gx}}$$

$${\text{Q}}_{{\text{2}}} {\text{x }} = {\text{ Fx }}\& {\text{ }} \sim {\text{Gx}}$$

$${\text{Q}}_{3} {\text{x}} = \; \sim {\text{Fx }}\& {\text{ Gx}}$$

$${\text{Q}}_{4} {\text{x}} =\, \sim {\text{Fx }}\& {\text{ }} \sim {\text{Gx}}.$$

Then the true constituent C* in L states that all Q-predicates are instantiated. Let H be the false universal generalization “All swans are white”. H states that Q₂ is empty and leaves other cells as question marks. Applying the min-sum definition with weights γ and γ´ for the min and sum factors, respectively, the degree of truthlikeness of H is^{Footnote 26}

$${\text{Tr}}\left( {{\text{H}},{\text{C}}^*} \right) = 1{-}{\upgamma }/4 - 5{\upgamma }^\prime/8.$$

Choosing γ = 2/3 and γ´ = 1/3, this is equal to 5/8. The degree of truthlikeness of the false constituent C₁ with CT₁ = {Q₁, Q₃, Q₄} is^{Footnote 27}

$${\text{Tr}}\left( {{\text{C}}_{1} ,{\text{C}}^*} \right) = 1{-} {\upgamma }/4{-}{\upgamma }^\prime/32 = 79/96 > {\text{ }}5/8.$$

The same numerical results hold for the nomic versions of C*, H, and C₁. In the probabilistic framework, H corresponds to the law P(Gx/Fx) = 1, but now the target is the true probability distribution P* over the cells Q₁,…,Q₄, and H is a disjunction of probabilistic constituents with probability 0 for Q₂. As the number of black swans is small in comparison to white swans, the true probabilistic law is something like P(Gx/Fx) = 0.95. It follows, for any reasonable distance measure, that H has a relatively high degree of truthlikeness, and in any case higher than that of the law ${\text{P}}\left( {{\text{Gx}}/{\text{Fx}}} \right) = 0.5$. But if the cognitive interest of the investigator is to know both the nomic and actual features of birds, so that the target is the conjunction P* & C*, then H’s overall truthlikeness is reduced, since it mistakenly excludes the cell Q₂, while laws of the form ${\text{P}}\left( {{\text{Gx}}/{\text{Fx}}} \right) = {\text{r}} < 1$ allow for the actual existence of black swans.

Example 2.

Already Example 1 illustrates the fact that the comparison of ordinary, nomic and probabilistic constituents is a complicated matter, as they involve different targets. For example, a probabilistic constituent P is equivalent to a single nomic constituent B only in the special case where just one cell Q_i is physically possible – and, hence, has probability 1. In other cases, the true nomic constituent B* is an infinite disjunction of probabilistic constituents, and the target-sensitivity does not allow a direct comparison of the degrees of truthlikeness of these different types of hypotheses. In particular, the atomistic nomic constituent, which states the possibility of all Q-predicates, is the disjunction of all pure probabilistic laws. This means that there is no connection between divergence and Clifford-distance for pure probabilistic laws. To see this, assume that P₁ and P₂ are two different laws, and B₁ and B₂ are the nomic constituent entailed P₁ and P₂. If P₁ and P₂ are pure laws, then CT₁ = CT₂ = Q and CT₁ΔCT₂ = ø, so that div(P₁,P₂) > 0 but the Clifford-distance Δ_C(B₁,B₂) = 0.^{Footnote 28} But some simple comparisons can be made for the special case of uniform mixed laws. Thus, suppose C₁ and C₂ are monadic nomic constituents in a language with K Q-predicates with $|{\text{CT}}_{1} {-}{\text{CT}}_{2} \left| { = {\text{A}},{\text{ }}} \right|{\text{CT}}_{2} {-}{\text{CT}}_{1} \left| { = {\text{B}},\,{\text{ and}}\,{\text{ }}} \right|{\text{CT}}_{1} \cap {\text{CT}}_{2} | = {\text{D}}$, so that the Clifford distance Δ_C between C₁ and C₂ is $\left( {{\text{A}} + {\text{B}}} \right)/{\text{K}}$ (see (3)). Let P₁ and P₂ be probabilistic constituents which allocate probability uniformly to CT₁ and CT₂ (1/c and 1/c′, respectively), where ${\text{c}^\prime} \ge {\text{c}}$. Now ${\text{D}} = {\text{c}}{-}{\text{A}} = {\text{c}^\prime} - {\text{B}}$. Then the Manhattan distance satisfies

$$\Delta _{1} ({\text{P}}_{1} ,{\text{P}}_{2} ){\text{ }} = \frac{A}{c}~ + \frac{B}{{c}^\prime} + \left( {\frac{1}{c} - \frac{1}{{c}^\prime}} \right){\text{D}} = \frac{1}{{c}^\prime}~\left( {{\text{A}} + {\text{B}}} \right) - \frac{c}{{c}^\prime}+1 = \frac{K}{{c}^\prime}~\Delta _{C} \left( {{\text{C}}_{1} ,{\text{C}}_{2} } \right){\text{ }} + \frac{{{c}^\prime - c}}{{c}^\prime}.$$

If ${\text{c}} = {\text{c}^\prime}$, this value equals ${\text{K}}\Delta _{{\text{C}}} \left( {{\text{C}}_{1} ,{\text{C}}_{2} } \right)/{\text{c}}$. For the Euclidean distance with ${\text{c}} = {\text{c}^\prime}$ we have

$$\Delta _{2} ({\text{P}}_{1} ,{\text{P}}_{2} ) = \left( {{\text{A}} + {\text{B}}} \right)/{\text{c}}^{2} = {\text{ K }}\Delta _{{\text{C}}} \left( {{\text{C}}_{1} ,{\text{C}}_{2} } \right)/{\text{c}}^{2} .$$

A similar connection to the Clifford measure holds for the Jensen-Shannon divergence:

$$\begin{aligned} \text{div}_{\text{JS}} \left( {\text{P}_{1} ,\text{P}_{2} } \right) &= \frac{{log2c}}{{2c}}\left( {\text{A} +\text{B}} \right) + \frac{{logc}}{c}~{\text{D}}{-}{\text{logc}} \hfill \\ & =\frac{{log2c}}{{2c}}\left( {\text{A} +\text{B}} \right){\text{ }}{-}{\text{ }}\left( {1{\text{ }} - \frac{D}{c}} \right){\text{logc}} \hfill \\ & =\frac{{log2c}}{{2c}}\left( {\text{A} + \text{B}} \right){\text{ }} - \frac{{logc}}{c}\text{A} \hfill \\ &= \frac{{log2}}{c}\text{A} \hfill \\ &= \frac{K}{{2c}}\Delta _{C} \left( {\text{C}_{1} ,\text{C}_{2} } \right). \hfill \\ \end{aligned}$$

However, such connections fail for non-uniform laws. For example, if nomic constituents B₁ and B₂ are otherwise almost equal, but B₁ makes correct possibility claims about cells Q_i with high true probability ${p}_{i}^{*}$ while B₂ makes such claims about cells Q_j with low probability ${p}_{j}^{*}$, then it may happen that truthlikeness ordering is reversed when the target changes from B* to P*: $\Delta _{{\text{C}}} \left( {{\text{B}}_{1} ,{\text{B}}^*} \right) > \Delta _{{\text{C}}} \left( {{\text{B}}_{2} ,{\text{B}}^*} \right)$, but ${\text{d}}\left( {{\text{B}}_{1} ,{\text{P}}^*} \right) < {\text{d}}\left( {{\text{B}}_{2} ,{\text{P}}^*} \right).$^{Footnote 29}

Example 3

If p and q are disjoint mixed probabilistic laws (i.e.,${\text{CT}}_{{\text{p}}} \cap {\text{CT}}_{{\text{q}}} ={\oslash}$), then

$$\Delta _{1} \left( {{\text{p}},{\text{q}}} \right) = \sum {\text{p}}_{{\text{i}}} ~ + \sum {\text{q}}_{{\text{i}}} = {\text{ }}1 + 1 = 2$$

$${\text{div}}_{{{\text{JS}}}} \left( {{\text{p}},{\text{q}}} \right) = - \sum \frac{{{\text{p}}_{{\text{i}}} }}{2}{\text{log}}\frac{{{\text{p}}_{{\text{i}}} }}{2} - ~\sum \frac{{{\text{q}}_{{\text{i}}} }}{2}~{\text{log}}\frac{{{\text{q}}_{{\text{i}}} }}{2}~ - \frac{1}{2}~{\text{H}}\left( {\text{p}} \right) - \frac{1}{2}~{\text{H}}\left( {\text{q}} \right) = {\text{ }}1.$$

Example 4

The Poisson distribution for a randomly occurring rare event ${\text{p}}\left( {\text{i}} \right),{\text{ i}} = 1,{\text{ }}2, \ldots ,$ with a constant mean λ is defined by

$${\text{p}}\left( {\text{i}} \right){\text{ }} = \frac{{\lambda ^{i} }}{{i!}}~e^{{ - \lambda }} .$$

The KL-divergence between two Poisson distributions with rates λ and λ´ (where λ´ > λ) is

$$\lambda ^{\prime} - \lambda - \lambda {\text{log}}\frac{{\lambda ^{\prime}}}{\lambda }~.~$$

The proof uses the Taylor series

$$e^{\lambda } ~ = \mathop \sum \limits_{0}^{\infty } \frac{{\lambda ^{i} }}{{i!}}.$$

Example 5

The Manhattan difference between two exponential laws (10) with decay rates λ and λ′ (where λ′ > λ) is

$$\Delta _{1} (1{\text{ }}{-}{\text{ }}e^{ - \lambda t} ,{\text{ }}1{\text{ }}{-}{\text{ }}e^{ -{\lambda^\prime t}} )~ = \mathop \int \limits_{0}^{\infty } (e^{{ - \lambda^\prime t}} - ~e^{{ - \lambda t}} )dt = ~\frac{{\lambda^\prime - \lambda }}{{\lambda \lambda^\prime }}.$$

Example 6

Let p and q be deterministic laws of succession such that ${\text{p}}_{{2/1}} = 1,{\text{ p}}_{{1/1}} = 0\;{\text{ and}}\;{\text{ q}}_{{2/1}} = 0,{\text{ q}}_{{1/1}} = 1$. Then

$${\text{div}}_{{{\text{JS}}}} \left( {{\text{p}},{\text{q}}} \right) = - \left(\frac{1}{2}{\text{log}}\frac{1}{2}+\frac{1}{2}{\text{log}}\frac{1}{2}\right){-}\left( {0 + 0} \right)/2 = -{\text{log}}\frac{1}{2} ~ = {\text{log}}2 = 1.$$

For indeterministic r with ${\text{r}}_{{2/1}} = {\text{r}}_{{1/1}} = \raise.5ex\hbox{$\scriptstyle 1$}\kern-.1em/ \kern-.15em\lower.25ex\hbox{$\scriptstyle 2$} ,$

$${\text{div}}_{{{\text{JS}}}} \left( {{\text{p}},{\text{r}}} \right) = - \frac{3}{4}{\text{log}}\frac{3}{4} - \frac{1}{4}{\text{log}}\frac{1}{4} - \frac{1}{2}\left( { - \frac{1}{2}\log \frac{1}{2} - \frac{1}{2}\log \frac{1}{2}} \right) = 2\log 2 - \frac{3}{4}\log 3 = 0.811.$$

These examples illustrate that several alternative distance measures give fairly similar comparative results. For uniform nomic constituents their results are related to the Clifford-measure between ordinary constituents, but this relation is not straightforward for non-uniform constituents and disappears for pure probabilistic laws. When it comes to measure the distance between particular probability values, geometrical and quadratic differences seem simple and useful, but for the distance between whole probability distributions or densities divergence is a convenient choice. The applicability of the Kullback–Leibler divergence is restricted to pure probabilistic laws, so that the Jensen-Shannon divergence turns out to be valuable complement which can be applied to mixed laws which assign zero probabilities to some Q-predicates or sample points.

5 Vertical versus horizontal distance measures

An important debate about the explication of truthlikeness for monadic languages concerned the question, whether the distance between constituents should reflect distances between Q-predicates. The Clifford-measure $\Delta _{{\text{C}}} \left( {{\text{C}}_{1} ,{\text{C}}^*} \right)$ counts all errors of C₁ about the Q-predicates equally: mistaken existence claims in ${\text{CT}}_{1} {-}{\text{CT}}^*$ and mistaken non-existence claims in ${\text{CT}}^* - \;{\text{CT}}_{1}$ have the same weight 1/K in (3). It is natural to consider also situations where the cognitive seriousness of errors in a false constituent are treated differently, so that the distance from the truth is not simply the cardinality of the symmetric difference. Niiniluoto proposed in 1976 two modifications of the Clifford-measure on the basis of the ρ-measure between Q-predicates.^{Footnote 30} In the Jyväskylä measure d_J false existence claims are weighted by their distance to the nearest non-empty cell, and false non-existence claims by their distance to the nearest empty cell, while in the weighted symmetric difference d_w the first condition holds, but false non-existence claims are weighted by the minimum distance to a really non-empty cell. Then Δ_C and d_w (unlike d_J) are symmetric, and Δ_C and d_J (unlike d_w) are specular, where a specular distance (in the sense of Festa, 1993) satisfies the condition that the maximally distant constituent from C_i is its photographic negative (i.e., all positive claims are replaced by negative ones and vice versa).^{Footnote 31} If the ρ-measure reflects resemblances between predicates in a family (e.g. colors), then for the Jyväskylä measure the generalization “All ravens are grey” is closer to the truth than “All ravens are white”.^{Footnote 32}

Tichý’s (1976) general definition of truthlikeness implies for the monadic case a distance measure between constituents which differs from the Clifford-measure Δ_C and its modifications d_J and d_w.^{Footnote 33} A linkage η between sets CT_i and CT_j is a surjective mapping from the larger of the sets to the smaller one. The cardinality card(η) of η is then ${\text{max}}\left\{ {\left| {{\text{CT}}_{{\text{i}}} } \right|,\;|{\text{CT}}_{{\text{j}}} |} \right\}$, and the breadth of η is the average distance between the linked predicates:

$${\text{B}}\left( {\upeta } \right) = \frac{1}{{card\left( \eta \right)}}~\mathop \sum \limits_{{< {Q_{u} ,Q_{v} } > \in \eta }} \rho \left( {Q_{u} ,Q_{v} } \right).$$

(13)

The distance ${\text{d}}_{{\text{T}}} \left( {{\text{C}}_{{\text{i}}} ,{\text{C}}_{{\text{j}}} } \right)$ between constituents C_i and C_j is then defined as the breadth of the narrowest linkage between CT_i and CT_j.

Niiniluoto (1987) rejects Tichý’s proposal for several reasons. The use of average in (13) leads to unintuitive examples, and constituents should not be treated as if they consisted only of existence claims. Indeed, d_T is not specular and does not reflect the cognitive goal of finding true universal generalizations. The most fundamental objection is that ${\text{d}}_{{\text{T}}} \left( {{\text{C}}_{{\text{i}}} ,{\text{C}}^*} \right)$ can be derived as the minimum distance between two state descriptions s and s′, where s entails the uniformly distributed infinite structure description entailing C_i and s′ entails C*. Thus, Tichý is not defining the distance between C_i and C* in terms of the counted or weighted differences in claims about the Q-predicates (and thereby the ability of C_i to express true generalizations), but rather in terms of putting an infinite number of individuals in their right places in a classification system.^{Footnote 34} The latter problem should be solved by choosing the target as the true state description and by replacing constituents (in a non ad hoc way) as disjunctions of state descriptions.

In spite of this criticism, Tichý’s basic idea is interesting, since the notion of a linkage resembles metrics defined for trees in terms of the number of transformations needed to change one tree to another.^{Footnote 35} A linkage takes seriously (but perhaps in a wrong way) the demand that “horizontal” distances between Q-predicates are relevant. The goal of distributing an infinite number of individuals to their right places could be viewed as analogous to the task of distributing a probability mass (of measure 1) to its right place. Indeed, a discrete probabilistic constituent (8) allocates the probabilities to a finite number of points in the space Q of Q-predicates, and a continuous probability density f on a state space ${\mathbf{Q}} \subset {\text{R}}^{{\text{n}}}$ does the corresponding assignment to an infinite number of points. This can be illustrated by the simple case where Q is a subset of the real line R and f: ${\mathbf{Q}} \to {\text{R}}^{ + }$. If we denote by D_f the region between the curve f(x) ≥ 0 and the real axis, i.e.,

$${\text{D}}_{{\text{f}}} = \left\{ { < {\text{x}},{\text{y}} > |\;{\text{x }}{\upvarepsilon }\;{\mathbf{Q}},{\text{ }}0{\text{ }} \le ~{\text{y}} \le {\text{ f}}\left( {\text{x}} \right)} \right\},$$

then the density f gives the probability measure 1 to D_f. For two probability densities f and g, the symmetric difference ${\text{D}}_{{\text{f}}} \Delta {\text{D}}_{{\text{g}}}$ covers the region between the functions f(x) and g(x) (see Fig. 2). The Manhattan distance is simply the area of this region:

$$\Delta _{1} \left( {{\text{f}},{\text{g}}} \right) = \left| {{\text{ D}}_{{\text{f}}} \Delta {\text{D}}_{{\text{g}}} } \right|,$$

which is a direct analogue of the Clifford-distance (3) between constituents.

The Manhattan measure, as well as its Euclidean and divergence alternatives, are in an obvious sense vertical, since they measure the distance between probability distributions by the absolute, quadratic, or logarithmic differences between the values of f(x) and g(x), without consideration of horizontal distances between points in the sample space Q. This verticality is dramatically seen in Example 3, where the distances between the values of two disjoint discrete probabilistic constituents p and q are maximal, and the distances Δ₁(p,q) and div_JS(p,q) have their maximal values quite independently of the location of p and q with respect to the space Q. A counterpart of this result for probability densities is the following observation: if f₁ and f₂ are geometrical distributions with the same shape but disjoint domains, then Δ₁(f₁,f₂), Δ₂(f₁,f₂), and div_JS(f₁,f₂) have their maximal values quite independently of the geometrical distance a between these densities (see Fig. 3). In fact, all distance measures surveyed by Cha (2007), which are applicable to mixed probabilistic laws, share this feature of verticality.

The observations above motivate the idea that one could try to find measures which in some way take into account the horizontal distances between probability distributions (in addition to their vertical ones). Then a modification of Tichý’s linkages might be fruitful. The detailed development of this suggestion has to be left for another occasion, but a simple illustration of the idea can be given here. Consider again real-valued probability densities f which define regions D_f in a subspace S of R². Let ${\upbeta }:{\text{ S}} \to {\text{S}}$ be an area-preserving function, so that β[A] has the same area as A for all subregions A of S. Thus, β maps D_f onto D_g by moving the whole probability mass from D_f to D_g.^{Footnote 36} The length of the vector (< x,y > ,β(x,y)) is defined by the metric of S, and the breadth of β is defined as the sum (integral) of all these lengths for points < x,y > in D_f. Then the distance between probabilities f and g is the breadth of the narrowest transformation β between D_f and D_g. For example, in Fig. 3 the mapping ${\upbeta }\left( {{\text{x}},{\text{y}}} \right) = \left( {{\text{x}} - {\text{a}},{\text{ y}}} \right)$, i.e., linear shift to the left,^{Footnote 37} gives a linkage between f₂ and f₁ whose breadth is a, since

$$\mathop \int \limits_{{D_{{f2}} }}^{{}} | < {\text{x}},{\text{y}} > , < {\text{x}} - {\text{a}},{\text{y}} > |{\text{dxdy}} = \mathop \int \limits_{{D_{{f2}} }}^{{}} a~{\text{dxdy}} = {\text{a}}.$$

For probabilities on the discrete sample space Q, which in effect define columns on the points of Q with the total length one, the corresponding idea is to measure the distance between p and q by looking for the shortest length-preserving transformation between p and q. Such a transformation divides the columns of q into pieces and moves them in order to reach a fit with p. If a part of a column q_i is moved to Q_j, then the length of this part is multiplied with the distance ρ(Q_i,Q_j). For example, let ${\mathbf{Q}} = \left\{ {0,1,2} \right\},{\text{ }}{\uprho }\left( {0,1} \right) = {\uprho }\left( {1,2} \right) = 1/2,{\text{ }}{\uprho }\left( {0,2} \right) = 1$. Then p and q have the maximal distance 1, if p gives all probability to 0 and q to 2. If

$$\begin{aligned} & {\text{p}}_{1} = {\text{ p}}_{2} = {\text{p}}_{3} = 1/3 \hfill \\& {\text{q}}_{1} = 1/6,{\text{ q}}_{2} = 2/3,{\text{ q}}_{3} = 1/6. \hfill \\ \end{aligned}$$

then the distance between p and q is

$$\frac{1}{6}.~\frac{1}{2} + ~\frac{1}{6}.\frac{1}{2} = ~\frac{1}{6}.$$

But if

$${\text{q}}_{1} = {\text{ }}1/6,{\text{ q}}_{2} = 1/6,{\text{ q}}_{3} = {\text{ }}2/3,$$

then the distance between p and q is ¼. These measures, which combine vertical and horizontal aspects, are applicable to both pure and mixed probabilistic laws.

6 Estimating distance from probabilistic truth

According to the similarity approach to the epistemic problem of truthlikeness, unknown degrees of truthlikeness can be estimated by their expected value (4) using a posterior probability distribution over constituents. The same idea can be applied for the estimation of unknown degrees of divergence, which measure distance from the true probabilistic law.

Example 7

^{Footnote 38} If p is the true probability of success in a binomial model

$${\text{B}}\left( {{\text{p}},{\text{s}}} \right) = \left( {\begin{array}{*{20}c} n \\ s \\ \end{array} } \right){\text{p}}^{{\text{s}}} \left( {1 - {\text{p}}} \right)^{{{\text{n}} - {\text{s}}}}$$

and q is our guessed value, then the divergence of q from p in a single trial is

$${\text{div}}\left( {{\text{q}},{\text{p}}} \right) = {\text{plog}}\frac{p}{q} + \left( {1 - {\text{p}}} \right){\text{log}}\frac{{1 - p}}{{1 - q}}.$$

As divergence is additive for independent distributions, this divergence in n trials is

$${\text{n}}\left[{\text{plog}}\frac{p}{q} + \left( {1 - {\text{p}}} \right){\text{log}}\frac{{1 - p}}{{1 - q}}\right].$$

If the prior distribution g(p) of p is the uniform Beta(1,1), i.e., g(p) = 1 for 0 ≤ p ≤ 1, by Bayes´ theorem the posterior distribution g(p/s) of p with s successes and n-s failures is Beta(s + 1, n-s + 1), i.e.,

$${\text{g}}\left( {{\text{p}}/{\text{s}}} \right) = \frac{{\Gamma \left( {n + 2} \right)}}{{\Gamma \left( {s + 1} \right)\Gamma \left( {n - s - 1} \right)}}{\text{p}}^{{\text{s}}} \left( {1 - {\text{p}}} \right)^{{{\text{n}} - {\text{s}}}} ,$$

whose mean is (s + 1)/(n + 2).^{Footnote 39} Then the estimated distance of q from p can be calculated by

$$\mathop \int \limits_{0}^{1} g\left( {p/s} \right)div\left( {q,p} \right)dp.~$$

It follows that, given s successes in n trials, guess q´ is estimated to be closer to the truth than q if and only if

$$\begin{aligned} &{\text{log}}\frac{{q}^\prime}{q}~\mathop \int \limits_{0}^{1} g\left( {p/s} \right)p dp~ > {\text{log}}\frac{{1 - q^\prime}}{{1 - q}}~\mathop \int \limits_{0}^{1} g\left( {p/s} \right)\left( {1 - p} \right)dp~ \hfill \\ &{\text{iff}}\;\;~{\text{log}}\frac{{q^\prime~}}{q}~.~\frac{{s + 1}}{{n + 2}}~ > {\text{log}}\frac{{1 - q^\prime}}{{1 - q}}~.~\left[ {1 - ~\frac{{s + 1}}{{n + 2}}} \right] \hfill \\ & {\text{iff}}~\;\;~{\text{log}}\frac{{q^\prime}}{q}/{\text{log}}\frac{{1 - q}}{{1 - q^\prime}} > \frac{{n - s + 1}}{{s + 1}}. \hfill \\ \end{aligned}$$

Note that for the deterministic hypothesis q = 1 the value of div(q,p) is not well defined, but for the Jensen-Shannon distance

$${\text{div}}_{{{\text{JS}}}} \left( {1,{\text{p}}} \right) = \frac{1}{2}{\text{plog}}\frac{p}{2}~ - \frac{1}{2}~\left( {{\text{p}} + 1} \right){\text{log}}\frac{{p + 1}}{2} + \frac{1}{2}~{\text{log}}2.$$

Hence,

$${\text{div}}_{{{\text{JS}}}} \left( {1,0} \right) = \frac{1}{2}~{\text{log}}2 - \frac{1}{2}{\text{log}}\frac{1}{2} = {\text{log}}2 = 1.$$

Example 8

^{Footnote 40} Let x₁,…, x_n be independent measurements of an unknown real-valued quantity θ with a normal distribution ${\text{N}}({{\uptheta }},\sigma ^{2} )$:

$${\text{f}}({\text{x}}/{\uptheta }){\text{ }} = \frac{1}{{\sigma \sqrt {2\pi } }}e^{{ - \left( {x - ~\theta } \right)^{2} /2\sigma ^{2} }} .$$

Then their mean value ${\text{y}} = \left( {{\text{x}}_{1} + \cdots + {\text{x}}_{{\text{n}}} } \right)/{\text{n}}$ is normally distributed N(θ,σ²/n). If the prior probability of θ is sufficiently flat normal, then the posterior distribution g(θ/y) of θ is approximately N(y,σ²/n), where y is the observed mean. If f(x/θ) is the true distribution, and f(x/θ_o) is our guess, then their estimated directed divergence is

$$\begin{aligned} & \int g\left( {\theta /y} \right)~\int f\left( {x/\theta } \right)\log [f\left( {x/\theta } \right)/f\left( {x/\theta _{{\mathfrak{o}}} } \right)]\text{dx}\text{d}\uptheta \hfill \\ & = \int g\left( {\theta /y} \right)~[(\theta {-}\theta _{o} )/2\sigma ^{2} ]\text{d}\uptheta \hfill \\ & = \frac{1}{{2\sigma ^{2} }}\int \theta ~g\left( {\theta /y} \right)d\theta ~ - \frac{{\theta _{0} }}{{2\sigma ^{2} }} \hfill \\ & = (\text{y}{-}\uptheta _{o} )/2\upsigma ^{2} . \hfill \\ \end{aligned}$$

Here the mean y as the best estimate agrees with the result of the Bayes-rule of minimizing expected quadratic loss.

7 Conclusion

We have seen in this paper that the basic idea of the similarity approach to truthlikeness can be extended from qualitative and quantitative first-order languages to cases where probabilistic statements (and their disjunctions) are compared with probabilistic targets. Sections 2 and 3 show how one can naturally proceed from universal and deterministic laws to probabilistic laws. Section 4 argues that the Kullback–Leibler divergence has to be supplemented by the Jensen-Shannon divergence as a measure between mixed probabilistic laws, i.e., laws which assign zero probabilities to some sample points and thereby entail some universal laws. Section 5 formulates a research program for studying a new class of measures which account for the horizontal differences between probability densities, based on distances between sample points. In this way the theory of probabilistic truth approximation does not only lend tools from probability calculus but may suggest novel kinds of problems for mathematicians. Finally, Sect. 6 gives examples to show that the method of estimating degrees of legisimilitude by their expected value can be generalized from the case of deterministic laws to probabilistic laws.

Notes

For surveys, see Niiniluoto (1998), Oddie (2016).
For details, see Niiniluoto (1987), pp. 204–208. Most of these logical concepts were developed for the purposes of inductive logic by Rudolf Carnap and Jaakko Hintikka (cf. Niiniluoto, 2011).
This the Manhattan or Hamming distance between Q_u and Q_v (i.e., the number of disagreements in their definition), but normalized to take values between 0 and 1. Q-predicates can also be defined by Carnapian families of predicates, and their distance by the Euclidean metric. See Niiniluoto (1987), pp. 44–47.
See Tichý (1976). For discussion, see Oddie (1986), pp. 99–105, Niiniluoto (1987), pp. 310–335. Cf. Section 5 below.
This solution to the epistemic problem was proposed by Niiniluoto in 1977 (cf. Niiniluoto, 1987, Ch. 7).
For a survey, see Niiniluoto (2011).
See Niiniluoto (1987), pp. 91–98. Nomic constituents correspond to what Kuipers (1982) called”theoretical truth” (as opposed to”descriptive truth”) and later “nomic truth” (see Kuipers, 2019).
See Niiniluoto (1987), p. 377.
For monadic nomic constituents, the relevant posterior probabilities are again obtained from Hintikka’s inductive logic (see Niiniluoto, 1987, pp 98–102).
See Niiniluoto (1987), p. 379.
See Niiniluoto (1987), Ch. 3.
See Niiniluoto (1987), p. 385. The metric (7) is based on the differences between the values of two functions, but it does not reflect the similarity of their mathematical form (see Niiniluoto, 2019, p. 131). For a proposal to measure the distance between quantitative laws as a combination of accuracy and nomicity, see Garcia Lapeña (2021).
See Niiniluoto (1987), p. 393.
See Niiniluoto (1987), pp. 118–121.
For probabilistic laws with single-case propensities, see Fetzer (1981).
Discrete state systems, both deterministic and probabilistic, with a Markov condition have been studied by Rescher (1970).
See Parzen (1962), pp. 248, 277.
Cf. Niiniluoto (1987), pp. 7–8.
See Festa (1993).
See Niiniluoto (1987), pp. 15–16, pp. 302–303, pp. 321–322.
See Pettigrew (2015). Oddie (2019) accepts the quadratic scoring rule, but argues that it is incompatible with the principles of truthlikeness.
See McCutcheon (2019).
See Kullback (1959).
For example, Sober (2002) follows the statistician H. Akaike in measuring the distance between a fitted model (with fixed parameter values) and the truth by the Kullback–Leibler distance.
div_JS is absolutely continuous, since in (12) (p_i + q_i)/2 = 0 implies p_i = 0 and q_i = 0.
See formula (9.21) with b = b´= 1 and q = 4 in Niiniluoto (1987), p. 338.
See formula (6.88) with |I |= 24 and av(*,B) = ½ in Niiniluoto (1987), p. 229.
Fig. 2 in Sect. 5 shows that symmetric difference still has an interesting connection to the distance between probability densities.
Festa (2007) has proposed a way of measuring the distance between a monadic generalization and “the statistical truth” (i.e., true probabilistic constituent). The idea, roughly speaking, is to divide Q-predicates into “statistically common” and “rare” ones, and then demand that a truthlike generalization should make true existential claims about common predicates and false exclusion claims only about rare predicates.
See Niiniluoto (1978). See also the refined treatment of truth approximation by Kuipers (2019).
See Niiniluoto (1987), p. 319.
See Niiniluoto (1987), p. 340.
See also Oddie (1986), pp. 91–99, who applies this method to depth-d constituents.
See Niiniluoto (1987), pp. 328–330.
See the Boorman-Olivier and Fu metrics for trees in Niiniluoto (1987), pp. 11–14.
A transformation β(x,y) = < f(x,y),g(x,y) > , where f and g are linear functions, is area-preserving, if the absolute value of its Jacobian determinant is one at every point. The Jacobian is composed of the partial derivatives of f and g:
$\begin{array}{l} |\partial {{\text{f}(\text{x},\text{y})/\text{dx} }}\,\,\partial {{\text{f}(\text{x},\text{y})/\text{dy} |}}\\ {|}\partial {{\text{g}(\text{x},\text{y})/\text{dx} }}\,\, \partial {{\text{g}(\text{x},\text{y})/\text{dy} |}} \end{array}$
Note that the Jacobian determinant of this transformation is.
$\begin{array}{l} {{|1\,\, 0|}}\\ {{|0\,\, 1|}} \end{array}$
so that its value is 1 × 1 + 0 × 0 = 1.
Cf. Niiniluoto (1987), p. 404.
See Festa (1993, pp. 60–61. Here Γ(n) = (n-1)! is the gamma function.
Cf. Niiniluoto (1987), pp. 281–283, 428.

References

Cha, S.-H. (2007). Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences, 1, 300–307.
Google Scholar
Festa, R. (1993). Optimum inductive methods. Kluwer.
Festa, R. (2007). Verisimilitude, qualitative theories, and statistical inferences. In S. Pihlström, P. Raatikainen, & M. Sintonen (Eds.), Approaching truth: Essays in honour of Ilkka Niiniluoto (pp. 143–178). College Publications.
Fetzer, J. (1981). Scientific knowledge: Causation, explanation, and corroboration. D. Reidel.
Garcia Lapeña, A. (2021). Truthlikeness for quantitative deterministic laws. The British Journal for the Philosophy of Science.
Hilpinen, R. (1976). Approximate truth and truthlikeness. In M. Przelecki, K. Szaniawski, & R. Wojcicki (Eds.), Formal methods in the methodology of empirical sciences (pp. 19–42). D. Reidel.
Kuipers, T. (1982). Approaching descriptive and theoretical truth. Erkenntnis, 18, 343–387.
Article Google Scholar
Kuipers, T. (Ed.). (1987). What is closer-to-the-truth? Rodopi.
Kuipers, T. (2019). Nomic truth approximation revisited. Springer.
Kullback, S. (1959). Information theory and statistics. Wiley.
McCutcheon, R. (2019). In favor of logarithmic scoring. Philosophy of Science, 86, 286–303.
Article Google Scholar
Niiniluoto, I. (1978). Truthlikeness in first-order languages. In J. Hintikka, I. Niiniluoto, & E. Saarinen (Eds.), Essays on mathematical and philosophical logic (pp. 437–458). D. Reidel.
Niiniluoto, I. (1987). Truthlikeness. D. Reidel.
Niiniluoto, I. (1998). Verisimilitude: The third period. The British Journal for the Philosophy of Science, 49, 1–29.
Article Google Scholar
Niiniluoto, I. (2011). The development of the Hintikka program. In D. Gabbay, S. Hartmann, & J. Woods (Eds.), Handbook of the history of logic (Vol. 10, pp. 311–356). North-Holland.
Niiniluoto, I. (2019). Truth-seeking by abduction. Springer.
Oddie, G. (1986). Likeness to truth. D. Reidel.
Oddie, G. (2016). Truthlikeness. In E. Zalta (Ed.), The Stanford encyclopedia of philosophy, https://plato.stanford.edu/archives/win2016/entries/truthlikeness/. Read Oct. 20, 2020.
Oddie, G. (2019). What accuracy could not be. The British Journal for the Philosophy of Science, 70, 551–580.
Article Google Scholar
Parzen, E. (1962). Stochastic processes. Holden-Day.
Pettigrew, R. (2015). Accuracy and laws of credence. Oxford University Press.
Popper, K. (1963). Conjectures and refutations. Routledge and Kegan Paul.
Rescher, N. (1970). Scientific explanation. The Free Press.
Rosenkrantz, R. (1980). Measuring truthlikeness. Synthese, 45, 463–488.
Article Google Scholar
Sober, E. (2002). Instrumentalism, parsimony, and the Akaike framework. Philosophy of Science, 69(S3), S112–S123.
Article Google Scholar
Tichý, P. (1976). Verisimilitude redefined. The British Journal for the Philosophy of Science, 27, 25–42.
Article Google Scholar

Download references

Acknowledgements

I am grateful to Gustavo Cevolani and Theo Kuipers for useful comments on probabilistic truth approximation in general and my paper in particular.

Funding

Open access funding provided by University of Helsinki including Helsinki University Central Hospital.

Author information

Authors and Affiliations

Department of Philosophy, History, and Art Studies, University of Helsinki, Unioninkatu 40 A, P. O. BOX 24, 00014, Helsinki, Finland
Ilkka Niiniluoto

Authors

Ilkka Niiniluoto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ilkka Niiniluoto.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the topical collection "Approaching Probabilistic Truths", edited by Theo Kuipers, Ilkka Niiniluoto, and Gustavo Cevolani.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Niiniluoto, I. Approaching probabilistic laws. Synthese 199, 10499–10519 (2021). https://doi.org/10.1007/s11229-021-03256-8

Download citation

Received: 27 August 2020
Accepted: 08 June 2021
Published: 09 July 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s11229-021-03256-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Approaching probabilistic laws

Abstract

Similar content being viewed by others

Truthlikeness for probabilistic laws

Logics for Strict Coherence and Carnap-Regular Probability Functions

A First-Order Logic for Reasoning About Higher-Order Upper and Lower Probabilities

1 The similarity approach to truthlikeness

2 Verisimilitude vs. Legisimilitude

3 Probabilistic laws

4 Distance between probabilities

Example 1.

Example 2.

Example 3

Example 4

Example 5

Example 6

5 Vertical versus horizontal distance measures

6 Estimating distance from probabilistic truth

Example 7

Example 8

7 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Approaching probabilistic laws

Abstract

Similar content being viewed by others

Truthlikeness for probabilistic laws

Logics for Strict Coherence and Carnap-Regular Probability Functions

A First-Order Logic for Reasoning About Higher-Order Upper and Lower Probabilities

1 The similarity approach to truthlikeness

2 Verisimilitude vs. Legisimilitude

3 Probabilistic laws

4 Distance between probabilities

Example 1.

Example 2.

Example 3

Example 4

Example 5

Example 6

5 Vertical versus horizontal distance measures

6 Estimating distance from probabilistic truth

Example 7

Example 8

7 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation