Skip to main content
Log in

Entropy increase and information loss in Markov models of evolution

  • Published:
Biology & Philosophy Aims and scope Submit manuscript

Abstract

Markov models of evolution describe changes in the probability distribution of the trait values a population might exhibit. In consequence, they also describe how entropy and conditional entropy values evolve, and how the mutual information that characterizes the relation between an earlier and a later moment in a lineage’s history depends on how much time separates them. These models therefore provide an interesting perspective on questions that usually are considered in the foundations of physics—when and why does entropy increase and at what rates do changes in entropy take place? They also throw light on an important epistemological question: are there limits on what your observations of the present can tell you about the evolutionary past?

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. In this paper ‘Markov process’ will refer to either a discrete Markov process (a Markov chain) or a continuous-time Markov process, on a finite state space.

  2. Throughout this paper, log is with base e, rather than 2; the conclusions remain the same regardless of the base of the log, though some formulae change slightly.

  3. The formulae for a continuous-time model are similar; just replace the term \( (1 - u - v)^{t} \) by \( e^{ - (u + v)t} \).

  4. For example, at t = 0, \( \Pr (X_{t} = 1|X_{0} = 1) = \Pr (X_{t} = 0|X_{0} = 0) = 1 \), and \( \Pr (X_{t} = 1|X_{0} = 0) = \Pr (X_{t} = 0|X_{0} = 1) = 0 \).

  5. A sufficient condition for a function to be strictly increasing is that it has a strictly positive slope everywhere, though this condition is not necessary. This can be seen by considering the function f(x) = 1 + (x − 1)3, which is strictly increasing on the interval [0,2] but has zero slope at x = 1.

  6. We also assume, as usual, that the process is time-homogeneous—that is, the transition probabilities (or rates) do not change over time.

  7. The exceptional cases arise if both u, v are zero (and in the discrete chain case if u, v are both 1).

  8. See, for example, Häggström (2002) for a precise definition of aperiodicity. A sufficient condition for it to hold is that \( \Pr (X_{t} = i|X_{0} = i) > 0 \) for each state i and t = 1. Moreover, if the chain is irreducible a sufficient condition is simply that \( \Pr (X_{t} = i|X_{0} = i) > 0 \) for at least one state i and t = 1.

  9. A stationary distribution of a Markov process is any distribution on the states that satisfies the condition that if X 0 is chosen according to that distribution, then X t also has this distribution for all t > 0. An equilibrium distribution is stationary, but not conversely; indeed, a Markov chain may have infinitely many stationary distributions but no equilibrium distribution.

  10. The Markov chain convergence theorem is a consequence of the well-known Perron-Frobenius theorem (as applied to irreducible matrices), see e.g. Grimmett and Stirzaker (2001, p. 295).

  11. Summations are over all the states (or pairs of states). We use i throughout to denote the initial state and j to denote the state at time t.

  12. The Wright-Fisher model is a classical Markov chain in population genetics in which each generation is replaced by the next (see e.g. Durrett 2002).

  13. In the Moran model each step of the process involves the death of a single individual and its replacement by a new one (see e.g. Durrett 2002).

  14. Since both \( H(X_{t} ) \) and \( H(X_{t} |X_{0} ) \) converge to \( - \sum\limits_{i} {\pi_{i} } \log (\pi_{i} ) \).

  15. We describe a further interesting property of this chain in the Section “Tree-like evolution”.

  16. An equivalent definition of singularity is that we can write some row of the matrix as a linear combination of the other rows. Singular matrices are thus, in some sense, ‘exceptional’.

  17. Since the transition matrix of a continuous process acting for time t with intensity matrix Q can be written as exp(Qt), and Jacobi’s formula assures us that det(exp(Qt) = exp(tr(Q)t) > 0.

  18. \( Y \to W \to Z \) means that W screens off Y from Z. If, in addition, Z screens off Y from W, then the inequality \( I(Y;Z) \le I(Y;W) \) becomes an equality (see, e.g., Cover and Thomas 1991).

  19. Non-trivial here means that the equilibrium distribution assigns strictly positive probability to at least two states.

  20. Equation (6) follows from symmetry of the mutual information function I, so that \( H(X_{t} ) - H(X_{t} |X_{0} ) = I(X_{0} ;X_{t} ) = I(X_{t} ;X_{0} ) = H(X_{0} ) - H(X_{0} |X_{t} ) \).

  21. A process is time-reversible precisely when it is in equilibrium and π i Pr(X t  = j|X 0 = i) =  π j Pr(X t  = i|X 0 = j) for all states i, j and t > 0. Any two-state process in equilibrium is time-reversible, whether it involves drift or selection. For any n > 2 there are n-state Markov processes that are not time-reversible (Häggström 2002).

  22. This discussion corrects some of what Sober (2008, pp. 300–306) says about a model of frequency dependent selection for the majority trait; the model is nonMarkovian.

  23. A lumped Markov process is not generally a Markov process; necessary and sufficient conditions for it to be so are well known (see, for example, Kemeny and Snell 1976).

  24. This follows from results in Tuffley and Steel (1997b).

  25. Sober (1989) uses a likelihood framework to show that with a two-state drift process, the pair of observations D 1 = 1 and D 4 = 1 provides stronger evidence that R = 1 than does the pair of observations D 2 = 1 and D 3 = 1; however, with other processes, the relationship reverses.

  26. The mutual information is positive for (D 2, D 3) and R because if we observe state 1 at D 2 and state 3 at D 3 then C must be have been either in state 3 or 4, and so R cannot have been in state 2. The mutual information is zero if we replace (D 2, D 3) by (D 1, D 4) because taking two steps in the chain produces the uniform distribution from any starting state.

  27. A related example, involving maximum parsimony, was described by Fischer and Thatte (2009).

  28. This inequality is an immediate consequence of the data-processing inequality (referred to just before Proposition 5a) since \( X_{0} \to X \to Y \) is a Markov chain.

  29. ‘Accuracy’ here refers to the expected probability of correctly reconstructing the ancestral state by the method.

  30. If the state of all the leaves on a tree provides at least as much information about the state of the root as any subset of the leaves provides, and if parsimony sometimes does better at reconstructing the state of the ancestor when it consults only a subset of the leaves, then parsimony sometimes misinterprets what the full data set is saying. Of course, the sub-optimal performance of parsimony in this context leaves open that the method might perform optimally under other models of evolution.

  31. To see this, note that we can take \( c_{i} = {\frac{{\Pr (X_{0} = i)}}{{\Pr (X_{0} = 1)}}};d_{\alpha } = p_{1}^{\alpha } . \)

References

  • Barrett M, Sober E (1994) The second law of probability dynamics. Br J Philos Sci 45:941–953

    Article  Google Scholar 

  • Barrett M, Sober E (1995) When and why does entropy increase? In: Savitt S (ed) Time’s arrow today. Cambridge University Press, Cambridge, UK, pp 230–258

    Chapter  Google Scholar 

  • Berger JO (1985) Statistical decision theory and Bayesian analysis, 2nd edn. Springer Series in Statistics, Springer-Verlag, Berlin

    Google Scholar 

  • Brooks D, Wiley E (1988) Evolution as entropy. University of Chicago Press, Chicago

    Google Scholar 

  • Chang J (1996) Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. Math Biosci 134:189–215

    Article  Google Scholar 

  • Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York

    Book  Google Scholar 

  • Crow J, Kimura M (1970) An introduction to population genetics theory. Burgess Publishing Company, Minneapolis

    Google Scholar 

  • Durrett R (2002) Probability models for DNA sequence evolution. Springer-Verlag, New York

    Google Scholar 

  • Evans W, Kenyon C, Peres Y, Schulman LJ (2000) Broadcasting on trees and the Ising model. Adv Appl Probab 10:403–433

    Google Scholar 

  • Felsenstein J (2004) Inferring phylogenies. Sinauer Associates, Sunderland, MA

    Google Scholar 

  • Fischer M, Thatte B (2009) Maximum parsimony on subsets of taxa. J Theor Biol 260:290–293

    Article  Google Scholar 

  • Grimmett G, Stirzaker D (2001) Probability and random processes, 3rd edn. Oxford University Press, Oxford, UK

    Google Scholar 

  • Häggström O (2002) Finite Markov chains and algorithmic applications. Cambridge University Press, Cambridge, UK

    Book  Google Scholar 

  • Kemeny JG, Snell JL (1976) Finite Markov chains. Springer-Verlag, New York

    Google Scholar 

  • Mossel E (1998) Recursive reconstruction on periodic trees. Random Struct Algorithm 13(1):81–97

    Article  Google Scholar 

  • Mossel E (2003) On the impossibility of reconstructing ancestral data and phylogenies. J Comput Biol 10:669–678

    Article  Google Scholar 

  • Mossel E, Steel M (2005) How much can evolved characters tell us about the tree that generated them? In: Gascuel O (ed) Mathematics of evolution and phylogeny. Oxford University Press, Oxford, pp 384–412

    Google Scholar 

  • Pinelis I (2003) Evolutionary models of phylogenetic trees. Proceedings of the Royal Society B 270(1522):1425–1431

    Article  Google Scholar 

  • Rozanov YA (1969) Probability theory: a concise course. Dover Publications Inc, New York

    Google Scholar 

  • Semple C, Steel M (2003) Phylogenetics. Oxford University Press, Oxford, UK

    Google Scholar 

  • Seneta E (1973) Non-negative matrices: an introduction to theory and applications. Wiley, New York, pp 52–54

    Google Scholar 

  • Sober E (1989) Independent evidence about a common cause. Philos Sci 56:275–287

    Article  Google Scholar 

  • Sober E (2008) Evidence and evolution: the logic behind the science. Cambridge University Press, Cambridge, UK

    Google Scholar 

  • Sober E, Barrett M (1992) Conjunctive forks and temporally asymmetric inference. Aust J Philos 70:1–23

    Article  Google Scholar 

  • Sober E, Steel M (2002) Testing the hypothesis of common ancestry. J Theor Biol 218:395–408

    Google Scholar 

  • Tuffley C, Steel MA (1997a) Links between maximum likelihood and maximum parsimony under a simple model of site substitution. Bull Math Biol 59(3):581–607

    Article  Google Scholar 

  • Tuffley C, Steel MA (1997b) Modeling the covarion hypothesis of nucleotide substitution. Math Biosci 147:63–91

    Article  Google Scholar 

  • Weber B, Depew D (1988) Entropy, information, and evolution: new perspectives on physical and biological evolution. MIT Press, Cambridge

    Google Scholar 

  • Yockey H (2005) Information theory, evolution, and the origin of life. Cambridge University Press, Cambridge

    Book  Google Scholar 

Download references

Acknowledgments

We thank the editor and referee for useful suggestions. ES thanks the William F. Vilas Trust of the University of Wisconsin-Madison for financial support and MS thanks the Royal Society of New Zealand for funding under its James Cook Fellowship scheme.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elliott Sober.

Appendix: Technical details

Appendix: Technical details

Part A: Proof of propositions

First recall that the ‘mutual information’ I between X 0 and X t is defined, as usual, as:

$$ I(X_{0} ;X_{t} ) = \sum\limits_{ij} \Pr (X_{0} = i\& X_{t} = j)\log \left( {{\frac{{\Pr (X_{0} = i\& X_{t} = j)}}{{\Pr (X_{0} = i)\Pr (X_{t} = j)}}}} \right). $$
(7)

This can be rewritten in terms of the transition probabilities \( (\Pr (X_{t} = j|X_{0} = i)) \), initial distribution (p i ) and subsequent distribution (Prt(j)) of states as:

$$ I(X_{0} ;X_{t} ) = \sum\limits_{ij} p_{i} \Pr (X_{t} = j|X_{0} = i)\log \left( {{\frac{{\Pr (X_{t} = j|X_{0} = i)}}{{\Pr (X_{t} = j)}}}} \right). $$
(8)

A straightforward and well-known inequality in information theory is that:

$$ I(X_{0} ;X_{t} ) = H(X_{t} ) - H(X_{t} |X_{0} ). $$
(9)

To justify Proposition 4, it suffices, by (9), to show that \( I(X_{0} ;X_{t} ) > 0. \) Notice that I(X 0; X t ) is the Kullback–Leibler distance between the probability distributions \( p_{i} \Pr (X_{t} = j|X_{0} = i) \) and \( p_{i} \Pr (X_{t} = j) \) . In particular, I(X 0; X t ) = 0 implies that these two probability distributions are identical—that is, \( p_{i} \Pr (X_{t} = j|X_{0} = i) = p_{i} \Pr (X_{t} = j) \) for all pairs of states i, j (including i = j). As there are two distinct values of i for which p i  > 0 then the two corresponding rows of the matrix \( P = [p_{ij} ] = [\Pr (X_{t} = j|X_{0} = i)] \) are identical (since \( p_{ij} = \Pr (X_{t} = j) \) in both cases), and so det(P) = 0. This contradicts our assumption in Proposition 4, and so \( I(X_{0} ;X_{t} ) > 0. \)

Next we turn to the proof of Proposition 6. The regularity assumption implies that, for all states i, j we have: \( \pi_{i} \ge \varepsilon \) and \( |\Pr (X_{t} = j|X_{0} = i) - \pi_{j} | \le Be^{ - ct} \) for strictly positive constants \( B,c,\varepsilon \) (see, for example, Rozanov, 1969; Theorem 7.4). It follows that the logarithmic term in \( I(X_{0} ;X_{t} ) \) from (7), namely, \( \log \left( {{\frac{{\Pr (X_{t} = j|X_{0} = i)}}{{\Pr (X_{t} = j)}}}} \right) \), can be written as \( \log (1 + O(e^{ - ct} )) = O(e^{ - ct} ) \), where O is the usual order notation. In particular, for some C > 0 we have \( I(X_{0} ;X_{t} ) \le Ce^{ - ct} \sum\limits_{ij} p_{i} \Pr (X_{t} = j|X_{0} = i) = Ce^{ - ct} , \) as claimed.

We turn now to the proof of Proposition 7. For the Wright-Fisher model, it is well known that:

$$ \mathop {\lim }\limits_{t \to \infty } \Pr (X_{t} = j|X_{0} = i) = \left( {\begin{array}{*{20}c} {{i \mathord{\left/ {\vphantom {i N}} \right. \kern-\nulldelimiterspace} N},} & {{\text{if }}j = N;} \\ {{{(N - i)} \mathord{\left/ {\vphantom {{(N - i)} N}} \right. \kern-\nulldelimiterspace} N},} & {{\text{if }}j = 0;} \\ {0,} & {{\text{otherwise}}.} \\ \end{array} } \right. $$

Thus,

$$ \mathop {\lim }\limits_{t \to \infty } H(X_{t} |X_{0} = i) = - \frac{i}{N}\log \left( {\frac{i}{N}} \right) - \left( {1 - \frac{i}{N}} \right)\log \left( {1 - \frac{i}{N}} \right). $$
(10)

Now, for i selected from the uniform distribution, \( \mathop {\lim }\limits_{t \to \infty } H(X_{t} |X_{0} ) = {\frac{1}{N + 1}}\sum\limits_{i = 0}^{N} {\lim_{t \to \infty } H(X_{t} |X_{0} = i)} , \) and so, from Eq. (10), \( \mathop {\lim }\limits_{t \to \infty } H(X_{t} |X_{0} ) = \log (N) - {\frac{2}{N(N + 1)}}\sum\limits_{i = 1}^{N} i\log (i). \) Thus,

$$ \mathop {\lim }\limits_{t \to \infty } H(X_{t} |X_{0} ) = {\frac{2}{(N + 1)}}\sum\limits_{i = 1}^{N} - \frac{i}{N}\log \left( {\frac{i}{N}} \right)\sim 2\int\limits_{0}^{1} { - x\log (x){\text{d}}x} = \frac{1}{2}, $$

where ~ refers to asymptotic equivalence as \( N \to \infty \). In summary, we have the following inequality: \( \mathop {\lim }\limits_{N \to \infty } \mathop {\lim }\limits_{t \to \infty } H(X_{t} |X_{0} ) = \frac{1}{2} < \log (2) = \mathop {\lim }\limits_{N \to \infty } \mathop {\lim }\limits_{t \to \infty } H(X_{t} ). \) Now consider \( H(X_{1} |X_{0} ) \). Conditional on \( X_{0} = i \), the distribution of X 1 under the Wright-Fisher model is binomial with N trials and probability \( p = \frac{i}{N} \). Thus, \( H(X_{1} |X_{0} = i) \) is the entropy of a binomial distribution with these parameters. Since i is uniformly distributed between 0 and N it can be shown that \( H(X_{1} |X_{0} ) \) grows at the order log(N) so, for N sufficiently large, \( 0 = H(X_{0} |X_{0} ) < H(X_{1} |X_{0} ) \), and also \( H(X_{1} |X_{0} ) > \mathop {\lim }\limits_{t \to \infty } H(X_{t} |X_{0} ). \) This justifies the claims in Proposition 7.

Finally we justify Proposition 9. Let us suppose that \( \mathop {\lim }\nolimits_{t \to \infty } I(X_{0} ;X_{t} ) = 0 \); we will show that this implies the rows of Π are linearly dependent. Notice that the condition \( \mathop {\lim }\nolimits_{t \to \infty } I(X_{0} ;X_{t} ) = 0 \) implies (via Pinsker’s inequality) that: \( \mathop {\lim }\limits_{t \to \infty } \left[ {\Pr (X_{t} = j\& X_{0} = i) - \Pr (X_{t} = j)\Pr (X_{0} = i)} \right] = 0, \) for all pairs of states i, j, which in turn implies that:

$$ \mathop {\lim }\limits_{t \to \infty } \left[ {\Pr (X_{t} = j|X_{0} = i) - \Pr (X_{t} = j)} \right] = 0, $$
(11)

for all i, j (note that the event X 0 = i has strictly positive probability for all states i by the assumption that the mixture is proper). This further implies the following condition holding for all states i, i′, j:

$$ \mathop {\lim }\limits_{t \to \infty } \left[ {\Pr (X_{t} = j|X_{0} = i) - \Pr (X_{t} = j|X_{0} = i^{\prime})} \right] = 0, $$
(12)

by using Eq. (11) twice (once for the conditioning event X 0 = i and once for \( X_{0} = i^{\prime} \)). Let Z denote the index \( \alpha \in \{ 1, \ldots ,k\} \) of the model M α that is selected at the start of the process; thus Z = α with probability q α. We have:

$$ \Pr (X_{t} = j|X_{0} = i) = \sum\limits_{\alpha = 1}^{k} \Pr (X_{t} = j|X_{0} = i\& Z = \alpha )\Pr (Z = \alpha |X_{0} = i). $$

Now, since the Markov process M α is regular, the Markov chain convergence theorem assures us that, \( \mathop {\lim }\limits_{t \to \infty } \Pr (X_{t} = j|X_{0} = i\& Z = \alpha ) = \pi_{j}^{\alpha } , \) for all i, and so:

$$ \mathop {\lim }\limits_{t \to \infty } \Pr (X_{t} = j|X_{0} = i\& Z = \alpha ) = \sum\limits_{\alpha = 1}^{k} \pi_{j}^{\alpha } \Pr (Z = \alpha |X_{0} = i). $$

Combining this with Eq. (12) gives the following constraint:

$$ \sum\limits_{\alpha = 1}^{k} \pi_{j}^{\alpha } \cdot (\Pr (Z = \alpha |X_{0} = i) - \Pr (Z = \alpha |X_{0} = i^{\prime})) = 0, $$
(13)

for all states i, i′, j. Now, suppose that:

$$ \Pr (Z = \alpha |X_{0} = i) - \Pr (Z = \alpha |X_{0} = i^{\prime}) = 0 , $$
(14)

for all states i, i′ and index α. We have \( \Pr (Z = \alpha |X_{0} = i) = \Pr (X_{0} = i|Z = \alpha )q_{\alpha } /\Pr (X_{0} = i), \) by Bayes’ Theorem. Furthermore, since \( \Pr (X_{0} = i|Z = \alpha ) = p_{i}^{\alpha } \) (the initial state distribution for M α) Eq. (14) implies that \( p_{i}^{\alpha } /\Pr (X_{0} = i) = p_{{i^{\prime}}}^{\alpha } /\Pr (X_{0} = i^{\prime}), \) for all states i, i′ and index α. It follows that we can write \( p_{i}^{\alpha } = c_{i} d_{\alpha } , \) for all states i, and indices α, where c i does not depend on α and d α does not depend on i. Footnote 31 Thus, from which the identity: \( \sum\limits_{i} {p_{i}^{\alpha } } = 1 = \sum\limits_{i} {p_{i}^{\beta } } \) for any two distinct indices α, β we obtain d α = d β and thus \( p_{i}^{\alpha } = p_{i}^{\beta } \) for all i; that is, the initial distribution of any two models are equal, in violation of our assumption.

Thus we may suppose that there exists states i, i′ and an index α for which \( \Pr (Z = \alpha |X_{0} = i) - \Pr (Z = \alpha |X_{0} = i^{\prime}) \ne 0 \). Then the row vector \( \Updelta = [\Updelta_{\alpha } ] \) defined by

$$ \Updelta_{\alpha } : = \Pr (Z = \alpha |X_{0} = i) - \Pr (Z = \alpha |X_{0} = i^{\prime}), $$

for \( \alpha = 1,2, \ldots ,k \), is not equal to the zero vector \( 0 \) and yet from Eq. (13) we have \( \Updelta \Uppi = 0, \) so the rows of Π are linearly dependent. This completes the proof.

Part B: Proofs of other claims in Section “Non-Markov processes

First we justify the analysis reported for the mixed model in which the lineage starts, with equal probability, in state 0 or 1, and if the lineage starts in state 0 the process converges to an equilibrium of (x, 1 − x), where 0 < x < ½, while if the lineage starts in state 1 the process converges, at the same rate, to an equilibrium of (1 − x, x). From (7) we have:

$$ I(X_{0} ;X_{t} ) = \sum\limits_{i = 0}^{1} {\sum\limits_{j = 0}^{1} {\Pr (X_{t} = j\& X_{0} = i)\log \left( {{\frac{{\Pr (X_{t} = j\& X_{0} = i))}}{{\Pr (X_{t} = j)\Pr (X_{0} = i)}}}} \right)} } . $$

Now, \( \Pr (X_{t} = j) = \frac{1}{2} \) for all t ≥ 0. Also, for i = 0,1, if \( {\Pr}^{i} \) refers to the Markov process that applies if the initial state is i, then \( {\Pr} (X_{t} = j\& X_{0} = i) = \frac{1}{2}{\Pr}^{i} (X_{t} = j|X_{0} = i), \) and so:

$$ I(X_{0} ;X_{t} ) = \sum\limits_{i = 0}^{1} \sum\limits_{j = 0}^{1} \frac{1}{2}{\Pr}^{i} (X_{t} = j|X_{0} = i)\log (2{\Pr}^{i} (X_{t} = j|X_{0} = i)). $$
(15)

Setting \( a(t): = x + (1 - x)e^{ - t/\gamma } ;\quad b(t): = (1 - x)(1 - e^{ - t/\gamma } ), \) gives:

$$ \begin{gathered} {\Pr}^{0} (X_{t} = 0|X_{0} = 0) = {\Pr}^{1} (X_{t} = 1|X_{0} = 1) = a(t); \hfill \\ {\Pr}^{0} (X_{t} = 1|X_{0} = 0) = {\Pr}^{1} (X_{t} = 0|X_{0} = 1) = b(t). \hfill \\ \end{gathered} $$

Consequently, by Eq. (15), we have:

$$ I(X_{0} ;X_{t} ) = a(t)\log (2a(t)) + b(t)\log (2b(t)). $$
(16)

Since \( a(t) = b(t) = \frac{1}{2} \) for the value t x of t for which \( e^{ - t/\gamma } = {\frac{1 - 2x}{2(1 - x)}}, \) we have \( I(X_{0} ;X_{t} ) = 0 \) at t = t x . Routine analysis with Eq. (16), now gives \( \mathop {\lim }\nolimits_{t \to \infty } I(X_{0} ;X_{t} ) = \log (2) - g(x), \) as claimed.

Finally, we justify the assertion that a finite mixture of Markov processes can be described as a lumped Markov process. Suppose a process X t is described by a finite mixture of Markov processes \( M_{\alpha } (\alpha = 1, \ldots ,k) \), each on state space S, with model M α selected with probability q α, and with \( p_{i}^{\alpha } \) as the initial distribution of state i in model M α. Let \( \Upomega = \{ (s,\alpha ):s \in S,\alpha = 1, \ldots ,k\} \) and consider the following Markov process Z t on Ω. Select the initial state \( (i,\alpha ) \in \Upomega \) with probability \( q_{\alpha } \cdot p_{i}^{\alpha } \) and define transition probabilities on all ordered pairs of states from the set Ω as follows: \( \Pr (Z_{t} = (j,\alpha )|Z_{0} = (i,\alpha )) = \Pr_{{}}^{\alpha } (X_{t} = j|X_{0} = i) \), where \( \Pr^{\alpha } (X_{t} = j|X_{0} = i) \) is the transition probability within the model M α, and \( \Pr (Z_{t} = (j,\beta )|Z_{0} = (i,\alpha )) = 0 \) for all α ≠ β and all states i, j. Then, under the function \( f:\Upomega \to S \) defined by \( f((s,\alpha )) = s \), the processes (X t ; t ≥ 0) and (f(Z t ); t ≥ 0) have the same distribution; that is, mixtures are just a special case of lumped processes, as claimed.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sober, E., Steel, M. Entropy increase and information loss in Markov models of evolution. Biol Philos 26, 223–250 (2011). https://doi.org/10.1007/s10539-010-9239-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10539-010-9239-x

Keywords

Navigation