Introduction

Manfred Eigen’s evolution model (Eigen 1971; Eigen 1996) stands as a landmark attempt at coherently relating evolution, molecular biology, and information theory. That work views selection as condensation in information space, and evolution as a succession of phase transformations. Indeed, Eigen’s quasispecies model, with its error catastrophe, corresponds exactly to a phase transition in a two dimensional Ising system (Leuthausser 1986). The essence of Eigen’s paradox is that the error catastrophe limits genome length in RNA precursor organisms to much less than observed in DNA organisms having error correcting enzymes, which, themselves, cannot be created in the absence of just such a long genome.

As Holmes (2005) put it,

To create more genetic complexity, it is therefore necessary to encode more information in longer genes by using a replication system with greater fidelity. But there’s the catch: to replicate with greater fidelity requires a more accurate and hence complex replication enzyme, but such an enzyme cannot be created because this will itself require a longer gene, and longer genes will breach the error threshold.

Here we will reconsider Eigen’s paradox from a highly formal perspective which hews quite closely to the fundamental asymptotic limit theorems of information theory, in particular the Rate Distortion Theorem, and its zero-error limit, the Shannon–McMillan, or Source Coding Theorem. These, like the Central Limit Theorem in parametric statistics, permit derivation of ‘regression-like’ models which can be applied to real data. We use these theorems in a principled manner to derive the high rate of mutation inherent to RNA virus replication and suggest a plausible ‘Red Queen’ coevolutionary ratchet (Van Valen 1973) leading toward an evolutionary condensation resulting in effective error-correction mechanisms.

Our program has these components:

  1. [1]

    An increasingly complicated network of simple interacting ‘RNA-like’ organisms creates a collective biochemical system – a ‘vesicle’ – which, as a parallel communication channel, can have a much higher channel capacity for low-error replication than do the individual components.

  2. [2]

    Several such distinct, properly interacting, collectives – compartments in the sense of Eigen and Szathmary – become each others’ most intimate environments, generating a coevolutionary ratchet resulting in a Red Queen structure which, given sufficient energy, can support quasi-stable states with very low reproductive error rates.

  3. [3]

    High error rate, but low energy, systems-of-vesicles can become subject to systematic ‘large deviations’ excursions which, over sufficient time, can lead to the establishment of a distribution of low error rate, but higher energy, chemical systems: even prebiological quasi-organisms can, apparently, build pyramids, as it were. Thus many different chemistry-of-life solutions seem possible, and these, of course, would be subject to evolutionary processes of selection and chance extinction.

  4. [4]

    The latter steps depend critically on the availability of adequate energy sources, which, we hold, will be largely driven by changes in the protoecosystem, i.e., ecosystem resilience domain shifts, in the sense of Holling (1973) (Gunderson 2000; Wallace and Wallace 2008). One imagines availability of a new metabolic cycle, onset of predation, and/or crude photosynthesis as possible examples.

This is not, except perhaps for [4], a particularly new perspective, being essentially similar to the ‘bags of genes’ (Holmes 2005) model of Szathmary and colleagues (Szathmary and Demeter 1987; Szathmary 1989, 2006; Fontanari et al. 2006) and, of course, to Eigen’s hypercycle compartment model (Eigen 1996), but with hypercycles generalized to broad coevolutionary interactions. Our innovations lie in a highly formal use of the asymptotic limit theorems of information theory and in the systematic extension of phase transition techniques from statistical physics to information theory via ‘biological’ renormalizations.

Although the basic line of reasoning is fairly straightforward, a number of mathematical tasks need to be confronted. The first is to reconfigure the error catastrophe of Eigen’s model in terms of average distortion. That done, the famous homology between information source uncertainty and free energy density can be extended to the rate distortion function, and, since the rate distortion limit is always a non-increasing convex function of the distortion, the basic model emerges as a kind of energy minimization near the error catastrophe. Pettini’s (2007) topological hypothesis regarding the relation between topological shifts in structure and phase transition can than be applied, using characteristic ‘biological renormalizations’ to specify different forms of phase transition. A theory of ‘all possible’ Eigen models is direct. For analogs to the Gaussian channel, zero distortion – no error at all – is unattainable, requiring infinite energy. Such systems would, then, always be subject to some mutational variation.

A standard coevolutionary argument applied to two or more vesicles interacting through a mutual information crosstalk produces the essential results, crudely similar to the set of stable strategies of elaborate stochastic coevolutionary game theory models (e.g., Champagnat et al. 2006). Again, the simplest possible system has two quasi-equilibrium points, one near the error limit, which is the low energy solution, and the other near zero error, the high energy solution.

The many different possible chemical strategies in this broad spread of possible solutions would themselves become higher order Darwinian individuals subject to the vagaries of evolutionary process in the context of large deviations, producing, on our planet, the familiar RNA/DNA chemistry basis of life. Other possibilities, however, seem particularly interesting, and may perhaps be observed in vitro, or ex planeta.

Reconsidering the Eigen Model

Following Campos and Fontanari (1999), the Eigen quasispecies model can be characterized in its binary version as follows:

A molecule is represented by a string of L digits, (s 1, s 2, ..., s L ) with the variates s α allowed to take only two different values, each of which represents a different type of monomer used to build the molecule. The concentrations x i of molecules of type i = 1, 2, ... , 2L evolve in time according to the equations

$$ dx_{i}/dt = \sum\nolimits_{j}W_{i,j}x_{j} - \Phi(t)x_{i} $$
(1)

where Φ(t) is a dilution flux that keeps the total concentration constant, determined by the condition that ∑  i dx i /dt = 0. Taking ∑  i x i  = 1 gives

$$ \Phi = \sum\nolimits_{i,j}W_{i,j}x_{\!j}. $$
(2)

The elements of W i,j , the replication matrix, depend on the replication rate or fitness A i of the strings of type i, as well as on the Hamming distance d(i,j ) between strings i and j. They are

$$ W_{i,i} = A_{i}q^{\nu} $$
(3)

and

$$\label{eq4} W_{i,j}=A_{j}q^{L-d(i,j\,)}[1-q]^{d(i,j\,)}, i \neq j. $$
(4)

Here 0 ≤ q ≤ 1 is the single-digit replication accuracy, assumed the same for all digits. Again, Leuthausser (1986) shows this model corresponds exactly to that of a two dimensional Ising system.

The famous figure on p.83 of Eigen (1996), taken from Swetina and Schuster (1982), shows a numerical realization of the model for L = 50. The fittest sequence has Hamming distortion d = 0, and is characterized as the ‘Master’ sequence. As the error rate (1 − q) increases from zero, the proportion of the population corresponding to the master sequence, x 0, declines, while those corresponding to other distortion values rise accordingly. The essential point, however, is that, for all values below the critical error threshold, the average across the quasispecies, all the x d , what Eigen calls the ‘consensus sequence’, remains precisely the master sequence itself, even while the proportion constituting the master sequence itself falls. After the error threshold, however, that consensus average is lost, and the distributions become strictly random: the information of the master sequence has been dissipated.

Our interest is in reconsidering this effect from the perspective of the average distortion. As the error threshold is approached from below, the proportion corresponding the master sequence, having d = 0, declines, while the proportion of the population having d > 0 increases. Thus the average distortion, D = ∑  j d j x j , such that ∑  i x i  = 1, itself increases monotonically as the error rate approaches the error threshold, until a critical value of the average distortion D is reached.

The Rate Distortion Theorem, described in more detail below, states that there is a minimum necessary rate of transmission of information, R(D), through an information channel for any given average distortion D. That is, rates of trans mission – channel capacities – greater than R(D) are guaranteed to have average distortion less than D for any chosen distortion measure (Cover and Thomas 1991; Dembo and Zeitouni 1998).

We have translated the Eigen model from a focus on a single error threshold into one involving a critical value of average distortion in the transmission of information, and this will prove to be important for our subsequent modeling exercise.

We will, like others, attempt to solve Eigen’s paradox by invoking a population of related mutants as constituting a parallel transmission channel having a sufficiently high capacity to ensure collective reproductive fidelity in spite of individual molecular reproductive errors: The ‘consensus average’ becomes the essential reproductive message, and different such populations are separated into compartments or vesicles in the sense of Szathmary and Demeter (1987). These interact to become each other’s principal environments, engaging in a crosstalk that enables the coevolutionary ratchet. Many versions of this mechanism may have developed, becoming subject to evolutionary selection and chance extinction. Certainly both Eigen’s compartment hypercycle Eigen (1996) and Szathmary’s stochastic corrector (Szathmary and Demeter 1987) solve the essential problem. Likely, so too may a plethora of different structures, or, as Szathmary (1989) comments, ‘while it is true that hypercycles need compartments, do compartments need hypercycles?’. Here we will attempt a kind of general compartment model, using a broad information-theoretic brush.

Genetic Inheritance as an Information Source

Eigen’s and Szathmary’s perspective is not the only information theoretic analysis of reproduction, although it certainly takes things to their earliest history. Adami et al. (2000) make a case for reinterpreting the transmission of genetic heritage in terms of a formal information process. They assert that genomic complexity can be identified with the amount of information a sequence stores about its environment: genetic complexity can be defined in a consistent information-theoretic manner. In their view, information cannot exist in a vacuum and must have an instantiation. For biological systems the instantiation of information is DNA. To some extent it is the blueprint of an organism and thus information about its own structure. More specifically, it is a blueprint of how to build an organism that can best survive in its native environment, and pass on that information to its progeny. They assert that an organism’s DNA thus is not only a ‘book’ about the organism, but also a book about the environment it lives in, including the species it coevolves with. They identify the complexity of genomes by the amount of information they encode about the world in which they have evolved.

Ofria et al. (2003) continue in the same direction and argue that genomic complexity can be defined rigorously within standard information theory as the information the genome of an organism contains about its environment. From the point of view of information theory, it is convenient to view evolution on the molecular level as a collection of information transmission channels, subject to a number of constraints. In these channels, they state, the organism’s genomes code for the information (a message) to be transmitted from progenitor to offspring, and are subject to noise due to an imperfect replication process. Information theory is concerned with analyzing the properties of such channels, how much information can be transmitted and how the rate of perfect information transmission of such a channel can be maximized.

Adami and Cerf (2000) argue, using simple models of genetic structure, that the information content, or complexity, of a genomic string by itself (without referring to an environment) is a meaningless concept and a change in environment (catastrophic or otherwise) generally leads to a pathological reduction in complexity.

The transmission of genetic information is thus a contextual matter which involves operation of an information source which, according to this development, must interact with embedding (ecosystem) structures.

It is much in this spirit that we now address the Eigen paradox, beginning with a brief review of some facts and theorems from information theory.

Information Sources and Their Dynamics

Channel Capacity

Messages from a source, seen as symbols x j from some alphabet, each having probabilities P j associated with a random variable X, are ‘encoded’ into the language of a ‘transmission channel’, a random variable Y with symbols y k , having probabilities P k , possibly with error. Someone receiving the symbol y k then retranslates it (without error) into some x k , which may or may not be the same as the x j that was sent.

More formally, the message sent along the channel is characterized by a random variable X having the distribution

$$P(X=x_{\!j})=P_{\!j}, j=1,...,M.$$

The channel through which the message is sent is characterized by a second random variable Y having the distribution

$$P(Y=y_{k})=P_{k}, k=1,...,L.$$

Let the joint probability distribution of X and Y be defined as

$$P(X=x_{\!j},Y=y_{k})=P(x_{\!j},y_{k})=P_{\!j,k} $$

and the conditional probability of Y given X as

$$P(Y=y_{k} \mid X=x_{\!j}) = P(y_{k} \mid x_{\!j}).$$

Then the Shannon uncertainty of X and Y independently and the joint uncertainty of X and Y together are defined respectively as

$$\begin{array}{lll}\label{eq5} H(X)&=&-\sum\nolimits_{j=1}^M P_{\!j}\log(P_{\!j})\\ H(Y)&=&-\sum\nolimits_{k=1}^L P_{k}\log(P_{k})\\ H(X,Y)&=&-\sum\nolimits_{j=1}^M \sum\nolimits_{k=1}^L P_{j,k} \log(P_{j,k}). \end{array}$$
(5)

The conditional uncertainty of Y given X is defined as

$$\label{eq6} H(Y \mid X)=-\sum\nolimits_{j=1}^M \sum\nolimits_{k=1}^L P_{j,k} \log[P(y_{k} \mid x_{\!j})] $$
(6)

For any two stochastic variates X and Y, H(Y) ≥ H(Y |X), as knowledge of X generally gives some knowledge of Y. Equality occurs only in the case of stochastic independence.

Since P(x j , y k ) = P(x j )P(y k |x j ), we have

$$ H(X \mid Y)=H(X,Y) - H(Y) $$

The information transmitted by translating the variable X into the channel transmission variable Y – possibly with error – and then retranslating without error the transmitted Y back into X is defined as

$$ I(X \mid Y) \equiv H(X) - H(X \mid Y) = H(X) + H(Y) - H(X,Y). $$
(7)

See, for example, Ash (1990), Khinchin (1957) or Cover and Thomas (1991) for a modern treatment. The original development can be found in Shannon and Weaver (1949) and Shannon (1957). The essential point is that if there is no uncertainty in X given the channel Y, then there is no loss of information through transmission.

In general this will not be true, and herein lies the essence of the theory.

Given a fixed vocabulary for the transmitted variable X, and a fixed vocabulary and probability distribution for the channel Y, we may vary the probability distribution of X in such a way as to maximize the information sent. The capacity of the channel is defined as

$$\label{eq8} C \equiv \mathop{\max\nolimits_{P(X)}} I(X \mid Y), $$
(8)

subject to the subsidiary condition that ∑ P(X) = 1.

The central mechanism of the Shannon Coding Theorem which enables the sending of a message with arbitrarily small error along the channel Y at any rate R < C is to encode it in longer and longer ‘typical’ sequences of the variable X; that is, those sequences whose distribution of symbols approximates the probability distribution P(X) above which maximizes C.

If S(n) is the number of such ‘typical’ sequences of length n, then

$$ \log[S(n)] \approx n H(X) $$

where H(X) is the uncertainty of the stochastic variable defined above. Some consideration shows that S(n) is much less than the total number of possible messages of length n. Thus, as n → ∞, only a vanishingly small fraction of all possible messages is meaningful in this sense. This observation, after some considerable development, is what allows the Coding Theorem to work so well. In sum, the prescription is to encode messages in typical sequences, which are sent at very nearly the capacity of the channel. As the encoded messages become longer and longer, their maximum possible rate of transmission without error approaches channel capacity as a limit. Again, Ash (1990), Khinchin (1957) and Cover and Thomas (1991) provide details.

A trivial but important generalization of the simple channel without memory is the Parallel Channel. Consider a set of K discrete memoryless channels having capacities C 1, ... ,C K . Assume these are connected in parallel in the sense that each unit of time an arbitrary symbol is transmitted and received over each channel. The input x = (x 1, ... x K ) to the parallel channel is a K-vector whose components are inputs to the individual channels, and the output y = (y 1, ... ,y K ) is a K-vector whose components are the individual channel outputs. The capacity of the parallel channel is

$$ C_{Total} = \sum\nolimits_{i=1}^{K}C_{i}. $$
(9)

Thus the consensus sequence in a compartment or vesicle, which is clearly a parallel channel, can be transmitted at a much higher overall rate than is possible using individual replicators. This is no small matter.

The Shannon–McMillan Theorem

Not all statements – sequences of the random variable X – are equivalent. According to the structure of the underlying language of which the message is a particular expression, some messages are more ‘meaningful’ than others, that is, in accord with the grammar and syntax of the language. The other principal result from information theory, the Shannon–McMillan or Asymptotic Equipartition Theorem, describes how messages themselves are to be classified.

Suppose a long sequence of symbols is chosen, using the output of the random variable X above, so that an output sequence of length n, with the form

$$x_{n}=(\alpha_{0}, \alpha_{1}, ... , \alpha_{n-1}) $$

has joint and conditional probabilities

$$\begin{array}{lll} &&P\big(X_{0}=\alpha_{0}, X_{1}=\alpha_{1}, ... ,X_{n-1}=\alpha_{n-1}\big)\\ &&P\big(X_{n}=\alpha_{n} \mid X_{0}=\alpha_{0}, ... ,X_{n-1}=\alpha_{n-1}\big). \end{array} $$
(10)

Using these probabilities we may calculate the conditional uncertainty

$$ H\big(X_{n} \mid X_{0}, X_{1}, ... ,X_{n-1}\big). $$

The uncertainty of the information source, H[X], is defined as

$$ H[\mathbf{X}] \equiv \mathop{\lim\nolimits_{n \rightarrow \infty}} H\big(X_{n} \mid X_{0}, X_{1}, ... , X_{n-1}\big). $$
(11)

In general

$$H\big(X_{n} \mid X_{0}, X_{1}, ... ,X_{n-1}\big) \leq H(X_{n}).$$

Only if the random variables X j are all stochastically independent does equality hold. If there is a maximum n such that, for all m > 0

$$ H\big(X_{n+m} \mid X_{0}, ... ,X_{n+m-1}\big)=H\big(X_{n} \mid X_{0}, ... ,X_{n-1}\big), $$

then the source is said to be of order n. It is easy to show that

$$H[\mathbf{X}]=\lim_{n \rightarrow \infty} \frac{H\big(X_{0}, ... X_{n}\big)}{n+1}. $$

In general the outputs of the X j , j = 0, 1, ... ,n are dependent. That is, the output of the communication process at step n depends on previous steps. Such serial correlation, in fact, is the very structure which enables most of what follows.

Here, however, the processes are all assumed stationary in time, that is, the serial correlations do not change in time, and the system is stationary.

A very broad class of such self-correlated, memoryless, information sources, the so-called ergodic sources for which the long-run relative frequency of a sequence converges stochastically to the probability assigned to it, have a particularly interesting property:

It is possible, in the limit of large n, to divide all sequences of outputs of an ergodic information source into two distinct sets, S 1 and S 2, having, respectively, very high and very low probabilities of occurrence, with the source uncertainty providing the splitting criterion. In particular the Shannon–McMillan Theorem states that, for a (long) sequence having n (serially correlated) elements, the number of ‘meaningful’ sequences, N(n) – those belonging to set S 1 – will satisfy the relation

$$\label{eq12} \log[N(n)]/n \approx H[\mathbf{X}]. $$
(12)

More formally,

$$\begin{array}{lll}\label{eq13} &&\mathop{\lim\nolimits_{n \rightarrow \infty}} \log[N(n)]/n = H[\mathbf{X}]\\ &&{\kern32pt}= \mathop{\lim\nolimits_{n \rightarrow \infty}} H\big(X_{n} \mid X_{0}, ... ,X_{n-1}\big)\\ &&{\kern32pt}= \mathop{\lim\nolimits_{n \rightarrow \infty}} H\big(X_{0}, ... ,X_{n}\big)/(n+1). \end{array} $$
(13)

The Shannon Coding theorem, by means of an analogous splitting argument, shows that for any rate R < C, where C is the channel capacity, a message may be sent without error, using the probability distribution for X which maximizes I(X |Y) as the coding scheme. Using the internal structures of the information source permits limiting attention only to meaningful sequences of symbols. This restriction can greatly raise the maximum possible rate at which information can be transmitted with arbitrarily small error: if there are M possible symbols and the uncertainty of the source is H[X], then the effective capacity of the channel C, using this ‘source coding,’ becomes (Ash 1990)

$$ C_{E} = C \log(M)/H[\mathbf{X}]. $$
(14)

As H[X] ≤ log(M), with equality only for stochastically independent, uniformly distributed random variables,

$$ C_{E} \geq C. $$
(15)

Note that, for a given channel capacity, the condition

$$\label{eq16} H[\mathbf{X}] \leq C $$
(16)

always holds.

Source uncertainty has a very important heuristic interpretation. As Ash (1990) puts it,

...[W]e may regard a portion of text in a particular language as being produced by an information source. The probabilities P[X n  = α n |X 0 = α 0,...,X n − 1 = α n − 1) may be estimated from the available data about the language; in this way we can estimate the uncertainty associated with the language. A large uncertainty means, by the [Shannon–McMillan Theorem], a large number of ‘meaningful’ sequences. Thus given two languages with uncertainties H 1 and H 2 respectively, if H 1 > H 2, then in the absence of noise it is easier to communicate in the first language; more can be said in the same amount of time. On the other hand, it will be easier to reconstruct a scrambled portion of text in the second language, since fewer of the possible sequences of length n are meaningful.

The Rate Distortion Theorem

The Shannon–McMillan Theorem is the zero error limit of the Rate Distortion Theorem (Dembo and Zeitouni 1998; Cover and Thomas 1991), which can be expressed in terms of a splitting criterion that identifies high probability pairs of sequences. We follow closely the treatment of Cover and Thomas (1991).

The origin of the problem is the question of representing one information source by a simpler one in such a way that the least information is lost. For example we might have a continuous variate between 0 and 100, and wish to represent it in terms of a small set of integers in a way that minimizes the inevitable distortion that process creates. Typically, for example, an analog audio signal will be replaced by a ‘digital’ one. The problem is to do this in a way which least distorts the reconstructed audio waveform.

Suppose the original stationary, ergodic information source Y with output from a particular alphabet generates sequences of the form

$$y^{n}=y_{1}, ... , y_{n}. $$

These are ‘digitized,’ in some sense, producing a chain of ‘digitized values’

$$b^{n}=b_{1}, ... , b_{n}, $$

where the b-alphabet is much more restricted than the y-alphabet.

b n is, in turn, deterministically retranslated into a reproduction of the original signal y n. That is, each b m is mapped on to a unique n-length y-sequence in the alphabet of the information source Y:

$$ b^{m} \rightarrow \hat{y}^{n}=\hat{y}_{1}, ..., \hat{y}_{n}. $$

Note, however, that many y n sequences may be mapped onto the same retranslation sequence \(\hat{y}^{n}\), so that information will, in general, be lost.

The central problem is to minimize that loss.

The retranslation process defines a new stationary, ergodic information source, \(\hat{Y}\).

The next step is to define a distortion measure, \(d(y, \hat{y})\), which compares the original to the retranslated path. For example the Hamming distortion is

$$\begin{array}{lll} d(y, \hat{y}) &=& 1, y \neq \hat{y}\\ d(y, \hat{y}) &=& 0, y = \hat{y}. \end{array} $$
(17)

For continuous variates the Squared error distortion is

$$ d(y, \hat{y}) = (y - \hat{y})^{2}. $$
(18)

The Rate Distortion Theorem applies to a broad class of possible distortion measures.

The distortion between paths y n and \(\hat{y}^{n}\) is defined as

$$ d(y^{n}, \hat{y}^{n})=(1/n)\sum\nolimits_{j=1}^{n}d(y_{\!j}, \hat{y}_{\!j}). $$
(19)

Suppose that with each path y n and b n-path retranslation into the y-language and denoted y n, there are associated individual, joint, and conditional probability distributions

$$p(y^{n}), p(\hat{y}^{n}), p(y^{n} \mid \hat{y}^{n}).$$

The average distortion is defined as

$$ D \equiv \sum\nolimits_{y^{n}}p(y^{n})d(y^{n},\hat{y}^{n}). $$
(20)

It is possible, using the distributions given above, to define the information transmitted from the incoming Y to the outgoing \(\hat{Y}\) process in the usual manner, using the Shannon source uncertainty of the strings:

$$I(Y, \hat{Y}) \equiv H(Y) - H(Y \mid \hat{Y}) = H(Y) + H(\hat{Y}) - H(Y, \hat{Y}).$$

If there is no uncertainty in Y given the retranslation \(\hat{Y}\), then no information is lost.

In general, this will not be true.

The information rate distortion function R(D) for a source Y with a distortion measure \(d(y, \hat{y})\) is defined as

$$\label{eq21} R(D)={\min\limits_{p(y,\hat{y});\sum_{(y,\hat{y})}p(y)p(y \mid \hat{y})d(y,\hat{y})\leq D}}I(Y,\hat{Y}). $$
(21)

The minimization is over all conditional distributions \(p(y \mid \hat{y})\) for which the joint distribution \(p(y,\hat{y})=p(y)p(y \mid \hat{y})\) satisfies the average distortion constraint (i.e., average distortion ≤ D).

The Rate Distortion Theorem states that R(D) is the minimum necessary rate of information transmission (effectively, the channel capacity) so that the average distortion does not exceed the distortion D. Cover and Thomas (1991) or Dembo and Zeitouni (1998) provide details.

Pairs of sequences \((y^{n}, \hat{y}^{n})\) can be defined as distortion typical; that is, for a given average distortion D, defined in terms of a particular measure, pairs of sequences can be divided into two sets, a high probability one containing a relatively small number of (matched) pairs with \(d(y^{n}, \hat{y}^{n}) \leq D\), and a low probability one containing most pairs. As n → ∞, the smaller set approaches unit probability, and, for those pairs,

$$ p(y^{n}) \geq p(\hat{y}^{n} \mid y^{n})\exp[-nI(Y, \hat{Y})]. $$
(22)

Thus, roughly speaking, \(I(Y, \hat{Y})\) embodies the splitting criterion between high and low probability pairs of paths.

For the theory of interacting information sources, then, \(I(Y, \hat{Y})\) can play the role of H in the dynamic treatment that follows.

The rate distortion function of Eq. 21 can actually be calculated in many cases by using a Lagrange multiplier method – see Section 13.7 of Cover and Thomas (1991). For a simple Gaussian channel having noise with zero mean and variance σ 2 then

$$\begin{array}{lll} R(D) &=& 1/2 \log[\sigma^{2}/D], 0 \leq D \leq \sigma^{2}\\ R(D) &=& 0, D > \sigma^{2}. \end{array} $$
(23)

For this particular channel, zero distortion, no mutations at all, requires an infinite channel capacity, which, we will show, requires infinite energy.

A second important observation is that any rate distortion function R(D), following the arguments of Cover and Thomas (1991, Lemma 13.4.1) is necessarily a decreasing convex function of D, that is, a reverse-J-shaped curve. This requirement, like the singularity of Gaussian-like channels at zero distortion, has profound consequences for replication dynamics.

Information System Phase Transitions

As many have noted, the first part of Eq. 13,

$$ H \equiv \lim_{n \rightarrow \infty} \log[N(n)]/n, $$

is homologous to the thermodynamic limit in the definition of the free energy density of a physical system. This has the form

$$\label{eq24} F(K) ={\lim_{V \rightarrow \infty}}\log[Z(K)]/V, $$
(24)

where F is the free energy density, K the inverse temperature, V the system volume, and Z(K) is the partition function defined by the system Hamiltonian. Any good statistical mechanics text will provide details (e.g., Landau and Lifshitz 2007).

The Appendix shows how this homology permits a highly natural generalization of standard renormalization methods from statistical mechanics to information theory, producing phase transitions and analogs to evolutionary punctuation in systems characterized by piecewise, adiabatically stationary, ergodic information sources. Thus biological phase changes appear to be ubiquitous in natural systems.

The approach uses a mean field approximation in which average strength (or probability) of nondisjunctive linkages between interacting nodes (vesicles) – crosstalk – serves as a kind of inverse temperature parameter. Phase transitions can then be described using various ‘biological’ renormalization strategies, in which universality class tuning becomes a principal second order mechanism. These are continuous phase transitions in that there is no latent heat required, as in boiling water until it turns to steam, although a treatment of ecosystem resilience in dynamic manifolds (e.g., Wallace and Wallace 2008) can perhaps be viewed as an analog. Under a resilience domain shift the underlying system topology changes, either within or between dynamic manifolds, and this can be viewed as similar to the fundamental geometric structural change water undergoes when it transforms from crystal to liquid, or from liquid to gas. Indeed, recent work (Franzosi and Pettini 2004; Pettini 2007) uses Morse theory from differential topology to identify necessary conditions for topological shifts on dynamic manifolds of systems undergoing first and second order phase transition. The converse, sufficiency, is an open question. See the Appendix for a brief introduction to Morse Theory.

The important point is that the homology between Eqs. 12 and 24 ensures that some form of emergent behavior, akin to a physical phase transition, is inevitable for a broad class of systems which transmit information.

Phenomenological Onsager Theory

The homology between the information source uncertainty and the free energy density of a physical system arises, in part, from the formal similarity between their definitions in the asymptotic limit. Information source uncertainty can be defined as in Eq. 13. This is, as noted, quite analogous to the free energy density of a physical system, Eq. 24.

Feynman (1996) provides a series of physical examples, based on Bennett’s work, where this homology is, in fact, an identity, albeit for very simple systems. Bennett argues, in terms of idealized irreducibly elementary computing machines, that the information contained in a message can be viewed as the work saved by not needing to recompute what has been transmitted.

Feynman explores in some detail Bennett’s ideal microscopic machine designed to extract useful work from a transmitted message. The essential argument is that computing, in any form, takes work. Thus the more complicated a process, measured by its information source uncertainty, the greater its energy consumption, and our ability to provide energy is limited. In a psychological context, the striking phenomenon of inattentional blindness, Wallace (2007) argues, emerges as a thermodynamic limit on processing capacity in a focused cognitive system, that is, one which has been strongly configured about a particular task. Other biological generalizations seem obvious.

Understanding the time dynamics of information systems away from the kind of phase transition critical points described above and in the Appendix requires a phenomenology similar to the Onsager relations of nonequilibrium thermodynamics. This will lead to a more general theory involving large-scale topological changes in the sense of Morse theory.

If the source uncertainty associated with a biological process is parametized by some vector of quantities K ≡ (K 1, ..., K m ), then, in analogy with nonequilibrium thermodynamics, gradients in the K j of the disorder, defined as

$$\label{eq25} S \equiv H(\mathbf{K})-\sum\nolimits_{j=1}^{m}K_{\!j} \partial H/\partial K_{\!j} $$
(25)

become of central interest.

Equation 25 is similar to the definition of entropy in terms of the free energy density of a physical system, as suggested by the homology between free energy density and information source uncertainty described above.

Pursuing the homology further, the generalized Onsager relations defining temporal dynamics become

$$\label{eq26} dK_{\!j}/dt = \sum\limits_{i} L_{j,i} \partial S/\partial K_{i}, $$
(26)

where the L j,i are, in first order, constants reflecting the nature of the underlying phenomena. The L-matrix is to be viewed empirically, in the same spirit as the slope and intercept of a regression model, and may have structure far different than familiar from simpler chemical or physical processes. The \(\partial S/\partial K\) are analogous to thermodynamic forces in a chemical system, and may be subject to override by external physiological driving mechanisms: biological systems, unlike physical systems, can make choices as to energy allocation.

That is, an essential contrast with simple physical systems driven by (say) entropy maximization is that complex biological structures can make decisions about resource allocation, to the extent resources are available. Thus resource availability is a context, not a determinant.

Equations 25 and 26 can be derived in a simple parameter-free covariant manner which relies on the underlying topology of the information source space implicit to the development (e.g., Wallace and Wallace 2008). We will not pursue that development here.

The dynamics, as we have presented them so far, have been noiseless, while biological systems are always very noisy. Equation 26 might be rewritten as

$$dK_{\!j}/dt = \sum_{i} L_{j,i} \partial S/\partial K_{i} + \sigma W(t) $$

where σ is a constant and W(t) represents white noise.

This leads directly to a family of classic stochastic differential equations having the form

$$dK^{j}_{t}= \sum\nolimits_{i} L^{j,i}\big(t, \partial S/\partial K^{i}\big)dt + \sigma^{j}(t)dB^{j}_{t},$$
(27)

where the L i,j and σ are appropriately regular functions of t and K i, dB t j represents the noise structure, and we have readjusted the indices.

Further progress in this direction requires introduction of methods from stochastic differential geometry and related topics in the sense of Emery (1989), which will not be required in our approach. The obvious inference is that noise, which need not be ‘white’, can serve as a tool to shift the system between various modes, as a kind of crosstalk and the source of a generalized stochastic resonance.

Effectively, topological shifts between and within dynamic manifolds constitute another theory of phase transitions. Indeed, similar considerations have become central in the study of phase changes for physical systems (Franzosi and Pettini 2004).

Our phenomenological Onsager treatment would likely be much enriched by adoption of a Morse theory perspective.

Rate Distortion Dynamics

It is possible to restate Eqs. 26 and 27 in a manner which relates them closely to our central concerns. First recall the relation between information source uncertainty and channel capacity from Eq. 16:

$$ H[\mathbf{X}] \leq C. $$

Next, the definition of channel capacity, Eq. 8:

$$ C \equiv \max_{P(X)} I(X \mid Y). $$

Finally, the definition of the rate distortion function, from Eq. 21:

$$R(D)=\min_{p(y,\hat{y});\sum_{(y,\hat{y})}p(y)p(y \mid \hat{y})d(y,\hat{y})\leq D}I(Y,\hat{Y}).$$

R(D) defines the minimum channel capacity necessary for average distortion D, placing limits on information source uncertainty. Thus, we suggest that, as in Eigen’s original model, distortion measures can drive information system dynamics.

We are led to propose, as a heuristic, that the dynamics of Eqs. 25, 26, and 27 will, through Eq. 16, be constrained by the system as described in terms of a parametized rate distortion function. To do this, take R as parametized, not only by the distortion D, but by some vector of variates D = (D 1, ... , D k ), for which the first component is the average distortion. The assumed dynamics are then driven by gradients in the rate distortion disorder defined as

$$\label{eq28} S_{R} \equiv R(\mathbf{D})-\sum\limits_{i=1}^{k}D_{i}\partial R/\partial D_{i}, $$
(28)

leading to the deterministic and stochastic systems of equations

$$ dD_{j}/dt=\sum\nolimits_{i}L_{j,i}\partial S_{R}/\partial D_{i} $$
(29)

and

$$dD^{j}_{t}=\sum\nolimits_{i}L^{j,i}\big(t,\partial S_{R}/\partial D^{i}\big)dt+ \sigma^{J}(t)dB^{j}_{t}.$$
(30)

A simple Gaussian channel, taking σ 2 = 1, has a Rate Distortion function

$$R(D)= 1/2 \log[1/D], $$

so that,

$$ S_{R}(D) = R(D)-DdS_{R}/dD = 1/2 \log(1/D) + 1/2. $$
(31)

The simplest possible Onsager relation becomes

$$ dD/dt \propto -dS_{R}/dD=1/(2D), $$
(32)

where − dS R /dD represents the force of the ‘biochemical wind’.

This has the solution

$$ D \propto \sqrt{t}. $$
(33)

Similar results will accrue to any of the reverse-J-shaped relations which must inevitably characterize simple RNA-like replicators. The implication is that all simple RNA(-like) organisms – including RNA viruses – will inevitably be subject to a relentless biochemical force, a constant torrent, which can drive them very close to their critical mutation rates. That is, in general, absent a contravening biological or other constraint,

$$ D(t) = f(t), $$
(34)

with f(t) monotonic increasing in t.

It is not surprising, then, that so many RNA viruses are found close to their error thresholds, which do indeed constitute a powerful biological constraint (Holmes 2003).

In the next section we will show how two or more interacting RNA-like vesicles can, given enough energy, by virtue of Van Valen’s (1973) Red Queen dynamic, oppose this fierce chemical hurricane.

Prebiotic Coevolution

Natural systems subject to coevolutionary interaction may become enmeshed in the Red Queen dilemma of Alice in Wonderland, in that they must undergo constant evolutionary change in order to avoid extinction – they must constantly run just to stay in the same place. An example would be a competitive arms race between predator and prey: Each evolutionary advance in predation must be met with a coevolutionary adaptation which allows the prey to avoid the more efficient predator. Otherwise the system will collapse, since a highly specialized predator can literally eat itself to extinction. Similarly, each prey defense must be matched by a predator adaptation for the system to persist.

Here we present a fairly elaborate model of coevolution, in terms of interacting information sources. Interaction events, we will argue, can be highly punctuated. These are assumed to be between prebiotic vesicles which engage in crosstalk.

We begin by examining ergodic information sources and their dynamics under the self-similarity of a renormalization transformation near a punctuated phase transition, using a simple version of the more elaborate arguments in the Appendix. We then study the linked interaction of two information sources in which the richness of the quasi-language of each affects the other, that is, when two information sources have become one another’s primary environments. This leads directly and naturally to a coevolutionary Red Queen. We will generalize the development to a ‘block diagonal’ structure of several interacting sources, and extend the model to second order, using a large deviations argument, suggesting a means for the necessary series of punctuated biochemical evolutionary events.

Fragmentation and Coalescence

The structures of interest to us here can be most weakly, and hence universally, described in terms of an adiabatically, piecewise stationary, ergodic information source involving a stochastic variate X which, in some general sense, sends symbols α in correlated sequences α 0, α 1...α n − 1 of length n (which may vary), according to a joint probability distribution, and its associated conditional probability distribution,

$$\begin{array}{l} P\big[X_{0} = \alpha_{0}, X_{1} = \alpha_{1},...X_{n-1}=\alpha_{n-1}\big], \\\\ P\big[X_{n-1} = \alpha_{n-1} \mid X_{0}=\alpha_{0},...X_{n-2}=\alpha_{n-2}\big]. \end{array}$$

If the conditional probability distribution depends only on m previous values of X, then the information source is said to be of order m (Ash 1990).

By ‘ergodic’ we mean that, in the long term, correlated sequences of symbols are generated at an average rate equal to their (joint) probabilities. ‘Adiabatic’ means that changes are slow enough to allow the necessary limit theorems to function, ‘stationary’ means that, between pieces, probabilities don’t change (much), and ‘piecewise’ means that these properties hold between phase transitions, which are described using renormalization methods.

As the length of the (correlated) sequences increases without limit, the Shannon–McMillan Theorem permits division of all possible streams of symbols into two groups, a relatively small number characterized as meaningful, whose long-time behavior matches the underlying probability distribution, and an increasingly large set of gibberish with vanishingly small probability. Let N(n) be the number of possible meaningful sequences of length n emitted by the source \(\textbf{X}\). Again, uncertainty of the source, \(H[\textbf{X}]\), can be defined by the relation

$$ H[\textbf{X}] = \lim_{n \rightarrow \infty} \frac{\log[N(n)]}{n}. $$

The Shannon–McMillan Theorem shows how to characterize \(H[\textbf{X}]\) directly in terms of the joint probability distribution of the source \(\textbf{X}\).

Let P[x i |y j ] be the conditional probability that stochastic variate X = x i given that stochastic variate Y = y j and let P[x i , y j ] be the joint probability that X = x i and Y = y j . Then the joint and conditional uncertainties of X and Y, H(X,Y), and H(X |Y) are given by expressions like those of Eqs. 5 and 6.

And again, the Shannon–McMillan Theorem of states that the function for \(H[\textbf{X}]\) is determined according to the limits of Eq. 13.

Estimating the probabilities of the sequences α 0,...α n − 1 from observation, the ergodic property allows us to use them to estimate the uncertainty of the information source. Thus \(H[\textbf{X}]\) is directly measurable.

Equation 16, \(H[\textbf{X}] \leq C\), establishes C as the maximum rate at which the external world can transmit information originating with the information source, or that internal vesicles can communicate. Much of the subsequent development could, in fact, be expressed using this relation, and the analog in terms of the rate distortion Eqs. 2830.

Again recall the relation between the expression for source uncertainty and the free energy density of a physical system which undergoes a phase transition depending on an inverse temperature parameter K = 1/T at a critical temperature T C .

Imposition of a renormalization symmetry on F(K) in Eq. 24 describes, in the infinite volume limit, the behavior of the system at the phase transition in terms of scaling laws (Wilson 1971). After some development, taking the limit n → ∞ as an analog to the infinite volume limit of a physical system, we will apply this approach to a parametized source uncertainty. We will examine changes in structure as a fundamental ‘inverse temperature’ changes across the underlying system.

We use three parameters to describe the relations between an information source and its environment or between different interacting sources.

The first, J ≥ 0, measures the degree to which acquired characteristics are transmitted. J ≈ 0 thus represents a high degree of genetic as opposed to epigenetic inheritance.

J will always remain distinguished, a kind of inherent direction or external field strength in the sense of Wilson (1971).

The second parameter, , represents the inverse availability of resources. Q ≈ 0 thus represents a high ability to renew and maintain a particular system.

The third parameter, K = 1/T, is an inverse index of a generalized temperature T, which we will more directly specify below in terms of the richness of interacting information sources.

We suppose further that the structure of interest is implicitly embedded in, and operates within the context of, a larger manifold stratified by metric distances.

Take these as multidimensional vector quantities \(\textbf{A}\), \(\textbf{B}\), \(\textbf{C}\).... \(\textbf{A}\) may represent location in space, time delay, or the like, and \(\textbf{B}\) may be determined through multivariate analysis of a spectrum of observed behavioral or other factors, in the largest sense, etc.

It may be possible to reduce the effects of these vectors to a function of their magnitudes \(a= \mid \textbf{A} \mid \), \(b=\mid \textbf{B} \mid \) and \(c= \mid \textbf{C} \mid \), etc. Define a generalized distance r as

$$\label{eq35} r^{2} = a^{2} + b^{2} + c^{2} + ...\ \ . $$
(35)

To be more explicit, we assume an ergodic information source \(\textbf{X}\) is associated with the reproduction and/or persistence of a biological population or other structure. The source \(\textbf{X}\), its uncertainty \(H[J, K, Q, \textbf{X}]\) and its parameters J, K, Q all are assumed to depend implicitly on the embedding manifold, in particular on the metric r of Eq. 35.

A particularly elegant and natural formalism for generating such punctuation in our context involves application of Wilson’s (1971) program of renormalization symmetry – invariance under the renormalization transform – to source uncertainty defined on the r-manifold. The results predict that language in the most general sense, which includes the transfer of information within a biological or prebiotic context, will undergo sudden changes in structure analogous to phase transitions in physical systems.

We emphasize that this is an argument by abduction from physical theory: Much current development surrounding physical phase change is based on the assumption that at transition a system looks the same under renormalization. That is, phase transition represents a stationary point for a renormalization transform in the sense that the transformed quantities are related by simple scaling laws to the original values.

Renormalization is a clustering semigroup transformation in which individual components of a system are combined according to a particular set of rules into a ‘clumped’ system whose behavior is a simplified average of those components. Since such clumping is a many-to-one condensation, there can be no unique inverse renormalization, and, as the Appendix shows, many possible forms of condensation.

Assume it possible to redefine characteristics of the information source \(\textbf{X}\) and J, K, Q as functions of averages across the manifold having metric r, which we write as R. That is, ‘renormalize’ by clustering the entire system in terms of blocks of different sized R.

Let N(K, J, Q, n) be the number of high probability meaningful correlated sequences of length n across the entire community in the r-manifold of Eq. 35, given parameter values K, J, Q. We study changes in

$$ H[K, J, Q, \textbf{X}] \equiv \lim_{n \rightarrow \infty} \frac{\log[N(K, J, Q, n)]}{n} $$

as KK C and/or QQ C for critical values K C , Q C at which the system begins to undergo a marked transformation from one kind of structure to another.

Given the metric of Eq. 35, a correlation length, χ(K, J, Q), can be defined as the average length in r-space over which structures involving a particular phase dominate.

Now clump the ‘community’ into blocks of average size R in the multivariate r-manifold, the ‘space’ in which the system of interest is implicitly embedded.

Following the classic argument of Wilson (1971), reproduced and expanded in the Appendix, it is possible to impose renormalization symmetry on the source uncertainty on H and χ by assuming at transition the relations

$$\label{eq36} H\big[K_{R}, J_{R}, Q_{R}, \textbf{X}\big] = R^{\delta}H[K, J, Q, \textbf{X}] $$
(36)

and

$$\label{eq37} \chi\big(K_{R}, J_{R}, Q_{R}\big) = \chi(K, J, Q)/R $$
(37)

hold, where K R , J R and Q R are the transformed values of K, J and Q after the clumping of renormalization. We take K 1, J 1, Q 1 ≡ K, J, Q and permit the characteristic exponent δ > 0 to be nonintegral. The Appendix provides examples of other possible relations.

Equations 36 and 37 are assumed to hold in a neighborhood of the transition values K C and Q C .

Differentiating these with respect to R gives expressions for dK R /dR, dJ R /dR and dQ R /dR depending simply on R which we write as

$$\begin{array}{lll} dK_{R}/dR &=& u\big(K_{R}, J_{R}, Q_{R}\big)/R\\ dQ_{R}/dR &=& w\big(K_{R}, J_{R}, Q_{R}\big)/R\\ dJ_{R}/dR &=& \big[v\big(K_{R}, J_{R}, Q_{R}\big)/R\big]J_{R}. \end{array} $$
(38)

Solving these differential equations gives K R , J R and Q R as functions of J, K, Q and R.

Substituting back into Eqs. 36 and 37 and expanding in a first order Taylor series near the critical values K C and Q C gives power laws much like the Widom-Kadanoff relations for physical systems (Wilson 1971). For example, letting J = Q = 0 and taking κ ≡ (K C  − K)/K C gives, in first order near K C ,

$$\begin{array}{lll} H &=& \kappa^{\delta/y}H_{0}\\ \chi &=& \kappa^{-1/y}\chi_{0} \end{array} $$
(39)

where y is a constant arising from the series expansion.

Note that there are only two fundamental equations – Eqs. 36 and 37 – in n > 2 unknowns: The critical ‘point’ is, in this formulation, most likely to be a complicated implicitly defined critical surface in J, K, Q, ... -space. The ‘external field strength’ J remains distinguished in this treatment, i.e., the inverse of the degree to which acquired characteristics are inherited, but neither K , Q nor other parameters are, by themselves, fundamental, rather their joint interaction defines critical behavior along this surface.

That surface is a fundamental object, not the particular set of parameters (except for J) used to define it, which may be subject to any set of transformations which leave the surface invariant. Thus inverse generalized temperature, resource availability, or whatever other parameters may be identified as affecting H, are inextricably intertwined and mutually interacting, according to the form of this critical surface. That surface, in turn, is unlikely to remain fixed, and should vary with time or other extrinsic parameters, including, but not likely limited to, J.

At the critical surface a Taylor expansion of the renormalization Eqs. 36 and 37 gives a first order matrix of derivatives whose eigenstructure defines fundamental system behavior. For physical systems the surface is a saddle point (Wilson 1971), but more complicated behavior seems likely in what we study. See Binney et al. (1986) for some details of this differential geometry.

Taking, for the moment, the simplest formulation, (J = Q = 0), that is, a well-provisioned system with memory, as K increases toward a threshold value K C , the source uncertainty declines and, at K C , the average regime dominated by the ‘other phase’ grows. That is, the system begins to freeze into one having a large correlation length for the second phase. The two phenomena are linked at criticality in physical systems by the scaling exponent y.

Assume the rate of change of κ = (K C  − K)/K C remains constant, |/dt | = 1/τ K . Analogs with physical theory suggest there is a characteristic time constant for the phase transition, τ ≡ τ 0/κ, such that if changes in κ take place on a timescale longer than τ for any given κ, we may expect the correlation length \(\chi =\chi_{0} \kappa^{-s}\), s = 1/y, will be in equilibrium with internal changes and result in a very large fragment in r-space. Following Zurek (1985, 1996), the ‘critical’ freezout time, \(\hat{t}\), will occur at a ‘system time’ \(\hat{t}=\chi/ \mid d\chi/dt \mid \) such that \(\hat{t}=\tau\). Taking the derivative /dt, remembering that by definition /dt = 1/τ K , gives

$$ \frac{\chi}{\mid d\chi/dt \mid}=\frac{\kappa \tau_{K}}{s}=\frac{\tau_{0}}{\kappa} $$

so that

$$\kappa =\sqrt{s \tau_{0}/\tau_{K}}. $$

Substituting this value of κ into the equation for correlation length, the expected size of fragments in r-space, \(d(\hat{t})\), becomes

$$d \approx \chi_{0}\left(\frac{\tau_{K}}{s\tau_{0}}\right)^{s/2} $$

with s = 1/y > 0. The more rapidly K approaches K C the smaller is τ K and the smaller and more numerous are the resulting r-space fragments. Thus rapid change produces small fragments more likely to risk extinction in a system dominated by economies of scale.

An absolutely central observation is that this argument suggests the possibility of a reverse transformation: from a set of weakly linked, quasi-independent vesicles in simple loose association, into a large, coherent, strongly-linked structure in which the vesicles are now intimate, mutually-supporting, subcomponents. From a Morse Theory perspective, this must involve a fundamental topological transformation of the underlying biochemical networks.

The next section explores the dynamics of coevolutionary coherence in more detail.

Recursive Interaction

The next iteration of the argument involves envisioning reciprocally interacting biological information sources – vesicles – as subject to a coevolutionary Red Queen by treating their respective reproductive source uncertainties as recursively parametized by each other. That is, we assume the information sources are each other’s primary environments. These are, respectively, characterized by information sources \(\textbf{X}\) and \(\textbf{Y}\), whose uncertainties are parametized

  1. [1]

    by measures of both inheritance and inverse resources – J, Q as above – and, most critically,

  2. [2]

    by each others inverse uncertainties, and , i.e.,

    (40)

    This is a recursive system having complicated behaviors.

Assume a strongly heritable genetic system, J = 0, with fixed inverse resource base, Q, for which \(H[\textbf{X}]\) follows something like the lower graph in Fig. 1, a reverse J-shaped curve with , and similarly \(H[\textbf{Y}]\) depends on . That is, increase or decline in the source uncertainty of one system leads to increase or decline in the source uncertainty of the other, and vice versa. The richness of the two information sources is closely linked.

Fig. 1
figure 1

A reverse-J-shaped curve for source uncertainty \(H[\textbf{X}]\) – measuring language richness – as a function of an inverse temperature parameter \(K=1/H[\textbf{Y}]\). To the right of the critical point K C the system breaks into fragments in r-space whose size is determined by the rate at which K approaches K C . A collection of fragments already to the right of K C , however, would be seen as condensing into a single unit as K declined below the critical point. If K is an inverse source uncertainty itself, i.e., \(K=1/H[\textbf{Y}]\) for some information source \(\textbf{Y}\), then under such conditions a Red Queen dynamic can become enabled, driving the system strongly to the left. No intermediate points are asymptotically stable in this development. To the right of the critical point K C the system is locked into disjoint fragments. Thus there are two quasi-stable points, a low energy solution near the error limit phase transition point, and a high energy state nearer to, but never at, the zero error limit

Start at the right of the lower graph for \(H[\textbf{X}]\) in Fig. 1, the source uncertainty of the first system, but to the left of the critical point K C . Assume \(H[\textbf{Y}]\) increases so decreases, and thus \(H[\textbf{X}]\) increases, walking up the lower curve of Fig. 1 from the right: the richness of the first system’s internal language increases.

The increase of \(H[\textbf{X}]\) leads, in turn, to a decline in and triggers an increase of \(H[\textbf{Y}]\), whose increase leads to a further increase of \(H[\textbf{X}]\) and vice versa: The Red Queen, taking the system from the right of Fig. 1 to the left, up the lower curve as the two systems mutually interact.

Now enlarge the scale of the argument, and consider the possibility of other interactions.

The upper graph of Fig. 1 represents the disorder

$$S=H[K, \textbf{X}]-KdH[K, \textbf{X}]/dK, K\equiv1/H[\textbf{Y}],$$

whose gradient, in the absence of the Red Queen interaction, would simply drive the system toward the minimum energy critical point for this system, as in the section on rate distortion dynamics.

Clearly this ratchet can also be reversed. Thus the system has two quasi-stable limit points, a low energy solution near the error limit phase transition point, and a high energy state near to, but never at, the zero error limit, depending on the availability of sufficient energy to the system. Absent a relatively high energy source, low error rate reproduction would be impossible, according to the model. This suggests that some major, large scale, ecosystem transformation in energy availability was a necessary condition for low error rate reproduction.

This simple model has a surprising number of implications that are not confined to a prebiotic context. It is possible to formulate the contest between a highly mutating pathogen, operating at a low energy configuration, and a complicated immune system, in terms of phenotype, rather than simply genotype, coevolution, in the sense of West-Eberhard (2003). Then the phenotype-phenotype ‘two-vesicle’ coevolutionary ratchet has, again, two quasi-fixed points: fragmentation near a high variability phenotype, and immune/viral phenotype coalescence near a low variability phenotype. The latter will, ultimately, write an image of itself, in the sense of Adami et al. (2000), on the pathogen gene in spite of its high mutation rate, producing protected zones defining that phenotype.

Extending the Model

The model directly generalizes to multiple interacting information sources.

First consider a matrix of crosstalk measures between a set of information sources. Assume the matrix of mutual information crosstalk can be block diagonalized into two major components, characterized by mutual information source measures like

$$I_{m}(X_{1}...X_{i}), m=1,2.$$

Then apply the two-component theory above.

Extending the development to multiple, recursively interacting information sources resulting from a more general block diagonalization seems direct. First use inverse measures as parameters for each of the other blocks, writing

where the K s represent other relevant parameters.

Next segregate the according to their relative rates of change.

The dynamics of such a system becomes a recursive network of stochastic differential equations, similar to those used to study many other highly parallel dynamic structures (e.g., Wymer 1997).

Letting the K j and all be represented as parameters Q j , (with the caveat that I m not depend on ), one can define

$$S^{m}_{I}\equiv I_{m}-\sum_{i}Q_{i}\partial I_{m}/\partial Q_{i} $$

to obtain a complicated recursive system of phenomenological ‘Onsager relations’ stochastic differential equations like Eq. 27 or, in terms of rate distortion functions, like Eq. 30,

$$\label{eq41} dQ^{j}_{t} = \sum\nolimits_{i}\big[L_{j,i}\big(t,...\partial S^{m}_{I}/ \partial Q^{i}...\big)dt + \sigma_{j,i}\big(t,...\partial S^{m}_{I}/\partial Q^{i}...\big)dB^{i}_{t}\big], $$
(41)

where, again, for notational simplicity only, we have expressed both the reciprocal ’s and the external K’s in terms of the same Q j .

The index m ranges over the crosstalk and we could allow different kinds of ‘noise’ \(dB^{i}_{t}\), having particular forms of quadratic variation which may, in fact, represent a projection of environmental factors under something like a rate distortion manifold (Wallace and Wallace 2008).

Indeed, the I m and/or the derived S m might, in some circumstances, be combined into a Morse function, permitting application of Pettini’s Topological Hypothesis.

The model rapidly becomes unwieldy, probably requiring some clever combinatorial or groupoid convolution algebra and related diagrams for concise expression, much as in the usual field theoretic perturbation expansions (Hopf algebras, for example). The virtual reaction method of Zhu et al. (2007) is another possible approach.

As in the simple model above, there will be, first, multiple quasi-stable points within a given system’s I m , representing a class of generalized resilience modes accessible via punctuation, enmeshing the crosstalking vesicles in a manner analogous to the simple model of Fig. 1, and depending critically on available energy.

Second, however, will be analogs to the fragmentation of Fig. 1 when the system exceeds the critical value K c . That is, the K-parameter structure will represent full-scale fragmentation of the entire structure, and not just punctuation within it.

There are other possible patterns:

  1. [1]

    Setting Eq. 41 equal to zero and solving for stationary points – ‘coevolutionary stable strategies’ – again gives attractor states since the noise terms preclude unstable equilibria.

  2. [2]

    However, this system may also converge to limit cycle or ‘strange attractor’ behaviors in which it seems to chase its tail endlessly.

  3. [3]

    What is converged to in both cases is not a simple state or limit cycle of states. Rather it is an equivalence class, or set of them, of highly dynamic information sources coupled by mutual interaction through crosstalk. Thus ‘stability’ in this structure represents particular patterns of ongoing dynamics rather than some identifiable ‘state’. This leads directly to elaborate groupoid descriptions of dynamic manifolds, and to consideration of directed homotopy equivalence classes which we will not explore further here (Wallace and Wallace 2008).

We are, then, deeply enmeshed in a highly recursive phenomenological stochastic differential equations, but in a dynamic rather than static manner: the objects of this dynamical system are equivalence classes of information sources and their crosstalk, rather than simple ‘states’ of a dynamical or reactive chemical system.

Other formulations may well be possible, but our work here serves to illustrate the method.

It is, however, interesting to compare our results to those of Dieckmann and Law (1996), who invoke evolutionary game dynamics to obtain a first order coevolutionary canonical equation having the form

$$\label{eq42} ds_{i}/dt=K_{i}(s) \partial W_{i}\big(s_{i}^{\prime}, s\big) \mid_{s_{i}^{\prime}=s_{i}}. $$
(42)

The s i , with i = 1, ..., N, denote adaptive trait values in a community comprising N species. The \(W_{i}(s_{i}^{\prime},s)\) are measures of fitness of individuals with trait values \(s_{i}^{\prime}\) in the environment determined by the resident trait values s, and the K i (s) are non-negative coefficients, possibly distinct for each species, that scale the rate of evolutionary change. Adaptive dynamics of this kind have frequently been postulated, based either on the notion of a hill-climbing process on an adaptive landscape or some other sort of plausibility argument.

When this equation is set equal to zero, so there is no time dependence, one obtains what are characterized as ‘evolutionary singularities’, i.e., stationary points.

Dieckmann and Law contend that their formal derivation of this equation satisfies four critical requirements:

  1. [1]

    The evolutionary process needs to be considered in a coevolutionary context.

  2. [2]

    A proper mathematical theory of evolution should be dynamical.

  3. [3]

    The coevolutionary dynamics ought to be underpinned by a microscopic theory.

  4. [4]

    The evolutionary process has important stochastic elements.

Our Eq. 41 seems clearly within this same ballpark, although we have taken a much different route, one which indeed produces elaborate patterns of phase transition punctuation in a highly natural manner. Champagnat et al. (2006), in fact, derive a higher order canonical approximation extending Eq. 42 which is very much closer equation to Eq. 41, that is, a stochastic differential equation describing evolutionary dynamics.

Champagnat et al. (2006) go even further, using a large deviations argument to analyze dynamical coevolutionary paths, not merely evolutionary singularities. They contend that in general, the issue of evolutionary dynamics drifting away from trajectories predicted by the canonical equation can be investigated by considering the asymptotic of the probability of ‘rare events’ for the sample paths of the diffusion. By ‘rare events’ they mean diffusion paths drifting far away from the canonical equation.

The probability of such rare events is governed by a large deviation principle: when a critical parameter (designated ε) goes to zero, the probability that the sample path of the diffusion is close to a given rare path ϕ decreases exponentially to 0 with rate I(ϕ), where the ‘rate function’ I can be expressed in terms of the parameters of the diffusion. This result, in their view, can be used to study long-time behavior of the diffusion process when there are multiple attractive evolutionary singularities. Under proper conditions the most likely path followed by the diffusion when exiting a basin of attraction is the one minimizing the rate function I over all the appropriate trajectories. The time needed to exit the basin is of the order \(\exp(H/\epsilon)\) where H is a quasi-potential representing the minimum of the rate function I over all possible trajectories.

An essential fact of large deviations theory is that the rate function I which Champagnat et al. (2006) invoke can almost always be expressed as a kind of entropy, that is, in the form \(I=-\sum_{j}P_{j}\log(P_{j})\) for some probability distribution. This result goes under a number of names; Sanov’s Theorem, Cramer’s Theorem, the Gartner-Ellis Theorem, the Shannon–McMillan Theorem, and so forth (Dembo and Zeitouni 1998). Here we will use it to suggest the possibility of second order effects in prebiotic coevolutionary process. The fluctuational paths defined by the system of equations in Eq. 41 may, over sufficient time, lead to relatively sudden, and very marked, evolutionary excursions.

Large Deviations in Prebiotic Evolution

We begin with a recapitulation of large deviations and fluctuation formalism.

Information source uncertainty, according to the Shannon–McMillan Theorem, serves as a splitting criterion between high and low probability sequences (or pairs of them) and displays the fundamental characteristic of a growing body of work in applied probability often termed the Large Deviations Program, (LDP) which seeks to unite information theory, statistical mechanics and the theory of fluctuations under a single umbrella.

Following Dembo and Zeitouni (1998, p.2),

Let X 1, X 2,... X n be a sequence of independent, standard Normal, real-valued random variables and let

$$ S_{n} = (1/n) \sum_{j=1}^{n}X_{j}. $$
(43)

Since S n is again a Normal random variable with zero mean and variance 1/n, for all δ > 0

$$ \mathop{\lim\nolimits_{n \rightarrow \infty}} P(\mid S_{n} \mid \geq \delta)=0, $$
(44)

where P is the probability that the absolute value of S n is greater or equal to δ. Some manipulation, however, gives

$$ P(\mid S_{n} \mid \geq \delta) = 1 - \frac{1}{\sqrt{2}\pi}\int_{-\delta \sqrt{n}}^{\delta \sqrt{n}} \exp\big(-x^2/2\big) dx, $$
(45)

so that

$$ \mathop{\lim\nolimits_{n \rightarrow \infty}} \frac{\log P(\mid S_{n} \mid \geq \delta)}{n} = -\delta^2/2. $$
(46)

This can be rewritten for large n as

$$ P(\mid S_{n} \mid \geq \delta) \approx \exp\big(-n\delta^2/2\big). $$
(47)

That is, for large n, the probability of a large deviation in S n follows something much like the asymptotic equipartition relation of the Shannon–McMillan Theorem, i.e., that meaningful paths of length n all have approximately the same probability \(P(n) \propto \exp(-n H[\mathbf{X}])\).

Questions about meaningful paths appear suddenly as formally isomorphic to the central argument of the LDP which encompasses statistical mechanics, fluctuation theory, and information theory into a single structure (Dembo and Zeitouni 1998).

Again, the cardinal tenet of large deviation theory is that the rate function − δ 2/2 can, under proper circumstances, be expressed as a mathematical entropy having the standard form

$$ -\sum\nolimits_{k}p_{k}\log p_{k}, $$
(48)

for some distribution p k .

Next we briefly recapitulate part of the standard treatment of large fluctuations (Onsager and Machlup 1953; Fredlin and Wentzell 1998).

The macroscopic behavior of a complicated physical system in time is assumed to be described by the phenomenological Onsager relations giving large-scale fluxes as

$$ \sum\nolimits_{i} R_{i,j} dK_{\!j}/dt = \partial S/\partial K_{i}, $$
(49)

where the R i,j are appropriate constants, S is the system entropy and the K i are the generalized coordinates which parametize the system’s free energy.

Entropy is defined from free energy F by a Legendre transform – more of which follows below:

$$S \equiv F - \sum_{j} K_{\!j} \partial F/\partial K_{\!j}, $$

where the K j are appropriate system parameters.

Neglecting volume problems for the moment, free energy can be defined from the system’s partition function Z as

$$F(K)=\log[Z(K)]. $$

The partition function Z, in turn, is defined from the system Hamiltonian – defining the energy states – as

$$ Z(K) = \sum_{j} \exp[-K E_{\!j}], $$

where K is an inverse temperature or other parameter and the E j are the energy states.

Inverting the Onsager relations gives

$$ dK_{i}/dt = \sum\nolimits_{j} L_{i,j} \partial S/\partial K_{\!j} = L_{i}(K_{1},...,K_{m}, t) \equiv L_{i}(K, t). $$
(50)

The terms \(\partial S/\partial K_{i}\) are macroscopic driving forces dependent on the entropy gradient.

Let a white Brownian noise ε(t) perturb the system, so that

$$ dK_{i}/dt = \sum\nolimits_{j} L_{i,j}\partial S/\partial K_{\!j} + \varepsilon(t) = L_{i}(K, t) + \varepsilon(t), $$
(51)

where the time averages of ε are < ε(t) > = 0 and < ε(t)ε(0) > = (t). δ(t) is the Dirac delta function, and we take K as a vector in the K i .

Following Luchinsky (1997), if the probability that the system starts at some initial macroscopic parameter state K 0 at time t = 0 and gets to the state K(t) at time t is P(K, t), then a somewhat subtle development (e.g., Feller 1971) gives the forward Fokker–Planck equation for P:

$$ \partial P(K,t)/\partial t=-\nabla \cdot(L(K,t)P(K,t))+(D/2)\nabla^{2}P(K,t). $$
(52)

In the limit of weak noise intensity this can be solved using the WKB, i.e., the eikonal, approximation, as follows: take

$$ P(K,t)=z(K,t)\exp(-s(K,t)/D). $$
(53)

z(K,t) is a prefactor and s(K,t) is a classical action satisfying the Hamilton–Jacobi equation, which can be solved by integrating the Hamiltonian equations of motion. The equation reexpresses P(K,t) in the usual parametized negative exponential format.

Let \(p \equiv \nabla s\). Substituting and collecting terms of similar order in D gives

$$ dK/dt = p + L, dp/dt = -\partial L/\partial K p $$
(54)

and

$$ -\partial s/\partial t \equiv h(K, p, t) = pL(K,t) + \frac{p^{2}}{2}, $$
(55)

with h(K,t) the Hamiltonian for appropriate boundary conditions.

Again following Luchinsky (1997), these Hamiltonian equations have two different types of solution, depending on p. For p = 0, dK/dt = L(K, t) which describes the system in the absence of noise. We expect that with finite noise intensity the system will give rise to a distribution about this deterministic path. Solutions for which p ≠ 0 correspond to optimal paths along which the system will move with overwhelming probability.

In sum, to again paraphrase Luchinsky (1997), large fluctuations, although infrequent, are fundamental in a broad range of processes, and it was recognized by Onsager and Machlup (1953) that insight into the problem could be gained from studying the distribution of fluctuational paths along which the system moves to a given state. This distribution is a fundamental characteristic of the fluctuational dynamics, and its understanding leads toward control of fluctuations. Fluctuational motion from the vicinity of a stable state may occur along different paths. For large fluctuations, the distribution of these paths peaks sharply along an optimal, most probable, path. In the theory of large fluctuations, the pattern of optimal paths plays a role similar to that of the phase portrait in nonlinear dynamics.

The essential generalization is that, by invoking the results of the section on rate distortion dynamics, we can substitute the rate distortion function for free energy in this argument. For the two-vesicle system of Fig. 1 this suggests the possibility of a transition from the low energy, high error state to the high energy, low error condition, assuming enough time and energy are available.

For more complicated cooperative prebiotic assemblies of RNA-like vesicles, these sudden, relatively large, jet-like transitions might – again given enough time and available energy – lead to far more complicated structures having error correction capabilities within the overall constraints of the particular chemical system. A central question then revolves around what classes of systems would be more likely to make such a transition. These might be the ones best sought via observation ex planeta.

Discussion and Conclusions

Like many others, we have resolved Eigen’s paradox by proposing appropriate interaction between compartments consisting of closely related populations of prebiotic mutants carrying a ‘consensus sequence’. These are isolated in vesicles, which are themselves linked. Our innovation has been to impose a formalism driven by the asymptotic limit theorems of information theory. Rather than abduct spin glass, two dimensional Ising, or similar models from physics, we have used the homology between information source uncertainty and free energy density to invoke highly modified biological phase change and generalized Onsager relation schemes in order to describe the dynamics of information processing systems at, and away from, phase transition analogs. A restatement of Eigen’s model in rate distortion terms, compounded with a coevolutionary perspective, suggests RNA virus-like structures are likely to be found as either a fragmented community, with each mutational population driven toward the mutation error limit, or else in a cooperative state of linked vesicles which, collectively, can have a relatively low reproductive error rate, depending on available energy.

More complicated coevolutionary dynamics seem possible, not only ‘quasi-evolutionarily stable strategies’, but even analogs to limit cycles or pseudorandom strange attractors, all having various rates of reproductive error.

A large deviations argument suggests that certain of these prebiotic chemical systems may, given time, undergo highly structured excursions among their various possible equilibrium or quasi-equilibrium modes, depending on available energy. The more viable of these collective modes having low error rates would then become subject to further evolutionary process, resulting in a condensation from RNA-like to DNA-like error correcting structures, where the various vesicles become ever more closely intertwined. Energy availability may indeed be the critical matter, with the probability of a ‘large deviation’ dependent on development of new energy sources. The inverse perspective is that the availability of a new energy source – a new metabolic cycle, or elementary photofixation, for example – can drive large deviations, and hence evolutionary process. One can argue that large scale energy availability is enmeshed with large scale ecosystem resilience shifts, sensu Holling (1973), another example in which resilience drives evolution, albeit in a prebiotic circumstance (Wallace and Wallace 2008).

Our relatively systematic invocation of core concepts from information theory apparently marks a departure from current theorizing on these subjects, although we have, in fact, purged the groupoid and topological foundations of a ‘resilience’ analysis of similar topics (Wallace and Wallace 2008). These permit further classification of quasi-stable sets within the dynamic manifolds implicit to the development here. That is, a far richer theoretical structure is available, and will probably be required: One is, in spite of all the formal heavy lifting of this paper, still somehow reminded of the famous cartoon in which a white-coated scientist, standing in front of a blackboard filled with mathematical scribbles, turns to his audience, chalk in hand, saying “...and here a miracle occurs...”.

Nonetheless, we have added significantly to the store of needed theoretical building blocks. The keys to further progress, however, seem, as usual, to lie in more extensive and penetrating observational and laboratory studies.