Understanding and explaining the origin of life on Earth are perhaps the most difficult problems that science faces (Morowitz 1992; Deamer 1994; deDuve 1995; Koonin 2011). There are difficulties everywhere: most likely, there is no remnant of the original set of molecules that began their fight against the second law of thermodynamics (but see the discussion in Benner et al. (2004); Davies et al. (2009)), so that the molecules in use by the most primitive life extant in the biosphere today is almost completely different from those molecules that began it all (Joyce et al. 1987). In view of all the problems that an RNA world scenario creates in terms of an “impossible” chemistry (Levy and Miller 1998; Sutherland 2010), we cannot even assume that RNA (in its present form) was part of the ancient equation.

Given the unfathomable number of variables (geological, chemical, and environmental), is it even worth while pursuing an answer to a question we can barely formulate? Here I propose–as others have before me (Joyce 2012; Walker and Davies 2013)–that in parallel to the ongoing work exploring the conditions of the pre-biotic Earth and the possible chemistries that could have given rise to a chemical system supporting self-replication, we ought to pursue theoretical research that is disconnected from a particular chemistry; that instead focuses entirely on information-theoretic aspects of the problem, as well as simulations of (and with) artificial (abstract) chemistries.

Information-theoretic considerations are, after all, the only thing we can be certain about in this wilderness of ideas and uncertainties concerning the emergence of a biosphere. In order to be considered living, a system must be able to maintain information on a time scale that exceeds (likely by many orders of magnitude) the abiotic scale (defined below) (Adami 1998). Indeed, if for a moment we define information as the difference between the maximum (thermodynamical) and the actual entropy of the system (see also ?), then certainly an abiotic system will have a vanishing amount of information unless it begins in a non-equilibrium state and approaches thermodynamic equilibrium. Let’s agree to call this timescale–the time it takes for a non-equilibrium system to approach maximum entropy–the abiotic scale. Living systems can stay away from maximum entropy for much longer, indeed arbitrarily long (the biotic time scale is, for all we know, only limited by the existence of the biosphere). It is then this ability: to persist in a state of reduced entropy for biotic as opposed to abiotic time scales, that defines a set of molecules as living, and this set of molecules must achieve that feat via the self-replication of information.

From this point of view, the transition from non-life to life has occurred if a thermodynamical system permanently moves from maximum entropy H max to H max−ΔH, where ΔH is the entropy deficit, or information. I argue here that the likelihood that such a transition occurs spontaneously depends mostly on the size of ΔH. I will investigate here some quantities that are bound to affect the size of ΔH without referring to any aspects of the local chemistry that gives rise to that information. In this manner, I hope we can learn something about the environmental characteristics we should look for in candidates for origin-of-life scenarios, characteristics that hopefully go beyond those that we are already investigating.

In the following, I will focus on potentially self-replicating molecular sequences of L monomers. Obviously, this constrains potential origin-of-life scenarios, but not as much as one might think. First, for the information-theoretic treatment that follows, it does not matter whether the replicator is a single contiguous sequence or instead a set of sequences that are replicating each other in an autocatalytic cycle (Eigen and Schuster 1979), in such a manner that the end result is the self-replication of the information contained in the set. It is possible in principle to calculate the amount of information contained within the autocatalytic set (this is potentially non-trivial because of likely redundancies within the set of molecules), in which case the analysis that follows applies to this (non-redundant) information.Footnote 1 After all, when our own DNA is replicated, this also happens via an intermediary of a myriad of other molecules, all of which are, however, encoded within the chromosomal or mitochondrial DNA. In other words, our genetic information replicates itself through intermediaries also.

Second, constraining our self-replicators to a limited number of possible monomers it could use (the “repertoire”) is only a weak constraint, because in an information-theoretic treatment of molecular chemistries we can ignore any monomers that are rarely used because those contribute only marginally to entropies (and hence information) on account of the p i logp i form of the Shannon entropy. Third, while it is easier to treat sequences of fixed length, the formalism can easily be extended to treat molecules of varying length. Focusing on linear polymers, however, is a restriction that ultimately might not turn out to be warranted. Indeed, it is possible to encode information in molecular assemblies that are not linear chains (Segre et al. 1998, 2000), and for those my treatment would have to be modified.

I first calculate the fraction of molecules of length L (with monomers drawn from an alphabet or repertoire of size D) that are functional–in the sense that they display the ability to self-replicate. In the following, I do not bother about an error rate, because I simply assume that even if the self-replication is imprecise, there will be a finite fraction of molecules that will be replicated accurately. I assume this because were the error rate too high then information cannot be maintained, and indeed the sequence of symbols is strictly speaking not information, precisely because it cannot persist. Note that the critical error rate depends on the population size, because in very large populations a few error-free sequences might be generated even if most copies are flawed, which is all that is required for the maintenance of information. All these statements can be made precise, but I won’t bother to undertake this here.

Let us first define the fraction of molecules F(x>0) with replication rate x>0 as F(x)=N ν /N, where N ν is the number of self-replicators (at any rate x or larger), and N is the total number of possible molecules of that length. If we assume (I will shortly relax this assumption) that all potential self-replicators appear with equal probability, then the Shannon entropy (also called “uncertainty”) of the self-replicating ensemble is H= logN ν . This uncertainty is related to the probability of finding such a self-replicating sequence by chance in a set of other sequences. The probability to find such a sequence among all possible sequences is of course just F(x). The information self-replicators have (about what it takes to self-replicate in their world) is the difference between the maximal and the actual entropy (Adami and Cerf 2000; Adami 2004, 2012)

$$ I_{S}=H_{\max}-\log N_{\nu} =L-H\;, $$
(1)

because the entropy of a random polymer of length L is H max=L mers (if we take logarithms with respect to base D) (Adami 2004). Because we also have H max= logN, Eq. 1 implies a relationship between the fraction of functional molecules and their information content as suggested by Szostak (2003) I S =−logF(x>0), but here I will go beyond this relationship and explore a more accurate estimate of H that takes into account the composition of the polymer in terms of monomers, the relative frequency of the polymers, and the rate at which each monomer is being produced abiotically.

Let us first relax the assumption that all possible replicators appear with equal frequency in the population. Let us instead enumerate all replicators in terms of their genotype i. Then, the entropy of self-replicators is

$$ H=-\sum\limits_{i=1}^{N_{\nu}}p_{i}\log_{D} p_{i}\;, $$
(2)

where p i is the likelihood to find genotype i in an infinite population. Of course, we do not know those probabilities as infinite populations do not exist, so we will have to estimate them. We do this by writing the entropy of the molecules in terms of the entropy of monomers. Indeed, if all monomers in the sequence were independent of each other, then the entropy of the random variable that represents the ith polymer

$$ X_{i}=X^{(1)}_{i}\times X^{(2)}_{i}\times ... \times X^{(N)}_{i}\ $$
(3)

could simply be written as the sum of the entropy of each monomer variable X (j) in that sequence

$$ H(X_{i})=\sum\limits_{j=1}^{L} H(X_{i}^{(j)})\;. $$
(4)

But we know from general arguments that monomers in a biological polymer are not independent, because if they were, self-replicators would be easy to find. Indeed, if monomers were independent, then the time it takes to find a particular sequence of length L is of the order L (while it is of the order D L in the worst-case scenario). However, information theory allows us to write the entropy Eq. 2 in terms of monomer entropies as well as correlation entropies. Correlation entropies between pairs of monomers are called “information”, but higher order correlations can exist too. If we define the Shannon information between two residues i and j as H(i:j), the correlation entropy between three residues i, j, and k as H(i:j:k) and so on, we can write Eq. 2 asFootnote 2

$$ H=\sum\limits_{i=1}^{L} H(i)-\sum\limits_{i>j}^{L}H(i:j)+\sum\limits_{i>j>k}^{L}H(i:j:k)-{\cdots} $$
(5)

In Eq. 5, the sum goes over alternating signs of correlation entropies, culminating with a term (−1)L−1 H(1:2:3:⋯:L).

So now, the information contained in a molecule is

$$ I_{S}=L-H=L-\sum\limits_{i=1}^{L} H(i)+\sum\limits_{i>j}^{L}H(i:j)-\cdots\;. $$
(6)

Let me assume for a moment that we do not have to worry about the higher order informations H(i:j), H(i:j:k) and so on, in the sense that these terms will be smaller than all the first order terms \(L-{\sum }_{i} H(i)\). This is not at all an obvious assumption, because each of the higher-order terms might be small, but there is an exponentially increasing number of them as the order increases. We can start worrying about these terms when everything else is said and done: at the moment let me just mention that I have seen very few cases where terms of the order H(i:j:k) or higher play a role, while pairwise correlations such as H(i:j) can play very important roles indeed (Adami and Cerf 2000; Gupta and Adami 2014).

For the sake of the argument, let me just consider the size of \(I_{S}=L-{{\sum }_{i}^{L}}H(i)\). Such a term might be large, in particular if the positions are fairly well conserved ( H(i)≈0). By the preceding arguments, such a sequence will be very unlikely to emerge by chance, as this probability is

$$ P_{0}=F(x)=D^{-I_{S}}\;. $$
(7)

Now let me make one other simplifying assumption, namely that the entropy at each site is roughly the same, namely H b : (‘b’ for “biotic”)

$$ H(i)\approx H_{b} \ \ \ \ \forall i\;. $$
(8)

Such an assumption is of course ludicrous when we think about the per-site entropy of known biomolecules, which varies tremendously from site to site (see, e.g., Adami 2004, 2012; Gupta and Adami 2014). Let us thus say that H b is really the average per-site entropy

$$ H_{b}=\frac1L{\sum\limits_{i}^{L}}H(i) $$
(9)

so that

$$ I_{S}=L-LH_{b}=L(1-H_{b})\;. $$
(10)

What sets the value of H b ? At each site, the entropy is determined by how often any particular monomer is found there on average in a typical functional protein. Suppose that each monomer appears on average with probability q j in such an informational molecule. The entropy of an average site is then

$$ H_{b}=-\sum\limits_{j=1}^{D} q_{j}\log_{D} q_{j}\;. $$
(11)

If each monomer occurs roughly with the same frequency q j ≈1/D, then H b ≈1, and I S =0. Indeed, it is not possible to encode information in such a way, unless information is stored in higher order correlations. In Eq. 10 we assumed that the set of all possible molecules had entropy L, which came from the assumption that in random (abiotic) molecules, each monomer did indeed appear with probability 1/D. What if that was not the case?

What if, by chance, monomers in random molecules have different frequencies? Indeed, amino acids in abiotic proteins do not occur at equal frequencies at all. Rather, their abundance is dictated by the rate at which they form abiotically (see, e.g., Dorn et al. 2011 and references therein). Let’s say this abundance distribution is π j , with entropy

$$ H_{a}=-\sum\limits_{j=1}^{D}\pi_{j}\log_{D}\pi_{j}\;, $$
(12)

and the information is then

$$ I_{S}=L(H_{a}-H_{b})\;, $$
(13)

We now see that if by sheer luck the abiotic entropy H a is close to the biotic one H b , then the entropy gap can be made arbitrarily small. And because the likelihood to find a sequence with information I S by chance is \(\phantom {\dot {i}\!}D^{-I_{S}}\), such a reduction in the entropy gap could affect the likelihood of spontaneous emergence of I S tremendously.

Imagine for example that by chance the biotic distribution q j is fairly close to the abiotic one, that is q j =π j (1+ε j ), where ε j ≪1 is symmetrically distributed around 0, so that \({\sum }_{j}\varepsilon _{j}=0\). Then

$$ H_{a}-H_{b}=-\sum\limits_{j}(\pi_{j}\log_{D}\pi_{j}-q_{j}\log_{D} q_{j})\approx \sum\limits_{j}\pi_{j}{\varepsilon_{j}^{2}} +{\mathcal O}(\varepsilon^{4})\;. $$
(14)

Note that because I assumed that the biotic distribution q j is derived from the abiotic one, the expression for the entropy gap Eq. 14 is guaranteed to be positive.

How big of a difference does a reduced entropy gap make? It can be dramatic, as I will now show. Consider for example the entropy gap for self-replicators in the digital-life system avida (Adami and Brown 1994; Adami 1998; Wilke and Adami 2002; Ofria and Wilke 2004; Adami 2006), which is a system in which self-replicating computer programs mutate and evolve to adapt to a fitness landscape that can be specified by the user. A simple self-replicator can be written in avida that takes only 15 instructions taken from an alphabet of D=26 (these 15 mers are the equivalent of about 70 bits of information, as 15 × log226≈70.5). The probability to find such a self-replicator by chance is rather low:

$$ P_{0} = 26^{-15}\approx 6\times 10^{-22}\;. $$
(15)

If we would test a million random sequences a second on a parallel cluster of 1,000 CPUs, a random search to find a single self-replicating sequence would take over 50,000 years, on average. Now, it makes no sense to test sequences of length 15 because presumable the per-site entropy of such a compressed replicator is about zero, so there is virtually no redundancy. Let us imagine that instead we test sequences of length 100 that have the same information content of 15 mers. In that case, we expect the average per-site entropy of the self-replicator to be about 0.85 mers, so that I S =100×(1−0.85)=15, that is, the per-site entropy gap for self-replicators of this type is ΔH=0.15. P 0 is of course still given by Eq. 15, which still all but rules out a random search.

What if we change the abiotic entropy to the one that we expect for a replicator? We can estimate this biotic entropy H b from the entropy profile for avidians studied by Dorn et al. (2011), reproduced in Fig. 1 for a typical case.

Fig. 1
figure 1

Probability distribution of instructions (in random order on the x-axis) for an adapted avidian self-replicator with D=29 (from Dorn et al. 2011). The largest fraction is the nop-A instruction which is special in avida as it is used to initialize empty instructions. The dashed line is the uniform prior, with entropy H a =1

This distribution has an average entropy of ≈0.87 mers per site. Let us imagine then that avidian instructions are not produced with equal likelihood H a =1 (the uniform prior given by the dashed line in Fig. 1) but with an entropy more like the actual observed entropy, say H a ≈0.9. This innocuous change will change the probability P 0 to the probability P

$$ P_{0}\to P_{\star} = 26^{-100(0.9-0.85)}=26^{-5}\;. $$
(16)

The effective information appears to have been reduced from 15 to 5! The probability to find the self-replicator now is P ≈8.8×10−8. When Rupp et al. performed a random search for self-replicators of length 100 using the biased H a (Rupp et al. 2006), that is, by replacing monomers not uniformly randomly but in a biased manner according to the probabilities we see in Fig. 1, they found 10 self-replicators in the 2×108 they tried, that is, a rate of 5×10−8 which is well in line with the reduced estimate above. But we should keep in mind that it is important here to not just mimic the abiotic entropy (after all this can be achieved with any number of distributions), but to mimic the distribution that appears in biotic polymers, such as the one in Fig. 1 for avidians, so that the entropy gap is reduced as in Eq. 14.

Of course, the entropy gap for biomolecules could be much more severe than for digital organisms. We know, for example, that the information content of the HIV-1 protease (a 99 mer molecule) is approximately 75 mers (Gupta and Adami 2014). The probability to find such a molecule by random search is astronomically low, P 0=20−75≈2.64×10−96. Now, it is of course well known that the probability distribution describing the abiotic generation of amino acids is far from uniform. Indeed, if we restrict ourselves to only the 20 amino acids used in biochemistry, the abiotic distribution actually has a significantly smaller entropy than the biotic one, simply because the heavy amino acids are not formed abiotically at all, see (Dorn et al. 2011). It is clear that the biotic distribution has evolved to be far from the initial abiotic distribution, so we cannot anymore assume q j =π j (1+ε j ). Let us instead investigate what happens to P 0 if we change the abiotic distribution away from uniform. In a 99-mer molecule with 75 mers of information, the average entropy per site is 24/99 ≈0.2424, that is, we write the information content of the protease as I P =99∗(1−0.2424). If we assume that the abiotic entropy per monomer is 0.5 rather than the 1.0 coming from the uniform assumption, the probability to find this molecule by chance changes to

$$ P_{\star}=20^{-99(0.5-0.2424)}\approx 7.5\times 10^{-32}\;, $$
(17)

an enhancement of about 64 orders of magnitude. When such extreme amplifications of the likelihood of spontaneous discovery of information are possible, scenarios that were previously deemed impossible (Shapiro 2000) could move much closer to reality. For example (even though I do not believe that the first self-replicator was a protein) the abundance distribution of amino acids (and other molecules such as mono-carboxylic acids) found in meteorites is much closer to that found in sediment than the distribution found in synthesis experiments (Dorn et al. 2011), indicating that the abiotic formation distribution is strongly dependent on local environment, and that those environments more conducive to mimic biotic distributions may well occur outside of Earth.

To be fair, nobody expects life to originate from a self-replicator of about 100 instructions drawn from an alphabet of 20-30. I presented these examples just to illustrate how abiotic monomer distributions that are close to the biotic distribution can enhance the likelihood of stumbling upon a rare self-replicator by chance by many orders of magnitude, just as the likelihood to find a self-replicator was enhanced tremendously in the biogenesis experiments with avida by (Rupp et al. 2006).

Indeed, even efforts to create self-replicating sets of molecules in the laboratory that are not related to terran life support this view. Lincoln and Joyce created a self-replicating system encoding about 86 bits of information in the laboratory using existing RNA enzymes (even though only 24 of those bits were evolvable, see Lincoln and Joyce (2009)). The spontaneous emergence of this replicator is, in a sobering manner, impossible from an unbiased library, as P 0≈7.7×10−25, or three orders of magnitude smaller than the likelihood I calculated for the spontaneous emergence of an avidian self-replicator.

Clearly, more work is required to test the relationship between spontaneous emergence and biased monomer abundance distributions. We could, for example, study different prior (abiotic) distributions and conduct biogenesis experiments in avida just as Rupp et al. (2006), to see if the correlation between reduced gap and enhanced emergence holds quantitatively. In the non-digital realm, we could repeat the experiment of Keefe and Szostak (2001), who searched for proteins that bind ATP within a library of 6×1012 randomly generated 80-mer proteins. Among this random set they found four proteins that bound ATP, suggesting that the information necessary for ATP binding is I s =−logD(2/3×10−12)≈9.4 mers, a value not inconsistent with deletion experiments and sequence analysis. By creating a biased (rather than random) library that takes into account the amino acid frequency bias of the ATP binding proteins they found, it should be possible to increase the fraction of ATP binding proteins found by chance significantly. Indeed, Hackel et al. showed that biasing protein libraries with conserved domains (zero entropy regions) but also variable regions constrained by the entropic profile of functional molecules as described here, leads to an increased rate of finding functional proteins by chance (Hackel et al. 2010), compared to the rate observed from unbiased libraries.

Even though we still do not know which set of monomers gave rise to the first self-replicator (if ever there was one), the information-theoretic musings I have presented here should convince even the skeptics that, within an environment that produces monomers at relative ratios not too far from those found in a self-replicator, the probabilities can move very much in favor of spontaneous emergence of life. For every candidate chemistry then (where a self-replicator can be constructed), we would should look for the environment that is best suited to produce it.