The cellular world

All self-sufficient life forms on earth share the cellular organization. Myriads of parasites are also cellular but exploit the resources of other cellular organisms to reproduce; some of these parasites go through their entire life cycles inside other cells. The second vast empire of life forms consists of viruses and other selfish elements that possess genetic but not biochemical autonomy and fully depend on cellular hosts for their reproduction (Koonin 2011; Koonin and Dolja 2013). I believe that discussion of the origin and early evolution of cells is particularly appropriate for the celebratory anniversary issue of Antonie van Leeuwenhoek. Although Van Leeuwenhoek might not have been the discoverer of cells sensu strictu, he definitely discovered unicellular organisms and the remarkable diversity of the microbial world (Dobell 1932) the origin of which is the theme of the present article.

Even the simplest known cells are exquisitely organized agglomerates of intricate macromolecular complexes. The two classes of such complexes that define the cellular state and cleanly separate cells from virus-like entities are (i) membrane embedded energy transformation and molecular transport systems and (ii) translation system that makes all the proteins required for the cell function. A fundamental and striking feature of cells is that formation of a cell de novo has never been observed. According to the famous dictum of Rudolf Virchow, Omnis cellula e cellula, i.e. new cells are generated exclusively from old ones, by various forms of division or budding (Virchow 1858). Using the definitions of Szathmary and Maynard Smith (Szathmary and Maynard Smith 1997), cells are reproducers rather than replicators, i.e. cellular reproduction is not reducible to genome replication, in contrast to the reproduction of many simple selfish elements. Indeed, all the advances of synthetic biology notwithstanding, we are far from being able to generate a cell using genomic DNA alone, whereas for example positive-strand RNA viruses or plasmids can be reproduced indefinitely starting from pure RNA or DNA.

The simplest free-living prokaryotes including autotrophs encompass about 1,200 genes (Giovannoni et al. 2005; Koonin and Wolf 2008). Attempts to reconstruct the gene set for a minimal cell by combining comparative genomic data on gene conservation and biological considerations yield estimates of about 250–300 genes (Gil et al. 2004; Koonin 2003; Mushegian 1999; Mushegian and Koonin 1996). Numerous intracellular parasitic bacteria with gene repertoires of this size and even smaller indeed have been isolated (McCutcheon and Moran 2012; Moran et al. 2008). Some of the smallest bacteria in this category, with less than 200 genes, have lost genes for certain components of the translation system and might be on their way to becoming organelles (Nakabachi et al. 2006; Sloan et al. 2014). Endosymbiotic organelles, most notably mitochondria and chloroplasts, arguably represent the penultimate stage of cellular degradation (Gray 1999, 2012; Gray et al. 2001; Lang et al. 1999). These organelles retain their own genomes, albeit with very few genes, their own internal translation systems and their own membranes, although substantially modified from the ancestral bacterial membranes. In particular, the organelle membranes are equipped with protein import systems that deliver into the organelle the protein products of former endosymbiont genes that have been transferred to the nuclear genome of the host. Thus, the organelles remain cell-like reproducers but most of their protein-coding capacity has been relegated to the genome of a distinct reproducer, the host cell. The ultimate stage of cellular degradation are derivatives of the mitochondria, such as hydrogenosomes and mitosome that have lost genomes and translation systems but retain the membrane (Embley 2006; van der Giezen 2009). All the proteins required for the function of these organelles are encoded in the nucleus, synthesized by the cytoplasmic translation system and transported into the organelle. Thus, in this case, the membrane seems to be the only remaining entity that maintains the cell-like status of the organelles.

There are strong indications that all cellular life forms are infected by viruses or virus-like selfish elements, and the genomes of most cells contain multiple integrated genomes of such elements (Koonin and Dolja 2013). Strikingly, viruses are the most abundant biological entities on earth, in terms of the number of particles, total mass and genetic complexity (Edwards and Rohwer 2005; Kristensen et al. 2010; Rohwer 2003; Suttle 2005, 2007). Theoretical modeling studies suggest that selfish elements inevitably parasitized on independent replicators since the very early stages of evolution, once the replicators have reached certain minimal complexity (Szathmary and Demeter 1987; Takeuchi and Hogeweg 2008, 2012; Takeuchi et al. 2011).

Quoting Carl Woese, the origin of cells is “the greatest of evolutionary problems” (Woese 2002). In a sense, all of biological evolution that occurred after the origin of cells—or more precisely, cells with their associated selfish elements—is routine compared to that momentous breakthrough (Koonin 2014; Koonin et al. 2006; Woese 2002). Given the intricacy of the molecular machinery of even the simplest cells, the lack of any observations on cell origin, except from other cells, and the lack of any apparent intermediate stages in the evolution of cells, the origin of cells engenders the ominous specter of irreducible complexity. Of course, this only means that we might still lack the essential data and theoretical models, let alone experimental approaches, to reconstruct the early stages of cellular evolution. In the rest of this article, I attempt to make the most of the few relevant observations and ideas that are at our disposal.

The primordial pool of virus-like genetic elements

Comparative genomics, ancestral gene repertoires, and LUCA

As numerous complete genomes of diverse organisms become available, comparative genomics turns into a truly powerful methodology (Delsuc et al. 2005; Doolittle 2005; Koonin et al. 2000; Wolfe and Li 2003). It has the ability not only to determine which genes are conserved and which are not, but also to reconstruct the gene composition of ancestral life forms including the hypothetical Last Universal Common (Cellular) Ancestor (LUCA)—under certain assumptions, of course (Charlebois and Doolittle 2004; Glansdorff et al. 2008; Harris et al. 2003; Koonin 2003; Mushegian 2008). The key assumption is that genes shared by many diverse extant species are most likely to be inherited from the common ancestor of these species; in particular, genes that are present in all modern cellular life forms hark back to LUCA. The number of such ubiquitous genes is very small, fewer than 60, and nearly all of them encode proteins involved in translation and the core transcription machinery; adding the genes for rRNAs and tRNAs, the universal set comprises about 100 genes (Charlebois and Doolittle 2004; Harris et al. 2003; Koonin 2003).

This limited repertoire of genes obviously could not provide for a viable life form, so a considerable number of genes that must have been present in LUCA were lost or displaced by non-orthologous but functionally analogous genes in some lines of descent during the subsequent evolution. Consequently, formal evolutionary reconstruction approaches have to be applied in order to delineate the likely gene complement of LUCA. The simplest reconstruction methods are based on the principle of evolutionary parsimony, i.e., attempt to derive the evolutionary scenario that includes the smallest number of elementary events (the most parsimonious scenario) (Kunin and Ouzounis 2003; Mirkin et al. 2003; Snel et al. 2002). The set of relevant events is small: (i) gene “birth”, that is, emergence of a new gene, typically, via gene duplication followed by radical divergence, (ii) gene acquisition via horizontal gene transfer (HGT), (iii) gene loss. Counting these events for different scenarios and choosing the one with the minimum number of events seems to be a straightforward task. However, realization of this goal encounters hurdles at several levels, resulting in highly conservative estimates of ancestral gene repertoires, with the amount of underestimate remaining uncertain. The more sophisticated maximum likelihood approaches in genome reconstruction rely on the same set of elementary events but the analysis of these events is formalized in a gene birth-and-death model so that for each gene a posterior probability of presence in each ancestral node of the underlying tree is estimated (Csuros 2010; Csuros and Miklos 2009; Csuros et al. 2011).

Not only maximum parsimony but even the probabilistic approaches to the reconstruction of genome history have major limitations. First, these methods rely only on the information on the presence or absence of individual genes (more precisely, members of orthologous gene clusters) in extant genomes. In principle, however, even a ubiquitous gene might not be ancestral but rather might have spread via a sweep of HGT. Second, although probabilistic methods do not rule out the presence of currently rare genes in ancestral forms, the posterior probabilities assigned to such genes are necessarily low and cannot fully account for possible independent losses in many lineages. Finally, and perhaps most fundamentally, any evolutionary reconstruction is predicated on a defined topology of the underlying organismal tree which is the “tree of life” (TOL) when the reconstruction of the LUCA is involved. With the discovery of extensive HGT in archaea and bacteria, the relevance and indeed the validity of the TOL has been sharply questioned (Bapteste and Boucher 2009; Bapteste et al. 2005; Doolittle 2000, 2009; Doolittle and Bapteste 2007; O’Malley and Boucher 2005; O’Malley and Koonin 2011). However, statistical analysis of the “forest” of thousands of individual gene trees reveals a clear trend of vertical inheritance that is centered at the tree for (nearly) universal genes, primarily those encoding translation system components (Puigbo et al. 2009, 2012, 2013). The prominence of this trend vindicates the reconstruction of ancestral gene sets using the translation-centered evolutionary tree as a scaffold, and the rest of the problems faced by this reconstruction can be considered technical.

All the difficulties and uncertainties of evolutionary reconstruction notwithstanding, parsimony and probabilistic methods confidently map to the LUCA at least several hundred genes (Koonin 2003; Mirkin et al. 2003). Analyses performed under a different angle and reaching even deeper into the history of life through the reconstruction of the evolution of individual protein families, yields results that are compatible with the extensive gene repertoire of the LUCA and sheds some light on the preceding stages of evolution (Anantharaman et al. 2002; Aravind et al. 2002a, b; Leipe et al. 2002). The reconstructed gene set of the LUCA encompasses several families of paralogs, for example among aminoacyl-tRNA synthetases or translation factors, indicating that extensive evolution of protein families antedated the LUCA.

Given the uncertainty associated with the methods for evolutionary reconstruction, perhaps the most compelling argument for a complex LUCA is the elaborate organization of the modern translation machinery that, almost in its entirety, comprises indisputable LUCA heritage. The functioning of such an advanced translation system is predicated on commensurate metabolic capabilities including not only the pathways for the synthesis of all nucleotides and (nearly) all amino acids but also those for a variety some coenzymes, e.g. S-adenosylmethionine, the cofactor of numerous RNA methylases, several of which can be traced back to LUCA with a high confidence (Anantharaman et al. 2002; Kozbial and Mushegian 2005).

The richness of the reconstructed gene repertoire of the LUCA would imply that this ancestral life form possessed the same level of organization as modern bacteria and archaea if not for two glaring holes in its reconstructed gene set: (i) the key components of the DNA replication machinery, namely the polymerases that are responsible for the initiation (primases) and elongation of DNA replication, and for gap-filling after primer removal, and the principal DNA helicases, and (ii) most of the enzymes of lipid biosynthesis. These essential proteins fail to make it into the reconstructed gene repertoire of LUCA because the respective processes in bacteria, on the one hand, and archaea, on the other hand, are catalyzed by distinct, unrelated enzymes and, in the case of membrane phospholipids, yield chemically distinct membranes (the archaeal membrane phospholipids are isoprenoid ethers of glycerol 1-phosphate whereas bacterial lipids are fatty acid esthers of glycerol 3-phosphate, i.e., the lipids in the two domains differ not only in their chemical composition but also in chirality) (Edgell and Doolittle 1997; Leipe et al. 1999; Lombard et al. 2012; Martin and Russell 2003; Mushegian and Koonin 1996; Pereto et al. 2004).

Following the most straightforward logic, one might hypothesize that the LUCA possessed neither DNA replication nor membranes. However, it is almost certain that such a conclusion would have been a gross over-simplification. Any reconstruction of the evolutionary events involving the LUCA must account for both the universally conserved and the archaea- and bacteria-specific components of the DNA replication and membrane enzymatic machineries. Indeed, although the polymerases and replicative helicases are distinct, other key replication proteins, such as the sliding clamp, clamp loader and single-stranded DNA-binding protein, are orthologous in bacteria and archaea (along with eukaryotes) (Leipe et al. 1999; Makarova and Koonin 2013). Similarly, although bacteria and archaea possess distinct membranes, some key membrane protein complexes, such as the rotary ATPases, the signal recognition particles and the Sec translocon, are universal and supposedly ancestral, suggestive of the presence of some form of membranes in the LUCA (Beckwith 2013; Koonin and Martin 2005; Mulkidjanian et al. 2009).

The “uniformitarian assumption”, namely, that LUCA was a more or less regular, modern-type cell, akin to the extant bacteria and archaea, often seems to be adopted in accounts of early evolution without much critical evaluation (Forterre et al. 1992; Forterre and Philippe 1999); (Forterre et al. 2005). To account for the lack of conservation of key elements of the DNA replication and membrane biogenesis machineries, the uniformitarian hypotheses of LUCA would invoke one of the two scenarios:

  1. (i)

    LUCA somehow combined both versions of these systems, with subsequent differential loss in the archaeal and bacterial lineages.

  2. (ii)

    LUCA had a particular version of each of these systems, with subsequent non-orthologous displacement in archaea or bacteria.

Specifically, with respect to membrane biogenesis, it has been proposed that LUCA had a mixed, heterochiral membrane, the two versions with opposite chiralities emerging as a result of subsequent specialization in archaea and bacteria (Koga 2011; Pereto et al. 2004). With regard to the DNA replication, a hypothesis has been developed under which one of the modern replication systems is ancestral whereas the other system evolved in viruses and subsequently displaced the original one in either the archaeal or the bacterial lineage (Forterre 1999).

More radical proposals on LUCA’s nature take a “what you see is what you get” approach by postulating that LUCA lacked those key features that are not homologous in extant archaea and bacteria, at least, in their modern form (Koonin 2009). The possibility that LUCA was substantially different from any known cells has been brought up, originally, in the concept of “progenote”, a hypothetical, primitive entity in which the link between the genotype and the phenotype was not yet firmly established (Woese and Fox 1977). In its original form, the progenote idea involves primitive, imprecise translation, a notion that is not viable given the extensive diversification of proteins prior to LUCA that is demonstrated beyond doubt by the analysis of diverse protein superfamilies. More realistically, it can be proposed that the emergence of the major features of cells was substantially asynchronous (Koonin 2014; Woese 1998) so that LUCA closely resembled modern cells in some ways but was distinctly “primitive” in others (in the memorable phrase of Carl Woese different cellular systems “crystallized” asynchronously). Focusing on the major areas of non-homology between archaea and bacteria, it has been hypothesized that LUCA:

  1. (i)

    did not have a typical, large DNA genome (Forterre 2006; Leipe et al. 1999) and/or.

  2. (ii)

    was not a typical membrane-bounded cell (Koonin and Martin 2005; Martin and Russell 2003) (Fig. 1).

    Fig. 1
    figure 1

    Cellular evolution: non-cellular vs protocellular scenarios. The left panel shows the protocellular scenario whereby evolution started from lipid vesicles that enclosed multiple RNA segments. Viruses, shown by hexagonal and bacilliform shapes, coevolved with protocells. The evolution of membranes toward increasing organization, multifunctionality and ion non-permeability is shown by changing line styles. The right panel shows the non-cellular scenario under which the primordial membranes evolved within virus-like entities, with protocells being a relatively late innovation

In the next sections I explore the implications of these propositions and elaborate on the nature of LUCA and cellular origins.

The primordial pool of virus-like genetic elements: a workshop for the evolution of genome replication strategies

With respect to DNA replication, the conundrum to resolve is the conservation of some but not other, key proteins of the DNA replication machinery between archaea (and eukaryotes) and bacteria. To account for this mixed pattern of conservation and divergence, it has been proposed that LUCA replicated its genome via a retrovirus-like replication cycle, in which the universal transcription machinery was involved in the transcription of provirus-like dsDNA molecules and the conserved components of the DNA replication system played accessory roles in this process (Leipe et al. 1999). This speculative scheme combined, within a single hypothetical replication cycle, the conserved proteins that are involved in transcription and replication with proteins, such as reverse transcriptase (RT) that, among the extant life forms, are represented, primarily or exclusively, in viruses and other selfish genetic elements. The proposal formally accounts for the universal conservation of these proteins but has no direct analogy in extant genetic systems.

Perhaps, a more plausible portrayal of the LUCA would follow the general lines of Woese’s vision of a “collective” primordial genome (Woese 1998). Under this view, the LUCA was not a species in the modern sense but rather a pool of diverse genetic elements that rapidly recombined, fused and split. The origin of the first replicators within the framework of the RNA World, in which the functions of both information carriers and catalysts were embodied in RNA molecules (Bernhardt 2012; Neveu et al. 2013), implies a subsequent transition to the modern, DNA-based genetic systems, presumably via a reverse transcription stage. Most likely, different variants of the genetic cycles coexisted within the primordial gene pool (Fig. 1). This simple logic naturally leads to the concept of a primordial “virus-like world” given that, among the extant life forms, diversity of the genetic cycles exists only among viruses and related selfish genetic elements (Koonin and Dolja 2014; Koonin et al. 2006). Similar considerations apply to the size of the primordial genomic segments. In modern life forms, all large (over about 30 kilobase) replicons, both viral and cellular, consist of dsDNA (Koonin 2009). Conceivably, this strong preference stems from the combination of stability and regular structure of dsDNA molecules that provides for accurate replication of large genomes. The transitional (from the RNA world to the DNA world) primordial gene pool would necessarily encompass numerous small segments of RNA and DNA, with the latter gradually increasing in size thanks to accretion of shorter segments which provided selective advantage to viable gene combinations (Koonin and Martin 2005).

Invoking viruses in the discussion of the primordial gene pool might appear, at best, confusing, and at worst, ridiculous because modern viruses, by definition, are obligate intracellular parasites. To avoid this semantic dead end, it is indeed more appropriate to speak in this context of virus-like genetic elements and perhaps of virus-like particles. I believe, however, that this virus-like character of the primordial genetic elements is prominent and manifest at more than one level.

On the origin of cell membranes

As pointed out above, the second major area of non-homology between archaea and bacteria involves the lipid chemistry and the enzymes of lipid biosynthesis. The glycerol moieties of archaeal and bacterial phospholipids are of the opposite chiralities, and the hydrophobic chains differ as well, being based on fatty acids in bacteria and on isoprenoids in archaea. In addition, in bacterial lipids, the hydrophobic tails are linked to the glycerol moiety by ester bonds whereas archaeal lipids contain ether bonds (Boucher et al. 2004; Pereto et al. 2004; Thomas and Rana 2007). Accordingly, the suites of enzymes involved in the phospholipid biosynthesis in archaea and bacteria are either non-homologous or distantly related but not orthologous (Boucher et al. 2004; Koga and Morii 2007; Koonin and Martin 2005; Lombard et al. 2012). The dichotomy of the membranes and their biogenesis prompted the iconoclastic hypothesis that the LUCA was not a typical, membrane-bounded cell but rather a consortium of diverse, virus-like genetic elements that might have dwelled in networks of inorganic compartments (“bubbles”) that are abundant in the vicinity of hydrothermal vents on the sea floor (Koonin and Martin 2005; Martin and Russell 2003) (see discussion below). Such a membrane-less LUCA hypothesis is compatible not only with the dissimilarity of the bacterial and archaeal membranes and the apparent lack of homology between the membrane biogenesis machineries but also with the non-homology of the key proteins involved in DNA replication (Edgell and Doolittle 1997; Leipe et al. 1999; Mushegian and Koonin 1996) which implies absence of large DNA genomes in the LUCA genomic consortium. Thus, the LUCA is envisaged as a pool of genetic elements inhabiting a network of inorganic compartments, with ample opportunity for recombination and ligation to form larger genomes. This version of the LUCA reverberates with Carl Woese’s concept of “high genetic temperature” of the early stages of evolution including the LUCA whereby the high rate of genetic exchanges dramatically accelerated evolution (Woese 1998, 2002). Under the pre-cellular LUCA scenario, cellular organization evolved on (at least) two independent occasions yielding archaea and bacteria (Koonin and Martin 2005; Martin and Russell 2003). There might have been many more “attempts” of cellular escape from the primordial habitats but only two were successful in the long term.

The problem with the membrane-less LUCA is that, at least in its strong form, this model contradicts the data of comparative genomics. The nearly universal conservation, in modern cellular life forms, of elaborate, membrane-embedded macromolecular complexes, such as parts of the general protein secretory pathway, in particular the signal recognition particle (SRP) (Cao and Saier 2003) and the membrane ATP synthase (Gogarten et al. 1989; Mulkidjanian et al. 2007; Nelson 1989), implies that LUCA possessed some form of membranes although not necessarily that it was a genuine cellular organism (Jekely 2006; Koonin and Martin 2005).

The chemical nature of the putative primordial membranes remains unknown. An attractive possibility seems to be that the primordial membranes were simpler than the extant membranes of archaea or bacteria, and several candidates for the role of such primitive membranes have been proposed. Thus, it has been argued that fatty acids are the simplest amphiphilic molecules that could form abiogenically and could have been subsequently recruited by the first organisms (Deamer 1986; Mansy and Szostak 2008; Monnard et al. 2002; Monnard and Deamer 2002; Namani and Deamer 2008; Segre et al. 2001). Alternatively, the first membranes might have consisted of polyprenyl phosphates similar to the membrane components of modern archaea (Gotoh et al. 2007; Nakatani et al. 2012; Nomura et al. 2002; Ourisson and Nakatani 1994; Streiff et al. 2007). Membrane vesicles capable of enclosing polynucleotides and proteins have been produced in the laboratory with both fatty acids (Mansy and Szostak 2008) and with polyprenyl phosphates (Nomura et al. 2002). The original experiments with fatty acid vesicles were unable to maintain efficient RNA synthesis due to the membrane-disruptive effect of the relatively high concentrations of Mg2+ that are required for nucleotide polymerization. However, this hurdle has been recently cleared through an extremely simple chemical modification, namely inclusion of citrate in the reaction solution (Adamala and Szostak 2013). These notable results suggest that the potential of simple membrane vesicles for modelling pre-cellular evolution is far from being exhausted.

Nevertheless, the scenario of cell origin from simple lipid vesicles (Deamer 2005, 2008; Szostak et al. 2001), although intuitive, engenders major difficulties. A pure lipid bilayer is not a viable solution for the membrane of a primordial cellular life form because it would effectively prevent exchange of all ions and complex molecules between the inside of a vesicle and the environment. Because of the hydrophobic barrier, ions can penetrate the lipid bilayer only with the help of specialized membrane proteins, such as channels or translocases. The membrane-embedded portions of these proteins consist almost entirely of hydrophobic amino acids, and the proteins themselves are water-insoluble. Hence a chicken and egg problem: channels and translocases could not evolve without membranes, whereas membranes without channels and translocators could not support the first life forms. Given the dependence between the conductivity of lipid bilayers and the length of the acyl tails (Paula et al. 1996), it has been proposed that primordial membranes might have been built of lipids with short hydrophobic tails and so were thinner than modern ones, with a lower hydrophobic barrier that allowed ions to spontaneously cross the membrane (Deamer 1997, 2008). This mechanism, however, would not work for large molecules such as proteins or polynucleotides, the transport of which is thought to have been important during the early steps of cellular evolution, for proto-cell division and/or for the functioning of primordial virus-like particles (see also below).

A closer examination of the “genomic” (the lack of homology between the core components of the DNA replication systems in archaea and bacteria; see above) and the “membrane” (the radical difference between the structures of the phospholipids and the enzymes of lipid biosynthesis between archaea and bacteria) challenges to the LUCA suggests that the two are tightly linked. A complex LUCA without a large DNA genome similar to modern bacterial and archaeal genomes apparently could only have a genome consisting of several hundred segments of RNA (or provirus-like DNA), each several kilobases (i.e. several protein-coding genes) in size. This limitation is dictated by the dramatically lower stability of RNA molecules compared to DNA and is empirically supported by the fact that the largest known RNA genomes (those of coronaviruses) reach only ~30 kb (Gorbalenya et al. 2006); the largest retroelements, with RNA or DNA genomes, are considerably smaller. It has been proposed that LUCA was a RNA cell that subsequently gave rise to three major RNA cell lineages (the ancestors of bacteria, archaea and eukaryotes) in each of which the RNA genome was then independently replaced by DNA as a result of acquisition of the DNA replication machinery from distinct viruses (Forterre 2006). However, full-fledged RNA cells appear to be an unrealistic proposition. Indeed, the necessity to accurately segregate hundreds of genomic RNA segments at cell division would create an insurmountable problem as this would require extremely elaborate mechanisms of genome segregation not known to exist in modern organisms. Otherwise, the change in the gene complement brought about by each cell division would, effectively, preclude reproduction. The segregation mechanisms that operate in modern bacteria and, probably, archaea involve pumping dsDNA into daughter cells with the help of a specific ATPase and, probably, coevolved with large dsDNA genomes (Donachie 2002; Errington et al. 2003; Iyer et al. 2004; Weiss 2004). Thus, if the LUCA lacked a large dsDNA genome and instead had a “collective” genome comprised of numerous RNA segments, it must have been a life form distinct from modern cells. In principle, the possibility can be envisaged that protocellular, membrane-bounded vesicles harbored multiple segments of RNA or DNA that would undergo imprecise, “statistical” segregation during division. However, the imprecision of inheritance in such protocells would not allow their diversification into distinct cellular lineages—the conversion to large dsDNA genomes would have to come first.

Another aspect of early life forms, including the LUCA, that is considered central to the advent of advanced life forms, is the rampant HGT (Woese 2002). Indeed, HGT is the principal route of rapid innovation in microbes (Treangen and Rocha 2011) and innovation is bound to have been rapid at the earliest stages of evolution. Moreover, a hypothesis has been proposed and buttressed by mathematical modeling that the universality of the genetic code might be linked to the critical role of HGT in the evolution of the primordial life forms. Under conditions of extensive HGT, a single version of the code would necessarily sweep the populations of ancestral life forms, and conversely, any organisms with deviant codes would be unable to benefit from HGT and, being isolated from other organisms, would be eliminated by selection (Goldenfeld and Woese 2007; Vetsigian et al. 2006). Analogies with the history of human civilization are obvious and, perhaps, illuminating: the existence of a lingua franca greatly accelerates progress, and conversely, isolated communities are stalled in their development and doomed to eventual extinction. Constant, extensive HGT is an intrinsic, natural feature of models of pre-cellular evolution but certainly cannot be taken for granted for primitive cells. Below I discuss scenarios of pre-cellular evolution that take this conundrum into account.

Coevolution of membranes and membrane proteins

A common solution to chicken and egg paradoxes is coevolution of the egg and the chicken. With regard to the evolution of cell membranes, this idea implies coevolution of the membranes and membrane proteins. Multiple lines of evidence indicate that the universal membrane ATPases (ATP synthases) evolved from simple membrane pores that recruited helicases and other proteins (Mulkidjanian et al. 2007). This scenario, in agreement with some previous ideas (Williams 1978), holds that the first membranes were porous and enabled passive exchange of ions, small molecules and even polymers between proto-cells and their environment; the membrane impermeability to ions increased gradually with time. The first membrane proteins most likely were amphiphilic, with a tendency to assemble into membrane pores (Mulkidjanian et al. 2009).

Integral membrane proteins contain long runs of hydrophobic amino acid residues that function as transmembrane segments; by contrast, in water-soluble, globular proteins, the distribution of polar and non-polar amino acids in the polypeptide chain is quasi-random (Finkelstein and Ptitsyn 2002). Assuming that the quasi-random distribution is ancestral, the advent of membrane proteins can be envisaged as a gradual, multi-step evolution of long hydrophobic stretches capable of forming membrane-spanning alpha-helices. A specific model of evolution within this framework has been developed and tested via molecular dynamics simulations which show that a stand-alone amphiphilic helix can spontaneously insert into a lipid bilayer provided that the helices dimerize on the membrane surface and then make pores in the membrane (Pohorille and Deamer 2009; Pohorille et al. 2005, 2003). The formation of a water-filled pore would stabilize the polar residues of the helices, whereas the non-polar residues would interact with the lipid phase. A problem with this hypothesis is that a single helix is not a stable fold, so the starting point of this scenario remains unclear.

A refined version of this scheme has been proposed (Mulkidjanian et al. 2009) starting from the idea that membrane proteins are “inside-out” versions of globular proteins, as has been suggested for the specific case of bacteriorhodopsin (Engelman and Zaccai 1980). Globular proteins have a quasi-random distribution of polar and non-polar residues such that the non-polar residues are packed inside the globule whereas the polar residues are exposed on the surface. The first membrane proteins might have evolved by turning globular proteins “inside-out” as a result of their interaction with membranes, with their hydrophobic side chains turning towards the lipid bilayer. One of the simplest protein folds is the widespread alpha-helical hairpin (long alpha-hairpin according to the Structural Classification Of Proteins (SCOP) (Andreeva et al. 2008) that is stabilized by hydrophobic interaction of the two helices. This stabilisation, however, is relatively weak, so upon interaction with the membrane, the helices would spread on its surface, and then reassemble within the membrane such that the non-polar side chains would interact with the hydrophobic lipid phase. The hairpins, then, could aggregate to form water-filled pores with inner polar surfaces. This putative primordial arrangement of alpha-helices is partially retained by the c-ring of the F-ATPase that is plumbed by lipid only from the outer, periplasmic side of the membrane but is apparently filled with water and segment(s) of the gamma-subunit of the ATPase from the cytoplasmic site (Pogoryelov et al. 2008). Integral membrane proteins would evolve gradually, via multiple amino acid replacements, yielding long, hydrophobic helices capable of forming tight bundles. Concomitantly, some membrane proteins would join to form the first translocons (White and von Heijne 2005), enabling controlled insertion of these bundles into the membrane. This scenario for the origin of membrane proteins is compatible with the results of the global analysis of the known membrane transporters which led to the conclusion that the evolution of membrane proteins went from non-specific oligomeric channels, built of peptides with one or two transmembrane segments each, towards larger, specific membrane translocators that emerged by duplication of the primordial membrane proteins (Saier 2003).

The alternative, rather popular hypothesis that derives the first membrane proteins from stand-alone hydrophobic α-helices that, via duplication, would yield increasingly complex membrane proteins (see e.g. (Fiedler et al. 2010; Popot and Engelman 2000) does not appear plausible. A stand-alone hydrophobic helix is water-insoluble and hardly can be released from the ribosome after translation. The modern membrane proteins are co-translationally inserted into the membrane by translocons which themselves are membrane-embedded protein complexes (White 2003; White and von Heijne 2008) that could not have evolved before membrane proteins.

The putative ancient mechanism of spontaneous protein insertion into membranes via the “inside-out” transition does not seem to have been abandoned even in modern cells: it appears to be still employed by diverse pore-forming proteins such as colicins (Duche et al. 1994; Padmavathi and Steinhoff 2008), diphtheria toxin (Lai et al. 2008), and the proapoptotic protein BAX (Annis et al. 2005). These proteins are monomeric in the water phase but oligomerize in the membrane during pore formation (Anderluh and Lakey 2008; Iacovache et al. 2008).

The next key step in the evolution of membrane transport would have been the advent of active, ATP-driven biopolymer translocases from a combination of a helicase and a membrane pore as detailed previously (Mulkidjanian et al. 2007). The difficulty with the translocase scenario is that the only known RNA translocases in modern cells seem to be the ATP-driven translocases that mediate tRNA import into mitochondria (Salinas et al. 2008), a process that is limited to eukaryotes and hence is bound to be a secondary innovation. Strikingly, however, RNA translocation across a membrane, driven by a distinct motor ATPase that is encoded in a bacteriophage genome and directly interacts with the bacteriophage RNA polymerase, is a regular part of the life cycle of double-stranded RNA-containing bacteriophages (Mancini and Tuma 2012); (Lisal and Tuma 2005). It appears plausible that the first biological membranes evolved within lipid-containing, virus-like particles that contained RNA translocases, rather than in modern-type proto-cells (Koonin et al. 2006). This possibility could reinforce a non-cellular rather than the protocellular model of the LUCA (Fig. 1) under which such particles would have been agents of HGT that, as discussed above, appears essential for pre-cellular evolution.

The habitats of pre-cellular life forms and the origin of cells

Compartmentalization is obviously essential for the evolution of ensembles of replicators, even if only as means for concentrating the required monomers and thus enabling their polymerization (Meyer et al. 2012). It is well recognized that molecular crowding, i.e. high intracellular concentrations of the abundant macromolecular complexes, such as ribosomes, proteins, RNAs and small molecules, is essential for cell functioning. Molecular crowding eliminates rate limitations imposed by diffusion and ensures that chemical reactions within the cell occur at sufficiently high rates (Ellis 2001; Phillip and Schreiber 2013; Zhou et al. 2008). More subtly but no less importantly, compartmentalization is necessary to avert the collapse of replicator systems under the pressure of the inevitably evolving parasites (Takeuchi and Hogeweg 2012). Membranous vesicles appear to be the most likely form of prebiotic compartmentalization but, as discussed above, simple lipid vesicles present substantial difficulty for the exchange of their content with the environment, whereas evolution of modern-type membrane transport is a complex process. Thus, instead of or more plausibly concomitant with membrane compartmentalization, different forms of abiogenic compartmentalization have been considered for the role of cradles of the evolving protocells.

The most popular scenarios involve networks of inorganic compartments that exist in the vicinity of hydrothermal vents on the ocean floor (Martin et al. 2008; Martin and Russell 2003; Russell and Hall 1997; Russell et al. 1988). More specifically, hydrothermal vents are surrounded by expansive honeycomb-like structures, composed primarily of iron sulfide, that have been proposed as hospitable compartments for prebiological and the earliest stages of biological evolution (Martin and Russell 2007). Indeed, such compartments possess the ability to concentrate small molecules and polymers, would provide chemical and thermal gradients that are essential as the energy supply for the evolving life forms and the catalysts that would accelerate the first biochemical reactions. The networks of inorganic compartments would provide an excellent medium for extensive HGT that, as discussed above, would be essential at the early stages of evolution (Koonin and Martin 2005). Importantly, these inorganic compartments could have been the sites where the first biological membranes—and eventually the first cells—evolved.

The key role of inorganic compartments in pre-cellular evolution appears almost certain as a general principle but the specific nature of these compartments remains an open question. The search for the primordial evolution sites can be guided by the “chemistry conservation principle” according to which the chemical and in particular ionic composition of an organism is highly conservative and reflects the environment in which it evolved (Mulkidjanian et al. 2012a, b; Mulkidjanian and Galperin 2007). All modern cells contain much more potassium, phosphate, and transition metals and conversely, much less sodium than modern or reconstructed primeval oceans (Anbar 2008, Pinti 2009, Williams and Frausto da Silva 2006). Cells maintain steep ion gradients thanks to sophisticated, energy-dependent membrane enzymes (membrane pumps) that are embedded in elaborate, ion-tight membranes. As discussed above, the first cells could possess neither ion-tight membranes nor membrane pumps, so the concentrations of small inorganic molecules and ions within protocells and in their environment would equilibrate. Hence, the ion composition of modern cells is likely to reflect the inorganic ion composition of the habitats of protocells. Concordant with this notion, the activity of proteins and functional systems that are ubiquitous in modern cellular life forms, and so by inference ancestral, including the translation system, typically requires K+, Zn2+ or Mn2+, and phosphate. Thus, the first cells most likely evolved in environments with a high K+/Na+ ratio and relatively high concentrations of Zn, Mn, and phosphate. Such ionic composition conducive to the origin of cells apparently could not have existed in marine environments but resembles the chemistry of vapor-dominated zones of inland geothermal systems. A simple geochemical reconstruction suggests that under the anoxic, CO2-dominated primordial atmosphere, the chemistry of basins at geothermal fields would resemble the internal milieu of modern cells (Mulkidjanian et al. 2012a, b). Thus, the most plausible habitats for the pre-cellular stages of evolution might have been shallow ponds of condensed and cooled geothermal vapor that were lined with porous silicate minerals mixed with metal (primarily Zn) sulfides and enriched in K+, Zn2+, and phosphorous compounds. These terrestrial geothermal fields could have hosted all stages of the early evolution of life, from the primordial RNA world to the emergence of the first cells, possibly both major cellular types, archaea and bacteria. Under this scenario, the primitive cells moved into oceans only after evolving modern-type, ion-tight membranes.

Conclusions and outlook

The emergence of the cellular organization is the central problem in the study of the evolution of life (Koonin 2014; Woese 2002), so much so that all the subsequent evolution, even such major transitions as the emergence of eukaryotes, can be viewed as “mere history.” So is there a chance we ever “know”, with a high degree of confidence, how cells actually evolved? Probably, not. The fundamental difficulty is that we are unaware of any evolutionary intermediate on the path from relatively simple molecular complexes to cells. The simplest modern cells, namely intracellular symbionts and parasites, or the even simpler organelles do not “really count” because they clearly are products of the reverse process of degradation, i.e. derived cells, rather than intermediates on the path from non-cellular to cellular life forms. Perhaps somewhat paradoxically, more insight into the emergence of cells seems to come from the study of viruses and related mobile elements that constitute the second major empire of life. The major classes of selfish elements are likely to originate directly from a primordial, pre-cellular pool in which diverse replication strategies were “tried out” during the transition from the primordial RNA world to the modern-type cellular life forms endowed with large DNA genomes.

The existence of a primordial pool of diverse genetic elements is compatible with one of the major distinctions between the two fundamental cell types, archaeal and bacterial, namely the non-homology of the key proteins involved in DNA replication. Under this scenario, a variety of replication strategies and the respective molecular systems have evolved in the primordial pool, and several of these survive to this day in selfish elements, but only two have been adopted by evolutionarily successful cellular life forms that escaped their primordial habitats. Essentially the same pertains to the other fundamental gulf between archaea and bacteria, the non-homology of the membranes and the enzymatic apparatus behind them. In this case, however, the existence of universal membrane components implies the possibility of two distinct scenarios one of which depicts the common ancestor of modern cells as a cell-like entity with a primitive membrane whereas the other includes evolution of membranes in virus-like moieties (Fig. 1).

Regardless of the specific scenario, the key concept in membrane evolution is the coevolution of membranes and membrane proteins. The “inside-out” route of evolution of membrane proteins is physically plausible and provides for a credible scenario of evolution of the first membrane transport systems. This scenario includes primitive, “porous” membranes as an intermediate step between pre-cellular life or protocellular life forms and modern cells that are bounded by ion-tight membranes. Such porous membranes would contain various protein and polynucleotide translocases, favouring HGT and gene mixing.

Finally, a combination of theoretical modelling of the evolution of replicators, geochemical analysis and comparative genomics offers a plausible picture of the “cradles” for cell evolution. The role of networks of inorganic compartments containing potential catalysts and providing energy gradients appears essential. Contrary to the common belief, the most plausible habitats for the earliest life forms might be terrestrial rather than marine hydrothermal areas.

We may never truly know how cells evolved. Nevertheless, synthesis of theoretical, computational and experimental approaches from diverse disciplines seems to be instrumental in progressively narrowing down the space of open possibilities which conceivably is the best one can hope for when it comes to events that took place about 4 billion years ago, and perhaps only once in the observable universe.