Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

FormalPara Chapter Summary

General overviews of eukaryote genomes are first discussed, including organelle genomes, introns, and junk DNAs. We then discuss the evolutionary features of eukaryote genomes, such as genome duplication, C-value paradox, and the relationship between genome size and mutation rates. Genomes of multicellular organisms, plants, fungi, and animals are then briefly discussed.

1 Major Differences Between Prokaryote and Eukaryote genomes

A eukaryotic cell has a nucleus, surrounded by the nuclear membrane. There are other membrane systems in their cells, such as endoplasmic reticulum, Golgi apparatus, and vacuole. Prokaryotes do not have these membranes nor organella. Therefore, existence of membrane systems and organella, particularly mitochondria, are the two major characteristics of eukaryotes. It should be noted that some parasitic eukaryotes lost mitochondria. Genome sizes of eukaryotes became much bigger than those of prokaryotes. Accordingly, gene numbers are also more abundant in eukaryotes than prokaryotes. It is not clear if the formation of nucleus triggered the increase of the genome size.

There are various differences of genome structures between prokaryotes and eukaryotes, and they are listed in Table 8.1. Most of bacterial genomes are circular, while all eukaryotic genomes so far known are linear (here organelle genomes are not considered). The main reason for a large genome size in eukaryotes is the existence of many repeat sequences, which are minority in prokaryotes. Pseudogenes and introns are also few in prokaryotic genomes, while both are abundant in eukaryotic genomes. High occurrences of gene duplications in eukaryotes prompted production of many pseudogenes. Horizontal gene transfers are known to be quite frequent in prokaryotes, and they are rare in eukaryotes. Finally, genome duplications sometimes occur in eukaryotes, especially in plants and in vertebrates, but genome duplication is so far not known for prokaryotic genomes.

Table 8.1 Comparison of prokaryotic and eukaryotic genomes

Because the gene number of typical eukaryotic genomes is much larger than that of prokaryotes, there are many genes shared among most of eukaryote genomes but nonexisting in prokaryote genomes. Some examples are listed in Table 8.2. For example, myosin is located in animal muscle tissues, and its homologous protein exists in cytoskeleton of all eukaryotes, but not found in prokaryotes.

Table 8.2 Examples of genes shared among most of eukaryote genomes but nonexisting in prokaryote genomes

Recently, Kryukov et al. (2012; [1]) constructed a new database on oligonucleotide sequence frequencies and conducted a series of statistical analyses. Frequencies of all possible 1–10 oligonucleotides were counted for each genome, and these observed values were compared with expected values computed under observed oligonucleotide frequencies of length 1–4. Deviations from expected values were much larger for eukaryotes than prokaryotes, except for fungal genomes. Figure 8.1 shows the distribution of the deviation for various organismal groups. The biological reason for this difference is not known.

Fig. 8.1
figure 1

Comparison of genome complexity among eukaryote genomes (From Kryukov et al. 2012; [1])

2 Organelle Genomes

There are two major types of organella in eukaryotes: mitochondria and plastids. Figure 8.2 shows schematic views of mitochondria and chloroplasts. These two organella has their independent genomes. This suggests that they were initially independent organisms which started intracellular symbiosis with primordial eukaryotic cells. Because most eukaryotes have mitochondria, the ancestral eukaryotes, a lineage that emerged from Archaea, most probably started intracellular symbiosis with mitochondrial ancestor. A parasitic Rickettsia prowazekii is so far phylogenetically closest to mitochondria [2], and a rickettsia-like bacterium is the best candidate as the mitochondrial ancestor. However, there is an alternative “hydrogen hypothesis” [3]. Plastids include chloroplasts, leucoplasts, and chromoplasts and exist in land plants, green algae, red algae, glaucophyte algae, and some protists like euglenoids.

Fig. 8.2
figure 2

Schematic views of mitochondrion and chloroplast

Mitochondrial genome sizes of some representative eukaryotes are listed in Table 8.3. Most of animal mitochondrial genomes are less than 20 kb, and sizes of protist and fungi mitochondrial genomes are somewhat larger. Mitochondrial genome size of plants is much larger than those of other eukaryotic lineages, yet the size is mostly less than 500 kb.

Table 8.3 Size of mitochondrial genomes

2.1 Mitochondria

An ancestral eukaryotic cell, probably an archaean lineage, hosted a bacterial cell, and intracellular symbiosis started. Initially, Archaea and Bacteria shared genes responsible for basic metabolism, and the situation is a sort of gene duplication for many genes, though homologous genes are not identical but already diverged long time ago. In any case, division of labor followed, and only limited metabolic pathways were left in the bacterial system, which eventually became mitochondria.

Animal mitochondrial genomes contain very small number of genes; 13 for peptide subunits, 20 for tRNA, and 2 for rRNA [4]. Figure 8.3 shows gene orders of five representative animal species mitochondrial DNA genomes. Although most of vertebrate mitochondrial DNA genomes have the same gene order as in human (Fig. 8.3a), gene order may vary from phylum to phylum. Yet the gene content and the genome size are more or less constant among animals. It is not clear why animal mitochondrial genomes are so small. One possibility is that animal individuals are highly integrated compared to fungi and plants, and this might have influenced a drastic reduction of the mitochondrial genome size. Another interesting feature of animal mitochondrial DNA genomes is the heterogeneous rates of gene order change. For example, platyhelminthes exhibit great variability in mitochondrial gene order (Sakai and Sakaizumi, 2012; [5]).

Fig. 8.3
figure 3

Gene orders of five animal mitochondrial DNA genomes (From Saitou 2007; [103])

In contrast, plant mitochondrial genomes are much larger (see Table 8.3). Figure 8.4 shows the genome structure of tobacco mitochondrial genome (from Sugiyama et al. 2005; [6]). Horizontal gene transfers are also known to occur in plant mitochondrial DNAs even between remotely related species [7].

Fig. 8.4
figure 4

Genome structure of tobacco mitochondria (From Sugiyama et al. 2005; [6])

The melon (Cucumis melo) mitochondrial genome size, ca. 2.9 Mb, is exceptionally large, and recently its draft genome was determined [8]. Interestingly, melon mitochondrial genome looks like the vertebrate nuclear genome in its contents, in spite of its genome size being similar to that of bacteria. The protein coding gene region accounted for only 1.7 % of the genome, and about half of the genome is composed of repeats. The remaining part is mostly homologous to melon nuclear DNA, and 1.4 % is homologous to melon chloroplast DNA. Most of the protein coding genes of melon mitochondrial DNAs are highly similar to those of its congeneric species, which are watermelon and squash whose mitochondrial genome sizes are 119 kb and 125 kb, respectively. This indicates that the huge expansion of its genome size occurred only recently. Interestingly, cucumber (Cucumis sativus), another congeneric species, also has ~1.8-Mb mitochondrial genome with many repeat sequences [9]. It will be interesting to study whether the increase of mitochondrial genomes of melon and cucumber is independent or not.

2.2 Chloroplasts

Chloroplasts exist only in plants, algae, and some protists. It may change to leucoplasts and chromoplasts. Because of this, a generic name “plastids” may also be used. The origin of chloroplast seems to be a cyanobacterium that started intracellular symbiosis as in the case of mitochondria.

A unique but common feature of chloroplast genome is the existence of inverted repeats [10], and they mainly contain rRNA genes. Chloroplast DNA contents may change during the plant growth, and matured leaves are devoid of DNA in their chloroplasts [11].

Chloroplast genomes were determined for more than 340 species as of December 2013 [106]. Their genome sizes range from 59 kb (Rhizanthella gardneri) to 521 kb (Floydiella terrestris). Although the largest chloroplast genome is still much smaller than atypical bacterial genome, its average intergenic length is 4 kb, much longer than that for bacterial genomes.

2.3 Interaction Between Nuclear and Organelle Genomes

Fractions of mitochondrial DNA may sometimes be inserted to nuclear genomes, and they are called “numts.” An extensive analysis of the human genome found over 600 numts [12]. Their sequence patterns are random in terms of mitochondrial genome locations. This suggests that mitochondrial DNAs themselves were inserted, not via cDNA reverse-transcribed from mitochondrial mRNA. A possible source is sperm mitochondrial DNA that were fragmented after fertilization [12]. The reverse direction, from nucleus to mitochondria, was observed in melon, as discussed in subsection 8.2.1.

3 Intron

Intron is a DNA region of a gene that is eliminated during splicing after transcription of a long premature mRNA molecule. Intron was discovered by Phillip A. Sharp and Richard J. Roberts in 1977 as “intervening sequence” [13], but the name “intron” coined by Walter Gilbert in 1978 [14] is now widely used. It should be noted that some description on intron by Kenmochi [15] was used for writing this section.

3.1 Classification of Intron

There are various types of introns, but they can be classified into two: those requiring spliceosomes (spliceosome type) and self-splicing type. Figure 8.5 shows the splicing mechanisms of these two major types. Most of introns in nuclear genomes of eukaryotes are spliceosome type, and there are common GU–AG type and rare AU–AC type, depending on the nucleotide sequences of the intron–exon boundaries [16]. Spliceosomes involving these two types differ [17].

Fig. 8.5
figure 5

Two major types of introns. (a) Spliceosome type. (b) Self-splicing type

Self-splicing introns are divided into three groups: groups I, II, and III. Group I introns exist in organellar and nuclear rRNA genes of eukaryotes and prokaryotic tRNA genes. Group II are found in organellar and some eubacterial genomes. Cavalier-Smith [18] suggested that spliceosome-type introns originated from group II introns because of their similarity in splicing mechanism and structural similarity between group II introns and spliceosomal RNA. Group III introns exist in organellar genomes, and its splicing system is similar with that of group II intron, though they are smaller and have unique secondary structure.

Fig. 8.6
figure 6

An overall pattern of Alu element evolution (From Saitou 2007; [103])

There is yet another type of introns which exist only in tRNAs of single-cell eukaryotes and Archaea [19]. These introns do not have self-splicing functions, but endonuclease and RNA ligase are involved in splicing. The location of this type of introns is often at a certain position of the tRNA anticodon loop.

3.2 Introns Early/Late Controversy

After the discovery of introns, their probable functions and evolutionary origin have long been argued (e.g., [20, 21]). Because self-splicing introns can occur at any time, even in the very early stage of origin of life, we consider only spliceosome-type introns. For brevity, we hereafter call this type of introns as simply “intron.” There are mainly two major hypotheses: introns early and introns late. The former claims that exon existed as a functional unit from the common ancestor of prokaryotes and eukaryotes, and “exon shuffling” was proposed for creating new protein functions [14]. Introns which separate exons should also be quite an ancient origin [14, 22]. In contrast, introns are considered to emerge only in the eukaryotic lineage according to the introns-late hypothesis [23, 24].

The protein “module” hypothesis proposed by Go [25] is related to be introns-early hypothesis. Pattern of intron appearance and loss has been estimated by various methods (e.g., [21, 26]). Kenmochi and his colleagues analyzed introns of ribosomal proteins of mitochondrial genomes and eukaryotic nuclear genomes in details [2729]. These studies supported the introns-late hypothesis, because introns in mitochondrial and cytosolic ribosomal proteins seem to be independent origins and introns seem to emerge in many ribosomal protein genes after eukaryotes appeared.

3.3 Functional Regions in Introns

Introns do not code for amino acid sequences by definition. In this sense, most of introns may be classified as junk DNAs (see the next section). There are, however, evolutionarily conserved regions in introns, suggesting the existence of some functional roles in introns.

4 Junk DNAs

Ohno (1972; [30]) proclaimed that the most part of mammalian genomes are nonfunctional and coined the term “junk DNA.” With the advent of eukaryotic genome sequence data, it is now clear that he was right. There are in fact so much junk DNAs in eukaryotic genomes. Junk DNAs or nonfunctional DNAs can be divided into repeat sequences and unique sequences. Repeat sequences are either dispersed type or tandem type. Unique sequences include pseudogenes that keep homology with functional genes.

4.1 Dispersed Repeats

Prokaryote genomes sometimes contain insertion sequences; however, this kind of dispersed repeats constitutes the major portion of many eukaryotic genomes. These interspersed elements are divided into two major categories according to their lengths: short ones (SINEs) and long ones (LINEs).

One well-known example of SINE is Alu elements in primate genomes. It is about 300-bp length, and originated from 7SL ribosomal RNA gene. Let us see the real Alu element sequence from the human genome sequence. If we retrieve the DDBJ/EMBL/GenBank International Sequence Database accession number AP001720 (a part of chromosome 21), there are 128 Alu elements among the 340-kb sequence. The density is 0.38 Alu elements per 1 kb. If we consider the whole human genome of ~ 3 billion bp, Alu repeats are expected to exist in ~1.13 million copies. One example of Alu sequence is shown below from this entry coordinates from 133600 to 133906:

ggcgggagcg atggctcacg cctgtaatgc cagcactttg ggaggccgag

gtgggtggat cacaaggtca ggagatagag accatcctgg ctaacacggt

gaaacactgt ctctactaaa aacacaaaaa actagccagg cgtggtggcg

ggtgcctgta atcccagcta ctcgggaggc tgaggcagga gaatggtgtg

aacccaggaa gtggagcttg cagtgagctc agattgcgcc actgcactcc

agcctgggtg acagagtgag actccatctc aaaaaaaata aaataaataa

aaaaaa

If we do BLAST homology search (see Chap. 14) using DDBJ system (http://blast.ddbj.nig.ac.jp/blast/blastn) targeted to nonhuman primate sequences (PRI division of DDBJ database), the best hit was obtained from chimpanzee chromosome 22, which is orthologous to human chromosome 21. I suggest interested readers to do this homology search practice.

Alu elements were first classified into J and S subfamilies [31]. It is not clear about the reason of selection of two characters (J and S), but probably two authors (Jurka and Smith) used initials of their surnames. In any case, this division was based on the distance from Alu consensus sequence; Alu elements which are more close to the consensus were classified as S and those not as J. Later, a subset of the S subfamily were found to be highly similar with each other, and they were named as Y after ‘young,” for they appeared relatively in young or recent age. Rough estimates of the divergence time of Alu elements are as follows: J subfamily appeared about 60 million years ago, and S subfamily separated from J at 44 million years ago, followed by further separation of Y at 32 million years ago [32]. Figure 8.6 shows the overall pattern of Alu element evolution (based on [32]).

Fig. 8.7
figure 7

A schematic view of synonymous distance distribution of duplogs with and without genome duplication (From Saitou 2007; [103])

4.2 Tandem Repeats

Tandemly repeated sequences are also abundant in eukaryotic genomes, and the representative ones are heterochromatin regions. Heterochromatins are highly condensed nonfunctional regions in nuclear DNA, in contrast to euchromatins, in which many genes are actively transcribed. Heterochromatins usually reside at teromeres, terminal parts of chromosomes, and at centromeres, internal parts of chromosomes, that connect spindle fibers during cell division. A more than 1-Mb teromeric regions of Arabidopsis thaliana were found to be tandem repeats of ca. 180-bp repeat unit [33, 34]. The nucleotide sequence below is Arabidopsis thaliana tandemly repeated sequence AR12 (International Sequence Database accession number X06467):

aagcttcttc ttgcttctca atgctttgtt ggtttagccg aagtccatat

gagtctttgt ctttgtatct tctaacaagg aaacactact taggctttta

ggataagatt gcggtttaag ttcttatact taatcataca catgccatca

agtcatattc gtactccaaa acaataacc

The human genome also has a similar but nonhomologous sequence in centromeres, called “alphoid DNA” with the 171-bp repeat unit [35]. The following is the sequence (International Sequence Database accession number M21746):

catcctcaga aacttctttg tgatgtgtgc attcaagtca cagagttgaa

cattcccttt cgtacagcag tttttaaaca ctctttctgt agtatctgga

agtgaacatt aggacagctt tcaggtctat ggtgagaaag gaaatatctt

caaataaaaa ctagacagaa g

If we do BLAST homology search (see Chap. 13) targeted to the human genome sequences of the NCBI database, there was no hit with this alphoid sequence. This clearly shows that the human genome sequences currently available are far from complete, for they do not include most of these tandem repeat sequences.

Telomores of the human genome are composed of hundreds of 6-bp repeats, ttaggg. If we search the human genome as 36-bp long 6 tandem repeats of this 6-repeat units as query using the NCBI BLAST, many hits are obtained.

4.3 Pseudogenes

As we already discussed in Chap. 4, authentic pseudogenes have no function, and they are genuine members of junk DNAs. When a gene duplication occurs, one of two copies often become a pseudogene. Because gene duplication is prevalent in eukaryote genomes, pseudogenes are also abundant. Pseudogenes are, by definition, homologous to functional genes. However, after a long evolutionary time, many selectively neutral mutations accumulate on pseudogenes, and eventually they will lose sequence homology with their functional counterpart. There are many unique sequences in eukaryote genomes, and majority of them may be this kind of homology-lost pseudogenes.

4.4 Junk RNAs and Junk Proteins

A long RNA is initially transcribed from a genomic region having an exon–intron structure, and then RNAs corresponding to introns are spliced out. These leftover RNAs may be called “junk” RNAs, for they will soon be degraded by RNAse. Only a limited set of genes are transcribed in each tissue of multicellular organisms, but leaky expression of some genes may happen in tissues in which these genes should not be expressed. Again these are “junk” RNAs, and they are swiftly decomposed. A series of studies (e.g., [36, 37]) claimed that many noncoding DNA regions are transcribed. However, van Bakel et al. [38] showed that most of them were found to be artifact of chip-chip technologies used in these studies. If nonsense or frameshift mutations occur in a protein coding sequences, that gene cannot make proteins. Yet its mRNA may be produced continuously until the promoter or its enhancer will become nonfunctional. In this case, this sort of mutated genes produces junk RNAs. If only a small quantity of RNAs are found from cells and when they are not evolutionarily conserved, they are probably some kind of junk RNAs.

As junk DNAs and junk RNAs exist, cells may also have “junk” proteins. If mature mRNAs are not produced in the expected way, various aberrant mRNA molecules will be produced, and ribosomes try to translate them to peptides based on these wrong mRNA information. Proteins produced in this way may be called “junk” proteins, for they often have no or little functions. Even if one protein is correctly translated and is moved to its expected cellular location, it can still be considered as “junk” protein. One good example is the ABCC11 transporter protein of dry-type cerumen (earwax), for one nonsynonymous substitution at this gene caused that protein to be essentially nonfunctional [39].

5 Evolution of Eukaryote Genomes

There are various genomic features that are specific to eukaryotes other than existence of introns and junk DNAs, such as genome duplication, RNA editing, C-value paradox, and the relationship between genome size and mutation rates. We will briefly discuss them in this section.

5.1 Genome Duplication

The most dramatic and influential change of the genome structure is genome duplications. Genome duplications are also called polyploidization, but this term is tightly linked to karyotypes or chromosome constellation.

Prokaryotes are so far not known to experience genome duplications, which are restricted to eukaryotes. Interestingly, genome duplications are quite frequent in plants, while it is relatively rare in the other two multicellular eukaryotic lineages. An ancient genome duplication was found from the genome analysis of baker’s yeast [40], and Rhizopus oryzae, a basal lineage fungus, was also found to experience a genome duplication [41]. Among protists, Paramecium tetraurelia is known to have experienced at least three genome duplications [42]. Because we human belongs to vertebrates and the two-round genome duplications occurred at the common ancestor of vertebrates (see Chap. 9), we may incline to think that genome duplications often happen in many animal species. It is not the case. So far, only vertebrates and some insects are known to experience genome duplications. The reason for this scattered distribution of genome duplication occurrences is not known.

If we plot the number of synonymous substitutions between duplogs in one genome, it is possible to detect a relatively recent genome duplication. This is because all genes duplicate when a genome duplication occurs, while only a small number of genes duplicate in other modes of gene duplications (see Chap. 3). Figure 8.7 shows the schematic view of two cases: with and without genome duplication. Lynch and Conery (2000; [44]) used this method to various genome sequences and found that the Arabidopsis thaliana genome showed a clear peak indicative of relatively recent genome duplication, while the genome sequences of nematode Caenorhabditis elegans and yeast Saccharomyces cerevisiae showed the curves of exponential reduction. It is interesting to note that before the genome sequence was determined, the genome duplication was not known for Arabidopsis thaliana, while the genome of Saccharomyces cerevisiae was later shown to be duplicated long time ago [40].

Fig. 8.8
figure 8

A negative correlation between the rate of synonymous substitutions and the protein-coding region size (From Rajic et al. 2005; [43])

When genome duplications occurred in some ancient time, the number of synonymous substitutions may become saturated and cannot give appropriate result. In this case, the number of amino acid substitutions may be used, even if each protein may have varied rates of amino acid substitutions. In any case, accumulation of mutations will eventually cause two homologous genes to become not similar with each other. Therefore, although the possibility of genome duplications in prokaryotes are so far rejected [45], it is not possible to infer the remote past events simply by searching sequence similarity. We should be careful to reach the final conclusion.

5.2 RNA Editing

Modification of particular RNA molecules after they are produced via transcription is called RNA editing. All three major RNA molecules (mRNA, tRNA, and rRNA) may experience editing [46]. There are various patterns of RNA editing; substitutions, in particular between C and U, and insertions and deletions, particularly U, are mainly found in eukaryote genomes. Guide RNA molecules exist in one of the main RNA editing mechanisms, and they specify the location of editing, but there are some other mechanisms [47].

It is not clear how the RNA editing mechanism evolved. Tillich et al. [47] studied chloroplast RNA editing and concluded that suddenly many nucleotide sites of chloroplast DNA genome started to have RNA editing, but later the sites experiencing RNA editing constantly decreased via mutational changes. They claimed that there was no involvement of RNA editing on gene expression. This result does not give RNA editing a positive significance.

Because there are many types of RNA molecules inside a cell, there also exist many sorts of enzymes that modify RNAs. It may be possible that some of them suddenly started to edit RNAs via a particular mutation. RNA editing which did not cause deleterious effects to the genome may have survived by chance at the initial phase. This view suggests the involvement of neutral evolutionary process in the evolution of RNA editing.

5.3 C-Value Paradox

Organisms with complex metabolic pathways have many genes. Multicellular organisms are such examples. Generally speaking, their genome sizes are expected to be large. In contrast, viruses whose genomes contain only a handful of genes have small genome sizes. Therefore, their possibility of genome evolution is rather limited. Even if amino acid sequences are rapidly changing because of high mutation rates, the protein function may not change. Unless the gene number and genome size increase, viruses cannot evolve their genome structures. It is thus clear that the increase of the genome size is crucial to produce the diversity of organisms. However, genomes often contain DNA regions which are not indispensable. Organisms with large genome sizes have many such junk DNA regions. Because of their existence, the genome size and the gene number are not necessarily highly correlated. This phenomenon was historically called C-value paradox (e.g., [48]), after the constancy of the haploid DNA amount for one species was found, yet their values were found to vary considerably among species at around 1950 (e.g., [4951]). “C-value” is the amount of haploid DNA, and C probably stands as acronym of “constant” or “chromosomes.” We now know that the majority of eukaryote genome DNA is junk, and there is no longer a paradox in C-values among species.

5.4 Conserved Noncoding Regions

While bacterial genomes are mostly consisting of protein coding genes, a considerable region of eukaryote genomes is noncoding. Most of them are junk DNA and do not have functions. If we find evolutionary conservation, however, these conserved regions should have some function through purifying selection. From the initial stage of molecular evolutionary studies, protein noncoding regions were suspected to be involved in gene regulation (Zuckerkandl and Pauling 1965; Britten and Davidson 1971; King and Wilson 1975). Now it is becoming clear that at least some noncoding regions play important roles in gene regulation (e.g., Carroll 2005; [55]). The functional elements are expected to evolve more slowly than surrounding nonfunctional DNA, as they are under purifying selection. Therefore, conserved noncoding sequences (CNSs) are likely to be important from the functional point of view.

Animal CNSs were discovered by comparison of human and fugu fish genome sequences by How et al. (1996; [52]). CNS analyses have been proved to be powerful for detecting regulatory elements (e.g., Hardison 2000; [53], Levy et al. 2001; [54]). Bejarano et al. (2004; [102]) found highly conserved noncoding sequences through comparison of human, mouse, and rat genomes. Siepel et al. (2005; [56]) found conserved noncoding DNA sequences from insects, nematodes, and yeasts by comparing closely related species. We will discuss more on conserved noncoding sequences of vertebrates in Chap. 9.

As for plants, Kaplinsky et al. (2002; [57]) found six short (<60 bp) CNSs from seven DNA regions related to protein coding gene orthologs between rice and maize. Guo et al. (2003; [58]) identified 20 bp as the minimal criterion for a CNS in grasses. Inada et al. (2003; [59]) examined 3,000 bases upstream and downstream of 52 orthologous protein coding genes of rice and maize and found that most CNSs were less than 20 bases. Thomas et al. (2007; [60]) compared Arabidopsis thaliana paralogous sequences, and found 14,944 intronic conserved noncoding sequences, ranging their lengths from 15 to 285 bp. D’Hont et al. (2012; [61]) determined banana genome and found 116 CNSs from genome sequences of commelinid monocot (banana, palm, and grasses). Kristas et al. (2012; [62]) compared genome sequences of Arabidopsis, grape rice, and Brachypodium and found >100 times more abundant CNSs from monocots than dicots. Hettiarachchi and Saitou; [63] compared genome sequences of 15 plant species and searched lineage-specific CNSs. They found 2 and 22 CNSs shared by all vascular plants and angiosperms, respectively, and also confirmed that monocot CNSs are much more abundant than those of dicots.

5.5 Mutation Rate and Genome Size

What kind of the relationship exists between the genome size and mutation rates? If all the genetic information contained in the genome of one organism are necessary for survival of that organism, the individual will die even if only one gene of its genome lost its function by a mutation. An organism with a small genome size and hence with a small number of genes, such as viruses, can survive even if the mutation rate is high. In contrast, organisms with many genes may not be able to survive if highly deleterious mutations often happen. Therefore, such organisms must reduce the mutation rate.

Rajic et al. (2005; [43]) compared the rate of synonymous substitutions per year from virus to human and the protein coding region size and found a clear negative correlation, as shown in Fig. 8.8. Sunjan et al. (2010; [64]) compared many studies on viral mutations and found a clear negative correlation between the substitution type mutation rate per nucleotide site per cell infection and viral genome size.

Fig. 8.9
figure 9

The relationship between the genome size and gene numbers among 88 fungi genomes

However, when the nucleotide substitution type mutation rate per generation was compared with the whole-genome size, Lynch (2006; [65]) found a positive correlation. More recently, Lynch (2010; [66]) admitted that for organisms with small-sized genomes, these two values were in fact negatively correlated. However, when large-genome-sized eukaryotes are compared, now a positive correlation was observed.

We have to be careful when we discuss these two contradictory reports. One considered the rate using unit as physical year, while the other used one generation as the unit. Another difference is to use either only protein coding gene region DNA sizes or the whole-genome sizes. The relationship between the mutation rate and genome size is not simple. Drake et al. (1998; [67]) examined this problem and found that the mutation rate per genome per replication was approximately 1/300 for bacteria, while mutation rates of multicellular eukaryotes vary between 0.1 and 100 per genome per sexual or individual generation. Table 8.4 shows the list of the mutation rate and the genome size for various organisms. Apparently there is no clear tendency.

Table 8.4 Mutation rates and genome sizes of various organisms

6 Genome of Multicellular Eukaryotes

We will discuss genomes of three multicellular lineages of eukaryotes: plants, fungi, and animals in this section. Unfortunately, there seems to be no common feature of genomes of multicellular organisms, so each lineage is discussed independently.

6.1 Plant Genomes

Arabidopsis thaliana was the first plant species whose 125-Mb genome was determined in 2000 [68]. A. thaliana is a model organism for flowering plants (angiosperms), with only 2-month generation time. In spite of its small genome size, only 4 % of the human genome, it has 32,500 protein coding genes. The genome sequence of its closely related species, A. lyrata, was also recently determined [69].

Angiosperms are divided into monocots and dicots. A. thaliana is a dicot, and genome sequences of six more species were determined as of December 2013 (see Table 8.5).

Table 8.5 List of plant species whose genome sequences were determined

Rice, Oryza sativa, is a monocot, and its genome size, 370 ~ 410 Mb, is much smaller than that of the wheat genome. Its japonica and indica subspecies genomes were determined [70] and [71], and the origin of rice domestication is currently in great controversy, particularly in single or multiple domestication events (e.g., [72, 73]). The number of protein coding genes in the rice genome is 37,000 ~ 40,000 [74].

Wheat corresponds to genus Triticum, and there are many species in this genus. The typical bread wheat is Triticum aestivum, and it is a hexaploid with 42 (7 × 6) chromosomes. Its genome arrangement is conventionally written as AABBDD [75]. Because it is now behaving as diploid, genomic sequencing of 21 chromosomes (A1–A7, B1–B7, and D1–D7) is under way (see http://www.wheatgenome.org/ for the current status). The hexaploid genome structure emerged by hybridization of diploid (DD) cultivated species T. durum and tetraploid (AABB) wild species Aegilops tauschii [75]. A genome duplication followed hybridization.

Non-seedling land plants are ferns, lycophytes, and bryophytes, in the order of closeness to seed plants (e.g., [76]). A draft genome sequence of a moss, Physcomitrella patens was reported in 2008 [77], followed by genome sequencing of a lycophyte, Selaginella moellendorffii, in 2011 [78]. These genome sequences of different lineages of plants are deciphering stepwise evolution of land plants.

6.2 Fungi Genomes

The genome sequence of baker’s yeast (Saccharomyces cerevisiae) was determined in 1996, as the first eukaryotic organism [79]. There are 16 chromosomes in S. cerevisiae, and its genome size is about 12 Mb. There are a total of 8,000 genes in its genome: 6,600 ORFs and 1,400 other genes. The genome-wide GC content is 38 %, slightly lower than that of the human genome. The proportion of introns is very small compared to that of the human genome, and the average length of one intron is only 20 bp, in contrast to the 1,440-bp average length of exons [80]. As we already discussed, the ancestral genome of baker’s yeast experienced a genome-wide duplication [40]. Pseudogenes, which are common in vertebrate genomes, are rather rare in the genome of baker’s yeast; they constitute only 3 % of the protein coding genes [80]. The baker’s yeast is often considered as the model organisms for all eukaryotes; however, their genome may not be a typical eukaryote genome.

As of December 2013, genome sequences of more than 400 fungi species are available (see NCBI genome list at http://www.ncbi.nlm.nih.gov/genome/browse/ for the present situation). Figure 8.9 shows the relationship between the genome size and gene numbers for 88 genomes. There is a clear positive correlation between them. However, there are some outliers. The Perigord black truffle (Tuber melanosporum), shown as A in Fig. 8.9, has the largest genome size (~125 Mb) among the 88 fungi species whose genome sequences were so far determined, yet the number of genes is only ~7,500 [81].

Fig. 8.10
figure 10

Gain and loss of genes in each branch of the phylogenetic tree for fungi species (Based on [81])

Three other outlier species are Postia placenta, Ajellomyces dermatitidis, and Melampsora laricipopulina, shown as B, C, and D in Fig. 8.9, respectively. Interestingly, these four outlier species are phylogenetically not clustered well; two are belonging to Pezizomycotina of Ascomycota and the other two are Agaricomycotina and Pucciniomycotina of Basidiomycota. If we exclude these four outlier species, a good linear regression is obtained, as shown in Fig. 8.9. This straight line indicates that in average, one gene size corresponds to 2.9 kb in a typical fungi genome. If we apply this average gene size to the truffle genome, its genome size should be ~22 Mb, but the real size is 103 Mb larger. This suggests that there is unusually large number of junk DNA in this genome. In fact, 58 % of its genome consists of transposable elements [81]. The truffle genome must still have 24 % more junk DNA region. Gain and loss of genes in each branch of the phylogenetic tree for fungi species are shown in Fig. 8.10 (based on [81]). It will be interesting to examine genome sizes of species related to the Perigord black truffle, so as to infer the evolutionary period when the genome size expansion occurred.

Fig. 8.11
figure 11

The Hox gene clusters found in each animal phylum (From Saitou 2007; [103])

6.3 Animal Genomes

Animals, or metazoa, are the most integrated multicellular organisms. Genome sequences of four 35 invertebrate species and 32 vertebrate species were determined by end of 2011 according to the GCDB of Kryukov et al. (2012; [1]). As of December 2013, 35 invertebrate and 43 vertebrate species were deteremined according to KEGG database (http://www.genome.jp/kegg/catalog/org_list.html). A major gene system that is responsible for this is Hox genes. We thus first discuss this gene system in this subsection. The genome of C. elegans, first determined genome among animals, will be discussed next, followed by genomes of insects and those of deuterostomes. Because genomes of many vertebrate species were determined, we discuss them in Chap. 9, and in particular, on the human genome in Chap. 10.

6.3.1 Hox Code

Hox genes were initially found through studies of homeotic mutations that dramatically change segmental structure of Drosophila by Edward B. Lewis [82]. They code for transcription factors, and a DNA-binding peptide, now called homeobox domain, was later found in almost all animal phyla [83]. Figure 8.11 shows the Hox gene clusters found in 12 animal groups. There are four Hox clusters in mammalian and avian genomes, and they are most probably generated by the two-round genome duplication in the common ancestor of vertebrates (see Chap. 9).

Fig. 8.12
figure 12

Highly conserved noncoding sequences found from comparison of Hox A cluster regions of many vertebrate species (From Matsunami et al. 2010; [85])

Interestingly, the physical order of Hox genes in chromosomes and the order of gene expression during the development are corresponding, called “collinearity” [84]. This suggests that some sort of cis-regulation is operating in Hox gene clusters, and in fact, many long transcripts are found, and some of their transcription start sites are highly conserved among vertebrates [85]. Figure 8.12 shows highly conserved noncoding sequences found from comparison of Hox A cluster regions of many vertebrate species (from Matsunami et al. 2010; [85]).

Fig. 8.13
figure 13

Distribution of the protein coding genes in the genome of Caenorhabditis elegans (From [86])

The Hox genes control expression of different groups of downstream genes, such as transcription factors, elements in signaling pathways, or genes with basic cellular functions. Hox gene products interact with other proteins, in particular, on signaling pathways, and contribute to the modification of homologous structures and creation of new morphological structures [87].

There are other gene families that are thought to be involved in diverse animal body plan. One of them is the Zic gene family [88]. The Zic gene family exists in many animal phyla with high amino acid sequence homology in a zinc-finger domain called ZF, and members of this gene family are involved in neural and neural crest development, skeletal patterning, and left–right axis establishment. This gene family has two additional domains, ZOC and ZF-BC. Interestingly, Cnidaria, Platyhelminthes, and Urochordata lack the ZOC domain, and their ZF-BC domain sequences are quite diverged compared to Arthropoda, Mollusca, Annelida, Echinodermata, and Chordata. This distribution suggests that the Zic family genes with the entire set of the three conserved domains already existed in the common ancestor of bilateralian animals, and some of them may be lost in parallel in the platyhelminthes, nematodes, and urochordates [88]. Interestingly, phyla that lost ZOC domains have quite distinct body plan although they are bilateralian.

6.3.2 Genome of C. elegans

Caenorhabditis elegans was the first animal species whose 97-Mb draft genome sequence was determined in 1998 [89]. This organism belongs to the Nematoda phylum which includes a vast number of species [90]. Brenner (1974; [91]) chose this species as model organism to study neuronal system, for its short generation time (~ 4 days) and its size (~1 mm). Figure 3.3 in Chap. 3 shows the cell genealogy of this species.

The following description of this section is based on the information given in online “WormBook” [86]. There are 22,227 protein coding genes in C. elegans including 2,575 alternatively spliced forms, with 79 % confirmed to be transcribed at least partially. The number of tRNA genes is 608, and 274 are located in X chromosome. The three kinds of rRNA genes (18S, 5.8S, and 26S) are located in chromosome I in 100–150 tandem repeats, while ~100 5S rRNA genes are also in tandem form but located in chromosome V. The average protein coding gene length is 3 kb, with the average of 6.4 coding exons per gene. In total, protein coding exons constitute 25.6 % of the whole genome. Figure 8.13 shows the distribution of the protein coding genes, and Fig. 8.14 the distribution of exon numbers per gene. Both distributions have long tails. The median sizes of exons and introns are 123 bp and 65 bp, respectively. Intron lengths of C. elegans are quite short compared to these of vertebrate genes (see Chap. 9). The distribution of protein coding genes varies depending on chromosomes, slightly more dense for five autosomes than X chromosome and more dense in the central region than the edge of one chromosome. Processed, i.e., intronless, pseudogenes are rare, and a total of 561 pseudogenes were reported at the Wormbase version WS133. About half of them are homologous to functional chemoreceptor genes.

Fig. 8.14
figure 14

Distribution of exon numbers per gene in the genome of Caenorhabditis elegans (From [86])

Genome sequences of four congeneric species of C. elegans (C. brenneri, C. briggsae, C. japonica, and C. remanei) were determined (http://www.ncbi.nlm.nih.gov/genome/browse/).

6.3.3 Insect Genomes

A fruit fly Drosophila melanogaster was used by Thomas Hunt Morgan’s group in the early twentieth century and has been used for many genetic studies. Because of this importance, its genome sequence was determined at first among Arthropods in 2000 [92]. Heterochromatin regions of ~50 Mb were excluded from sequencing, and only 120-Mb euchromatin regions were determined. Genome sequences of 12 Drosophila species (D. ananassae, D. erecta, D. grimshawi, D. melanogaster, D. mojavensis, D. persimilis, D. pseudoobscura, D. sechellia, D. simulans, D. virilis, D. willistoni, and D. yakuba) were determined in 2007 [93]. Their genome sizes vary from 145 to 258 Mb, and the number of genes is 15,000–18,000. Interestingly, D. melanogaster has the largest genome size and the smallest number of genes.

A total of 12 insect species other than Drosophila 12 species were sequenced by end of 2011 [1]. As of December 2013, their genome sizes are in the range of 108 Mb and 540 Mb, more than five times difference, and the gene numbers are from 9,000 to 23,000.

6.3.4 Genomes of Deuterostomes

Deuterostomes contain five phyla: Echinodermata, Hemichordata, Chaetognatha, Xenoturbellida, and Chordata. The genome of sea urchin Strongylocentrotus purpuratus [94] was determined in 2006. Its genome size is 814 Mb with 23,300 genes. Genomes of another sea urchins, Lytechinus variegatus and Patiria miniata, are also under sequencing, as well as hemicordate Saccoglossus kowalevskii.

Chordata is classified into Urochordata (ascidians), Cephalochordata (lancelets or amphioxus), and Vertebrata (vertebrates). Because we will discuss genomes of vertebrates in Chap. 9, let us discuss genomes of ascidians and lancelets only. The genome of ascidian Ciona intestinalis was determined in 2002 [95], and the genome sequence of its congeneric species, C. savignyi, was also determined three years later [96]. The genome size of C. intestinalis is ~155 Mb with ~16,000 genes. Interestingly it contains a group of cellulose synthesizing enzyme genes, which were probably introduced from some bacterial genomes via horizontal gene transfer [8, 97].

The C. intestinalis genome also contains several genes that are considered to be important for heart development ([95]), and this suggests that heart of ascidians and vertebrates may be homologous. Through the superimposition of phylogenetic trees (see Chapter A2) for five genes coding muscle proteins, OOta and Saitou ([98]) estimated that vertebrate heart muscle was phylogenetically closer to vertebrate skeletal muscles. If both results are true, muscles used in heart might have been substituted in the vertebrate lineage. The genome sequences of an amphioxus (Cephalochordate Branchiostoma floridae) was determined in by Holland et al. (2008; [104]), and they provide good outgroup sequence data for vertebrates.

7 Eukaryote Virus Genomes

Eukaryotic viruses are relying most of metabolic pathways to their eukaryote host species. Therefore, the number of genes in virus genomes is usually very small. For example, influenza A virus has 8 RNA fragments coding for 11 protein genes, and the total genome size is ~13.6 kb.

As in bacteriophages, there are both DNA type and RNA type genomes in eukaryotic viruses. Table8.6 shows one example of classification of eukaryotic viruses based on their genome structure [99]. Genomes of double-strand DNA genome viruses have four types: circular, simple linear, linear with proteins covalently attached to both ends, and linear but both ends were closed. Genomes of single-strand DNA genome viruses are either circular or linear.

Table 8.6 Classification of eukaryotic viruses based on their genome structure (From Sadaie et al. eds. 2004; [99])

Genomes of RNA genomes are all linear in both single- and double-strand type. Those of single-strand RNA genomes are classified into two types: plus strand and minus strand. A subset of single-plus strand RNA genome type is experiencing DNA intermediate during replication, such as retroviruses and human T-cell leukemia virus (HTLV).

Some DNA genome viruses are unusually large and similar to a small bacterial genome. Megavirus, parasitic to amoeba, has 1.26-Mb genome size and there are 1,120 protein coding genes [100]. Megavirus is phylogenetically close to mimivirus [101], a member of nucleoplasmic large DNA viruses, including pox virus. Recently, a larger genome size virus, Pandoravirus, with more than 2.5-Mb genome, was discovered [105]. The phylogenetic status of these large genome size DNA viruses is unknown at this moment.