The genome of the polar eukaryotic microalga Coccomyxa subellipsoidea reveals traits of cold adaptation
Little is known about the mechanisms of adaptation of life to the extreme environmental conditions encountered in polar regions. Here we present the genome sequence of a unicellular green alga from the division chlorophyta, Coccomyxa subellipsoidea C-169, which we will hereafter refer to as C-169. This is the first eukaryotic microorganism from a polar environment to have its genome sequenced.
The 48.8 Mb genome contained in 20 chromosomes exhibits significant synteny conservation with the chromosomes of its relatives Chlorella variabilis and Chlamydomonas reinhardtii. The order of the genes is highly reshuffled within synteny blocks, suggesting that intra-chromosomal rearrangements were more prevalent than inter-chromosomal rearrangements. Remarkably, Zepp retrotransposons occur in clusters of nested elements with strictly one cluster per chromosome probably residing at the centromere. Several protein families overrepresented in C. subellipsoidae include proteins involved in lipid metabolism, transporters, cellulose synthases and short alcohol dehydrogenases. Conversely, C-169 lacks proteins that exist in all other sequenced chlorophytes, including components of the glycosyl phosphatidyl inositol anchoring system, pyruvate phosphate dikinase and the photosystem 1 reaction center subunit N (PsaN).
We suggest that some of these gene losses and gains could have contributed to adaptation to low temperatures. Comparison of these genomic features with the adaptive strategies of psychrophilic microbes suggests that prokaryotes and eukaryotes followed comparable evolutionary routes to adapt to cold environments.
aluminum-activated malate transporter
conserved pairs of adjacent orthologs
expressed sequenced tag
fatty acid synthase
glycosyl phosphatidyl inositol
Joint Genome Institute
long interspersed nucleotide elements
pyruvate phosphate dikinase
photosystem 1 reaction center subunit N
reactive oxygen species
short interspersed nucleotide elements
small interfering RNA.
Algae consist of an extremely diverse, polyphyletic group of eukaryotic photosynthetic organisms. To characterize the genetic and metabolic diversity of chlorophytes (eukaryotic green algae) and to better understand how this diversity reflects adaptation to different habitats, we sequenced the trebouxiophyceaen Coccomyxa subellipsoidea C-169 NIES 2166. C-169 is a small elongated non-motile unicellular green alga (cell size of approximately 3 to 9 μm; Figure S1A in Additional file 1) isolated in the polar summer of 1959/60 at Marble Point, Antarctica, from dried algal peat . The Antarctic is a particularly harsh environment, with extremely low temperatures (as low as -88°C), frequent and rapid fluctuations from freezing to thawing temperatures, severe winds, low atmospheric humidity, and alternating long periods of sunlight and darkness. C-169 is psychrotolerant with an optimal temperature for growth at around 20°C; in comparison, psychrophiles and psychrototrophs are organisms that have optimal growth temperatures of < 15°C and > 15°C, respectively, and a maximum growth temperature of < 20°C. C-169 was originally classified as Chlorella vulgaris, but present sequence data led to re-classification of the alga into the Coccomyxa genus with a species name of C. subellipsoidae (Supplemental Results in Additional file 2 and Figure S1 in Additional file 1).
C. subellipsoidea strains were first isolated in England and Ireland, where they form jelly-like incrustations on mosses and rocks [2, 3]. In contrast to its most closely sequenced relative, the trebouxiophyte Chlorella variabilis NC64A , which is an endosymbiont of paramecia, C-169 is free living. However, the type strain C. subellipsoidea SAG 216-13 as well as other isolates in the same species are known to form lichens with subarctic basidiomycetes of the genus Omphalina ; other Coccomyxa spp. are intracellular symbionts of Ginkgo  and Stentors amethystinus  and intracellular parasites of mussels . In the past 20 years C-169 has been used as a model organism in pioneering studies on green algal chromosome architecture. For example, early studies indicated that approximately 1.5% of its genome consists of LINE- and SINE-type retrotransposons [9, 10]. Additional studies provided a detailed analysis of the smallest 980 kb chromosome [11, 12].
Here we report the gene content, genome organization, and deduced metabolic capacity of C-169 and compare those features to other sequenced chlorophytes. We show that the C-169 gene repertoire encodes enzymatic functions not present in other sequenced green algae that are likely to represent hallmarks of its adaptation to the polar habitat.
Results and discussion
The C-169 genome was draft sequenced using the whole genome shotgun Sanger sequencing approach. After sequencing, the C-169 genome was assembled into 29 gap-free scaffolds (12-fold coverage) encompassing 48.8 Mb (Figure S2 in Additional file 1), which is 2.6 Mb (5%) larger than the genome of C. variabilis . Alignments of 28,322 ESTs from C-169 indicate that the assembly is 97% complete. Twelve scaffolds represent complete chromosomes with telomeric repeat arrays at both ends. Pulse field gel electrophoresis and Southern hybridization were used to assign the remaining 17 scaffolds to chromosomal bands (Supplemental Results in Additional file 2). This allowed nine scaffolds to be assigned to another four complete chromosomes. The eight remaining scaffolds could not be assigned unambiguiously, because of chromosomes with near identical sizes. These eight scaffolds have a telomeric repeat array at one end; this indicates that they correspond to four additional chromosomes. Thus, sequence assembly and Southern hybridization suggest that the C-169 karyotype consists of 20 chromosomes.
Genomic features of C. subellipsoidea C-169
Nuclear genome size
Number of scaffolds
GC (%) genome
GC (%) exon
GC (%) intron
Repeated sequences (%)
Protein coding gene number
Mean protein length (amino acids)
Gene density (kb/gene)
Mean exon length
Mean intron length
About one-third of the mitochondrial genome sequence (20,739/65,497 bp, 31%) and 6% of the chloroplast genome sequence (11,312/175,731 bp) are integrated into the nuclear genome as 385 scattered individual DNA fragments with sizes ranging from 40 to 397 bp (Table S4 in Additional file 2), some containing truncated open reading frames. This phenomenon is more prominent in C-169 than in any sequenced chlorophyte. Both the mitochondrial and chloroplast genomes have GC contents greater than 50% (53.2% for the mitochondria and 50.7% for the chloroplasts). This > 50% GC content is unusual as most mitochondria and plastid genomes are enriched in adenine and thymine. In fact, C-169 is one of only a few eukaryotes to have this property .
Non-random distribution of Zepp retrotransposon
Repeated sequences represent 7.2% (3.5 Mb) of the C-169 genome, a fraction comparable to other sequenced green algae, except for the chlorophyceaen species that have higher repeat contents (Table S3 in Additional file 2). Forty-one percent of the C-169 repeated sequences resemble known repeat families. The most prominent are non-long-terminal-repeat retrotransposons, including Zepp LINEs (16.2%) and retrotransposable elements RTE (5.8%), and SINEs (8.8%) (Table S5 in Additional file 2).
Clusters of nested Zepp retrotransposons were previously found at the termini of C-169 chromosomes . In this present study, we found 26 Zepp clusters in the genome assembly with sizes ranging from 1.5 to 42.3 kb and comprising one to several copies of nested Zepp elements. The 12 complete chromosome scaffolds plus the 4 chromosomes reconstructed by Southern hybridization contain one Zepp cluster each. These clusters most often lie inside chromosomes, where they are relatively distant from telomeres; only two chromosomes have a Zepp cluster in a sub-telomeric position (scaffolds 12 and 23; Figure S2 in Additional file 1). The eight remaining scaffolds corresponding to incomplete chromosomes have either one or no Zepp cluster: two have no Zepp cluster, two have an internally located Zepp cluster and four have Zepp retrotransposons at one end. The distribution pattern of Zepp retrotransposons in the assembled genome assembly suggests that each C-169 chromosome contains strictly one Zepp cluster. Because the average GC content of individual Zepp elements is relatively high (61% GC) compared to the rest of the genome (53% GC), Zepp clusters produce local peaks of GC content within chromosomes.
No sequence in the EST dataset originates from a Zepp element, indicating that they are expressed at very low levels or totally inactivated in the conditions for EST production. In a previous study, Zepp expression was only detected under specific conditions, such as irradiation with an electron beam or following a heat shock . A neutral explanation of the non-random distribution of Zepp retrotransposons is that they integrated into hotspots present as a single copy in each chromosome, for example, centromeric regions. Alternatively, a single Zepp cluster may be indispensable for normal chromosome function.
The report that Zepp elements were constantly present in neoformed minichromosomes supports this hypothesis . These observations suggest a role for Zepp elements or sequences therein in centromeric functions. No tandem satellite repeats, as occurs in the centromeres of many eukaryotes , were identified within or in the vicinity of the Zepp clusters. The Zepp elements may be involved in centromere formation in a process similar to the LINE-1 retrotransposons in human neocentromeric regions . The canonical Zepp element possesses two open reading frames encoding reverse transcriptase and Gag-like proteins . BLASTP searches in public databases did not identify significant matches for the Zepp Gag-like protein, while the closest homolog to the reverse transcriptase protein was found in the fungus Ustilago maydis. No such Zepp clusters are found in the other green algae genome sequences.
Conserved synteny with poor gene colinearity
Protein family expansion
Annotated proteins of nine sequenced chlorophyte algae (C-169, C. variabilis NC64A, C. reinhardtii, V. carteri, Micromonas pusilla CCMP1545, Micromonas sp. RCC299, Ostreococcus sp. RCC809, Ostreococcus lucimarinus and Ostreococcus tauri) were organized into 23,507 families based on shared sequence similarity. Except C-169, all these green algae are temperate and live in fresh water (C. variabilis, V. carteri), soil (C. reinhardtii) or marine water (Micromonas and Ostreococcus spp.). Assignment of PFAM domains to proteins identified several protein families that have a significantly higher number of proteins in C-169 than in other chlorophyte algae (Table S6 in Additional file 2). The expansion of some of these protein families might reflect adaptation of the alga to a new habitat with extreme conditions.
Four over-represented protein families correspond to important steps in lipid metabolism. They include putative type-I fatty acid (FA) synthases, FA elongases, FA ligases and type 3 lipases. In addition, we identified a family of three FA desaturase proteins not found in other green algae (Figure S4 in Additional file 1). These proteins may be involved in adaptive processes that allowed C-169 to survive in the Antarctic environment. These processes include modification of the FA composition (polyunsaturated and branched FAs) of membrane lipids to maintain membrane fluidity at low temperature  and production of antifreeze lipoproteins.
Although C-169 can grow on inorganic media, it encodes a large variety of amino acid transporters and amino acid permeases (Table S6 in Additional file 2) that presumably allow the alga to import amino acids from organic extracellular environments such as decomposing algal peat. C-169 also encodes six proteins with high sequence similarity to the plant aluminum-activated malate transporters (ALMT). In land plants, ALMTs mediate tolerance to external toxic aluminum cations by exuding malate that chelates and immobilizes Al3+ at the root surface, thus preventing it from entering root cells . Experimental studies are required to confirm that the algal ALMTs play a similar role in C-169.
Five expanded protein families are putatively involved in polysaccharide and cell wall metabolism (Table S6 in Additional file 2). The production of exopolysaccharides and antifreeze glycoproteins plays an important role in cryoprotection of cold-adapted microorganisms . C-169 encodes 22 putative glycosyl hydrolase proteins belonging to the cellulase family and 9 proteins that match the PFAM glycosyl hydrolase type-9 motif. In this last family, four of the proteins have their glycosyl hydrolase domain attached to a cellulose synthase-like domain that is highly similar to the cellulose synthase of tunicates . In algae, these cellulose synthase-like domains are only found in C-169, C. variabilis and Emiliania huxleyi and are not orthologous to the cellulose synthases and hemicellulose synthases of land plants (Figure S6 in Additional file 1). Interestingly, the tunicate cellulose synthase gene is also a fusion of a cellulose synthase domain and a glycosyl hydrolase domain (different from the algal glycosyl hydrolase type-9 domain) that has cellulase activity. Based on the identification of both cellulose synthase domains and cellulase domains, we predict that cellulose is a constituent of C-169 cell walls. Additional support for this prediction is that C-169 forms protoplasts after treatment with cellulases and Calcofluor white stains its cell wall .
C-169 encodes significantly more proteins containing short-chain dehydrogenase/reductase family signatures (PFAM adh_short motif) than other algae (Table S6 in Additional file 2). This large protein family uses a variety of substrates ranging from alcohols, sugars, steroids and aromatic compounds to xenobiotics , which is reflected in the wide phylogenetic diversity of short-chain dehydrogenases. Analysis of shared similarity between protein sequences indicates that the higher number of short-chain dehydrogenases in C-169 is essentially due to the specific expansion of a small number of subfamilies (Figure S7 in Additional file 1). Although no hypothesis can be presently advanced as to the functional role of these subfamilies, their specific expansion suggests that they contributed to C-169 adaptation.
Of the 2,305 predicted C-169 gene products with no detectable homolog in sequenced chlorophytes, 293 proteins grouped into 196 protein families with significant matches (BLASTP E-value < 1e-5) to other organisms (Table S8 in Additional file 2). Among these proteins are various enzymes putatively involved in defense and detoxification, transport, protection against solubilized dioxygen (for example, DOPA-dioxygenase), cell wall biosynthesis, and carbohydrate metabolism (Table S8 in Additional file 2). Overall, the majority (135/196, 69%) of these C-169-specific protein families have their closest phylogenetic homologs in Streptophytes and other Eukaryotes, which suggest that most of these genes existed in the common ancestor of chlorophytes and were subsequently lost in the Chlorophyceae, Mamiellophyceae and Chlorellaceae lineages. In contrast, bacteria are the closest phylogenetic counterpart of most of the C-169-specific proteins involved in carbohydrate metabolism and defense and detoxification pathways, which suggests that these important biological functions have been enriched by lateral gene transfer from prokaryotes.
Among the most remarkable C-169-specific proteins, we found a translation elongation factor-1α (protid: 54652) that functionally replaces the elongation factor-like EFL present in all the sequenced chlorophytes but C-169 . C-169 is also the only sequenced chlorophyte to encode a putative phospholipase D (Joint Genome Institute (JGI) ID: 38692), an important enzyme involved in stress responses and development in land plants . Furthermore, we found a chalcone synthase-like protein (protid: 45842) whose homologs in land plants and bacteria are involved in the synthesis of secondary metabolites for antimicrobial defense, pigmentation, UV photoprotection, and so on .
C-169 encodes a putative RNA-dependent RNA polymerase (RdRP) that resembles Arabidopsis homologs required for synthesizing small interfering RNAs (siRNA) involved in RNA silencing . Presumably functioning in the same pathway, C-169 also contains two argonaute-like proteins (AGLs; protid: 56022 and 56024) whose plant homologs bind siRNAs that regulate expression of their target genes. However, homologs to land plant Dicer ribonucleases and dsRNA binding proteins (DRBs), two key components of plant RNA silencing pathways, were absent in C-169. The apparent lack of a complete set of proteins required for RNA silencing suggests that this pathway is either non-functional or extensively modified compared to land plants.
Proteins involved in CO2 concentration
The CO2-concentrating mechanism (CCM) allows algae to accumulate internal concentrations of inorganic carbon (Ci; CO2 and HCO3-) well above the external concentrations in their aqueous environments, thereby promoting efficient photosynthesis and cell growth. Although most cyanobacteria and eukaryotic algae contain a functional CCM, its occurrence in C-169 was in question because another Coccomyxa strain symbiotic with a lichen lacks a CCM . However, annotation of the C-169 genome sequence identified 13 orthologs to genes known to be associated with the CCM in C. reinhardtii (Table S9 and Supplemental Results in Additional file 2, and Figure S8 in Additional file 1), the most thoroughly studied eukaryotic CCM. These genes include the well characterized CCM-associated genes (for example, CAH1, LCIB) as well as the master regulator of the C. reinhardtii CCM, CIA5/CCM1. These observations suggest that C-169 has a functional CCM.
Ubiquitous algal genes missing in C-169
Twenty-nine protein families whose genes were found in all sequenced chlorophytes are missing from the C-169 genome assembly (Table S10 in Additional file 2). C-169 does not encode any of the subunits of the glycosyl phosphatidyl inositol (Gpi) transamidase complex (Gpi8p, Gaa1p, Gpi16p, Gpi17p, and Cdc91p), which attach cell surface proteins to the cell membrane via preformed Gpi anchors . Homologs of Gpi8p, Gaa1p, and Gpi16p exist in all other sequenced chlorophytes, while Cdc91p was absent in both C-169 and C. variabilis; Gpi17p has not been identified in any algae. C-169 also lacks the Gpi-anchored wall transfer protein (Gwt) that is involved in Gpi-anchor biosynthesis. Thus, the Gpi anchoring system is lacking in this alga.
C-169 lacks a gene encoding a pyruvate phosphate dikinase (PPDK), an enzyme that ensures the interconversion of phosphoenolpyruvate and pyruvate. This protein is ubiquitous among other sequenced chlorophytes and streptophytes. PPDK plays a key role in gluconeogenesis and photosynthesis in C4 plants and is an ancillary glycolytic enzyme in C3 plants . In C-169, phosphoenolpyruvate/pyruvate conversion is apparently performed by three pyruvate kinases (PKs; protein ids: 32937, 61449 and 67234); however, the yield of glycolytically derived ATP per glucose is two in pyruvate kinase-dependent glycolysis and five in PPDK-dependent glycolysis. Thus, C-169 is potentially less effective in producing ATP from glycolysis than other chlorophytes.
Also missing in C-169 are genes encoding dolichyldiphosphatase, mannosyltransferase and carbohydrate kinase, three enzymes involved in glycan metabolism and cell wall maintenance, as well as genes of five families of transporter proteins, including the sodium/sulfate co-transporter, voltage-gated ion channel and maltose exporter families. C-169 lacks a cobalamin-dependent methionine synthase gene but has a cobalamin-independent methionine synthase gene, thus maintaining a functional methionine biosynthetic pathway .
C-169 lacks the photosystem 1 (PSI) reaction center subunit N (PsaN) involved in the docking of plastocyanin. Although PsaN is ubiquitous among green plants, it is not essential for phototrophic growth: Arabidopsis plants lacking PsaN can assemble a functional PSI complex but show a decrease in the rate of electron transfer from plastocyanin to PSI . Low temperatures induce an excess of electrons going through PSI that are eventually transported to oxygen, thereby generating reactive oxygen species (ROS), which are harmful to the cell . Thus, the unique loss of the PsaN gene in C-169 may be advantageous under cold climates because it may lead to reduced ROS formation.
Adaptive strategies of psychrophilic prokaryotes to cope with low temperatures and potential adaptation in C. subellipsoidea C-169
Prokaryotic genes or events involved in the process
C-169-specific genes potentially involved in the process
Increased fluidity of cellular membranes at low temperature
Unsaturated fatty acid (FA) synthesis genes, FA desaturases
Lipid biosynthesis genes, including FA synthase type I, FA desaturases, lipases
Reduction of freezing point of cytoplasm and stabilization of macromolecules
Genes for synthesis of compatible solutes, membrane transporters, antifreeze proteins and ice-binding proteins
Production of antifreeze lipoproteins, exopolysaccharides and glycoproteins: lipid biosynthesis genes, including FA synthase type I and FA ligases; carbohydrate metabolism genes, including glycosyl hydrolases and glycosyl transferases
Protection against reactive oxygen species
Catalases, peroxidases, superoxide dismutases, oxidoreductases
Dioxygen-dependant FA desaturases, DOPA-dioxygenase, loss of the gene encoding photosystem 1 subunit PsaN
Maintain catalytic efficiency at low temperatures
Global change in amino acid composition of encoded proteins to decrease protein structural rigidity
No apparent change in global amino acid composition relative to mesophilic plants and green algae
The fact that C-169 has more enzymes involved in the biosynthesis and modification of lipids than other sequenced chlorophytes suggests that this lineage of green alga has adapted to extreme cold conditions through greater versatility of its lipid metabolism, allowing it to synthesize a greater diversity of cell membrane components. These new enzymes and metabolic properties are of potential interest in developing technologies for converting lipids from microalgae into diesel fuel or valuable fatty acids . C-169 encodes specific dioxygenase (DOPA-dioxygenase) and FA desaturases that use dioxygen as a substrate, which, together with the loss of the PsaN gene, can contribute to providing a higher level of protection of the metabolism against ROS. In contrast to psychrophilic organisms that live in permanent cold environments , the C-169 proteome exhibits no evidence of systematic bias in amino acid composition relative to the proteomes of other sequenced Plantae that are mesophilic (Figure S9 in Additional file 1). This probably reflects the fact that C-169 lives in Antarctic soils, which withstand wide fluctuations in temperature (typically from -50°C to +25°C). Although C-169 inhabits polar ecological niches and can survive extremely low temperatures, its optimal growth temperature is close to 20°C. Thus, both optimal growth temperature and global amino acid composition indicate that C-169 is not fully specialized to grow in a permanent cold environment.
Materials and methods
C-169 was obtained from the Microbial Culture Collection, National Institute for Environmental Studies, Japan under strain #NIES 2166 Coccomyxa sp.
Genome sequencing and assembly
The C-169 genome was sequenced using the whole genome sequencing strategy. The data were assembled using release 2.10.11 of Jazz, a WGS assembler developed at the JGI. After excluding redundant and short scaffolds from the initial assembly, there was 48.8 Mb of ungapped scaffold sequence. The filtered assembly contained 29 scaffolds, with sizes ranging from 0.112 to 4.035 Mb. The sequence depth derived from the assembly was 12.0 ± 0.15. Pulse field gel electrophoresis studies for assignment of scaffolds to chromosomes were carried out according to Agarkova et al. . In addition 28,322 validated ESTs were generated from C-169 cells grown to log phase at 25°C in modified bold basal medium (MBBM). A detailed description of methods is provided in Supplemental Methods in Additional file 2.
Genome annotation and sequence analysis
The genome assembly of C-169 was annotated using the JGI annotation pipeline, which combines several gene predictors: 1) putative full length genes derived from 7,984 cluster consensus sequences of clustered and assembled C-169 ESTs were mapped to genomic sequence; 2) homology-based gene models were predicted using FGENESH+  and Genewise  seeded by BLASTx alignments against sequences from NCBI non-redundant protein set; 3) the ab initio gene predictor FGENESH was trained on the set of putative full-length genes and reliable homology-based models. Genewise models were completed using scaffold data to find start and stop codons. Additional gene models were predicted using ab initio GeneMark-ES  and combined with the rest of the predictions. ESTs and EST clusters were used to extend, verify, and complete the predicted gene models. Because multiple gene models per locus were often generated, a single representative gene model for each locus was chosen based on homology and EST support and used for further analysis. This led to a filtered set of 9,851 gene models with their characteristics supported by different lines of evidence summarized in Tables S1 and S2 in Additional file 2.
All predicted gene models were annotated using InterProScan  and hardware-accelerated double-affine Smith-Waterman alignments against SwissProt  and other specialized databases like the KEGG (Kyoto Encyclopedia of Genes and Genomes)  and PFAM . Finally, KEGG hits were used to map EC numbers , and Interpro hits were used to map Gene Ontology terms . In addition, predicted proteins were annotated according to KOG classification . All scaffolds, gene models and clusters, and annotations thereof, may be accessed at the JGI Coccomyxa Portal  and can also be found in the EMBL/GenBank data libraries under accession number AGSI00000000.
De novo identification of repeated sequences was performed by aligning the genome against itself using the BLASTN program (E-value < 1e-15). Individual repeat elements were organized into families with the RECON program using default settings . RECON constructed 2,976 repetitive sequence families from 11,044 individual repeat elements or fragments. Second, identification of known repetitive sequences was performed by aligning the prototypic sequences contained in Repbase v12.10  using TBLASTX. The results of the two methods were combined.
Annotated proteins of nine sequenced chlorophyte algae (C-169, C. variabilis NC64A, C. reinhardtii, V. carteri, M. pusilla CCMP1545, Micromonas sp. RCC299, Ostreococcus sp. RCC809, O. lucimarinus and O. tauri) were organized into 23,507 families based on shared sequence similarity (BLASTP, E-value < 1e-5) using the Tribe-MCL program  with default parameters except inflation parameter set to 1.4. Of those, 6,326 families contained at least one Coccomyxa protein, including 1,851 protein families that were found in all 9 species and represent the core protein family set of chlorophyte plants. There were 2,214 protein families containing 2,305 predicted C-169 gene products with no detectable homolog in the other sequenced chlorophytes. Of these, 196 families contained 293 proteins that had significant matches (BLASTP E-value < 1e-5) to other organisms (Table S6 in Additional file 2). Phylogenetic relationships and potential horizontal gene transfer for these 293 proteins were further assessed using the BLAST-EXPLORER program , which combines a BLAST search with a suite of tools that allows interactive, phylogenetic-oriented exploration of the BLAST results.
Most phylogenetic analyses were performed through the phylogeny.fr web platform . The Phylogeny.fr pipeline was set up as follows: homologous sequences were aligned with the MUSCLE program ; poorly aligned positions were removed from the multiple-alignment using the GBLOCK program . The cleaned multiple alignment was then passed on to the PHYML program  for phylogenetic reconstruction using the maximum likelihood criterion. Selection of the best fitting substitution model was performed using the ModelTest program for nucleotide sequences  and ProtTest for amino acid sequences . PhyML was run with the approximate likelihood ratio test (aLRT), a statistical test of branch support . This test is based on an approximation of the standard likelihood ratio test, and is much faster to compute than the usual bootstrap procedure while branch supports are generally highly correlated between the two methods.
Synteny and colinearity
Pairwise scaffold synteny
We identified 5,232 putative orthologous gene pairs between C-169 and C. variabilis using the reciprocal best blast hit criterion. In Figure 2a, the statistical significance of the number of orthologous genes shared between pairs of scaffolds was estimated by comparison with a non-syntenic model using Z-score statistics. This non-syntenic model was constructed from 1,000 randomized datasets in which the 5,232 orthologous gene pairs were reassociated at random. The number of orthologous genes in each scaffold was kept constant across replicates. For each pair of scaffolds, we calculated the mean and standard deviation of the number of shared orthologous genes in the 100 random replicates. The Z-score was determined by subtracting the mean number of orthologous genes in the non-syntenic model from the observed number of orthologous genes in the real dataset and then dividing the difference by the standard deviation. A Z-score > 3 indicates that the observed number of orthologous genes is significantly higher than in the non-syntenic model with a P-value < 0.01.
where ni,. is the row total of the number of genes on species A scaffold i with an ortholog anywhere in species B's genome, n.,j is the column total of the number of genes on species B scaffold j with an ortholog anywhere in A's genome and n is the total number of orthologous genes mapped between the two species.
For each pair of genomes, the mean and standard deviation of the synteny correlation in a non-syntenic model was calculated from 1,000 randomized datasets in which the orthologous gene pairs were re-associated at random. These parameters were used to assess the significance of the synteny correlation observed in the real data by means of the Z-score statistics.
Conserved adjacent gene pairs
For each pair of genomes, the non-syntenic model was constructed by reshuffling the order of all genes (that is, orthologs and non-orthologs) in one of the two genomes, keeping the number of genes in each scaffold constant across replicates. We used 1,000 randomized datasets to estimate the mean and standard deviation of the number of conserved adjacent gene pairs in the non-syntenic model. Z-score statistics was used to assess the significance of the observed number of conserved adjacent orthologous gene pairs in the read dataset relative to the number expected by chance in the non-syntenic model.
The work conducted by the DOE JGI is supported by the Office of Science of the US Department of Energy under contract number DE-AC02-05CH11231. This work was partially supported by Marseille-Nice Genopole, the PACA-Bioinfo platform, NSF-EPSCoR grant EPS-1004094 (JLVE), DE-FG36-08GO88055 (JLVE), grant P20-RR15635 from the COBRE program of the National Center for Research Resources (JLVE) and the NIH grant HG00783 (MB).
- 2.West W: Fresh-water algae, with a supplement of marine diatoms. Proc R Irish Acad. 1911, 31: 16.1-16.62.Google Scholar
- 3.Acton E: Coccomyxa subellipsoidea, a new member of the Palmellaceae. Ann Bot. 1909, 23: 573-578.Google Scholar
- 4.Blanc G, Duncan G, Agarkova I, Borodovsky M, Gurnon J, Kuo A, Lindquist E, Lucas S, Pangilinan J, Polle J, Salamov A, Terry A, Yamada T, Dunigan DD, Grigoriev IV, Claverie J-M, Van Etten JL: The Chlorella variabilis NC64A genome reveals adaptation to photosymbiosis, coevolution with viruses, and cryptic sex. Plant Cell. 2010, 22: 2943-2955. 10.1105/tpc.110.076406.PubMedPubMedCentralCrossRefGoogle Scholar
- 10.Yamamoto Y, Fujimoto Y, Arai R, Fujie M, Usami S, Yamada T: Retrotransposon-mediated restoration of Chlorella telomeres: accumulation of Zepp retrotransposons at termini of newly formed minichromosomes. Nucleic Acids Res. 2003, 31: 4646-4653. 10.1093/nar/gkg490.PubMedPubMedCentralCrossRefGoogle Scholar
- 13.Derelle E, Ferraz C, Rombauts S, Rouzé P, Worden AZ, Robbens S, Partensky F, Degroeve S, Echeynié S, Cooke R, Saeys Y, Wuyts J, Jabbari K, Bowler C, Panaud O, Piégu B, Ball SG, Ral J-P, Bouget F-Y, Piganeau G, De Baets B, Picard A, Delseny M, Demaille J, Van de Peer Y, Moreau H: Genome analysis of the smallest free-living eukaryote Ostreococcus tauri unveils many unique features. Proc Natl Acad Sci USA. 2006, 103: 11647-11652. 10.1073/pnas.0604795103.PubMedPubMedCentralCrossRefGoogle Scholar
- 14.Smith DR, Burki F, Yamada T, Grimwood J, Grigoriev IV, Van Etten JL, Keeling PJ: The GC-Rich mitochondrial and plastid genomes of the green alga Coccomyxa give insight into the evolution of organelle DNA nucleotide landscape. PLoS ONE. 2011, 6: e23624-10.1371/journal.pone.0023624.PubMedPubMedCentralCrossRefGoogle Scholar
- 21.Moyer CL, Morita RY: Psychrophiles and psychrotrophs. Encyclopedia of Life Sciences. 2007, Chichester: John Wiley & Sons, Ltd, 1-6.Google Scholar
- 28.Garcia-Ruiz H, Takeda A, Chapman EJ, Sullivan CM, Fahlgren N, Brempelis KJ, Carrington JC: Arabidopsis RNA-dependent RNA polymerases and dicer-like proteins in antiviral defense and small interfering RNA biogenesis during turnip mosaic virus infection. Plant Cell. 2010, 22: 481-496. 10.1105/tpc.109.073056.PubMedPubMedCentralCrossRefGoogle Scholar
- 36.Médigue C, Krin E, Pascal G, Barbe V, Bernsel A, Bertin PN, Cheung F, Cruveiller S, D'Amico S, Duilio A, Fang G, Feller G, Ho C, Mangenot S, Marino G, Nilsson J, Parrilli E, Rocha EPC, Rouy Z, Sekowska A, Tutino ML, Vallenet D, von Heijne G, Danchin A: Coping with cold: the genome of the versatile marine Antarctica bacterium Pseudoalteromonas haloplanktis TAC125. Genome Res. 2005, 15: 1325-1335. 10.1101/gr.4126905.PubMedPubMedCentralCrossRefGoogle Scholar
- 46.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. Nat Genet. 2000, 25: 25-29. 10.1038/75556.PubMedPubMedCentralCrossRefGoogle Scholar
- 47.Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Rogozin IB, Smirnov S, Sorokin AV, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol. 2004, 5: R7-R7. 10.1186/gb-2004-5-2-r7.PubMedPubMedCentralCrossRefGoogle Scholar
- 48.The JGI Coccomyxa Portal. [http://jgi.doe.gov/Coccomyxa]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.