Background

The green algae represent an ancient lineage of photosynthetic eukaryotes; molecular clock analyses estimate their origin between 700 and 1,500 millions years ago [1]. This lineage (Viridiplantae) split very early into two major divisions: the Chlorophyta, containing the majority of the described green algae, and the Streptophyta, containing the charophyte green algae and their land plant descendants. In the last decade, substantial advances have been made in our understanding of the broad-scale relationships among the streptophytes, in particular the land plants [2], and references therein; however, progress has lagged behind concerning the chlorophytes.

Early hypotheses on green algal phylogeny were based on morphology and ultrastructural data derived from the flagellar apparatus and processes of mitosis and cell division [3],[4]. These ultrastructural features, which apply to most green algae, supported the existence of the Streptophyta and Chlorophyta and revealed four distinct groups within the Chlorophyta that were recognized as classes: the predominantly marine, unicellular, Prasinophyceae; the predominantly marine and morphologically diverse Ulvophyceae; and the freshwater or terrestrial, morphologically diverse Trebouxiophyceae (=Pleurastrophyceae) and Chlorophyceae [5],[6]. It was hypothesized that the Prasinophyceae gave rise to the Ulvophyceae, Trebouxiophyceae and Chlorophyceae (UTC). Later, phylogenetic analyses based on the nuclear-encoded small subunit rRNA gene (18S rDNA) largely corroborated these hypotheses [1],[5],[7]. It was found, however, that the Prasinophyceae are paraphyletic, with the nine main lineages of prasinophytes identified so far representing the earliest branches of the Chlorophyta [8]. For the Ulvophyceae and Trebouxiophyceae, the limited resolution of 18S rDNA trees made it impossible to assess the monophyly of these classes [1],[6],[7]. Analyses of 18S rDNA data uncovered a myriad of lineages within each of the three UTC classes, but could not resolve their precise branching order. Despite these uncertainties, many taxonomic revisions have been implemented: new species not distinguished by light microscopy were described, new genera were erected, the circumscription of several main lineages was modified, and existing orders were elevated to the class level (e.g. Chlorodendrophyceae and Pedinophyceae). A recurrent theme that emerged from such studies is the finding that multiple genera containing taxa with reduced morphologies (such as unicells and filaments) are polyphyletic, with members often encompassing more than one class e.g. for Chlorella, [9],[10].

For ancient groups of eukaryotes such as the green algae, a large number of genes from many species need to be analyzed using reliable models of sequence evolution to resolve relationships at higher taxonomic levels [11]. Multi-gene data sets can be assembled by concatenating the sequences of protein-coding genes that are shared by the chloroplast or nuclear genomes. The chloroplast phylogenomic studies reported so far for green algae have provided valuable insights into the phylogeny of prasinophytes [12],[13], streptophytes [14]-[18] and the Chlorophyceae [19],[20], but only limited information is currently available regarding the relationships within the Trebouxiophyceae. For the Ulvophyceae, an analysis of ten concatenated gene sequences from both the nuclear and chloroplast genomes enabled Cocquyt et al. [21] to resolve the branching pattern of the main lineages of this class. In this context, it is worth mentioning that datasets of concatenated nuclear and chloroplast genes have also proved very useful to reconstruct phylogenetic relationships within specific green algal orders [22].

The present investigation is centered on the Trebouxiophyceae as delineated by Frield [23]. This species-rich class displays remarkable variation in both morphology (comprising unicells, colonies, filaments and blades) and ecology (occurring in diverse terrestrial and aquatic environments) [1],[5],[7]. No flagellate vegetative form has been identified in this class. Several species (e.g. Trebouxia, Myrmecia and Prasiola) participate in symbioses with fungi to form lichens [24],[25] and others (e.g. Chlorella, Coccomyxa, and Elliptochloris) occur as photosynthetic symbionts in ciliates, metazoa and plants [26]. The Trebouxiophyceae also comprises species that have lost photosynthetic capacity and have evolved free-living or parasitic heterotrophic lifestyles (e.g. Prototheca and Helicosporodium) [27]-[29]. Aside from their intrinsic biological interest, trebouxiophycean algae have drawn the attention of the scientific community because of their potential utility in a variety of biotechnological applications such as the production of biofuels or other molecules of high economic value [30],[31].

Phylogenies based on 18S rDNA data have identified multiple lineages within the Trebouxiophyceae, and these include the Chlorellales, Trebouxiales, Microthamniales, and the Prasiola, Choricystis/Botryococcus, Watanabea, Oocystis and Geminella clades [32]-[39]. While the majority of the observed monophyletic groups are composed of several genera, a number of lineages consist of a single species or genus (e.g. Xylochloris, Leptosira, Lobosphaera). The interrelationships between most of the trebouxiophycean lineages are still unresolved. Interestingly, taxa with highly different morphologies (e.g. the minute unicellular Stichococcus and the macroscopic filamentous or blade-shaped Prasiola) have been recovered in the same clade, demonstrating that vegetative morphology can evolve relatively rapidly. Polyphyly has been reported not only in morphologically simple genera [5],[7],[40], but also in those with colonial forms [36],[41].

In this study, we have sought to decipher the relationships among the main trebouxiophycean lineages and to evaluate the monophyly of the Trebouxiophyceae. Toward these goals, we have analyzed data sets of 79 chloroplast DNA (cpDNA)-encoded proteins and genes spanning the broad range diversity of the Trebouxiophyceae. Twenty-nine chlorophyte chloroplast genomes were newly sequenced to generate these data sets. The trees we inferred using the maximum likelihood (ML) and Bayesian inference methods enabled us not only to clarify the internal structure of the Trebouxiophyceae but also to gain insights into their ancestral status with regards to the type of environment they first colonized and their subsequent adaptations to different ecosystems.

Results

In the course of this study, we generated the chloroplast genome sequences of 27 trebouxiophycean taxa, thus bringing to 35 the total number of trebouxiophyceans sampled in our phylogenetic analyses (Table 1). These taxa represent the variety of trebouxiophycean lineages that had been recognized prior to January 2013; at least two representatives were examined for each of the lineages ncluding multiple genera. The chloroplast genome sequences of two flagellates belonging to the Pedinophyceae (Pedinomonas tuberculata and Marsupiomonas sp. NIES 1824) were also determined because Pedinomonas minor, the previously sampled taxon from this group had been found to be related to the Chlorellales and a member of the Oocystis lineage in an earlier phylogenomic study [42]. Only the results of our phylogenetic analyses are presented here; in a separate article, we will report the salient features of the newly sequenced chloroplast genomes and discuss how these structural data advance understanding of chloroplast genome evolution in the Chlorophyta.

Table 1 Pedinophycean and trebouxiophycean taxa used in the chloroplast phylogenomic analyses

All data sets analyzed in our study were assembled from 79 cpDNA-encoded proteins and taxon sampling included up to 63 green algal taxa, i.e. the 38 trebouxiophyceans and pedinophyceans listed in Table 1, 23 additional chlorophytes (12 prasinophytes, nine chlorophyceans, and two ulvophyceans) and two streptophyte algae (Mesostigma viride and Chlorokybus atmophyticus). We favored the use of amino acid rather than nucleotide sequences in our phylogenomic study because, in analyses of ancient divergences, amino acid data sets are less prone than nucleotide data sets to saturation problems, convergent compositional biases and convergent codon-usage biases [49]-[51]. We initiated our phylogenomic study by analyzing the amino acid data set comprising all 63 taxa (15,549 sites). Note that some of the genes coding for the proteins analyzed are missing from a number of taxa, in particular from prasinophytes and chlorophyceans (see Figure 1); however, the proportion of missing data in the analyzed data sets does not exceed 6%.

Figure 1
figure 1

Phylogeny of 61 chlorophytes inferred using a data set of 15,549 positions assembled from 79 cpDNA-encoded proteins. The tree presented here is the best-scoring ML tree inferred under the GTR + Γ4 model. Support values are reported on the nodes: from top to bottom, or from left to right, are shown the posterior probability (PP) values for the PhyloBayes CATGTR + Γ4 analyses and the bootstrap support (BS) values for the RAxML GTR + Γ4, LG4X and gcpREV + Γ4 analyses. Black dots indicate that the corresponding branches received BS and PP values of 100% in all four analyses. Shaded areas identify the clades that are well supported in 18S rDNA phylogenies. The histograms on the left indicate the proportion of missing data for each taxon. The scale bar denotes the estimated number of amino acid substitutions per site.

Even though amino acid phylogenies are more robust to compositional effects than nucleotide phylogenies, they may still suffer from a general mutational pressure acting at the nucleotide level [52],[53]. For this reason, we also inferred trees from nucleotide data sets corresponding to the 63-taxon amino acid data set and examined whether they are congruent with those derived from amino acid data sets.

Analysis of the amino acid data sets

The amino acid data set comprising all 63 taxa was analyzed with PhyloBayes using the site-heterogeneous CATGTR + Γ4 model and also with RAxML using the site-homogeneous GTR + Γ4 and gcpREV + Γ4 models as well as the LG4X model (Figure 1). gcpREV is an empirical amino acid substitution model that has been recently developed for use with green plant chloroplast protein data [54]; it proved to be the best-scoring empirical model among those we tested using RAxML (cpREV, JTT, gcpREV, LG, WAG, and their + F alternatives). LG4X is a mixture model based on four substitution matrices [55]. The fits of the gcpREV + Γ4, GTR + Γ4 and CATGTR + Γ4 models to the 63-taxon data set were assessed using cross-validation (Table 2). CATGTR + Γ4 was found to be the best-fitting model; this finding was expected considering that site-heterogeneous models are known to provide a better fit than site-homogeneous models and minimize the impact of systematic errors arising from the difficulties to detect and interpret multiple substitutions [56]-[59]. Because it was also found that the GTR + Γ4 model has a better fit than the gcpREV + Γ4 model (Table 2), it appears that the size of the 63-taxon data set is sufficiently large to estimate a GTR amino acid substitution matrix that models more accurately our data than the empirical gcpREV matrix.

Table 2 Comparison of evolutionary models using cross validation and the chloroplast data set of 15,549 positions

The majority-rule consensus trees inferred from the 63-taxon amino acid data set using ML and Bayesian inference methods displayed essentially the same topology (Figure 1). As expected, the prasinophyte lineages represent the first branches and their divergence order is identical to that reported for a recent phylogenomic tree with the same sampling of prasinophyte taxa [12]. The trebouxiophyceans are recovered as a non-monophyletic assemblage. The monophyletic group formed by the six members of the Chlorellales is sister to the Pedinophyceae and the Chlorellales + Pedinophyceae clade is sister to all other UTC algae. The rest of the trebouxiophyceans, designated hereafter as core trebouxiophyceans, form a strongly supported clade that shares a sister relationship with the Ulvophyceae + Chlorophyceae clade. The deep node of the trees coinciding with the common ancestor of the UTC and pedinophycean algae received maximal support in all analyses, but the following node corresponding to the divergence of the core trebouxiophyceans from the Chlorellales + Pedinophyceae received lower support, especially in the ML analyses as indicated by the BS values of 73, 57 and 45%.

The 32 taxa within the core trebouxiophyceans are resolved as a grade of several strongly supported lineages. Three monophyletic groups containing multiple genera can be distinguished (i.e. clades A, B and C). Clade A, which consists of Koliella corcontica and members of the previously recognized Geminella and Oocystis clades, represents the earliest-diverging lineage of the core trebouxiophyceans. Clade B includes Neocystis brevis and representatives of the highly diversified Prasiola clade. Clade C, the largest of the three identified monophyletic groups, consists of 15 taxa belonging to the Xylochloris, Microthamniales, Trebouxiales, Lobosphaera, Watanabea, Choricystis and Elliptochloris clades. Clades A and B as well as clades B and C are separated from one another by a lineage consisting of a single taxon, i.e. the Pleurastrosarcina brevispinosa and the Parietochloris pseudoalveolaris lineage, respectively.

Considering that heterogeneity in amino acid composition may violate the stationarity assumption made by the evolutionary models in the analyses presented above, we explored whether the inferred relationships were affected by compositional-related artifacts. As a first approach, we examined the amino acid composition of the data set by plotting the first two components of a correspondence analysis of the 20 amino acid frequencies (Figure 2) but identified no large deviation in composition of the chloroplast proteins among the taxa examined. We also used the Dayhoff recoding strategy, which recodes the 20 amino acids into six groups on the basis of their physical and chemical properties. We found that the tree inferred from the Dayhoff-recoded data set under the CATGTR + Γ4 model exhibits the same topology as that obtained using standard 20 state models, except that the Chlorellales are not affiliated with the Pedinophyceae (data not shown). In this Bayesian analysis, which showed convergence problems (maxdiff = 1), the position of the Chlorellales relative to the core trebouxiophyceans is unresolved, whereas the Pedinophyceae is sister to the UTC clade (PP = 0.79). These observations together with the finding that the Chlorellales and Pedinophyceae are grouped in the correspondence analysis (Figure 2) suggest a possible compositional attraction between these two groups.

Figure 2
figure 2

Correspondence analysis of amino acid usage in the data set of 15,549 positions. The members of the Chlorellales and Pedinophyceae are found within the circled area.

Given the possibility that the affiliation between the Chlorellales and Pedinophyceae is caused by systematic errors of tree reconstruction, we tested whether removal of the three members of the Pedinophyceae affects the position of the Chlorellales. As shown in Figure 3A, the RAxML tree inferred under the GTR + Γ4 model still identifies the Chlorellales as sister to the Chlorophyceae + Ulvophyceae + core trebouxiophyceans (BS = 89%). To determine whether the two other possible positions occupied by the Chlorellales (topologies T2 and T3 in Figure 3B) can be dismissed with statistical confidence, we carried out the approximately unbiased (AU) test of phylogenetic tree selection [60]. Both topologies were found to be significantly different (P <0.05) from the best tree (T1) and were thus rejected by the AU test (Figure 3B).

Figure 3
figure 3

Influence of the Pedinophyceae on the placement of the Chlorellales. (A) Phylogeny of chlorophytes inferred under the GTR + Γ4 model using the amino acid data set of 15,549 positions after exclusion of the Pedinophyceae. The best-scoring RAxML tree is presented and support values are reported on the nodes, with black dots indicating 100% BS values. The scale bar denotes the estimated number of amino acid substitutions per site. (B) Confidence assessment of the three possible topologies (T1, T2 and T3) for the placement of the Chlorellales under the GTR + Γ4 model. The ΔlnL value indicates the difference in log likelihood relative to the best tree (T1). Local bootstrap probabilities were estimated by resampling of the estimated log-likelihood (RELL). pAU, P value for the approximately unbiased (AU) test [60] as implemented in CONSEL 0.20 [61].

Analysis of the nucleotide data sets

We analyzed two nucleotide data sets corresponding to the 63-taxon amino acid data set, both of which were designed to minimize deleterious effects of rapid sequence evolution and/or heterogeneous composition. The degen1 data set comprises all three codon positions (46,404 sites) that were degenerated using the Degen1.pl script [62], whereas the nt1 + 2 data set contains only the first and second codon positions (30,936 sites). The RAxML trees inferred from these data sets under the GTR + Γ4 model display essentially the same trebouxiophycean relationships as in the 63-taxon amino acid tree (Figure 4), except that the Marvania clade is sister to the Chlorella + Parachlorella clade (BS = 60 and 76%) and that Parietochloris pseudoalveolaris is recovered as sister to the Prasiola clade (BS = 53 and 43%). As observed for the amino acid phylogenies, the Chlorellales remained sister to the Chlorophyceae + Ulvophyceae + core trebouxiophyceans when the three algae belonging to the Pedinophyceae were excluded from the sampled taxa (data not shown).

Figure 4
figure 4

Phylogeny of 61 chlorophytes inferred using nucleotide data sets assembled from 79 cpDNA-encoded genes. The tree presented here is the best-scoring ML tree inferred using the degen1 data set under the GTR + Γ4 model. Support values are reported on the nodes: from top to bottom, or from left to right, are shown the BS values for the analyses of the degen1 and nt1 + 2 data sets. Black dots indicate that the corresponding branches received BS values of 100% in the two analyses. Shaded areas identify the trebouxiophycean lineages uncovered in this study. Open and filled squares denote aquatic and terrestrial/aeroterrestrial habitats, respectively; an open square containing a star indicates that the taxon is a symbiont. The scale bar denotes the estimated number of nucleotide substitutions per site.

Discussion

Identifying the relationships among the main lineages of the Trebouxiophyceae is crucial for understanding the evolutionary history of this morphologically and ecologically diversified class of chlorophytes. For the first time, a robust phylogeny of trebouxiophyceans with sampling of most of the lineages recognized on the basis of 18S rDNA data is inferred using a phylogenomic approach. Our study reveals that the class Trebouxiophyceae sensu stricto [23] is not a monophyletic group. In the chloroplast phylogenies we inferred from both amino acid and nucleotide data sets, the Chlorellales and a core group containing all other 29 trebouxiophyceans constitute two distinct, strongly supported monophyletic groups that emerge before the Chlorophyceae and Ulvophyceae (Figures 1 and 4). Prior to our investigation, a number of multi-gene trees with sparse sampling of trebouxiophyceans had recovered with little support the Trebouxiophyceae as nonmonophyletic [2],[42],[63]-[67], thus casting doubt on the monophyletic status of this class.

To our knowledge, no morphological features can be invoked to support or refute the phylogenetic relationship we observed between the Chlorellales and the core trebouxiophyceans. Mattox and Stewart [3] defined the class Pleurastrophyceae (=Trebouxiophyceae) based on the ultrastructure of the flagellar apparatus (counterclockwise orientation of basal bodies) and features related to cytokinesis and mitosis (phycoplast-mediated cytokinesis and mitosis with a non-persistent telophase spindle). Because all members of the Chlorellales lack motile stages and divide by autosporulation, the ultrastructural characters used by Mattox and Stewart are not available for this algal group, thus precluding an evaluation of the monophyletic status of the Trebouxiophyceae sensu stricto [23].

The phylogenetic relationships inferred in this study provide insights into the type of ecosystems colonized by the core trebouxiophyceans in their early evolutionary history (Figure 4). Considering that, like most of the chlorellaleans, the earliest-diverging core trebouxiophyceans (i.e. the Oocystis and Geminella clades) are predominantly planktonic species and that the core trebouxiophyceans occupying more derived lineages are mostly terrestrial algae, it appears that the first core trebouxiophyceans lived in aquatic ecosystems and that very early during evolution they evolved strategies to avoid desiccation [68] and conquered the land. This early transition from aquatic to terrestrial environments likely occurred just after the emergence of the Oocystis/Geminella clade. In this context, it is worth mentioning that a subaerial lifestyle has been inferred for the last common ancestor of the early-diverging clade Prasiola, which comprises terrestrial as well as aquatic species [69]. Therefore, the early evolution of desiccation tolerance undoubtedly accounts for the success of the core trebouxiophyceans in terrestrial/aeroterrestrial environments, and once this trait was acquired, reversals to aquatic habitats probably involved only minor molecular changes, explaining why transitions from terrestrial to aquatic habitats were frequent during the evolution of core trebouxiophyceans.

The main lineages of the core trebouxiophyceans

The core trebouxiophyceans form a grade of lineages, with several containing two or more genera and some containing a single known genus or taxon. Although the short internal branches separating the major clades of core trebouxiophyceans suggest that lineage diversification occurred rapidly, it is remarkable that only the placement of the single-taxon lineage occupied by the terrestrial alga Parietochloris pseudoalveolaris is supported by modest BS values in both the amino acid and nucleotide analyses (Figures 1 and 4). We highlight below the main evolutionary relationships uncovered for the core trebouxiophyceans in our chloroplast phylogenomic study.

The strongly supported assemblage formed by the Oocystis and Geminella clades represents the deepest branching trebouxiophycean lineage in both the protein- and DNA-based phylogenies (Figures 1 and 4). The placement of the Oocystis clade within the core trebouxiophyceans contrasts sharply with the sister relationship of the Oocystaceae and Chlorellales observed in a number of 18S rDNA studies [32],[37]-[39],[70]. With regards to the Geminella clade, we found that the “Koliellacorcontica taxon is robustly allied with this clade and thus should be considered to be a bona fide member; this association was previously observed in a phylogeny inferred from 18S rDNA, albeit with no support [37].

The sarcinoid green alga Pleurastrosarcina brevispinosa, for which no 18S rDNA sequence is currently available in public databases, occupies the next branch after the Oocystis/Geminella lineages. This desert crust alga, originally designated as Chlorosarcina brevispinosa, was assigned to the genus Pleurastrosarcina by Sluiman and Blommers [48]. The phylogenies reported here confirm that this taxon belongs to the Trebouxiophyceae and indicate that it represents a novel lineage of this class. In a very recent study, Fučíková et al. [71] reported that most major trebouxiophycean lineages contain desert-dwelling taxa and presented evidence for three new lineages of free-living trebouxiophyceans found in North American desert soil crusts. While the Desertella lineage is nested within the Watanabea clade, the Eremochloris and Xerochlorella lineages represent independent clades of the Trebouxiophyceae. In future studies, it will be interesting to investigate whether the sarcinoid Pleurastrosarcina brevispinosa belongs to one of the latter lineages. Another lineage that should examined for a possible affinity with Pleurastrosarcina is the Leptochlorella clade, which was recently discovered by Neustupa et al. [38] and further delineated by Fučíková et al. [71].

The branching order observed for the representatives of the Prasiola clade is mostly congruent with 18S rDNA phylogenies [33]-[35],[39], and in agreement with the studies of Krienitz et al. [72] and Gaysina et al. [70], the crescent-shaped green alga Neocystis brevis is recovered as sister to this clade. Given that this affiliation is supported with maximal BS values in all analyses, the Neocystis lineage clearly represents a basal branch of the Prasiola clade. Chlorella mirabilis shares a sister relationship with the Pabia + Koliella clade in all our analyses (Figures 1 and 4); in contrast, 18S rDNA trees frequently identify C. mirabilis as sister to all other lineages of the Prasiola clade [32]-[35],[39].

The coccoid soil alga Parietochloris pseudoalveolaris forms an independent lineage between the Prasiola clade and the monophyletic group uniting the Microthamniales and the Xylochloris clade in the amino acid-based phylogeny (Figure 1). Parietochloris is allied with the Microthamniales in a number of published 18S rDNA trees [32]-[34],[37],[38],[73], but this alliance is weakly supported. The Xylochloris clade is a newly identified assemblage of two lineages for which no sister groups were previously identified; it consists of the coccoid subaerial alga Xylochloris irregularis and the filamentatous soil alga Leptosira terrestris. The recent discovery of a coccoid soil alga (Chloropyrula uraliensis) belonging to a lineage related to the genus Leptosira suggests that the Xylochloris clade likely represents a diversified group of trebouxiophyceans [70].

The five remaining clades of core trebouxiophyceans consist of the Trebouxiales and the Lobosphaera, Watanabea, Choricystis and Elliptochloris clades. Members of all these clades, except the Lobosphaera lineage, include algae that occur as symbionts; the Trebouxiales, in particular, are the most common photobionts in lichens. The branching order reported here for the five clades of core trebouxiophyceans was not observed in 18S rDNA trees, even though these clades were often found as neighboring lineages. Only the most recent divergence of core trebouxiophycean lineages we identified (i.e. the Choricystis/Elliptochloris + Watanabea assemblage) was also recivered in 18S rDNA studies [32],[72], but with no support. In contrast to 18S rDNA trees where the Trebouxiales and the Lobosphaera clade display an unsupported sister relationship [32],[33],[38],[72], the Lobosphaera clade consistently emerges with strong support as an independent lineage after the Trebouxiales in all chloroplast trees.

The Chlorellales and their relationship with other core chlorophytes

Three distinct clades of Chlorellales were recovered in this study: the Parachlorella, Chlorella and Marvania clades (Figure 4). As observed by Somogyi et al. [74] in 18S rDNA trees (albeit with no support), we found that the Parachlorella clade is sister to the other two lineages in most amino acid-based trees; however, this position is occupied by the Marvania clade in the phylogenies inferred from nucleotide data. A recent 18S rDNA study [75] recovered Pseudochloris wilhelmii and the Parachlorella and Chlorella clades as part of a large assemblage that is sister to Marvania, a topology that contrasts with the finding that Marvania and Pseudochloris are sister taxa in all our analyses.

The results presented here reveal an affinity between the Chlorellales and the Pedinophyceae, although support is weak in the Bayesian analysis under the CATGTR + Γ4 model (PP = 0.84, Figure 1). This finding is consistent with previous chloroplast phylogenomic studies with scarce sampling of trebouxiophyceans, wherein the freshwater flagellate Pedinomonas minor was found to be sister to the clade formed by members of the Chlorellales [42],[66]. But subsequently, Marin [76] identified no association between the Pedinophyceae and the Chlorellales using nuclear and chloroplast rRNA operon data sets, the Pedinophyceae being placed as an independent lineage that is sister to the Chlorodendrophyceae + UTC. Note that the clade formed by the Chlorellales and other trebouxiophyceans was not supported with high confidence in these rRNA operon trees and that the branching order of most trebouxiophycean lineages was unresolved.

Given the conflicting positions of the Chlorellales and Pedinophyceae in the aforementioned analyses, the weak PP support that the Chlorellales + Pedinophyceae clade received in the PhyloBayes analyses of the amino acid data set and the basal position occupied by the Pedinophyceae in trees inferred from the Dayhoff-recoded data set, we conclude that the question as to whether the Chlorellales and Pedinophyceae form a monophyletic group remains unsettled. It is possible that the Chlorellales + Pedinophyceae affiliation is the result of systematic errors of phylogenetic reconstructions. Solving this issue will require sampling of the Chlorodendrophyceae and the inclusion of additional taxa from the Ulvophyceae and the lineage represented by the prasinophyte CCMP 1205. The two ulvophycean taxa used in our study represent distinct basal lineages of the Ulvophyceae (Oltmannsiellopsidales and Ulvales/Ulotrichales); taxa from the BCDT (Bryopsidales, Cladophorales, Dasycladales, and Trentepohliales) and Ignatius clades will need to be examined for a more representative sampling of ulvophycean diversity [21],[65]. We expect that resolving the ancient and rapid radiations of the core chlorophyte lineages (Pedinophyceae, Chlorodendrophyceae and UTC lineages) using a chloroplast phylogenomic approach will be challenging and will require optimized models of sequence evolution.

Conclusions

The phylogeny reported in this study forms a solid basis for future studies aimed at advancing knowledge about the nature of the morphological and ecological diversification of the Trebouxiophyceae. It provides important insights into the origins and adaptations of terrestrial and symbiotic lifestyles. Members of this group clearly occupy a pivotal position in the Viridiplantae and display considerable genetic diversity. A fundamental understanding of the molecular mechanisms underlying their adaptations to changing environments will require the analysis of genomes from key trebouxiophycean taxa.

Methods

Strains and culture conditions

The 29 green algal strains that were selected for chloroplast genome sequencing are listed in Table 1 (those are the strains whose accession number is associated with an asterisk). All strains were grown in K [77] or C [78] medium at 18°C under alternating 12 h-light/12 h-dark periods.

Genome sequencing, assembly and annotation

As indicated in Table 1, three methods were used to determine the sequences of the 29 green algal chloroplast genomes. Nine of these genomes were sequenced using the Sanger method, 12 using the 454 pyrosequencing method, and the remaining eight using the Illumina method. Sanger sequencing was carried out from random clone libraries of A + T-rich DNA fractions as described [79]. Chloroplast genome sequences were assembled using Sequencher 5.1 (Gene Codes Corporation, Ann Arbor, MI) and genomic regions not represented in the assemblies were sequenced from polymerase chain reaction (PCR)-amplified fragments using primers specific to the flanking contigs.

For 454 sequencing, shotgun libraries of A + T-rich DNA fractions (700-bp fragments) were constructed using the GS-FLX Titanium Rapid Library Preparation Kit of Roche 454 Life Sciences (Branford, CT, USA). Library construction and 454 GS-FLX DNA Titanium pyrosequencing were carried out by the “Plateforme d’Analyses Génomiques de l’Université Laval” [80]. Reads were assembled using Newbler v2.5 [81] with default parameters, and contigs were visualized, linked and edited using the CONSED 22 package [82]. Contigs of chloroplast origin were identified by BLAST searches against a local database of organelle genomes. Regions spanning gaps in the chloroplast assemblies were amplified by PCR with primers specific to the flanking sequences. Purified PCR products were sequenced using Sanger chemistry with the PRISM BigDye Terminator Ready Reaction Cycle Sequencing Kit (Applied Biosystems, Foster City, CA, USA).

For Illumina sequencing, total cellular DNA was isolated using the EZNA HP Plant Mini Kit of Omega Bio-Tek (Norcross, GA, USA). Libraries of 700-bp fragments were constructed using the TrueSeq DNA Sample Prep Kit (Illumina, San Diego, CA, USA) and paired-end reads were generated on the Illumina HiSeq 2000 (100-bp reads) or the MiSeq (300-bp reads) sequencing platforms by the Innovation Centre of McGill University and Genome Quebec [83] and the “Plateforme d’Analyses Génomiques de l’Université Laval” [80], respectively. Reads were assembled using Ray 2.3.1 [84] and contigs were visualized, linked and edited using the CONSED 22 package [82]. Identification of chloroplast contigs and gap filling were performed as described above for 454 sequence assemblies.

Genes and ORFs were identified on the final assemblies using a custom-built suite of bioinformatics tools [85]. Genes coding for rRNAs and tRNAs were localized using RNAmmer [86] and tRNAscan-SE [87], respectively. Intron boundaries were determined by modeling intron secondary structures [88],[89] and by comparing intron-containing genes with intronless homologs.

Phylogenomic analyses of amino acid data sets

The chloroplast genomes of 63 green algal taxa were used in the phylogenomic analyses. The GenBank accession numbers of the pedinophycean and trebouxiophycean genomes are presented in Table 1; those of the remaining taxa are as follows: Mesostigma viride, [GenBank:NC_002186]; Chlorokybus atmophyticus, [GenBank:NC_008822]; Prasinococcus sp. CCMP 1194, [GenBank:KJ746597]; Prasinoderma coloniale CCMP 1220, [GenBank:KJ746598]; Prasinophyceae sp. MBIC 106222, [GenBank:KJ746602]; Pyramimonas parkeae, [GenBank:NC_012099]; Monomastix sp. OKE-1, [GenBank:NC_012101]; Ostreococcus tauri, [GenBank:NC_008289]; Micromonas sp. RCC 299, [GenBank:NC_012575]; Nephroselmis olivacea, [GenBank:NC_000927]; Nephroselmis astigmatica, [GenBank:KJ746600]; Pycnococcus provasolii, [GenBank:NC_012097]; Picocystis salinarum, [GenBank:KJ746599]; Prasinophyceae sp. CCMP 1205, [GenBank:KJ746601]; Oltmannsiellopsis viridis, [GenBank:NC_008099]; Pseudendoclonium akinetum, [GenBank:NC_008114]; Oedogonium cardiacum, [GenBank:NC_011031]; Floydiella terrestris, [GenBank:NC_014346]; Stigeoclonium helveticum, [GenBank:NC_008372]; Schizomeris leibleinii, [GenBank:NC_015645]; Scenedesmus obliquus, [GenBank:NC_008101]; Chlamydomonas moewusii, [GenBank:EF587443-EF587503]; Dunaliella salina, [GenBank:NC_016732]; Volvox carteri f. nagariensis, [GenBank:GU084820]; and Chlamydomonas reinhardtii, [GenBank:NC_005353].

A total of 79 protein-coding genes were used to construct the data sets: accD, atpA, B, E, F, H, I, ccsA, cemA, chlB, I, L, N, clpP, cysA, T, ftsH, infA, minD, petA, B, D, G, L, psaA, B, C, I, J, M, psbA, B, C, D, E, F, H, I, J, K, L, M, N, T, Z, rbcL, rpl2, 5, 12, 14, 16, 19, 20, 23, 32, 36, rpoA, B, C1, C2, rps2, 3, 4, 7, 8, 9, 11, 12, 14, 18, 19, tufA, ycf1, 3, 4, 12, 20, 47, 62. Amino acid data sets were prepared as follows: the deduced amino acid sequences from the 79 individual genes were aligned using MUSCLE 3.7 [90], the ambiguously aligned regions in each alignment were removed using TRIMAL 1.3 [91] with the options block = 6, gt = 0.7, st = 0.005 and sw = 3, and the protein alignments were concatenated using Phyutility 2.2.6 [92].

Phylogenies were inferred from the amino acid data sets using the ML and Bayesian methods. ML analyses were carried out using RAxML 8.0.20 [93] and the gcpREV + Γ4 [54], LG4X [55] and GTR + Γ4 models of sequence evolution; in these analyses, the data sets were partitioned by gene, with the model applied to each partition. Confidence of branch points was estimated by fast-bootstrap analysis (f = a) with 500 replicates and confidence assessment of phylogenetic tree selections under the GTR + Γ4 model was carried out by the approximately unbiased (AU) test [60] as implemented in CONSEL 0.20 [61]. Bayesian analyses were performed with PhyloBayes 3.3f [94] using the site-heterogeneous CATGTR + ?4 model [57]. To establish the appropriate conditions for these analyses, five independent chains were run for 2,000 cycles and consensus topologies were calculated from the saved trees using the BPCOMP program of PhyloBayes after a burn-in of 500 cycles. Under these conditions, the largest discrepancy observed across all bipartitions in the consensus topologies (maxdiff) was lower than 0.30, indicating that convergence between the chains was achieved. Bayesian analysis of the Dayhoff-recoded version of the amino acid data set was also performed using PhyloBayes and the CATGTR + Γ4 model.

Cross-validation tests were conducted to evaluate the fits of the gcpREV + Γ4, GTR + Γ4 and CATGTR + Γ4 models of amino acid substitutions to the data set. They were carried out with PhyloBayes using ten randomly generated replicates. Cross-validation is a very general statistical method for comparing models. The procedure can be summarized as follows. The data set is randomly partitioned into two unequal subsets, the learning set (also called the training set) and the test set. The learning set serves to estimate the parameters of the model and these parameters are then used to compute the likelihood of the test set. To reduce variability, multiple rounds of cross-validation are performed using different partitions and the resulting log likelihood scores (which measure how well the test sets were predicted by the model) are averaged over the rounds.

To analyze the amino acid composition of the 63-taxon data set, we first assembled a 20 × 63 matrix containing the frequency of each amino acid per species using the program Pepstats of the EMBOSS package [95]. A correspondence analysis of this data set was then performed using the R package ca[96].

Phylogenomic analyses of nucleotide data sets

Nucleotide data sets containing the gene sequences represented in the amino acid data set of 63 taxa were prepared as follows. To obtain the data set with all three codon positions, the multiple sequence alignment of each protein was converted into a codon alignment, the poorly aligned and divergent regions in each codon alignment were excluded using Gblocks 0.91b [97] with the -t = c, Γb3 = 5, Γb4 = 5 and -b5 = half options, and the individual codon alignments were concatenated using Phyutility 2.2.6 [92]. The nt1 + 2 data set was obtained by excluding the third codon positions using Mesquite 2.75 [98]. The degen1 data set was prepared using the Degen1.pl 1.2 script of Regier et al. [62]. This script fully degenerates all codons that encode single amino acids by substituting one of the four standard nucleotides with ambiguity codes that allow for all possible synonymous change for that amino acid. It operates by degenerating nucleotides at all sites that can potentially undergo synonymous change in all pairwise comparisons of sequences in the data matrix, thereby making synonymous change largely invisible and reducing compositional heterogeneity but leaving the inference of nonsynonymous changes largely intact.

ML analyses of nucleotide data sets were carried out using RAxML 8.0.20 [93] and the GTR + Γ4 model of sequence evolution; in these analyses, the data sets were partitioned by gene, with the model applied to each partition. Confidence of branch points was estimated by fast-bootstrap analysis (f = a) with 500 replicates.

Availability of supporting data

The sequence data generated in this study are available in GenBank under the accession numbers KM462860-KM462888 (see Table 1). The data sets supporting the results of this article are available in the Dryad Digital Repository (doi: 10.5061/dryad.q4432) [99].

Authors’ contributions

CL and MT conceived the study, designed taxon sampling and wrote the manuscript. CO performed the experimental work. CO and CL carried out the genome assemblies and annotations. CL performed the phylogenetic analyses and generated the figures. MT and CL analyzed the phylogenetic data. All authors read and approved the final manuscript.