Background

In every living organism, developmental, morphological and physiological mechanisms, such as those allowing acclimation to environmental changes, are the result of genome expression modulation. One level of this modulation is related to gene expression, in which transcription factors are among the key players [1]. These regulators can be divided into two groups: transcription factors (TFs) and transcriptional regulators (TRs). These groups interact with each other and affect gene transcription. TFs are characterized by a DNA binding domain (DBD), an oligomerization domain (allowing interaction with other TFs, as well as with other transcriptional regulators) and a transcription regulation domain (allowing control of gene expression). These proteins (also called trans-factors) control the expression of multiple target genes by binding to specific DNA motifs in their promotor regions. TRs interact with TFs or with chromatin allowing genes to be transcribed either (1) facilitating the recruitment of the basal transcription machinery, or (2) modifying chromatin structure, making genes more accessible [2].

TFs are classified according to their DBD [3]. Most TFs have only one DBD, which can be present in one or multiple copies in the same sequence. However, some TFs can have several DBD types in their sequence [4].

Since the first study on the identification of TFs in four archaeal genomes [5], the increase in the number of sequenced genomes facilitates putative TF identification in unrelated taxa through in silico studies [610]. Such taxonomically diverse data allows comparative analyses between different species or lineages [6, 7, 913] and understanding of the evolutionary aspects through TFs [11, 14, 15]. This kind of study can reveal taxonomic characteristics (i.e., the specificity and expansion of TF families) of the TF complement of different organisms. In silico analysis of FTs performed on Arabidopsis thaliana (A. thaliana) showed that 45 % of TFs are plant specific. Moreover, a plant-specific expansion of the MYB superfamily was demonstrated (190 copies in the A. thaliana genome compared with 6 and 10 in Drosophila melanogaster and Saccharomyces cerevisiae, respectively) [6]. Another example of such lineage-specific expansion of a TF family is the retinoic acid receptors in the nematode Caenorhabditis elegans. Using the AnimalTFDB database, 239 putative TFs belonging to this family were identified, whereas in other animals, such as Tetraodon nigroviridis, this TF family is only represented by 19 members [10].

Among microalgae, TF complement comparative studies have been undertaken for stramenopiles [9] and to investigate the evolutionary history of both red and green algae among photosynthetic organisms [11, 15]. Microalgae arose from the endosymbiosis of a photosynthetic eukaryote, related to today’s cyanobacteria, by a primitive eukaryotic heterotroph. Glaucophyta, Rhodophyta and Chlorophyta all originated from this primary endosymbiosis [16, 17]. A series of secondary and tertiary endosymbioses would have then led to the diversity of microalgae observed today [18, 19]. Haptophytes would have appeared, as would stramenopiles, from the secondary endosymbiosis of both a green and a red alga by a heterotrophic eukaryote [19, 20]. Haptophytes are one of the key players in the evolutionary history of photosynthetic organisms [21] and are widely distributed among the photosynthetic unicellular eukaryotes in today’s oceans. However, in silico comparative studies in haptophytes are limited because few data are available.

Here, we conducted the first genome-wide identification and comparison of the TF complement in haptophytes using an optimized and automated pipeline. This analysis pipeline combines research for similarities with known TFs and protein domains using a large database containing plant, fungal, mammal and cyanobacterial TFs. Using our pipeline, we performed the in silico identification of the TF complement in three haptophytes (Tisochrysis lutea, Emiliania huxleyi and Pavlova sp) and two stramenopiles (the eustigmatophycea, Nannochloropsis gaditana and the diatom Phaeodactylum tricornutum), which are close organism groups [19, 22], as well as in the green alga Chlamydomonas reinhardtii and the red alga Porphyridium purpureum. We focused on the identification of the main families of TFs found in these microalgal species and compared their respective abundance in each. Moreover, the present study identified, for the first time, the presence of cyanobacterial TFs in each of the microalgal genomes studied.

Results and discussion

Evaluation of transcription factor identification accuracy

Pipeline analysis is essential for whole genome TF identification. Since no universal pipeline exists, each study uses its own. However, every pipeline is based on the same tools: a single identification with BLAST searches against a plant database [9, 15], and/or a single protein domain search with HMMER software focused on plant DBDs [1113]. Several pipelines combine both methods so as to be more accurate and exhaustive [2, 8]. Moreover, the HMMER software is used either with the Pfam database or the combination of Pfam and another database. Our pipeline also combines the same identification strategies, but with some specificities: our analysis pipeline includes more protein domain databases (the eleven databases of the InterProScan consortium) and the research is not restricted to plants, but enlarged to fungi, algae and cyanobacteria.

In order to estimate the accuracy of our pipeline (Fig. 1), we applied it to the predicted proteome of A. thaliana and three cyanobacteria (see Methods section). The sensitivity and the PPV were measured in the same way as [23] and [24].

Fig. 1
figure 1

Identification pipeline. The pipeline is divided into three steps. Step One uses two strategies: i) a similarity search against an algae-based self-built database of known TFs with BLAST software; ii) functional domain annotation with InterProScan and HMMER software. The protein list obtained is the subject of the Step Two: the filtration of false positives according to specific parameters (see Methods). The last step consists in the classification of the putative TF list obtained in Step Two using a homemade perl script followed by manual curation for specific cases (see Methods)

The analysis of the pipeline accuracy against eleven plant TF families showed that nine were identified with a good sensitivity and PPV values equal to one (Tables 1 and 2). Only, MADS and bHLH TF families were identified with a low sensitivity and a PPV value of 0.99, respectively. Using a more recent gold standard than [23] and [24], our sensitivity and PPV values are equivalent or better than previous pipelines [24, 25].

Table 1 Evaluation of the pipeline accuracy for each TF family for plant TFs. A sensitivity value less than one means inclusion of false negatives, and a PPV value less than one means inclusion of false positives
Table 2 Evaluation of the pipeline accuracy for each TF family for cyanobacterial TFs. A sensitivity value less than one means inclusion of false negatives, and a PPV value less than one means inclusion of false positives

Concerning the cyanobacterial TF families, the sensitivity value was one for all families (no false negative identified). The PPV values were equal to one for cyanobacterial TFs, except for the GntR and Crp families (0.83 and 0.88, respectively). These lower PPV values are mostly due to the lower number of TFs in these organisms (i.e., only one and two false positives for families GntR and Crp). These results indicate the high accuracy (low false positives identified) and performance (low false negatives) of our analysis pipeline for the in silico identification of TFs not only in plants and cyanobacteria but also for other organisms such as algae.

Transcription factor content in algae

In this study, predicted TFs from seven algae representing four different lineages were identified and classified using our analysis pipeline (Table 3). In total, 155,128 and 478 TFs were identified in the haptophytes Tisochrysis lutea (T. lutea), Pavlova sp. and Emiliania huxleyi (E. huxleyi), respectively. Concerning the two stramenopiles, 196 and 93 TFs were identified in Phaeodactylum tricornutum (P. tricornutum) and Nannochloropsis gaditana (N. gaditana), respectively. Finally, 199 and 212 TFs were identified in the rhodophyte Porphyridium purpureum (P. purpureum) and the chlorophyte Chlamydomonas reinhardtii (C. reinhardtii), respectively. All TFs identified belong to common families that are largely distributed between species studied. Here, the predicted TFs of the haptophytes T. lutea, Pavlova sp. and E. huxleyi were divided into 27, 24 and 25 families, respectively. Twenty-two families were reported for each of the stramenopiles (P. tricornutum and N. gaditana), while 25 and 37 families were identified for P. purpureum and C. reinhardtii. According to predicted proteomes, the proportion of TFs was estimated between 0.8 and 2.4 % (Fig. 2). Such percentages in microalgae are consistent with previous studies [9, 13]. By way of comparison across the eukaryotic world, the unicellular organism Saccharomyces cerevisiae dedicates 3.5 % of its proteome to TFs [26]; whereas the multicellular eukaryotes such as Drosophilia melanogaster, A. thaliana and Homo sapiens, contain 4.6, 5.9 and 8 to 9 % TFs, respectively [6, 26, 27]. In accordance with the fact that TFs play a role in morphology diversification of organisms [2830] these proportions show a correlation between the complexity of organisms and the proportion of TFs found in the proteome of these organisms [2, 14, 3133]. This is illustrated by the coincidence of TF families’ expansion with divergence of great eukaryotic lineages [11]. Indeed, it is well known that the evolutionary history of eukaryotes, especially plants, is punctuated by multiple biological processes, such as duplication [3436] or domain shuffling, allowing modifications resulting in the emergence of new TF families [6, 11, 37]. These whole or partial genome duplications and domain shuffling have not been shown in algae. However, it can be reasonably assumed that such phenomena, leading to the emergence of new TF families, have also occurred in algae. This is suggested by the presence of TF families found only in green algae compared to the other algal lineages.

Table 3 Transcription factor families identified and their proportions in seven microalgae
Fig. 2
figure 2

Percentages of the predicted proteomes dedicated to transcription factors in the 7 algae

These lineage-specific gains and losses of TF families are a kind of mirror of their evolutionary history. To illustrate this idea, a binary table representing the presence/absence of TF families in seven algae representing four different lineages was performed. On this basis, a similarity matrix was computed to infer a dendrogram using R version 3.1.0 (Fig. 3). The resultant dendrogram (deposited in TreeBase: http://purl.org/phylo/treebase/phylows/study/TB2:S19079) confirms the relationship between algae derived from the four different lineages. Haptophytes, stramenopiles, red algae and green algae are clearly separated. We also found that T. lutea is more related to E. huxleyi than Pavlova sp., as has been described in the literature [38, 39]. The rhodophyte P. purpureum is located between haptophytes and stramenopiles. This position is mostly due to the absence of MADS-box and C2C2-GATA families in stramenopiles, which makes them a more distant group from the four previous algae. Finally, the chlorophyte C. reinhardtii is the most distant from the others because of the presence of the TF families specific to the green lineage. This illustrates that the composition of this TF content is partly lineage specific. To discriminate the TF families, a haetmap was built using the data of Table 3. TF families were clustered according to their given proportions in the seven algal genomes (Fig. 4). Four interesting clusters were found: (i) TF families described as specific to green lineage. (ii) TF families with equivalent proportions among the 7 algal genomes. (iii) TF families present in the 7 algae but with different proportions. (iv) Finally, TF families only absent in stramenopiles.

Fig. 3
figure 3

Dendrogram representing the repartition of the four lineages according to the presence/absence of TF families. The green lineage is colored in green, stramenopiles in orange, red lineage in red and haptophytes in purple. The scale indicates distance measurement

Fig. 4
figure 4

Heatmap showing the clustering of TF families according to their proportion in the algal genomes. Cluster 1 comprises TF families described as specific to the green lineage. Cluster 2 is composed of families with equivalent proportions across algal genomes. Cluster 3 is composed of families present in the 7 algae but in different proportions. Cluster 4 is composed of 3 families that are absent in stramenopiles

In the following section, the TF content of the seven algae and their specificities of lineage, based on Table 3 and Fig. 4, are examined in more detail.

Comparison of TF families among microalgae lineages

Common TF families with equivalent proportions

The proportions of each TF family in the seven algae were compared. We found that four families were present in similar proportions throughout the algal lineage (Table 3). Among these, the Cold Shock Domain (CSD) family is distributed around 1 to 5 % in analyzed algae. Our analysis pipeline identified for the first time three CSD TFs in the rhodophyte P. purpureum, representing 1.5 % of the predicted proteome. Moreover, this family was previously described as absent from red microalgae [15]. The absence of identification of CSD TFs from the red lineage may be explained by the fact that research on red microalgae was performed only in the genome of the extremophiles Galderia sulfuraria (G. sulfuraria) and Cyanidioschyzon merolae (C. merolae). These organisms are adapted to the particular selection pressure due to their living environment (in hot springs such as in Yellowstone National Park) [40]. Consequently, the absence of this TF family from G. sulfuraria and C. merolae cannot be taken as a common characteristic of the red lineage.

The E2F/DP family, present in all eukaryotes and known for its involvement in the cell cycle [41], is also equally distributed among algae (around 1 to 3 %).

The MYB family is large, functionally diverse and represented in all eukaryote, such as algae (around 30 %). MYB factors are characterized by a highly conserved DNA-binding domain: the MYB domain. MYB TFs can be divided into different classes depending on the number of adjacent repeats. Three repeats of MYB protein are referred to as R1, R2 or R3, and repeats identified on other related MYB proteins are named in accordance with their similarity with R1, R2 and R3. Although most of these TFs are not functionally characterized in plants, some have been identified as involved in key mechanisms, such as cellular morphogenesis, secondary metabolism, response to biotic and abiotic stresses and signal transduction [4245]. Finally, the last family equally distributed among algae is the Sigma-70 family. Members of the Sigma-70 family of sigma factors serve as components of the RNA polymerase that direct it to specific promoter elements. In photosynthetic eukaryotes, these Sigma-70 TFs are nuclear encoded and play a role in plastid transcription [46].

Common TF families with different proportions

Four cases of TF families exhibit a difference of proportion between species and are grouped in the cluster number 3 in the Fig. 4. Among these, the C3H type zinc finger family, whose DBD forms a zinc finger, is twice as common in haptophytes and green algae (around 10 %, except for Pavlova sp. (5.5 %)) as in stramenopiles and red algae (around 5 %) (Table 3). This protein family is widespread in the tree of life [4749] and involved in the response to biotic and abiotic stresses [50, 51]. The second family that shows different proportions is the basic leucin-zipper (bZIP) TF family, which accounts for about 2 % in the three haptophytes analyzed in this study, while its proportion is about 10 % in the other algae (P. tricornutum: 12.8 %, N. gaditana: 11.8 %, P. purpureum: 10.6 % and C. reinhardtii: 9.4 %).

The third case is that of a particular class of MYB-related TFs: the SHAQKYF-like TFs. This family was described in plants, green algae, as well as in stramenopiles and Amoebozoa [9, 52, 53]. MYB-SHAQKYF is a minority among MYB-rel in E. huxleyi and T. lutea (2 and 4.7 %, respectively). For Pavlova sp. and C. reinhardtii, non-negligible amounts of MYB-SHAQKYF were identified among MYB-rel (13.3 and 22.2 %, respectively). In contrast, MYB-SHAQKYF represent almost half of the MYB-rel TFs in the two stramenopiles P. tricornutum and N. gaditana, as well as in the rhodophyte P. purpureum (50, 53.3 and 69.6 %, respectively) (Fig. 5). Such a distribution, together with the presence of such TFs in Amoebozoa, suggests that MYB-SHAQKYF proteins have an ancient origin.

Fig. 5
figure 5

Percentages of MYB-SHAQKYF among MYB-related TFs in algae

Finally, the Nuclear Factor-Y (NF-Y) family, also present in all eukaryotes is divided into three subunits: NF-YA, NF-YB and NF-YC. In plants, three subunits were identified [13]; these TRs are involved in mechanisms as diverse as chloroplast biogenesis, stress response, nodule formation, flowering time control, fatty acid biosynthesis, or response to absisc acid and blue light [5459]. Subunits NF-YB and NF-YC form a dimer in the cytosol, which is then translocated into the nucleus. The NF-YB/NF-YC dimer interacts in the nucleus with the NF-YA subunit. The functional trimer binds to a cis-element called CCAAT-box in the promoter of its target genes [60, 61]. However, no NF-YA subunit was identified in T. lutea and C. reinhardtii. Such an absence in chlorophyte was previously reported using a similar approach for C. reinhardtii, Volvox carteri and Ostreococcus tauri [13]. The absence of the NF-YA subunit would therefore imply that it is impossible to form the functional trimer. However, it was demonstrated that other TFs are able to interact with the NF-YB and NF-YC subunits. For example, the NF-YB/NF-YC complex can interact with a TF belonging to the C2C2-CO-like family thanks to its CCT domain [62]. Moreover, the interaction between the NF-YB/NF-YC complex and bZIP TFs of A. thaliana is sufficient to activate the transcription of target genes, either in the presence or absence of abscisic acid (ABA) [63]. Alternatively, the NF-YB/NF-YC dimer could be active without NF-YA in these taxa.

TF family expansion

During evolutionary history, duplication events occur. Following these duplications, the number of genes of a given family increases. These gene family expansions may be lineage or species specific [64]. Contrary to the other algae in which the MYB family is the most represented, in P. tricornutum, E. huxleyi and P. purpureum, another TF family is more represented because of the expansion phenomenon. In the stramenopile P. tricornutum, the Heat Shock Factor family (HSF) was the most represented among the TF families (34.2 % of the TF content) (Table 3). Such a proportion of HSF was previously shown in the diatoms P. tricornutum, T. pseudonana and Fistulifera solaris [9, 65]. This expansion seems to be specific to diatoms since neither N. gaditana nor other photosynthetic stramenopiles exhibit such expansion of HSFs [9].

In the haptophyte E. huxleyi, the most represented family, accounting for 33 %, is AP2/ERF, involved in growth and development as well as various responses to environmental stimuli. This family was described as specific to the green lineage [15] and its expansion in E. huxleyi was also previously described [9]. However, such a proportion of the AP2/ERF is not common to all haptophytes since T. lutea and Pavlova sp. have AP2/ERF proportions of 1.8 and 5.5 %, respectively, which are close to values recovered for stramenopiles and green algae, respectively. The non-detection of the AP2/ERF family in the Rhodophyta P. purpureum is noteworthy, confirming the absence of AP2/ERF in algae belonging to the red lineage [12, 15].

Finally, the C2H2 type zinc finger family was identified as the most represented family in the rhodophyte P. purpureum. We found that the C2H2 proportion represents 30.2 % compared to less than 8 % in the other algae. Interestingly, in the two extremophiles, G. sulfuraria and C. merolae, the C2H2 family was reported to account for less than 5 % [12].

These examples of lineage or species-specific TF expansion illustrate the phenomena that govern the story of TF evolution: gene duplication [66] and diversification through the emergence of lineage-specific families via functional domain shuffling [4, 6, 14, 67]. In the algal world, one of the best examples of lineage-specific TF families is the “green TFs family”, which are specific to the green lineage.

Lineage-specific TF families

Are TF families specific to the green lineage highly specific?

Previous comparative studies of the TF content of diverse photosynthetic organisms reveal that some TF families are specific to the green lineage because of their absence from red microalgae [11, 15]. Among all green lineage-specific TF families identified in this study, only nine families were present in the green algae C. reinhardtti: NF-X1, S1Fa-like, SBP, VARL, Whirly, WRKY, GARP-ARR-B, C2C2-CO-like and C2C2-Dof (Table 3). However, some TF families previously described as specific to the green lineage were also identified in haptophytes, stramenopiles or in the rhodophyte P. purpureum. First of all, one TF belonging to the ABI3/VP1 family was identified in T. lutea and the C2C2-LSD family have one member in both T. lutea and Pavlova sp. In the heatmap (Fig. 4), these two TF families are clustered with the nine families only identified in C. reinhardtii. Moreover, the CSD family was identified in all predicted proteomes and the AP2/ERF and TUB families are absent in P. purpureum, but present in the six other algae. Another interesting finding is the unique identification of a member of the Double B-box (DBB) family in P. purpureum. This family had only previously been identified in land plants [68] and was thought to be involved in light signal transduction mechanisms, such as early photomorphogenic development of A. thaliana [6972].

This presence of “green TFs” in algae that do not belong to the green lineage could be explained either (i) by a loss of these families during evolutionary history of rhodophytes, or (ii) by the acquisition of these families by horizontal gene transfer from a green algal endosymbiont to the nuclear genome. This last hypothesis is consistent with the endosymbiosis of a green and a red alga in the evolutionary history of haptophytes and stramenopiles [19].

Specific features of stramenopiles

The stramenopiles P. tricornutum and N. gaditana are distinguished by the absence of the C2C2-GATA family and the MADS-box family, which are involved in plant homeotic functions [7375] (Table 3). These results confirm those of Rayko et al. [9] for stramenopile micro- and macro-algae. Moreover, our results also highlight the absence of TFs from the LIM family in stramenopiles, while LIM TFs are present in all other studied algae. LIM, C2C2-GATA and MADS-box families are clustered together in Fig. 4. To examine whether these features are shared by other stramenopiles not investigated in this work, a specific research of LIM, MADS-box and C2C2-GATA TFs was carried out in the two diatoms Pseudo-nitzschia multiseries and Fragilariopsis cylindrus. No member of these families was identified (data not shown). By contrast, the MADS-box, C2C2-GATA and LIM families were identified in P. purpurem and C. reinhardtii (this study), as well as in other chlorophytes and rhodophytes (the green algae Bathycoccus prasinos, Micromonas pusilla, Micromonas sp, Ostreococcus lucimarinus, Ostreococcus sp, Ostreococcus tauri and Volvox carteri; the red algae C. merolae and G. sulfuraria) [12, 13]. This repartition suggests that the MADS-box, C2C2-GATA and LIM families were present in the hypothetical ancestor of the algae and secondarily lost in stramenopiles.

Another feature of stramenopiles concerns some particular combinations of functional domains. Two domain associations shared by both stramenopiles N. gaditana and P. tricornutum were identified. The first is composed of a bHLH domain and a PAS domain (named after the three first sequences in which it was identified (Per, Arnt, Sim)) and the second by a bZIP and LOV (Light, Oxygen, Voltage) domain combination. The bHLH-PAS TFs are well known in vertebrate TFs in which two PAS domains are present, contrary to the stramenopile sequences that have only one PAS [9, 76]. In vertebrates, the PAS domains are involved in the dimerization of PAS domains containing TFs, such as the Hypoxia Inducible Factor [77, 78]. The presence of bHLH and PAS domains in the same sequence in both vertebrates and stramenopiles may be an example of convergent evolution, which suggests that this fusion occurred in a parallel fashion in different lineages.

The second stramenopile specific combination is that of the bZIP and LOV domains. These sequences, called aureochromes, are an atypical case that couple both blue light receptor and transcription factor functions [79]. We identified three and four aureochromes in N. gaditana and in P. tricornutum, respectively. Such sequences have only been identified in photosynthetic stramenopiles [9, 7982]. In marine environments, the sea water absorbs wavelengths other than blue, which are the only wavelengths to travel long distances within the water column [83]. Blue light is thus expected to play an important role in algae, as suggested by the involvement of aureochromes in key mechanisms such as the cell cycle [84]. Moreover, mechanisms like photomorphogenesis and phototropism observed in algae [85] are influenced in land plants by phototropins [86]. These are blue light receptors harboring two LOV domains and have a role in signal transduction [87]. Thus, aureochromes are lineage-specific TFs evolved by photosynthetic stramenopiles that confer an adaptive capacity for success in an aquatic environment.

Specific features of haptophytes

The bHLH TFs were identified in the predicted proteome of P. tricornutum, N. gaditana, C. reinhardtii and P. purpureum, but not in the three haptophytes (Table 3). Nevertheless, bHLH is one of the most widespread TF families in eukaryotes and the second most represented in plants [13, 88]. This repartition suggests that the bHLH TF family was secondarily lost in T. lutea, E. huxleyi and Pavlova sp. These results confirm previous conclusions derived from the comparison of the TF content composition of six stramenopiles with E. huxleyi [9], and extends the number of haptophyte organisms sharing this common absence of bHLH families.

Interestingly, we identified two and four Heat Shock transcription factors (HSFs) in E. huxleyi and T. lutea, respectively, that share the association of a HSF DBD with a PAS domain. Moreover, two other HSF proteins, harboring two PAS domains, were identified only in T. lutea.

The HSF domain is known for playing a role in stress perception in all categories of living organisms [89]. Its sensor function is applied to stimuli such as light, oxygen or redox potential. Such stimuli are also known to induce HSF expression. In plants in particular, HSFs are involved in response to oxidative stress and redox stat changes [90, 91]. This functional convergence led us to hypothesize that the sensor function of the PAS domain may play a role in the detection of stimuli involved in HSF activation. The PAS domain also enables protein-protein interactions, especially with other PAS-containing proteins [89, 92]. This function may stabilize the homotrimer formed by activated HSFs. Likewise, four TFs have the undescribed association of a PAS domain and a homeobox domain in T. lutea.

Potential gene transfer cases

Identification of cyanobacterial TFs in the nuclear genome of algae

Remarkably, our TFs prediction pipeline allowed the identification of cyanobacterial TFs in the predicted proteome of all the microalgae studied (Table 4). We investigated whether the presence of these genes could be due to bacterial contamination, and if not, whether these genes are localized in the nuclear, chloroplastic or mitochondrial genome. Because information concerning bacterial contamination are only available for T. lutea (G. Carrier, pers. Com.), Pavlova sp. (transcriptomic data) and C. reinhardtii (JGI portal), it only was possible to answer the contamination question for these three algae. It allowed us to conclude that T. lutea, Pavlova sp. and C. reinhardtii cyanobacterial TFs identification are not due to bacterial contamination. Concerning the localization of the cyanobacterial TFs in the algae, we cannot draw any conclusions for Pavlova sp., for which no mitochondrial or chloroplastic genome are available. For P. purpureum the TFs are not localized in the chloroplastic genome; however, since the mitochondrial genome is not available, we cannot make a conclusion about a mitochondrial localization. We found that these TFs are nuclear genes for T. lutea, E. huxleyi, P. tricornutum, N. gaditana and C. reinhardtii.

Table 4 Number of cyanobacterial transcription factors (TFs) identified in the seven algae for each TF family

Only one TF belonging to the arsenic resistance operon regulator (arsR) family was identified, in N. gaditana. This family is involved in stress response to metal ions in cyanobacteria [93]. Considering the Bac_DNA_Binding family, one member was identified in all the algae except in P. purpureum. This protein family is involved in transcription regulation, transposition and DNA chaperones [94, 95]. Several members of the BolA family were identified in all algae. BolA is a widespread family identified in all groups of the tree of life [2] and is involved in cell cycle regulation and abiotic stress response in cyanobacteria [96]. The GerE family which is part of a two component response regulator was only identified in haptophytes T. lutea, E. huxleyi (except for Pavlova sp.), and in the two stramenopiles N. gaditana and P. tricornutum. This family is characterized by the presence of a LuxR DBD and involved in processes such as signal transduction [97], quorum sensing [98] and sporulation [99]. One member of LysR protein was identified in N. gaditana. In cyanobacteria, this family is involved in CO2 fixation [100] and nitrate assimilation [101]. Finally, the SfsA family was identified in all algae except P. tricornutum. SfsA TF is known to be involved in sugar fermentation [102].

So far, no genome-wide TF identification study has shown the presence of such sequences in microalgae, except for the BolA family in the chlorophyte C. reinhardtii, the diatom Thalassiosira pseudonana, the rhodophyte C. merolae and the cryptophyte Guillardia theta [2]. Since these TF families are found either in cyanobacteria or bacteria, their presence in the algal genomes could be explained either by an endosymbiotic gene transfer (EGT), which is a gene transfer taking place from the chloroplastic genome to the nuclear genome during evolutionary history [103, 104], or a horizontal gene transfer (HGT) from a prokaryotic organism to the algal genome [105].

Fungal TRF: fungus in algae

The TF families described above are of bacterial type, but TFs from the fungal TRF family (also called Zn-clus) were also identified. These TFs are abundant and well described in fungi [106]. Their DBD is characterized by a conserved CysX2CysX6CysX5−16CysX2CysX6−8Cys motif. The six conserved cysteines coordinate two Zn(II) ions allowing correct folding of the domain [107]. This DBD was first identified in the Saccharomyces cerevisiae Gal4 TF [108]. Members of this TF family are implicated in the regulation of genes involved in diverse mechanisms, such as amino acid biosynthesis [109], multidrug resistance [110], ethanol catabolism [111] or lipid catabolism [112, 113].

Fungal TRF were identified in T. lutea, Pavlova sp., E. huxleyi, N. gaditana and P. tricornutum. However, no fungal TRF were identified in either C. reinhardtii or in P. purpureum. In previous studies TFs from this family were identified in the rhodophyte G. sulfuraria [12, 15].

This presence of fungal type TFs in algal genomes is another illustration of the complex evolutionary history of algae [114]. Multiple endosymbiosis resulting in the algal diversity [18] is punctuated by numerous gene transfer events. These gene transfer events comprise both EGT [115, 116], as the original case of HGT from bacteria to the plastid genome [117], or from bacteria or archaebacteria to the nuclear genome [40, 105, 118, 119]. In these HGT, the donor organism is prokaryotic, but interesting cases of HGT from a fungus to an alga were recently shown [120]. All these gene transfers give rise to metabolic and regulatory diversity, leading to adaptation of algae to a wide variety of environments and conditions.

Conclusion

Using a pipeline with very good sensitivity and PPV for both plant and cyanobacterial TFs, we undertook the first genome-wide identification of TFs in haptophytes, coupled with a comparison of TF content between haptophytes and other algal lineages. The identification highlighted the presence of cyanobacterial TFs in algal nuclear genomes, which is likely to originate from either an EGT or an HGT. Moreover, members of the Fungal TRF family were identified in T. lutea, Pavlov asp, E. huxleyi, P. tricornutum and N. gaditana. The presence of fungal type TFs in algal genomes also illustrates the complex evolutionary history of these organisms. This comparison study confirms and extends lineage-specific features highlighted between haptophytes and stramenopiles by previous work [9] and extends the panel of genomes used for this comparison (Fig. 6). In order to investigate the evolutionary history of organisms and genome-wide studies, some gaps need to be filled and the red algae are one of them. In this kind of study, the only two red algae used are the two extremophiles G. sulfuraria and C. merolae. The extreme environmental pressures they face make these two algae peculiar cases that should not be considered representative of the red lineage. Here, we used mesophilic species P. purpureum. Availability of genomic data from haptophytes is also lacking. In this study, we provide the first genomic data of T. lutea. The characteristics revealed include some clues consistent with the hypothesis of an endosymbiosis of green and red algae in the evolutionary history of haptophytes and stramenopiles [19]. Therefore, this work provides a basis to better understand gene regulation in T. lutea, which is a species of ecological interest as part of haptophytes, a diverse and often ecologically dominant group in the planktonic photic realm [121].

Fig. 6
figure 6

Expansion, gain and loss of TF families during the evolutionary history of microalgae

Methods

Source datasets

The predicted proteomes used in this study were downloaded from different sources (Additional file 1: Table S1). The C. reinhardtii CC-503 cw92 mt+, P. tricornutum CCAP1055/1 and E. huxleyi CCMP1516 predicted proteomes were downloaded from the JGI genome portal at http://genome.jgi.doe.gov/. The N. gaditana CCMP526, Pavlova sp. CCMP459 and P. purpureum DBLAB2 predicted proteomes were downloaded from http://nannochloropsis.genomeprojectsolutions-databases.com/, http://data.imicrobe.us/project/view/104 and http://cyanophora.rutgers.edu/porphyridium/, respectively. The genome of the T. lutea CCAP927/14 strain was recently sequenced and annotated in our laboratory (data not shown). Raw read data are available at SRA (RUN: SRR3156597).

Identification and classification of transcription factors

The TF identification and classification pipeline was calibrated with the model plant A. Thaliana (TAIR 10). Overall, the pipeline uses two strategies: (1) a similarity research with BLAST software against a self-built database of known TFs from algae, A. thaliana, Saccharomyces cerevisiae and cyanobacteria; (2) identification of TF DBDs with InterProScan and HMMER software. The compilation of software results allowed us to obtain a putative list of TFs (Fig. 1).

Construction of a TF database for BLAST software

The TF database is composed of TFs from different organisms (the model plant A. thaliana; the green algae Bathycoccus prasinos, Chlorella sp, Coccomyxa sp, Micromonas pusilla, Micromonas sp, Ostreococcus lucimarinus, Ostreococcus sp, Ostreococcus tauri and Volvox carteri; the red algae Cyanidioschyzon merolae and Galdieria sulfuraria; the diatom Thalassiosira pseudonana and the yeast Saccharomyces cerevisiae). These sequences were retrieved from online databases (Additional file 2: Table S2). Since algae originate from the engulfment of a cyanobacteria-like organism by a primitive eukaryotic heterotroph, we added all cyanobacterial TFs of the cTFbase [8] to the self-built database.

Identification of protein functional domains

Each protein domain contained in the protein domain databases is stored as a Hidden Markov Model (HMM) and linked to a putative function. This statistical method computes a matrix based on the multiple alignments of a protein domain [122]. For functional domain annotation of all the predicted proteomes, we employed InterProScan 5 version 5.4-47.0 [123], which uses a consortium of eleven protein domain databases (PROSITE, HAMP, Pfam, PRINTS, ProDom, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, CATH-Gene3D and PANTHER). However, twelve DBDs (G2-like, BELL, HD-ZIP, HRT, NF-YB, NF-YC, SAP, STAT, Trihelix, VOZ, WOX and VARL) are not supported by the eleven databases of the consortium and were added through multiple alignments available in the TF databases PlantTFDB [13] and PlnTFDB [12] with HMMER3, v3.1b1 [124].

Pipeline description

First step

Sequences of each predicted proteome were analyzed in parallel by HMMER (hmmscan, default parameters), InterProScan (default parameters) for protein functional domains and by BLAST (e-value threshold 10−10) for a similarity search against known TFs (Fig. 1).

Second step

The results of each software analysis were filtered using different homemade PERL scripts. For InterProScan, false positives were filtered out to keep only annotated domains that had an e-value above or equal to 10−3. Among these, only TFs DBDs were conserved. For HMMER, filtration was done on the score value. Sequences with a significant hmmscan match (according to the database thresholds) were added as TF candidates. For BLAST searches, the filtering step was applied with an identity percentage threshold of 35 % and an alignment length threshold of 100 residues. Then, the best-BLAST hit was taken for each query. Finally, the results of all software processes were combined in one file.

Third step

Once identified, putative TFs were classified into specific families according to their DBD(s). We used a compilation of the “family assignment rules” described by the web databases PlantTFDB [13], PlnTFDB [12] and cTFbase [8], as well as previous studies [9, 11]. A PERL script was used to automatically classify the putative TFs in families following the assignment rules.

Final step

Manual curation was necessary, in particular for three complex cases: (1) MYB, where the calibration stage revealed that filtration of the e-value score generated false negatives. To overcome this, MYB identification was performed using the same protocol, with the exception of the validation step of the e-value scores on the InterProScan result. Moreover, each candidate was manually inspected (BLAST) to confirm each MYB domain and classify putative TFs in each family (MYB-3R, MYB-2R and MYB-related). (2) G2-like, due to the absence of a G2-like domain in the InterProScan database and its close similarity to the MYB-SHAQKYF domain, cross-annotation between these two domains was manually checked using HMMER. (3) TF families characterized by the repetition of a single domain; for proteins identified as belonging to the DBB and AP2/ERF families, the presence of two or more B-Box or AP2/ERF domains, respectively, was verified.

Evaluation of pipeline accuracy

To estimate the accuracy and reliability of our identification method, we applied our pipeline to the predicted proteome of A. thaliana (TAIR 10) and compared the identification of eleven well-annotated families to published datasets [13], used as a gold standard. For the identification of cyanobacterial TFs, we applied our pipeline to Synechocystis sp. PCC 6803 (GeneBank Assembly: GCA_000009725.1), Synechococcus sp. CC9605 (downloaded from cyanobase) and Nostoc punctiforme PCC73102 (GeneBank Assembly: GCA_000020025.1) predicted proteomes and compared our prediction results with published data [8]. The accuracy was evaluated by the measurement of sensitivity:

$$ \frac{True\ positives}{True\ positives+ False\ negatives} $$

and Positive Predictive Value (PPV):

$$ \frac{True\ positives}{True\ positives+ False\ positives} $$

A sensitivity value of less than one means inclusion of false negatives and a PPV of less than one means inclusion of false positives.

Availability of data and material

The datasets supporting the conclusions of this article are included within the article (and its additional files).