Background

Dinoflagellates are a diverse group of unicellular microalgae that are ubiquitous in marine and freshwater environments. In coral reefs, dinoflagellates of the family Symbiodiniaceae are the predominant photosynthetic symbionts of cnidarians (e.g. corals, sea anemones and jellyfish), giant clams, sponges and other microorganisms including foraminiferans and ciliates [1]. Symbiodiniaceae can contribute more than 90% of their carbon fixed via photosynthesis, to meet the energetic needs of the coral host [2].

Coral reef ecosystems worldwide are under severe threat from warming oceans and increased human activities in coastal areas [3]. A modest episodic increase in ocean surface temperature in this environment can result in oxidative damage and the decoupling of carbon flow between the symbiont and the host. Specifically, breakdown of the coral-dinoflagellate symbiosis (i.e. coral bleaching) puts the coral host at risk of starvation, disease and eventual death [4,5,6]. The global coral bleaching event between 2014 and 2017 is the longest on record, and episodic mass bleaching events continue to occur [7, 8]. Conservation strategies are urgently needed to maintain and restore existing coral reefs. The design of such interventions requires a multi-pronged approach to understand the role of each biotic component to sustain a healthy, resilient symbiosis [9,10,11]. Genomic resources have proven useful to inform conservation efforts [12], but genome-scale data for the coral reef symbionts remain scarce. The dearth of genome data from Symbiodiniaceae is explained by the relatively large sizes (1–5 Gbp) [13, 14], and complex, atypical structure of dinoflagellate genomes and chromosomes [15, 16].

The genetic diversity of Symbiodiniaceae can be explained by natural selection acting on genomes involved in a broad spectrum of symbiotic associations that vary in host specificity, transmission mode and permanence in hospite [17, 18] as well as by stochastic forces that can lead to genetic drift [19]. Symbiosis, or the lack thereof, has been implicated in the genome evolution of Symbiodiniaceae [20]. Most symbiotic species are thought to be facultative to some extent, with the potential to shift between a free-living motile stage (mastigote) and a spherical symbiotic stage (coccoid). The genomes of facultative and recent intracellular bacterial symbionts are usually dynamic, characterised by extensive structural rearrangements, intensified activity of transposable elements (TEs) and increased gene duplication that leads to the accumulation of pseudogenes [21, 22]. Symbiotic Symbiodiniaceae, predominantly facultative, are expected to display similar genomic features in contrast to free-living taxa; the latter group includes species that have thus far been found only in environmental samples and, in laboratory experiments, fail to successfully infect potential hosts [23, 24]. Based on current taxonomic classification, Family Symbiodiniaceae contains the greatest number of described species within the phylogenetically distinct Order Suessiales [25, 26] that also includes other free-living taxa such as Polarella [27] and Sphaerodinium [28].

Here, we generated draft genome assemblies from seven members of the genus Symbiodinium: two free-living, one opportunistic and four symbiotic isolates that represent distinct lifestyles of Symbiodiniaceae. In combination with other available data, we systematically compared whole-genome sequences of 15 dinoflagellate taxa (of which 13 are Symbiodiniaceae) to assess the divergence and genetic diversity of Symbiodiniaceae relative to those within the single genus of Symbiodinium. We uncovered extensive genome sequence divergence within Symbiodinium that is comparable to that among different genera of Symbiodiniaceae and gene families that may contribute to niche adaptation. This genetic diversity likely translates into more-complex interactions than previously thought between coral (and other) hosts and Symbiodiniaceae symbionts.

Results

High genome divergence among Symbiodiniaceae taxa

We generated de novo genome assemblies for seven Symbiodinium isolates encompassing distinct lifestyles; two are hybrid assemblies incorporating both short- and long-read sequence data (Table 1 and Additional file 3: Supplementary Table 1). We included in our analysis available genome assemblies from six other Symbiodiniaceae (largely derived from short-read data) and two hybrid assemblies from the outgroup species Polarella glacialis (all in Order Suessiales; Additional file 3: Supplementary Table 2), totalling 15 dinoflagellate genomes. We assessed genome sequence similarity based on pairwise alignment of whole-genome sequences (see “Methods”). In each pairwise comparison, we assessed the overall percentage of the query genome sequence that aligned to the reference (Q), and the average percent identity of the reciprocal best one-to-one aligned sequences (I). Our results revealed extensive sequence divergence among these genomes at the order (Suessiales), family (Symbiodiniaceae) and genus (Symbiodinium) levels (Fig. 1a). As expected, genome pairs that exhibit the highest sequence similarity are isolates from the same species, e.g. between S. microadriaticum CassKB8 and 04-503SCI.03 (Q = 87.44%, I = 99.72%; CassKB8 as query), and between the two P. glacialis isolates (Q = 97.10%, I = 98.59%; CCMP1383 as query). In contrast, genome sequences of the two S. tridacnidorum isolates appear more divergent (Q = 30.07%, I = 87.18%; CCMP2592 as query). Remarkably, some genomes within Symbiodinium are as divergent as those of distinct genera: for instance, Q = 1.10% and I = 91.88% for S. pilosum compared against S. natans as reference, and Q = 1.03% and I = 92.15% for S. tridacnidorum CCMP2592 against Cladocopium sp. C92 (Fig. 1a).

Table 1 The seven Symbiodinium isolates for which de novo genome assemblies were generated in this study
Fig. 1
figure 1

Genome divergence among Symbiodiniaceae. a Similarity between Symbiodiniaceae (and the outgroup P. glacialis) based on pairwise whole-genome sequence alignments. The colour of the square depicts the average percent identity of the best reciprocal one-to-one aligned regions (I) between each genome pair and the size of the square is proportional to the percent of the query genome that aligned to the reference (Q), as shown in the legend. The tree topologies on the left and bottom indicate the known phylogenetic relationship [26] among the isolates. Isolates in Symbiodinium are highlighted in grey, and their comparisons are highlighted in a bounded box. b Neighbour-joining tree based on 21-mers shared by genomes of Suessiales; branch lengths are proportional to the estimated distances. The shortest and longest distances (d) in the tree, as well as average distances (δ) among representative clades are shown following the bottom-left colour code. ‘Clade BCF’: clade including B. minutum, the two Cladocopium isolates, and F. kawagutii

Scalable, alignment-free phylogenetic approaches [29], which bypass multiple sequence alignment and the computationally intensive tree-inference step, have been shown to yield accurate phylogenetic relationships among thousands of genome sequences from Bacteria and Archaea [30]. Adopting the approach by Bernard et al. [30], we assessed the proportion of shared k-mers (short, sub-sequences of defined length k) between each pair of genomes (optimised k = 21; see “Methods”) and calculated a pairwise distance (d) (Additional file 3: Supplementary Table 3). These distances were used to derive the phylogenetic relationship of the genomes as a neighbour-joining (NJ) tree (Fig. 1b) and as similarity networks at different similarity thresholds T (Additional file 2: Supplementary Figure 1). Symbiodinium isolates are about as distant from the other Symbiodiniaceae (mean\( \delta \) = 7.24) as they are from the outgroup P. glacialis (\( \delta \) = 7.23). The free-living S. natans [31] and S. pilosum are the most divergent from all others in the genus (d > 4.50 between either of them and one other Symbiodinium taxon; Additional file 3: Supplementary Table 3). The distance between S. natans and S. pilosum (d = 5.64) is similar to that between F. kawagutii and C. goreaui (d = 5.75) that are members of distinct genera. The genome sequence divergence among Symbiodinium isolates is further supported by the proportion of mapped sequence reads (Additional file 2: Supplementary Figure 2).

We used the same gene prediction workflow, customised for dinoflagellates [32, 33], for the seven Symbiodinium genome assemblies generated in this study and for the eight other assemblies included in our analyses (Additional file 3: Supplementary Table 4). To further assess genome divergence, we identified conserved synteny based on collinear syntenic gene blocks (see “Methods”). Figure 2 illustrates the gene blocks shared between all possible genome pairs. S. microadriaticum CCMP2467 and S. tridacnidorum CCMP2592 share the largest number of gene blocks (853 implicating 8589 genes). The 15 genome assemblies analysed here vary in sequence contiguity (Table 1, Additional file 3: Supplementary Table 2). However, the number of core conserved eukaryote genes identified in each assembly is comparable (mean 68.96% of BUSCO eukaryote_odb9 genes; Additional file 2: Supplementary Figure 3), suggesting similar data completeness among the genes that were predicted using a consistent workflow. Although we cannot dismiss the impact of assembly contiguity and completeness on our observations here (and results from the whole-genome alignment and k-mer analyses above), these results provide an overview of genome divergence at the level of species, genus and family.

Fig. 2
figure 2

Conserved synteny of Suessiales genomes. Number of collinear syntenic gene blocks shared by pairs of Suessiales genomes. Gene blocks shared by more than two isolates are not shown

Figure 3a shows the composition of repeats for each of the 15 genomes. The repeat composition of the P. glacialis genome is distinct from Symbiodiniaceae, largely due to the prevalence of simple repeats [33]. To assess sequence divergence of repeats in each genome, Kimura distances [34], in which DNA transitions and transversions are assigned distinct substitution rates, were calculated for each repeat type (see “Methods”). Long interspersed nuclear elements (LINEs) in Symbiodiniaceae and in P. glacialis are highly divergent, with the Kimura distance centred between 15 and 40; these elements likely represent remnants of LINEs from an ancient expansion pre-dating the diversification of Suessiales [33, 35]. The proportion of these elements is substantially larger in the genomes of Symbiodinium and P. glacialis (the outgroup) than in those of other Symbiodiniaceae (Fig. 3b). For instance, LINEs comprise between 74.10 Mbp (S. tridacnidorum Sh18) and 96.90 Mbp (S. linucheae) of each of the Symbiodinium genomes, except in S. pilosum, where they encompass nearly twice as much in the genome (171.31 Mbp). In comparison, LINEs comprise, on average 7.49 Mbp of the genomes of the other Symbiodiniaceae (Additional file 2: Supplementary Figure 4). This implies that the remnants of LINEs were lost in the other Symbiodiniaceae genera. Interestingly, most LTRs and DNA transposons (recovered only in the hybrid assemblies incorporating both short- and long-read sequence data; Fig. 3b) are largely conserved (Kimura distance < 5), suggesting that they may remain active. The genome of S. pilosum has a nearly twofold increased abundance of LINEs, and a nearly twofold larger genome size estimate (1.99 Gbp) than other Symbiodinium genomes (Table 1 and Additional file 2: Supplementary Figure 4). Although this suggests whole-genome duplication or potential diploidy, we found no evidence to support either scenario (Additional file 2: Supplementary Figure 5). The prevalence of repetitive regions in S. pilosum, however, would explain in part why the total assembled bases of the genome constitute only 54.64% of the estimated genome size (Table 1). Our results reveal high sequence and structural divergence, and variable genome sizes among these Symbiodinium taxa, and compared to other Symbiodiniaceae, a distinct composition of repeats.

Fig. 3
figure 3

Repeat composition of Suessiales genomes. a Percentage of sequence regions comprising the major classes of repetitive elements, shown for each genome assembly analysed in this study. b Interspersed repeat landscape for each assembled genome. Both a and b follow the colour code shown in the legend. ‘No repeats’ refers to non-repetitive regions of the genome, ‘Unknown’ represents repeats that are not classified into any known types in the RepeatMasker database, including novel repeats

Gene features of Symbiodiniaceae genomes

Differences among predicted genes of Symbiodiniaceae have been attributed to phylogenetic relationships and to the implementation of distinct gene prediction approaches [32]. Our principal component analysis (PCA) approach, based on metrics of genes (Additional file 3: Supplementary Table 4) that were predicted using the same workflow (see “Methods”), revealed substantial variation within the genus Symbiodinium (Fig. 4a), and the contributions of the distinct metrics to this variation (Fig. 4b). We noted that the observed variation in gene architecture can be associated with one or more factors: (1) phylogenetic relationship, (2) the type of sequence data used for genome assembly and the consequent assembly quality and (3) lifestyle of the isolates. The variation resulting from the phylogenetic relationship among the genomes is illustrated by the separation of distinct genera along PC2 (explaining 24.82% of the variance). The metrics contributing the most to PC2 are associated with proportion of splice donors (Fig. 4b). The type of sequence data used for genome assembly and assembly quality is reflected along PC1 (explaining 42.79% of the variance). For instance, taxa for which hybrid assemblies were available (incorporating both short-read and long-read sequence data), i.e. the free-living S. natans and P. glacialis, and the symbiotic S. tridacnidorum CCMP2592, are distributed between −4.5 and 0.1 along PC1 (Fig. 4a). The distribution of the symbiotic Symbiodinium is limited (between 0.5 and 1.5 of PC1), with the exception of the two S. tridacnidorum isolates, for which the genome assemblies are of distinct quality, i.e. the high-quality hybrid assembly of CCMP2592 (Table 1) compared to the draft assembly of Sh18 that is fragmented (Additional file 3: Supplementary Table 2) and incomplete (Additional file 2: Supplementary Figure 3). In addition, the opportunistic S. necroappetens and free-living S. pilosum are distributed at > 2 along PC1 (Fig. 4a). These observations suggest that distinct lifestyles may contribute to differences in Symbiodiniaceae gene architecture.

Fig. 4
figure 4

Gene features in Symbiodiniaceae genomes. a Principal component analysis (PCA) based on metrics of predicted genes from the analysed 15 genomes. Data points are coloured by genus and shaped by lifestyle according to the legends to the right. Data points enclosed in a light blue area correspond to isolates with hybrid genome assemblies. Smi: S. microadriaticum, Sne: S. necroappetens, Sli: S. linucheae, Str: S. tridacnidorum, Sna: S. natans, Spi: S. pilosum, Bmi: B. minutum, Cgo: C. goreaui, Csp: Cladocopium sp. C92, Fka: F. kawagutii, Pgl: P. glacialis. Isolate name is shown in subscript for those species with more than one isolate. b Loading plot showing the contribution of the distinct gene metrics employed for the PCA to PC1 and to PC2

The predicted coding sequences (CDS) among Symbiodinium taxa exhibit biases of nucleotide composition at the different codon positions (Additional file 2: Supplementary Figure 6) and in codon usage (Additional file 2: Supplementary Figure 7). The G+C content among CDS (Additional file 3: Supplementary Table 4) and among third codon positions (Additional file 2: Supplementary Figure 6) varies slightly but is generally higher relative to the overall G+C content (Table 1 and Additional file 3: Supplementary Table 2). This is consistent with the results previously reported for genomes and transcriptomes of Symbiodiniaceae [36]. Of all Symbiodinium isolates, S. microadriaticum CassKB8 and 04-503SCI.03 contain the largest number of CDS with a strong codon preference; S. microadriaticum CCMP2467 has the least (Additional file 2: Supplementary Figure 7). These observations highlight the genetic variation not only within a single genus, but also within a single species.

Gene-family evolution

Using all 555,682 predicted protein sequences from the 15 genomes, we inferred 42,539 homologous protein sets of size ≥ 2 (see “Methods”); here, we refer to these sets as gene families. Of these families, 2500 (5.88%) contain genes from all 15 Suessiales isolates, 406 are exclusive and common to all 13 Symbiodiniaceae isolates, and 193 are exclusive and common to all nine Symbiodinium isolates (Fig. 5a). Remarkably, predicted proteins coded by the vast majority of these conserved, lineage-specific genes (> 95% of almost all core genes at each internal node within Symbiodinium in Fig. 5a) do not share significant sequence similarity to any protein sequences in the UniProt database. We consider these as “dark” genes, i.e. they are either novel or too highly diverged to identify putative homologues in existing data; some families of dark genes appear to be specific to some isolates (Fig. 5b). Although proteins of S. microadriaticum CCMP2467 are available in UniProt (to which we expected many Symbiodinium proteins would share significant sequence similarity), the proportion (> 90%) of dark genes among the core genes of Symbiodinium remains higher than those among the core genes for other genera, e.g. the node encompassing Breviolum, Cladocopium and Fugacium (83.56% dark genes; Fig. 5a). This result suggests that the function of core genes in Symbiodinium species is largely unknown and that the divergence of gene sequences has been more extensive among the Symbiodinium isolates than among the other Symbiodiniaceae genera. Although these observations also reflect the dearth of dinoflagellate sequences in UniProt (and public databases in general), the presence of highly divergent homologues implies lineage-specific functional innovation, lending support to the observation of conserved, lineage-specific dark genes in dinoflagellates [37, 38]. Conserved dark genes were also reported in core genes of other eukaryote lineages including green algae [39] and the haptophyte Emiliania [40].

Fig. 5
figure 5

Number of gene families and conserved dark genes in Suessiales. a Tree topology shows the phylogenetic relationship of the 15 Suessiales isolates, follows the species tree inferred based on 28,116 gene families containing four or more sequences from any isolates, rooted with P. glacialis as outgroup to Symbiodiniaceae. At each node, the total number of families that include genes from all diverging isolates (and not others) is shown; the proportion of dark genes among all genes in these families are shown as a pie chart. b Number of isolate-specific gene families for each isolate, showing the proportion of families that are dark. c Number of gene families specific to Symbiodiniaceae, Symbiodinium, and other genera

The species tree inferred using OrthoFinder (see “Methods”) based on the homologous protein families (Fig. 5a) is in complete topological congruence to the accepted Symbiodiniaceae phylogeny reconstructed using large subunit rRNA [26]. Our alignment-free tree (Fig. 1b) is also topologically congruent to the two phylogenies (in Fig. 5a and in [26]), except for the branching order of S. natans and S. pilosum basal to the Symbiodinium clade. This result demonstrates, in the case of Symbiodiniaceae and despite the extensive genome divergence, a strong consensus of phylogenetic signal exhibited by the different datasets (i.e. single marker gene, homologous protein families or whole-genome sequences), and the power of scalable alignment-free approaches for inferring phylogenetic relationships comprehensively using whole-genome sequences rather than a small subset of genes.

Of the 42,539 families, 18,453 (43.38%) contain genes inferred to be specific to Symbiodiniaceae at the permissive identity thresholds used. Interestingly, more (8828) gene families are specific to sequenced isolates of Symbiodinium than to sequenced isolates of all the other Symbiodiniaceae combined (2043 are specific to Breviolum, Cladocopium and Fugacium isolates; Fig. 5c). Although the simplest explanation is that substantially more gene families have been gained (or preserved) in Symbiodinium than in the other three genera, this may also be caused by the overrepresentation of this genus in our analysis. A previous study reported that substantially more gene families are specific to the clade encompassing Breviolum, Cladocopium and Fugacium (26,474) than are specific to Symbiodinium (3577) [36]. We consider our results here to be more reliable than those based largely on transcriptome data [36], in which gene numbers can be overestimated due to differential transcription, transcript isoforms, RNA editing and intrinsically shorter sequences. The smaller number of gene families shared among Symbiodiniaceae found here (i.e. 18,453 compared to 76,087 in González-Pech et al. [36]) reflects our more-conservative approach based on whole-genome sequence data, and our use of larger and/or higher-quality datasets. Nonetheless, our observations support the notion that the evolution of gene families has contributed to the diversification of Symbiodiniaceae [36].

Gene functions related to symbiosis and stress response are less abundant in symbiotic Symbiodiniaceae

We examined the functions annotated for the predicted genes from all 15 Suessiales genomes based on the annotated Gene Ontology (GO) terms and protein domains. A recent study, focusing on the transcriptomic changes in Cladocopium sp. following establishment of symbiosis with coral larvae [41], compiled a list of symbiosis-related gene functions in Symbiodiniaceae. We searched for these functions and assessed their relative abundance in each analysed genome (see “Methods”) (Additional file 2: Supplementary Figure 8). GO terms related to N-glycan processing, immune response, transmembrane transport and the metabolism of carbohydrate, nitrogen and lipid were recovered at lower relative abundance in the prominent symbiotic lineages (i.e. B. minutum, C. goreaui and Cladocopium sp. C92) than in others. Genes encoding various Pfam domain types of ankyrin and tetratricopeptide repeat show varied relative abundances in the different lineages, supporting the distinct functions of these domains in specific host-symbiont recognition [41, 42].

Compared to the symbiotic Breviolum and Cladocopium lineages, we observed a higher relative abundance of gene functions related to DNA damage repair and cell division in genomes of Symbiodinium, Fugacium and Polarella (Additional File 2: Supplementary Figure 9). This is consistent with results from earlier studies [36, 43], lending support for sexual reproduction [35, 44, 45] and recombination as a contributing factor to the genetic diversity of Symbiodiniaceae [46,47,48,49,50]. Sexual reproduction has been described in other dinoflagellates [51]. A recent study [45] revealed that Symbiodiniaceae have the genetic capacity for sexual reproduction, potentially via an alternative mechanism for chromosomal crossover. In the genomes of B. minimum and the two Cladocopium isolates, we also observed a lower relative abundance of GO terms and protein domains associated with stress response, photosynthesis, motility, and phagocytosis (Additional file 2: Supplementary Figure 9), suggesting there are fewer genes encoding these functions in the symbiotic lineages compared to other lineages that include species with a non-symbiotic lifestyle (i.e. Symbiodinium, F. kawagutii and the outgroup P. glacialis), or an evolutionary trajectory of genome reduction that is expected in obligate intracellular symbionts [20, 22]. The prevalence of cold-shock DNA-binding and bacteriorhodopsin domains in P. glacialis is consistent with the finding of highly duplicated gene families encoding these functions in the genome [33] and highlights the adaptation of this species to extreme cold and low-light environments.

We assessed the potential impact of organellar genome sequences in the seven de novo assembled Symbiodinium genomes on our results (see “Methods”). For each genome, we found no more than 0.63% of assembled genome sequences that are putatively plastid or mitochondrial, and no gene models were predicted in these sequences (Additional file 3: Supplementary Table 5). Therefore, the contribution of organelle DNA to our results is negligible.

Can symbiosis drive genome evolution in Symbiodiniaceae?

To inspect genome divergence potentially associated with a symbiotic lifestyle, we compared the two high-quality assemblies of (a) the symbiotic S. tridacnidorum [52] CCMP2592 that was isolated from a stony coral from the Great Barrier Reef and (b) the free-living S. natans [31] CCMP2548 (synonym HA3–5) isolated from sediment at the beach of Coconut Island, Hawaii [53]. Both of these genomes were generated using a combination of short- and long-read sequence data (Table 1 and Additional file 3: Supplementary Table 1). S. tridacnidorum encompasses isolates in ITS2-type A3 that are predominantly symbionts of giant clams in the Indo-Pacific [26]. Although the nature of this symbiosis is extracellular, S. tridacnidorum can also establish intracellular symbiosis with cnidarian hosts, both in experimental settings and in nature [52]. On the other hand, S. natans (the type species of the genus) is free-living. S. natans occurs frequently in environmental samples, exhibits a widespread distribution and, thus far, has not been shown to colonise cnidarian hosts [26, 31].

Compared to other genome assemblies generated using only short-read data (Table 1), these two genome assemblies provide better-resolved repetitive genomic regions, allowing for a more-in-depth comparative analysis of these regions. The estimated genome size is 1.29 Gbp for S. tridacnidorum and 0.74 Gbp for S. natans (Table 1 and Additional file 3: Supplementary Table 6); the latter is the smallest reported for any symbiodiniacean genome to date, although we note that accuracy for k-mer-based genome size estimation is sensitive to k length and the abundance of repeats in the data. The two genomes are highly dissimilar; we observed a low read-mapping rate (< 15%) of read pairs from one genome dataset against the genome assembly of the counterpart, and vice versa (Fig. 6a). Only 14.70 Mbp (1.33%) of the genome sequence of S. tridacnidorum aligned to 11.84 Mbp (1.55%) of S. natans at 90% identity or greater. Figure 6b shows the most-represented protein domains in the two genomes; gene functions related to methylation and retrotransposition, previously reported in other Symbiodiniaceae [54], are more abundant in S. tridacnidorum, whereas functions related to ion transport are more prominent in S. natans (Additional file 1: Supplementary Note [55,56,57] and Additional file 3: Supplementary Table 7).

Fig. 6
figure 6

Comparison of S. tridacnidorum and S. natans genomes. a Mapping rate of filtered read pairs generated for each species against the assembled genomes of itself and of the counterpart: S. tridacnidorum (St) versus S. natans (Sn). b Top ten most abundant protein domains recovered, sorted in decreasing relative abundance (from bottom to top) among proteins of St (left) and those of Sn (right). The abundance for each domain in both genomes is shown in each chart for comparison. Domains common among the top ten most abundant for both species are connected with a line between the charts. ‘MORN’: MORN repeat, ‘RCC1’: Regulator of chromosome condensation repeat, ‘RVT’: reverse transcriptase, ‘DUF’: domain of unknown function, ‘PPR’: pentatricopeptide repeat, ‘EFH’: EF-hand, ‘IonTr’: ion transporter, ‘Pkin’: protein kinase, ‘Ank’: ankyrin repeat, ‘DNAmet’: C-5 cytosine-specific DNA methylase. c Contribution of genomic features to the distinct composition of S. tridacnidorum and S. natans genomes, based on the ratio (Δ) of the total length of the implicated sequence region in S. tridacnidorum to the equivalent length in S. natans, shown in log2 scale. The ratio of the estimated genome sizes is shown as reference (marked with a dashed line). The untransformed Δ for each feature is shown in its corresponding bar. A genome feature with Δ greater than the reference likely contributed to the discrepancy of genome sizes. Bars are coloured based on the genome in which they are more abundant as shown in the legend. Pseudogenes are not included in this plot. d Volcano plot comparing gene-family sizes against Fisher’s exact test significance (p value). The colour of the circles indicates the species in which those gene families are larger according to the top-right legend; families recovered only in one genome but not in the other are not shown. The number of gene families with the same ratio and significance is represented with the circle size following the bottom-right legend. Filled circles represent size differences that are considered statistically significant (adjusted p ≤ 0.05)

We assessed the genome features distinct to each species that may have contributed to the discrepancy in genome size. Specifically, we assessed, for each feature, the ratio (Δ) of the total length of the implicated sequence regions in S. tridacnidorum to the equivalent length in S. natans (Fig. 6c). The genome size estimate for S. tridacnidorum is 1.74 times larger than that for S. natans (Additional file 3: Supplementary Table 6); we use this ratio as a reference for comparison. Most of the examined genome features span a larger region in the genome of S. tridacnidorum, as expected. The Δ for each inspected genic feature (even for exons and introns separately) approximates 1.74. However, six features related to duplicated genes and repetitive elements have Δ > 1.74. The abundance of repeats characteristic of TEs (such as LINEs and LTRs; Fig. 6c) suggests the enhanced activity of retrotransposition in S. tridacnidorum. In both genomes, we recovered genes that encode relicts of dinoflagellate spliced leader (DinoSL) sequences indicating retroposition of genetic elements. These genes encode various enzymatic functions (Additional file 2: Supplementary Figure 10); the presence of DinoSL relicts suggests the importance of high transcription activity of these genes in genome evolution. We recovered putative DinoSL sequences in both genomes (496 on 350 scaffolds of S. tridacnidorum, and 244 on 211 scaffolds of S. natans; see “Methods”), some in DinoSL or DinoSL-5S rRNA tandem arrays, as previously described in dinoflagellates [58]. We also recovered genes that encode the reverse transcriptase domains indicating retrotransposition (Additional file 3: Supplementary Tables 8 and 9). Thus, gene duplication and repeats likely expanded in S. tridacnidorum, and/or contracted in S. natans, contributing to the genome size discrepancy (Additional file 2: Supplementary Figure 11).

Tandem duplication of exons and genes is common in dinoflagellates [33, 59], and may serve as an adaptive mechanism to enhance functions relevant for their biology. Whereas in some dinoflagellates, genes in tandem arrays contain hundreds of copies, e.g. up to 5000 copies of the peridinin-chlorophyll a-binding protein (PCP) gene in Lingulodinium polyedra [60], these arrays are not as prominent in the genomes of S. tridacnidorum and S. natans (Additional file 2: Supplementary Figure 12), with the largest array comprising 10 and 13 gene copies, respectively. The length of duplicated gene blocks is markedly longer in S. tridacnidorum than in S. natans (Δ = 6.32; Fig. 6c). This observation, and the number of gene-block duplicates in each of the two species, suggests that segmental duplication has occurred more frequently (or alternatively, has been more preserved) in the course of genome evolution of S. tridacnidorum. We found 23 syntenic collinear blocks within the S. tridacnidorum genome (i.e. within-genome duplicated gene blocks) implicating 242 genes in total. Of these genes, 20 encode protein kinase functions (Additional file 3: Supplementary Table 10) that are associated with distinct signalling pathways. In comparison, only five syntenic collinear blocks implicating 62 genes were found in the S. natans genome; these genes largely encode functions of cation transmembrane transport, relevant for the maintenance of pH homeostasis. Ankyrin and pentatricopeptide repeats are common in the predicted proteins of duplicated genes in both genomes.

Using all predicted genes, we inferred 58,541 homologous gene families, of which only 16,663 (28.46%) contain genes from both genomes. Duplicated genes can experience distinct fates [61]. If the function remains the same or changes slightly (e.g. through subfunctionalisation), the duplicated gene sequences will remain similar, resulting in gene-family expansion. We assessed the difference in gene-family sizes between S. tridacnidorum and S. natans using Fisher’s exact test (see “Methods”) and considered those with an adjusted p ≤ 0.05 to be significantly different (Fig. 6d). We observe no strong evidence for gene-family expansion in S. tridacnidorum compared to S. natans; only 20 families are significantly larger, including OG0000004 that putatively encodes protein kinases and glycosyltransferases that are necessary for the biosynthesis of glycoproteins, and OG0000013 that encodes ankyrin and transport proteins (Additional file 3: Supplementary Table 11). These functions are important for the recognition of and interaction with the host among symbiodiniacean symbionts [41, 42]. In comparison, five gene families were significantly larger in S. natans than in S. tridacnidorum, of which one (OG0000003) encodes a sodium-transporter and another (OG0000034) a transmembrane protein. Many genes in the expanded families encode retrotransposition functions in both genomes, lending support to the contributing role of retrotransposons in gene-family expansion in Symbiodiniaceae [54] and more broadly in dinoflagellates [38].

If novel beneficial functions of the gene copies emerge (i.e. neofunctionalisation), the sequence divergence between gene copies may become too large to be recognised as the same family. This scenario could at least partially explain the higher number of single-copy genes exclusive to S. tridacnidorum (25,649, of which only 13,189 [51.42%] are supported by transcriptome evidence) than those exclusive to S. natans (16,137, of which 13,320 [82.54%] supported by transcriptome evidence). Although most dinoflagellate genes are known to be constitutively expressed regardless of exposure to abiotic stress [62], it remains unclear if the genes with no transcriptome support are functional genes, or not expressed in the conditions used for culture. However, the annotated functions for single-copy genes exclusive to each genome are similar in both species (Additional file 3: Supplementary Table 12), suggesting the presence of highly diverged homologues. In contrast, duplicated genes can also undergo loss of function (i.e. nonfunctionalisation or pseudogenisation). Pseudogene screening in both genomes (see “Methods”) identified 183,516 putative pseudogenes in S. tridacnidorum and 48,427 in S. natans. The nearly four-fold difference in the number of pseudogenes between the two genomes further supports the notion that duplication events [61] are more frequent in S. tridacnidorum and may explain the lower proportion of genes with transcript support in this species (Additional file 3: Supplementary Table 4).

These results suggest that the extensive genomic divergence between the symbiotic S. tridacnidorum and the free-living S. natans, including the discrepancy in genome sizes, is attributed to TEs, genetic duplication, structural rearrangements and pseudogenisation. These genome features are common in facultative and recent intracellular bacterial symbionts and parasites [21, 22], and expected in symbiotic Symbiodiniaceae [20]. The abundance of pseudogenes and larger genome size of the symbiotic species suggest the lack of the evolutionary pressure to purge the excessive genetic content during prolonged symbiotic associations. This observation contrasts with the smaller genomes of parasitic dinoflagellates, e.g. the genome of Amoebophrya ceratii (87.7 Mbp) that encodes fewer (19,925) genes [63]. Additional high-quality genome data from free-living and symbiotic taxa are required to gain a clearer understanding of the evolutionary transition(s) between free-living and symbiotic lifestyles in Symbiodiniaceae, and the impacts of symbiosis on genome evolution in these taxa (see Additional file 1: Supplementary Note).

Discussion

Our results reveal high sequence and structural divergence among genomes of Symbiodinium, and more broadly of the family Symbiodiniaceae. Genomic divergence in microbial eukaryotes has been associated with adaptation to specialised or harsh environments [33, 64], lifestyle specialisation [63] and the capacity to inhabit diverse environments [40]. Although sexual recombination likely contributes to the extensive genetic diversity of the family Symbiodiniaceae [45, 46, 50, 65], its limitation to highly similar (likely homologous) sequence regions renders unlikely its role as the sole driver of genome divergence. The evolutionary transition from a free-living to a symbiotic lifestyle can contribute to the loss of conserved synteny as consequence of large- and small-scale structural rearrangements [20]. The enhanced activity of mobile elements in the early stages of this transition can further disrupt synteny, impact gene structure and accelerate mutation rate [66, 67]. However, S. natans and S. pilosum, for which the free-living lifestyle appears to be ancestral, are still quite diverged from each other (Additional file 2: Supplementary Figure 1). Ancient events, such as changes of sea level or climate, are thought to have influenced diversification of Symbiodiniaceae [26, 68, 69] and may help explain the divergence of the extant lineages. These events may have severely reduced effective population size, thereby weakening selection, and allowing genetic drift to impact genome structure. For example, in a hypothetical scenario, a drastic drop of the sea level could have split the ancestral Symbiodiniaceae population into multiple sub-populations with very small population sizes. This could have enabled rapid genome divergence among the sub-populations that, in turn, evolved and diversified independently into the extant taxa. However, the capacity of dinoflagellates to form cysts under stress [70] could have counteracted this effect by facilitating dispersal. Dinoflagellates are estimated to have diversified ~190 million years ago (MYA) [71]. The diversification time of Symbiodiniaceae, recently revised at ~160 MYA [26], corresponds to the adaptive radiation of reef-building (scleractinian) corals ~240 MYA [72] and supports the notion that symbiosis impacted genome evolution of Symbiodiniaceae. Nevertheless, other evolutionary processes (e.g. natural selection, gene loss and/or lateral genetic transfer) contributing to genomic divergence in these taxa remain to be systematically investigated.

The evolutionary mechanisms underscoring Symbiodiniaceae diversity have been examined in earlier studies based on host specificity, coevolution, biogeography, ecology and marker-based phylogenies [18, 26, 50, 73,74,75]. Our work, based on whole-genome sequence data, emphasises how much genomic diversity can lie hidden beneath a “simple” morphology that may differ only subtly, for example, with respect to plate arrangements on the cell surface of Symbiodiniaceae [26]. Beyond neutral evolution, symbioses with corals and other hosts may be driving both morphological uniformity and massive genome divergence in this family. Understanding the evolutionary mechanisms that facilitate genome rearrangements and gene-function innovation in dinoflagellate symbionts remains an important step in evolutionary studies of the coral holobionts. The astonishing number of dark genes in Symbiodinium species signifies that, with available protein annotations, we are addressing only the tip of the iceberg of divergent and/or novel gene functions in these dinoflagellates. To reconstruct genomic changes and novel gene origins more robustly, the divergence of whole-genome sequences among Symbiodinium should be considered in the light of known and yet-to-be discovered biodiversity of this genus.

Conclusions

Our results reveal that, among these dinoflagellates, the genome architecture that underlies symbiotic associations with corals and other organisms is very diverse. Therefore, a one-size-fits-all approach with Symbiodiniaceae for engineering environmentally robust corals may prove ineffective. A combination of in situ and culture-based studies are needed to address these and other outstanding questions about the evolution of dinoflagellate symbionts of coral reefs, both with respect to genome architecture and innovation of novel gene functions, and about the resilience of coral-dinoflagellate symbioses in changing environments. The wealth of data and insights we have generated elucidate how symbiosis may underpin molecular mechanisms that drive genome evolution and divergence of Symbiodiniaceae. In the future, these insights could be used to reconstruct lifestyle impact on genome evolution of other microbial eukaryotes that presently exist as obligate or facultative symbionts or may once have been in their evolutionary history.

Methods

Symbiodinium cultures

Single-cell monoclonal cultures of S. microadriaticum CassKB8 and S. microadriaticum 04-503SCI.03 were acquired from Mary Alice Coffroth (Buffalo University, New York, USA), and those of S. natans CCMP2548, S. tridacnidorum CCMP2592, S. necroappetens CCMP2469, S. linucheae CCMP2456 and S. pilosum CCMP2461 were purchased from the National Center for Marine Algae and Microbiota at the Bigelow Laboratory for Ocean Sciences, Maine, USA (Table 1). The cultures were maintained in multiple 100-mL batches (in 250-mL Erlenmeyer flasks) in f/2 (without silica) medium (0.2 mm filter-sterilised) under a 14:10 h light-dark cycle (90 μE/m2/s) at 25 °C. The medium was supplemented with antibiotics (ampicillin [10 mg/mL], kanamycin [5 mg/mL] and streptomycin [10 mg/mL]) to reduce bacterial growth.

Nucleic acid extraction

Genomic DNA was extracted following the 2×CTAB protocol with modifications. Symbiodinium cells were first harvested during exponential growth phase (before reaching 106 cells/mL) by centrifugation (3000g, 15 min, room temperature (RT)). Upon removal of residual medium, the cells were snap-frozen in liquid nitrogen prior to DNA extraction, or stored at −80 °C. For DNA extraction, the cells were suspended in a lysis extraction buffer (400 μL; 100 mM Tris-Cl pH 8, 20 mM EDTA pH 8, 1.4 M NaCl), before silica beads were added. In a freeze-thaw cycle, the mixture was vortexed at high speed (2 min), and immediately snap-frozen in liquid nitrogen; the cycle was repeated 5 times. The final volume of the mixture was made up to 2% w/v CTAB (from 10% w/v CTAB stock; kept at 37 °C). The mixture was treated with RNase A (Invitrogen; final concentration 20 μg/mL) at 37 °C (30 min) and Proteinase K (final concentration 120 μg/mL) at 65 °C (2 h). The lysate was then subjected to standard extractions using equal volumes of phenol:chloroform:isoamyl alcohol (25:24:1 v/v; centrifugation at 14,000g, 5 min, RT) and chloroform:isoamyl alcohol (24:1 v/v; centrifugation at 14,000g, 5 min, RT). DNA was precipitated using pre-chilled isopropanol (gentle inversions of the tube, centrifugation at 18,000 g, 15 min, 4 °C). The resulting pellet was washed with pre-chilled ethanol (70% v/v), before stored in Tris-HCl (100 mM, pH 8) buffer. DNA concentration was determined with NanoDrop (Thermo Scientific), and DNA with A230:260:280 ≈ 1.0:2.0:1.0 was considered appropriate for sequencing. Total RNA was isolated from Symbiodinium cells in culture (4 weeks post-inoculum, at exponential growth) using the RNeasy Plant Mini Kit (Qiagen) following directions of the manufacturer. RNA quality and concentration were determined using Agilent 2100 BioAnalyzer.

Genome sequence data generation and de novo assembly

All genome sequence data generated for all seven Symbiodinium isolates are detailed in Additional file 3: Supplementary Table 1. Short-read sequence data (2 × 150 bp reads, insert length 350 bp) were generated using paired-end libraries on the Illumina HiSeq 2500 and 4000 platforms at the Australian Genome Research Facility (Melbourne) and the Translational Research Institute Australia (Brisbane). Some of the paired-end libraries (insert length 250 bp) were designed such that the read pairs of 2 × 150 bp would overlap. Quality assessment of the raw paired-end data was done with FastQC v0.11.5, and subsequent processing with Trimmomatic v0.36 [76]. To ensure high-quality read data for downstream analyses, the paired-end mode of Trimmomatic was run with the settings: ILLUMINACLIP:[AdapterFile]:2:30:10 LEADING:30 TRAILING:30 SLIDING WINDOW:4:25 MINLEN:100 AVGQUAL:30; CROP and HEADCROP were run (prior to LEADING and TRAILING) when required to remove read ends with nucleotide biases. Overlapping read pairs (from the library with insert size of 250 bp) were merged with FLASH v1.2.11 [77]. Library adapters from the mate pair data were removed with NxTrim v0.41 [78]. De novo genome assembly was performed for all isolates with CLC Genomics Workbench v7.5.1 (https://digitalinsights.qiagen.com/) at default parameters and using the filtered read pairs and single-end reads. These genome assemblies were further scaffolded with transcriptome data (where applicable; see below) using L_RNA_scaffolder [79].

Long-read sequence data for S. natans and S. tridacnidorum were generated on a PacBio Sequel system at the Ramaciotti Centre for Genomics (Sydney). In combination, these data and the paired-end libraries (adding up to a coverage of 152-fold for S. natans and 200-fold for S. tridacnidorum) were incorporated in hybrid de novo genome assembly with MaSuRCA 3.3.0 [80], following the procedure described in the manual. PacBio subreads were filtered to a minimum length of 5 kb, and all other sequence data were used as input without being pre-processed, as recommended [80]. The genome assemblies were further scaffolded with transcriptome data generated in this study (see below) using L_RNA_scaffolder [79]. These hybrid genome assemblies are more contiguous than the draft assemblies generated using only short-read sequence data (above), thus are of higher quality. Short sequences (< 1000 bp) were removed from all assemblies.

Estimation of genome size and ploidy

Genome size and sequence read coverage were estimated based on the k-mer frequency in short-read data, following earlier studies [33, 35]. Briefly, for each genome, k-mers were enumerated from the Illumina paired-end reads using Jellyfish v2.2.6 [81], independently for k = 17, 19, 21, 23, 25, 27, 29 and 31. For each k, genome size is estimated by dividing the total number of observed k-mers by the maximum fold-coverage of the k-mers as determined from the frequency distribution peak. The final genome size estimate is the mean size estimate derived from the different k values. Ploidy of a genome can also be estimated based on the frequency distribution of k-mers of short-read data [82]; a single peak suggests that the genome is likely haploid (single copy, n). In the scenario of a diploid (2n) genome or recent whole-genome duplication, two peaks would be observed, such that k-mer coverage of the second peak is exactly (or approximately) twofold compared to that of the first.

Removal of putative microbial contaminants

To identify putative sequences from bacteria, archaea and viruses in the genome scaffolds, we followed the two-phase approach of Chen et al. [32]. In brief, we first searched the scaffolds (BLASTn) against a database of bacterial, archaeal and viral genomes from RefSeq (release 88); hits with E ≤ 10−20 and alignment bit score ≥ 1000 were considered as significant. We then calculated the proportion of bases in each scaffold covered by significant hits. Next, we assessed the added length of implicated genome scaffolds across different thresholds of these proportions, and the corresponding gene models in these scaffolds as predicted from available transcripts using PASA v2.3.3 [83] (see below), with a modified script available at https://github.com/chancx/dinoflag-alt-splice that recognises an additional donor splice site (GA), and TransDecoder v5.2.0 [83]. Any scaffolds with significant bacterial, archaeal or viral hits covering ≥ 5% of its length was considered as a putative contaminant and removed from the assembly. Additionally, the length of the remaining scaffolds was plotted against their G+C content; scaffolds (> 100 kb) with irregular G+C content (in this case, G+C ≤ 45% or ≥ 60%) were considered as putative contaminant sequences and removed. On average, 12.10% of the assembled sequences were removed from each genome based on these criteria.

Generation and assembly of short-read transcriptome data

We generated transcriptome sequence data for the Symbiodinium isolates, except for S. necroappetens CCMP2469 for which the extraction of total RNAs failed (Additional file 3: Supplementary Table 13). Short-read sequence data (2 × 150 bp reads) were generated using paired-end libraries on the Illumina NovaSeq 6000 platform at the Australian Genome Research Facility (Melbourne). Quality assessment of the raw paired-end data was done with FastQC v0.11.4, and subsequent processing with Trimmomatic v0.35 [76]. To ensure high-quality read data for downstream analyses, the paired-end mode of Trimmomatic was run with the settings: HEADCROP:10 ILLUMINACLIP:[AdapterFile]:2:30:10 CROP:125 SLIDING WINDOW:4:13 MINLEN:50. The surviving read pairs were further trimmed with QUADTrim v2.0.2 (https://bitbucket.org/arobinson/quadtrim) with the flags -m 2 and -g to remove homopolymeric guanine repeats at the end of the reads (a known bias of Illumina NovaSeq 6000 data). Transcriptome assembly was done with Trinity v2.1.1 [84] in de novo and genome-guided modes. De novo transcriptome assembly was done using default parameters and the trimmed read pairs. For genome-guided assembly, high-quality read pairs were aligned to the preliminary de novo genome assembly using Bowtie v2.2.7 [85]. Transcriptomes were then assembled with Trinity in the genome-guided mode using the alignment information, and setting the maximum intron size to 100,000 bp. Both de novo and genome-guided transcriptome assemblies from each sample were used for scaffolding (see above) and gene prediction.

Generation of full-length transcript data

Full-length transcripts for S. tridacnidorum and S. natans were generated using the PacBio IsoSeq technology (Additional file 3: Supplementary Table 13). All sequencing was conducted using the PacBio Sequel platform at the Institute for Molecular Bioscience (IMB) Sequencing Facility, The University of Queensland (Brisbane, Australia). Full-length cDNA was first synthesised and amplified using the TeloPrime Full-Length cDNA Amplification Kit (Lexogen) and TeloPrime PCR Add-on Kit (Lexogen) following the protocols provided in the product manuals. One synthesis reaction was performed for each sample using 821 ng from S. tridacnidorum and 1.09 μg from S. natans of total RNA as starting material. Next, 25 (S. tridacnidorum) and 23 (S. natans) PCR cycles were carried out for cDNA amplification. PCR products were divided into two fractions, which were purified using 0.5× (for S. tridacnidorum) and 1× (for S. natans) AMPure PB beads (Pacific Biosciences), and then pooled with equimolar quantities. The recovered 699 ng (S. tridacnidorum) and 761 ng (S. natans) of cDNA were used for sequencing library preparation with the SMRTbell Template Prep Kit 1.0 (Pacific Biosciences). The cDNA from these libraries were sequenced in two SMRT cells.

To generate the dinoflagellate spliced leader (DinoSL)-specific transcript library, 12 PCR cycles were carried out for both samples using the conserved DinoSL fragment (5′-CCGTAGCCATTTTGGCTCAAG-3′) as forward primer, the TeloPrime PCR 3′-primer as reverse primer and the fraction of full-length cDNA purified with 0.5× (for S. tridacnidorum) and 1× (for S. natans) AMPure PB beads. The above-described PCR purification and sequencing library preparation methods were used for the DinoSL transcript libraries; cDNA from these libraries was sequenced in one SMRT cell per sample.

Due to the abundance of undesired 5′–5′ and 3′–3′ pairs, and to recover as much transcript evidence as possible for gene prediction, we adopted two approaches for processing IsoSeq data (Additional file 2: Supplementary Figure 13). First, the IsoSeq 3.1 workflow (https://github.com/PacificBiosciences/IsoSeq) was followed. Briefly, circular consensus sequences (CCS) were generated from the subreads of each SMRT cell with ccs v3.1.0 without polishing, and by setting the minimum number of subreads to generate CCS (--minPasses) to 1. Removal of primers was done with lima v1.8.0 in the IsoSeq mode, with a subsequent refinement step using IsoSeq v3.1.0. At this stage, the refined full-length transcripts of all SMRT cells (excluding those from the DinoSL library) were combined to be then clustered by similarity and polished with IsoSeq v3.1.0. High- and low-quality transcripts resulting from this approach were further used for gene prediction.

Second, we repeated the IsoSeq workflow with some modifications. We polished the subreads with the Arrow algorithm and used at least three subreads per CCS with ccs v3.1.0 to generate high-accuracy CCS. Primer removal and refinement were done as explained above. The subsequent clustering and polishing steps were skipped. The resulting polished CCS and full-length transcripts were also used for gene prediction. IsoSeq data from the DinoSL library were processed separately following the same two approaches.

Gene prediction and function annotation

We adopted a comprehensive, customised approach for ab initio gene prediction from dinoflagellate genomes, following earlier studies [32, 33]; the workflow is available at https://github.com/TimothyStephens/Dinoflagellate_Annotation_Workflow. For a detailed schematic overview of this workflow, see Figure S1 in Chen et al. [32]. A de novo repeat library was first derived for the genome assembly using RepeatModeler v1.0.11 (http://www.repeatmasker.org/RepeatModeler/). All repeats (including known repeats in RepeatMasker database release 20180625) were masked using RepeatMasker v4.0.7 (http://www.repeatmasker.org/); masked sequences were used in the subsequent steps for gene prediction. We used scripts available from RepeatMasker to calculate Kimura distances for sequences of each repeat type (calcDivergenceFromAlign.pl) and to generate a repeat landscape for each genome (createRepeatLandscape.pl).

As direct transcript evidence, we used the de novo and genome-guided transcriptome assemblies from Illumina short-read sequence data. For S. necroappetens CCMP2469, we used transcriptome data of the other six Symbiodinium isolates for gene prediction, as well as other available transcriptome datasets of Symbiodinium: S. microadriaticum CassKB8 [86], S. microadriaticum CCMP2467 [87], and S. tridacnidorum Sh18 [88]. We also combined the S. microadriaticum CassKB8 transcriptome data generated here with those from a previous study [86]. For each sample, we concatenated all RNA-Seq transcripts and “cleaned” them using SeqClean (https://sourceforge.net/projects/seqclean/) and the UniVec database build 10.0. For S. natans and S. tridacnidorum, we also incorporated the PacBio IsoSeq full-length transcript data (above) as evidence to guide gene prediction. We used PASA v2.3.3 [83], customised to recognise dinoflagellate alternative splice donor sites (see above), and TransDecoder v5.2.0 [83] to predict coding sequences (CDS). These CDS were searched (BLASTp, E ≤ 10− 20) against a protein database that consists of RefSeq proteins (release 88) and a collection of available and predicted (with TransDecoder v5.2.0 [83]) proteins of Symbiodiniaceae (total of 111,591,828 sequences; Additional file 3: Supplementary Table 14). We used the analyze_blastPlus_topHit_coverage.pl script from Trinity v2.6.6 [84] to retrieve only those CDS having a hit with > 70% coverage of the database protein sequence (i.e. nearly full-length) in the database for subsequent analyses.

The near full-length gene models were checked for TEs using HHblits v2.0.16 (probability = 80% and E-value = 10− 5), searching against the JAMg transposon database (https://sourceforge.net/projects/jamg/files/databases/) and TransposonPSI (http://transposonpsi.sourceforge.net/). Gene models containing TEs were removed from the gene set, and redundancy reduction was conducted using cd-hit v4.6 [89, 90] (ID = 75%). The remaining gene models were processed using the prepare_golden_genes_for_predictors.pl script from the JAMg pipeline (altered to recognise GA donor splice sites; http://jamg.sourceforge.net/). This script produces a set of “golden genes” that was used as training set for the ab initio gene prediction tools AUGUSTUS v3.3.1 [91] (customised to recognise the non-canonical splice sites of dinoflagellates; https://github.com/chancx/dinoflag-alt-splice) and SNAP v2006-07-28 [92]. Independently, the soft-masked genome sequences were passed to GeneMark-ES v4.32 [93] for unsupervised training and gene prediction. UniProt-Swiss-Prot proteins (downloaded on 27 June 2018) and predicted proteins of Symbiodiniaceae (Additional file 3: Supplementary Table 14) were used to produce a set of gene predictions using MAKER v2.31.10 [94] protein2genome; the custom repeat library was used by RepeatMasker as part of MAKER prediction. A primary set of predicted genes was produced using EvidenceModeler v1.1.1 [95], modified to recognise GA donor splice sites. This package combined the gene predictions from PASA, SNAP, AUGUSTUS, GeneMark-ES and MAKER protein2genome into a single set of evidence-based predictions. The weightings used for the package were as follows: PASA 10, MAKER protein 8, AUGUSTUS 6, SNAP 2 and GeneMark-ES 2. Only gene models with transcript evidence (i.e. predicted by PASA) or supported by at least two ab initio prediction programmes were kept. We assessed completeness by querying the predicted protein sequences in a BLASTp similarity search (E ≤ 10− 5, ≥ 50% query/target sequence cover) against the 458 core eukaryotic genes from CEGMA [96]. Transcript data support for the predicted genes was determined by BLASTn (E ≤ 10− 5) similarity search, querying the transcript sequences against the predicted CDS from each genome. Genes for which the transcripts aligned to their CDS with at least 50% of sequence cover and 90% identity were considered as supported by transcript data.

Following Liu et al. [35], functional annotation of the predicted genes was conducted based on sequence similarity searches against known proteins, in which the predicted protein sequences were used as query (BLASTp, E  ≤ 10− 5, minimum query or target cover of 50%) against the manually curated Swiss-Prot database, and those with no Swiss-Prot hits were subsequently searched against TrEMBL (both databases from UniProt, downloaded 27 June 2018). The best UniProt hit with associated Gene Ontology (GO, http://geneontology.org/) terms was used to annotate the query protein with those GO terms using the UniProt-GOA mapping (downloaded 3 June 2019). Pfam domains [97] were searched in the predicted proteins of all samples using PfamScan [98] (E ≤ 0.001) and the Pfam-A database (release 30 August 2018) [97]. Tests for enrichment of Pfam domains were done with one-tailed Fisher’s exact tests, independently for over- and under-represented features; domains with Benjamini and Hochberg [99] adjusted p ≤ 0.05 were considered significant. Enrichment of GO terms was performed using the topGO Bioconductor package [100] implemented in R v3.5.1, applying Fisher’s exact test with the ‘elimination’ method to correct for the dependence structure among GO terms. GO terms with a p ≤ 0.01 were considered significant.

To assess the potential impact of organellar genome sequences on our analysis, we used available sequences of plastid genomes [101,102,103] and mitochondrial genes [104,105,106,107] from other dinoflagellates as query to search (BLASTn, E ≤ 10− 10) against the seven de novo assembled genomes from this study. For each genome scaffold that shares significant similarity to known organellar sequences, we assessed if the scaffold contains other protein-coding genes (predicted above) that encode non-organellar functions. We consider those that encode only organellar functions as putative organellar genome sequences.

Comparison of genome sequences

We compared the genome data of 15 isolates in Order Suessiales: seven for which we generated genome assemblies in this study (Table 1), and as shown in Additional file 3: Supplementary Table 2, eight others represented by S. microadriaticum CCMP2467 [87], S. tridacnidorum Sh18 [88], B. minutum [108], C. goreaui [35], Cladocopium sp. C92 [88], F. kawagutii [35] and P. glacialis CCMP1383 and CCMP2088 [33]. Genes were consistently predicted from all genomes using the same workflow above.

Whole-genome sequence alignment was carried out for all possible genome pairs (225 combinations counting each genome as both reference and query) with nucmer v4.0.0 [109], using anchor matches that are unique in the sequences from both reference and query sequences (--mum). Here, the similarity between two genomes was assessed based on the proportion of the total bases in the genome sequences of the query that aligned to the reference genome sequences (Q) and the average percent identity of one-to-one alignments (i.e. the reciprocal best one-to-one aligned sequences for the implicated region between the query and the reference; I). If two genomes are identical, both Q and I would have a value of 100%. Filtered read pairs (Additional file 3: Supplementary Table 1) from all isolates were aligned to each other’s (and against their own) assembled genome scaffolds using BWA v0.7.13 [110]; mapping rates relative to base quality scores were calculated with SAMStat v1.5.1 [111]. For each possible genome pair, we further assessed sequence similarity of the repeat-masked genome assemblies based on the similarity between their k-mers profiles, to capture a comprehensive phylogenetic signal using whole-genome sequences. To determine the appropriate k-mer size to use, we extracted and counted k-mers using Jellyfish v2.2.6 [81] at multiple k values (between 11 and 101, step size = 2); k = 21 was found to capture an adequate level of uniqueness among these genomes as inferred based on the proportion of distinct and unique k-mers [112] (Additional file 2: Supplementary Figure 14). We then computed pairwise D2S distances (d) for the 15 isolates following Bernard et al. [30]. The calculated distances were used to build a NJ tree with Neighbor (PHYLIP v3.697) [113] at default settings. For deriving an alignment-free similarity network, pairwise similarity was calculated as 10 − d [114].

Analysis of conserved synteny

To assess conserved synteny, we identified collinear syntenic gene blocks common to each genome pair based on the predicted genes and their associated genomic positions. Following Liu et al. [35], we define a syntenic gene block as a region conserved in two genomes in which five or more genes are coded in the same order and orientation. First, we concatenated the sequences of all predicted proteins to conduct all-versus-all BLASTp (E ≤ 10− 5) searching for similar proteins between each genome pair. The hit pairs were then filtered to include only those where the alignment covered at least half of either the query or the matched protein sequence. Next, we ran MCScanX [115] in inter-specific mode (-b 2) to identify blocks of at least five genes shared by each genome pair. We independently searched for collinear syntenic blocks within each genome (i.e. duplicated gene blocks). Likewise, we conducted a BLASTp (E ≤ 10−5) to search for similar proteins within each genome; the hit pairs were filtered to include only those where the alignment covered at least half of either the query or the matched protein sequence. We then ran MCScanX in intra-specific mode (-b 1).

Analysis of genic features, gene families and functional enrichment

We examined variation among the predicted genes for all Suessiales isolates with a principal component analysis (PCA) using relevant metrics (Additional file 3: Supplementary Table 4), following Chen et al. [32]. We calculated G+C content in the third position of synonymous codons and effective number of codons used (Nc) with CodonW (http://codonw.sourceforge.net/) for complete CDS (defined as those with both start and stop codons) of all isolates. Groups of homologous sequences from all genomes were inferred with OrthoFinder v2.3.1 [116] and considered as gene families. A rooted species tree was inferred using 28,116 families encompassing at least 4 genes from any isolate using STAG [117] and STRIDE [118], following the standard OrthoFinder pipeline. Gene Ontology (GO) enrichment of genes in families core to Symbiodiniaceae and to Symbiodinium (defined as those common to all isolates in, and exclusive to, each group) was conducted using the topGO Bioconductor package [100] implemented in R v3.5.1, implementing Fisher’s exact test and the ‘elimination’ method; the GO terms associated to the genes of all isolates surveyed here were used as the background for each comparison. We considered a p ≤ 0.01 as significant. The significance of size differences of the gene families shared by S. tridacnidorum and S. natans was assessed with a two-tailed Fisher’s exact test correcting p values for multiple testing [99]; difference in size was considered significant for gene families with adjusted p ≤ 0.05. We assessed the relative abundance of gene functions based on the annotated GO terms and protein domains in all predicted genes. For each gene-function feature, we used Z-score to compare its relative abundance among the 15 Suessiales genomes, in which Z = (x − μ) / δ, where x is the relative abundance of a feature in a genome, μ is the mean of x among all 15 genomes and δ is the standard deviation of μ. For each genome, x for each GO term was calculated relative to the number of genes annotated with the term, and x for each protein domain was calculated relative to the total number of domains annotated among all proteins predicted. Hierarchical clustering of Z-scores in the heatmap was conducted based on pairwise Spearman’s correlation coefficients using hclust implemented in R v4.0.2.

Analysis of duplicated genes and pseudogenes in S. tridacnidorum and S. natans

We used the predicted genes and their associated genomic positions to identify potential segmental genome duplications in S. tridacnidorum CCMP2592 and S. natans CCMP2548, as well as in P. glacialis CCMP1383. First, we used BLASTp (E ≤ 10−5) to search for similar proteins within each genome; the hit pairs were filtered to include only those where the alignment covered at least half of either the query or the matched protein sequence. Next, we ran MCScanX [115] in intra-specific mode (-b 1) to identify collinear syntenic blocks of at least five genes and genes arranged in tandem within each genome separately.

Identification of genes with DinoSL and pseudogenes was done in a similar way to Song et al. [119]. We queried the original DinoSL sequence (DCCGUAGCCAUUUUGGCUCAAG) [120], excluding the first (ambiguous) position, against the upstream regions (up to 500 bp) of all genes in a BLASTn search, keeping the default values of all alignment parameters but with word size set to 9 (-word_size 9). To identify full-length DinoSL sequences in the genome scaffolds, we used the same query above and searched against the assembled genome scaffolds using BLAT [121] (-tileSize=11 -stepSize=5). Pseudogenes were identified using tBLASTn, using the predicted protein for each genome as query against the genome sequences, in which regions covered by the predicted genes were masked, as target; in doing so, pseudogenes would not overlap with the predicted gene models. Matched regions with ≥ 75% identity were considered part of pseudogenes and surrounding matching fragments were considered as part of the same pseudogene as long as they were at a maximum distance of 1 kb from another pseudogene fragment and in the same orientation.