Oxford Nanopore long-read sequencing
The nuclear genomes of Phaeodactylum tricornutum (CCMP632) and Thalassiosira pseudonana (CCMP1335) were de novo sequenced and assembled using long-read data generated on Oxford Nanopore’s MinION device (Fig. S1). For P. tricornutum, 986,604 reads (820,187 of which “passed” QC based on Albacore q-score binning) resulted in a total of 8.2 gigabase pairs (Gbp) of data (~300x coverage) with a mean read length of 8.3 Kbp, a mean read quality score (Q-score) of 8.5, and a read-length N50 of ~ 18.8 Kbp (Table 1, Table S1). For T. pseudonana, 701,596 reads were obtained (580,845 “passed”) totaling 7.5 Gbp of sequence (~230x coverage) with a mean read length of 10.6 Kbp, a Q-score of 9.9 and read-length N50 of ~ 20.1 Kbp (Table 1, Table S1). The reads were mapped to the existing diatom reference genomes to estimate nanopore read accuracy. For P. tricornutum, 76.8% of the unfiltered nanopore reads aligned to the reference genome with an average of 73.7% identity, while 76.6% of the unfiltered T. pseudonana long-read data aligned to the reference genome with an average identity of 71.5% (Table 1, Table S1).
In order to produce high-quality genome assemblies for T. pseudonana and P. tricornutum, we curated sub-datasets of reads by filtering the original long-read datasets and selecting the highest quality reads of ≥20 Kbp and ≥ 30 Kbp for P. tricornutum and T. pseudonana, respectively, while maintaining at least 50x coverage of both genomes (based on the estimated genome sizes reported by Armbrust et al.  and Bowler et al. ). The read length cut-offs for each species were determined in part by considering the read length N50 of the filtered datasets as well as the minimum read-length necessary to span transposable elements (based on previously reported TE sequences by Maumus et al. ) and the estimated gap sizes in the original reference genomes [3, 11]. Filtering resulted in smaller but higher quality read datasets for both organisms, indicated by improved mean read length, mean Q-score and read length N50 metrics (Table 1, Table S1; Fig. S2). The filtered dataset for P. tricornutum included 8.6% (84,445 reads) of the original long-read dataset (986,604 reads); however, the mean read length improved over four-fold to 32.6 Kbp and the mean Q-score increased to 9.6 (Table 1, Table S1). The filtered T. pseudonana dataset included only 6.7% (46,708 reads) of the initial read dataset (701,596). The mean read length increased over three-fold (37.9 Kbp) while the Q-score increased to 10.9 (Table 1, Table S1). For both diatoms, the read length N50 increased from ~ 20 Kbp to over 30 Kbp (Table 1, Table S1). These curated datasets were used for subsequent de novo genome assembly.
De novo genome assembly and analysis
De novo assemblies of the filtered datasets were produced using two dedicated long-read assemblers, Canu  and Flye [50, 51]. The raw T. pseudonana and P. tricornutum Canu and Flye assemblies suffered most noticeably from low percent identify to the reference genome and poor gene completeness (see below)—both symptoms of the high per-base error rate of nanopore sequencing (Table 2, Table S2). To correct mismatches and indels, the raw assemblies were first polished with long-read data using two rounds of Racon , followed by the complete Nanopolish  pipeline and at least ten rounds of Pilon , which uses Illumina-generated short-read data to correct false SNPs and erroneous insertions and deletions.
Overall, our polishing approach resulted in progressively improved measures of genome quality and completeness (contiguity, percent identity to the reference genome, error rate, gene content and Assembly Likelihood Evaluation (ALE) score; Table 2, Table S2). Comparison of the final polished Canu and Flye long-read assemblies to the previously published reference genomes yielded average sequence identities of ~ 99% for both T. pseudonana and P. tricornutum (Table 2, Table S2), a noticeable improvement over the initial 71–73% average mapping identities (Table 1).
The final polished Canu and Flye assemblies for T. pseudonana included a minimum of 40x read depth coverage, with a genome size for the final polished Flye assembly of 33.8 Mbp; this is consistent with the reference genome [3, 11] (Table 2, Table S2). In stark contrast, the polished Canu-derived assembly for T. pseudonana was 47.3 Mbp, which is over 10 Mbp larger than the reference genome size (Table 2, Table S2). The Flye assembly was more contiguous (52 contigs) than both the existing reference genome (115 contigs) and the Canu assembly (222 contigs; Table 2, Table S2). Compared to the Canu assembly, the Flye assembly had a longer contig N50 (1.38 Mbp vs 0.98 Mbp) and a lower contig L50 (8 vs 14; Table 2, Table S2).
Genome assembly trends were similar for P. tricornutum in that the final polished long-read assembly generated with Flye yielded a smaller genome size and lower number of contigs compared to Canu (Table 2, Table S2). While the P. tricornutum Flye assembly was somewhat larger than the existing reference genome (33.4 Mbp versus 27.4 Mbp), the Canu assembly was 66.8 Mbp, more than double the expected genome size. Despite using reads ≥30 Kbp, both the Flye (196 contigs) and Canu assemblies (291 contigs) were less contiguous than the existing P. tricornutum reference assembly (179 contigs). Unsurprisingly, the more contiguous Flye assembly had better contig N50 and L50 statistics, although the largest contig generated for P. tricornutum was produced by the Canu assembly (2.51 Mbp vs 1.66 Mbp with Flye; Table 2, Table S2).
In order to select the best overall assembly, the Canu and Flye assemblies were evaluated based on a combination of traditional assembly metrics, statistical analysis tools and gene completeness assessments. First, the accuracy of our de novo genome assemblies was assessed using the ALE pipeline . This statistical tool uses a Bayesian framework to detect synthetic errors in genome assemblies and calculate the likelihood that an assembly is correct given the raw data underlying it. The overall ALE score is calculated based on four ‘sub-scores’: (i) the placement score, which assesses how well each mapped read corresponds to the assembly, (ii) the insert score, which evaluates the expected paired-end read length in the assembly, (iii) the depth score, which measures sequencing depth consistency across the assembly, and (iv) the k-mer score, which uses k-mer frequency to calculate the assembly likelihood independent of the read data . Combined, these four values provide a more objective measure for comparing assemblies based on the same read dataset but produced by different assembly tools; the assembly with the highest ALE score is statistically the best and thus most likely to be correct. For T. pseudonana, the more contiguous Flye assembly with better L50 and N50 values was determined to be the ‘best’ genome assembly (Table 2, Table S2). However, our data show that contiguity may not always be the best indicator of the highest quality assembly. While the Flye assembly for P. tricornutum was the most contiguous, it was statistically worse than the more fragmented Canu assembly based on ALE (Table 2, Table S2).
To assess genome completeness, orthologs from a set of conserved single-copy eukaryotic genes were identified for each genome using BUSCO v3.0.2 . For T. pseudonana, BUSCO completeness for the Flye assembly and the published reference genome assembly was similar [Flye = 80.6%, 244 out of 303 total genes present; reference = 81.2%, 246 out of 303 total genes], with the majority (~ 79.0%) of the genes in both the Flye and reference assemblies existing as complete single copies (Table 2, Table S2; Fig. 1a). The Canu assembly was deemed similarly complete (79.2%, 240 out of 303 genes), although it contained fewer complete single copy genes (59.7%) and a larger proportion of complete duplicated genes (19.5%) than the T. pseudonana Flye and reference genome assemblies (Table 2, Fig. 1a).
For P. tricornutum, BUSCO completeness values for our Flye assembly and the reference genome were similar: 80.9% (245 out of 303 total genes) and 82.5% (250 out of 303 total genes), respectively (Table 2, Table S2; Fig. 1b). While our analyses detected mostly single copy complete genes (80.2%, 243 genes) and few complete duplicated genes (2.3%, 7 genes) for the original reference genome, the proportion of duplicated complete genes in the Flye assembly increased 4-fold (9.9%, 30 genes). Our Canu assembly had the highest BUSCO score (85.4%, 259 out of 303 genes), although it recovered fewer single copy genes (33.3%, 100 out of 303 genes) and a disproportionate number of complete duplicated genes (52.1%, 158 out of 303 genes) relative to the P. tricornutum Flye and reference assemblies (Table 2, Table S2; Fig. 1b). The large number of duplicated genes detected in the P. tricornutum Canu assembly raised the possibility that the assembly algorithm either resolved both haplotypes of the diploid genome (see below) or revealed large segmental duplications that were collapsed in the reference assembly.
We ultimately settled on the T. pseudonana Flye and P. tricornutum Canu assemblies, which were finalized with our full polishing pipeline, as the ‘best’ overall de novo assemblies for downstream analyses. Their robustness and completeness reflect the benefits of combining long-reads for the generation of long contigs with the accuracy of Illumina short-read data. The short-reads are needed to correct indel errors in the nanopore data, as indicated by the dramatically improved estimates of genome completeness and ALE scores with each iteration of our polishing pipeline (Fig. 1; Table 2, Table S2). Although the final T. pseudonana Flye assembly achieved greater contiguity than the original reference genome (which included 37 unplaced contigs), the P. tricornutum Canu assembly was over three times more fragmented than expected (Table 2, Table S2). The T. pseudonana Canu assembly was also significantly more fragmented than that produced with Flye.
The fragmented nature of the Canu assemblies for both diatom genomes is a consequence of the different way that the two assemblers handle allelic diversity and repetitive genomic content. While Canu is a more conservative assembler that is capable of resolving highly divergent haplotypes, low-complexity and highly repetitive areas , Flye may be prone to merging alleles and collapsing repetitive content, often resulting in more artifactually contiguous assemblies . After our analyses were completed, a newer version of Flye (v2.4) was released that is less prone to collapsing repeats and alleles; it produced an assembly for P. tricornutum that was slightly larger in size (39 Mbp), more fragmented (433 contigs) and had a smaller contig N50 (0.15 Mbp) than our initial Flye (v2.3) assembly. Although further analysis is required, the more fragmented nature of the Flye v2.4 assembly suggests that less repetitive content was collapsed and/or fewer alternative haplotypes were merged. It is worth noting that although Canu generates more fragmented assemblies that are less useful for inferring genomic structure and organization, Flye assemblies (v2.3 and older) are also imperfect in that they are more likely to exclude biologically real and potentially important genetic information. The abundance of LTR-RTs in the P. tricornutum genome (see below) likely confounded the Canu assembly algorithm and contributed to the fragmented nature of the final assembly. In the P. tricornutum genome, LTR-RT insertions often occur in just one of the haplotypes. Canu’s conservative algorithm likely detected discrepancies between allele-specific reads that were otherwise the same and did not merge those reads into a single contig. Instead, the algorithm separated those reads (those with the LTR-RT insertion and those without) and produced two distinct contigs (i.e., alternative haplotypes), which resulted in double the expected genome size based on the reference . That said, the Canu assembly is likely closer to reality than the reference or Flye v2.3 assemblies because it captures more of the complexities intrinsic to the P. tricornutum genome. In the case of T. pseudonana, further exploration of the Canu assembly is needed in order to determine if its fragmented nature is the result of greater LTR-RT content than previously recognized in the reference genome [3, 11]. For the time being, if we assume that LTR-RTs in T. pseudonana follow a similar pattern of haplotype-specific insertions, the decreased number of LTR-RT insertions in T. pseudonana compared to P. tricornutum (see below) likely resulted in less ‘haplotype phasing’ by the Canu algorithm and as a result, the T. pseudonana Canu assembly was not as inflated relative to the reference [3, 11].
Long-read sequencing resolves outstanding issues in existing diatom reference genomes
Resolution of telomeres and unlinked chromosome scaffolds
Our long-read assemblies resolved some of the unanswered questions posed by the T. pseudonana and P. tricornutum reference genomes, including unresolved telomeres and unplaced scaffolds. The 27.4 Mbp P. tricornutum reference genome was predicted to contain 33 chromosomes based on the assembly of 33 scaffolds (87.9 Kbp to 2.5 Mbp) . Out of the 33 chromosome-level scaffolds, 12 scaffolds achieved telomere-to-telomere resolution, 16 scaffolds contained one telomere, and five scaffolds lacked both telomeres. None of the contigs in our Canu long-read assembly contained telomeres at both ends, although 58 of these contigs have a telomere at one end. Bionano optical mapping (see below) assigned 32 Canu contigs with single telomeres to 29 Bionano-Canu hybrid chromosome-level scaffolds (out of 49 total hybrid scaffolds), resulting in three hybrid scaffolds with telomeres at both ends, and 26 hybrid scaffolds with a telomere at one end. Mapping our Canu contigs to the reference genome scaffolds indicated that 34 telomeres on the reference scaffolds were also present on the ends of the homologous Canu contigs. In some cases, telomeres were present on our Canu contigs but not their homologous counterparts in the reference scaffolds. While the P. tricornutum reference assembly has one telomere each for chromosomes 18 and 29, we were unable to uncover telomere sequence for the homologous Canu contigs. Our long-read sequencing combined with Bionano optical mapping suggests that the P. tricornutum genome contains at least 29 chromosomes. This estimate was further supported by pulsed-field gel electrophoresis (PFGE), which resolved at least 29 chromosomes ranging from ~ 480 Kbp to ~ 3.0 Mbp in size (Fig. S3). Additional chromosomes in the P. tricornutum genome may have been overlooked in our PFGE observations owing to the co-migration of multiple, similarly sized chromosomes.
The T. pseudonana reference genome was predicted to contain at least 24 chromosomes (297 Kbp to 3.04 Mbp), which were represented by six genome scaffolds with telomeres at both ends, 17 scaffolds with a telomere at only one end, and four scaffolds without telomeres at either end [3, 11]. Without scaffolding via optical mapping, our Flye long-read assembly did not achieve the same level of completion as the original reference. Comprised of 52 contigs, the Flye assembly contained only a single fully resolved telomere-to-telomere chromosome. That contig was homologous (99.6% identity) to reference chromosome 3, which has a telomere at one end. Single telomeres were resolved for 25 of the remaining Flye contigs, which, when mapped to the reference scaffolds, validated the resolution of the majority of the ‘single-telomere’ reference scaffolds. So, while our Flye assembly did not resolve chromosome-level contigs, it did map well to the more complete scaffolds of the reference genome [3, 11]. In some cases, the Flye contigs were able to resolve one or more telomeres where a reference chromosome only resolved one or none (i.e., Flye contigs included telomere sequence flanking the homologous region to the reference scaffold).
Based on optical restriction site mapping, the T. pseudonana reference genome identified two reference scaffolds as representative of chromosome 11 and two scaffolds corresponding to chromosome 16 [3, 11]. Telomere sequence was identified at one end of each of those T. pseudonana scaffolds (‘chr11a’, ‘chr11b’, ‘chr16a’, and ‘chr16b’); however, the two respective scaffolds could not be definitively linked to form two authentic chromosomes [3, 11]. In the case of ‘chr16a’ & ‘chr16b’, this was because the scaffolds were separated by repetitive sequence that was too long to be resolved by the length of a fosmid insert. Similarly, optical mapping attributed three T. pseudonana scaffolds (‘chr19a’, ‘chr19b’, and ‘chr19c’) to a single chromosome but sufficient nucleotide sequence data to demonstrate their connection was lacking. Mapping our Flye contigs to the reference genome allowed us to resolve the missing nucleotide data linking these fragmented reference chromosomes together, thereby validating three more complete chromosomes for T. pseudonana.
Resolution of unplaced scaffolds and gap filling with long-read data
The P. tricornutum and T. pseudonana reference genomes both include substantial amounts of sequence data that could not be placed in a larger chromosomal context at the time of publication. These unlinked scaffolds, termed “bottom drawer” (scaffolds prefixed as ‘Bd’), were predicted either to fall within unresolved gaps on the main chromosome-level scaffolds or to represent alternative haplotypes [3, 11]. The original P. tricornutum genome included 55 unlinked scaffolds 450 bp to 293 Kbp in size, while the T. pseudonana genome included 37 such scaffolds (2282 bp to 138 Kbp). Using long-read sequencing, we placed 30 out of 37 and 38 out of 55 T. pseudonana and P. tricornutum “bottom drawer” scaffolds, respectively (Table S3). This was achieved by manually identifying “bottom drawer” scaffolds with homology to contigs in our assemblies and bridging the reference genome gaps with our long-read-derived contigs.
The P. tricornutum and T. pseudonana reference genomes include 69 (~ 0.33 Mbp) and 18 (~ 0.10 Mbp) gap regions on the main chromosome scaffolds, respectively. We used local alignments between the reference chromosomes and their homologous long-read derived contigs to identify regions where our contigs spanned gaps in the reference chromosomes. In doing so, we filled in 13 gaps in the main scaffolds for T. pseudonana and 18 gaps for P. tricornutum (Table S4). We also assessed the gaps in the “bottom drawer” scaffolds for each diatom reference genome. Out of 31 (~ 53 Kbp) gaps for T. pseudonana, we resolved 12 (Table S4). For P. tricornutum, the 20 gaps (~ 97 Kbp) in the “bottom drawer” scaffolds could not be resolved as none of these scaffolds showed obvious homology to our Canu contigs. In total, by using long-read data to resolve the unlinked scaffolds and gaps associated with the reference assemblies, we were able to integrate 0.10 Mbp (T. pseudonana) and 0.49 Mbp (P. tricornutum) of additional sequence data into our assemblies relative to the original reference genomes.
Detection of structural variation
To assess small (< 50 bp) and large (> 50 bp) structural variation between the reference and long-read diatom genomes, we used Assemblytics  which detects and catalogs variants based on whole genome alignments generated by MUMmer. We found 1.20 Mbp of variants between the T. pseudonana Flye assembly and the original reference, with insertions and tandem expansions contributing to 58% of the total size variation (Fig. 2; Table S5). A total of 4.68 Mbp of structural variation was detected between our Canu P. tricornutum genome and the reference (Fig. 2; Table S5). Over 1.75 Mbp of that difference (624 variants in total) were attributed to insertions, with the majority (~ 1.12 Mbp) 4000–10,000 bp in size (Fig. 2). When variants in that size range were extracted and compared to a local database of diatom long-terminal repeat retrotransposons, 84% (157 out of 187 variants) were found to be CoDi LTR-RTs. Further investigation of all 1569 variants reported for the Canu P. tricornutum genome identified 25.5% (400 variants) as LTR-RTs, versus only 2.3% (21 out of 935 total variants) in T. pseudonana (a complete analysis of LTR-RTs is described below). A case-by-case investigation of the possible biological significance of these structural variations is beyond the scope of the present study but is certainly warranted.
Resolution of ribosomal RNA operons
Due to their multi-copy, homogeneous nature, nuclear ribosomal RNA (rRNA) operons are notoriously difficult to assemble using traditional sequence data; they thus serve as a useful test of the potential for long-read sequencing to improve genome assembly. To that end, the reference and polished long-read diatom genome assemblies were assessed for copies of the complete rRNA operon (18S, ITS1, 5.8S, ITS2, 28S). Whereas a single complete rRNA operon was detected on scaffold chromosome 17 in the T. pseudonana reference genome, our Flye assembly contained a single 733,359 bp contig (Flye contig3, which is homologous to reference scaffold chromosome 17) containing five complete tandem rRNA operon copies (Table 3). The average length of each complete rRNA operon was 5826.8 bp with an average of 4521.8 bp between each operon (Table 3). The five complete operons have an average identity of 99.6%. Two partial rRNA copies (1742 bp & 793 bp) were detected on the unlinked ‘bottom drawer’ reference scaffold Bd36x69, while the five complete rRNA copies on Flye contig3 were followed by a truncated (5538 bp) copy that was missing ~ 300 bp from the 28S portion of the operon.
To assess whether the tandem rRNA copies in T. pseudonana were mis-assemblies, we mapped our long-read data to the de novo Flye assembly. We detected multiple examples of single MinION reads that spanned all five rRNA copies located on Flye contig3, suggesting that the rRNA tandem array assembled by Flye was biologically accurate. However, the average Illumina read depth at those five rRNA loci was over four times the average read depth for the rest of the genome (638x vs. 148x) indicating that the T. pseudonana genome contains additional rRNA loci that were collapsed by the Flye assembly algorithm. Our detection of multiple rRNA copies at the end of Flye contig3, which is homologous to reference chromosome 17, is consistent with previous assessments of rRNA repeats for T. pseudonana. The initial version of the reference assembly for T. pseudonana reported a cluster of ~ 35 rRNA copies on chromosome 17; however, the assembler that was used to generate the second version of the T. pseudonana genome seems to have collapsed those repeats into a single rRNA locus on chromosome 17 [3, 11].
In the case of P. tricornutum, we identified a complete rRNA operon (5043 bp) on scaffold chromosome 13 of the reference genome as well as a partial operon (766 bp) on scaffold chromosome 7. In our long-read-derived assembly, we detected two Canu contigs with complete tandem rRNA copies (5935.6 bp average length) – two rRNA copies (99.9% identical) on contig2792 (homologous to reference scaffold chromosome 7) and five copies (99.9% average identity) on contig74 (homologous to reference scaffold chromosome 13; Table 3). Two partial rRNA copies (766 bp & 205 bp) were also detected on contig2792. Single MinION reads were found to span the two complete rRNA copies on contig2792, while multiple MinION reads were identified as spanning the five tandem rRNA copies on contig74. The P. tricornutum raw long-read data thus support the rRNA arrays detected on Canu contig2792 and contig74 as biologically authentic and not the result of mis-assemblies.
In contrast to T. pseudonana, the average Illumina read depth at the P. tricornutum Canu assembly rRNA loci was relatively similar to the average read depth across the entire genome (82.5x vs. 66x), suggesting that the long-read data capture the total number of rRNA loci in the genome as one would expect the read coverage to be a multiple of the average genomic coverage (e.g., 132x) if there were other copies of the rRNA operon that had been collapsed into this area. To assess if failure to resolve tandem rRNA arrays in the original reference genomes was a symptom of the assembly process collapsing highly repetitive genomic regions, we mapped the raw sequence data produced by Bowler et al.  to the intergenic spacer region (IGS) resolved in our Canu assembly. Those raw reads mapped to the IGS regions for both contig74 and contig2792 with an average read depth of ~ 7.0x (average read depth for entire contig = 7.0x) and 5.9x (average read depth for entire contig = 9.1x), respectively. This suggests that the tandem rRNA copies were indeed present in the reference data, but those regions were collapsed by the assembly algorithm. All things considered, our long-read assemblies provide a more accurate picture of the ribosomal RNA operon organization in the T. pseudonana and P. tricornutum genomes.
De novo gene prediction and annotation for Thalassiosira pseudonana
Rastogi et al.  recently used RNA-Seq and traditional EST data to re-annotate the P. tricornutum genome, identifying 12,233 genes versus the 10,402 genes originally reported in the reference genome . To our knowledge there have been no attempts to reinvestigate the gene content of T. pseudonana since Bowler et al.  predicted 11,673 genes in the nuclear genome, (~ 4000 of which were supported by EST data) based on their improvements to the initial reference assembly . Our polished T. pseudonana Flye assembly served as the foundation for the gene comparison and re-discovery analyses below.
Comparison of our T. pseudonana Flye long-read assembly to the complete set of proteins predicted for the reference resulted in the identification of 99.9% of the previously reported genes; only eight genes out of 11,673 were not detected. Long-read mapping against both the reference and Flye assemblies identified long-reads that supported the presence of those eight genes in the reference sequence as well as long-reads that authenticated their absence in our Flye assembly. It is possible that each of those genes occurs at a single locus in the genome and is represented by only a single allele, resulting in two distinct haplotypes at a given locus. While the reference assembly resolved the haplotype version containing the allele, the Flye assembly resolved the alternate haplotype in which the allele has been lost.
Exploration of potentially ‘new’ gene content in T. pseudonana was performed using the Flye long-read assembly and a newly assembled transcriptome. Four RNA-Seq datasets previously published by Goldman et al.  were downloaded from NCBI and assembled into 22,600 transcripts corresponding to 10,383 protein coding genes. The assembled transcriptome was mapped against our T. pseudonana Flye genome for an overall alignment completeness of 95.3%. When the reference protein coding genes were compared against the new transcriptome, a total of 344 reference genes were not detected at the nucleotide level. In comparison, when the transcriptome was compared against the reference dataset of protein coding genes for T. pseudonana, ~ 2500 out of the 22,600 transcripts were not recovered (e-value = 1e-15), even though ~ 560 of the ~ 2500 transcripts were identified as being homologous to other diatom sequences in the NCBI protein database (e-value ≤1e-05) (mostly Thalassiosira oceanica; see below).
Comparison of the newly assembled, RNA-seq-based T. pseudonana transcriptome against the reference and Flye assemblies indicated a very small proportion of missing genes. Out of 22,600 transcripts (corresponds to 10,383 protein coding genes), only 57 did not have hits against the reference genome, versus 52 transcripts that did not have hits against the Flye de novo assembly. A total of 37 transcripts lacking hits were shared between the reference and Flye genomes and likely correspond to poorly assembled transcripts or contamination. The 20 transcripts missing from the reference genome and 15 transcripts missing from the Flye assembly likely correspond to genes located in missing genomic regions in each assembly.
We used the T. pseudonana transcriptome and long-read assembly to carry out a de novo gene prediction, resulting in a protein coding dataset of 16,491 genes, substantially larger than the 11,673 genes reported for the reference [3, 11]. Of the newly predicted genes, 13,805 (83.7%) were found to have high similarity (≥70% amino acid identity) to previously reported T. pseudonana genes (Fig. S4). Out of the remaining 2686 predicted genes (16%), 1971 had no match against the Armbrust et al. (2004) reference genome, while 715 had only a weak match (i.e., percent identity ≤70%), suggesting that those genes represent paralogs to recognized T. pseudonana genes (Fig. S4).
When the 2686 ‘new’ T. pseudonana gene sequences were compared to the NCBI protein database, 2010 genes had hits (<1e-03; Fig. S4). While 148 genes were most similar to transposon genes (transposon-related genes are not included in the reference protein coding gene set), the remaining 1862 genes were found to be most similar to genes identified in other diatom species (Fig. S4). Notably, 1042 of these genes were most similar to genes in Thalassiosira oceanica (data which were not available when the reference genome was published), suggesting that these 1862 genes are authentic, newly recognized T. pseudonana genes and not artefacts of the gene finding process. A blastp analysis against the NCBI protein database indicated that 1189 genes out of the 1862 newly predicted genes for T. pseudonana had homology to genes with known functions (not including genes associated with transposons) in other species. 676 genes did not have obvious homologs in the NCBI protein database and thus require further investigation to determine if they represent T. pseudonana-specific genes or are artefacts (e.g., due to intron retention in RNA-Seq data). 67 ‘new’ genes mapped to the “gap-resolved” regions (see above) of the T. pseudonana Flye assembly. Out of those 67 genes, 33 genes (49.2%) were among the 2686 genes without a blast hit showing ≤70% identity to the reference genome. While three of the 33 genes were identified as being transposons, the remaining 30 genes were previously unidentified in T. pseudonana.
Out of 11,673 genes predicted in the original T. pseudonana reference genome, 3900 were inferred to be specific to this genome and another 1407 were deemed diatom-specific [3, 11]. These numbers are based on comparison of the protein coding gene datasets for T. pseudonana and P. tricornutum, which were the only diatoms datasets available at that time. Since then, genomes and transcriptomes have been sequenced from a variety of additional diatom orders and genera, allowing for a more comprehensive and accurate assessment of species-specific and diatom-specific gene content. Comparison of our T. pseudonana proteome to protein datasets for seven other diatoms identified 3731 orthologous groups shared among the eight diatom species (Table 4). A total of 7512 (45.6%) T. pseudonana genes predicted in our study were assigned to these groups; 5136 genes were inferred to be diatom-specific, 1959 as shared with other stramenopile lineages (e.g., oomycetes, Blastocystidae, Pelagophyceae) and 1082 as having strong similarity to bacteria (predominantly Proteobacteria) as determined by subsequent PLAST analyses against the NCBI protein database. The T. pseudonana genes with a strong affinity to bacteria were not investigated further, although they could represent instances of HGT, as inferred by previous studies [11, 13].
A total of 2692 T. pseudonana proteins were not assigned to orthologous groups (16.3% in total) and PLAST assessment against the NCBI protein database (e-value 1e-10, query coverage ≥70%, ≥40% identity) identified 2502 genes that were likely T. pseudonana-specific genes/proteins (Table 4). The remaining 190 genes showed obvious homology to other diatom sequences (predominantly T. oceanica). Out of the T. pseudonana-specific proteins, 716 were among the putative novel genes identified here-in.
The discovery of 1862 previously unreported genes in T. pseudonana was unexpected– it enhances our understanding of gene content for this species and provides a framework for consideration of which of its genes are ‘species-specific’ and ‘diatom shared’. The number of genes T. pseudonana shares with other diatoms will no doubt continue to grow as more genomes are sequenced.
Bionano optical mapping of the Phaeodactylum tricornutum genome
Ploidy assessment of the de novo P. tricornutum Canu genome assembly is consistent with previous suggestions that P. tricornutum is a diploid organism (Fig. S5) [11, 58,59,60,61]. As noted above, PFGE (Fig. S3) supports the existence of at least 29 chromosomes (~ 480 Kbp to ~ 3.0 Mbp in size) totaling ~ 30–32 Mbp, which is roughly consistent with the number of chromosomes (33) and genome size (27.4 Mbp) reported for the reference. As noted above, our Canu long-read assembly was roughly double the expected genome size, supporting the separation of reads into two contigs representing different alleles. Bionano optical mapping was performed in an attempt to more accurately resolve both P. tricornutum haplotypes.
The Canu assembly was used to select the direct labeling enzyme DLE-1 as the best enzyme for achieving the recommended labeling density of 8–25 sites per 100 Kbp required for optimal resolution (DLE1 recognition site: CTTAAG, labeling density 7.501/100 Kbp). The Bionano system generated 4,760,428 virtually labeled molecules that were filtered (molecules ≥100 Kbp) for a total of 1,055,998 molecules (average length 252.6 Kbp) totaling 267 Gbp. Only Canu contigs greater than 150 Kbp (138 of 293 contigs, 71.8% of total assembly) were scaffolded onto the de novo Bionano physical consensus maps. Hybrid scaffolding produced 49 super-scaffolds (128 Kbp − 2.78 Mbp) totaling 50.6 Mbp (Table S6). When combined with the 155 contigs that were too small to be anchored to the physical consensus maps (28.2% of the Canu assembly), the total genome size increased to 66.8 Mbp. The N50 of the Bionano-Canu hybrid assembly was found to be 1.06 Mbp, representing a 4.2-fold increase when compared to the Canu assembly alone (Table 2; Table S6). The resulting 49 super-scaffolds included 9.5 Mbp of gaps (23 bp - 1.0 Mbp) with the majority of gaps (86%) being ≤300 Kbp in length.
Our Bionano data resolved multiple super-scaffolds that are homologous to the same regions of the reference chromosomes, supporting the separation of the Canu contigs and scaffolds into two copies. As the haploid P. tricornutum reference genome contains 33 chromosomes (12 scaffolds with telomeres at both ends), we anticipated resolution of 66 total haplotypes. The presence of only 49 super-scaffolds indicates that the Bionano-Canu hybrid assembly is missing 17 haplotype representative scaffolds. The 49 super-scaffolds were classified as either a “full-length haplotype” (i.e., super-scaffolds aligned to their homologous reference chromosome sequences across their entire length; 19 scaffolds in total), a “partial haplotype” (i.e., the super-scaffold aligned to only part of its homologous reference chromosome; 17 scaffolds), a “mis-assembled haplotype” (i.e., portions of a super-scaffold aligned to more than one reference chromosome; nine scaffolds) or an “unresolved haplotype” (i.e., the super-scaffold could not confidently be resolved to a reference chromosome; four scaffolds; Table S6).
When the super-scaffolds were mapped against the P. tricornutum reference chromosomes, we were only able to resolve full-length haplotypes for four reference chromosomes (chr1, chr8, chr16, and chr26, eight super-scaffolds in total; Fig. S6). Seventeen reference chromosomes were characterized by super-scaffolds representing either one full-length haplotype and one partial haplotype, two partial haplotypes, a single full-length haplotype or a single partial haplotype (25 super-scaffolds; Table S6, Fig. S6). Reference chromosome 5 was the only exception and was represented by one full-length haplotype and two partial haplotypes (Table S6, Fig. S6). Resolution of the remaining 11 reference chromosomes as distinct haplotypes was even less straightforward; two reference chromosomes (chr30 & chr31) were not confidently resolved while nine mapped to nine super-scaffolds in what can be described as ‘hybrid-hybrids’. These super-scaffolds corresponded to single Bionano-Canu scaffolds with each half of the scaffold representing a different reference chromosome (either a partial or full-length haplotype) or, in a single case, a Bionano-Canu scaffold containing portions of four different reference chromosomes (Figs. S7 & S8). No fewer than fourteen reference chromosomes were identified as contributing to portions of the hybrid-hybrids, with some reference chromosomes appearing more than once.
Perhaps not surprisingly, deeper investigation of the nine hybrid-hybrid super-scaffolds, their respective Canu-contig sequences, and homologous reference chromosomes revealed that LTR-RT insertions and segmental duplications were a confounding factor in their formation. In three cases (SS100002, SS00015, SS100022), Bionano appeared to erroneously resolve contigs to single super-scaffolds when segmental duplications (consistent with those detected for the P. tricornutum reference genome ) and/or LTR-RTs were present near the contig ends (Fig. S7). The similar labeling enzyme patterns of those repetitive genomic regions appear to have been detected by the Bionano software as portions of a single molecule that should be joined in a single molecule map. Interestingly, the resolution of hybrid-hybrid super-scaffold SS100022, which linked contigs homologous to reference chromosomes 24 and 29 (Fig. S7B), is consistent with Diner et al. ; these authors identified similar putative centromere sequences located at the termini of those scaffolds (which also lack telomeres) and hypothesized that chromosomes 24 and 29 are in fact two portions of a single chromosome. Our Bionano data support the genomic arrangement theorized by Diner et al. , but due to the repetitive nature of the area where the Canu contigs are joined, we could not confidently join chromosomes 24 and 29.
Assessment of the remaining eight hybrid-hybrid super-scaffolds was more complicated. For four of these hybrid-hybrids, we observed that the portions of the scaffold identified as coming from different chromosomes were separated by one or more large gap regions (Fig. S8A-D). In other cases, the ‘breakpoints’ between the inter-chromosomal mergers occurred in the middle of a single Canu contig (Fig. S8E-F). To assess if those Canu contigs represent mis-assemblies produced by the Canu assembly process, we mapped the raw nanopore long-read sequences to the Canu contigs in question. In both cases, we identified multiple long-reads spanning the ‘breakpoint’. Clearly there are artefacts being introduced but with the data in hand we cannot determine where and why.
To determine if some of the 155 unscaffolded contigs that did not meet the minimum size requirement for inclusion in the Bionano optical mapping could be used to manually complete partially resolved haplotypes and fill in gap regions inserted into the Bionano-Canu super-scaffolds, we aligned the Bionano super-scaffolds to their respective homologous reference chromosomes. Based on these alignments, we used blastn to compare the unscaffolded contigs against the specific reference chromosome regions determined to be missing from our partially resolved haplotypes. The creation of a more complete haplotype was straightforward in cases where a partial haplotype for a given reference chromosome was also represented by a full-length haplotype, which could be used as a guide for positioning the homologous unscaffolded contigs. However, in cases where a chromosome was represented by two partial haplotypes, correct haplotype assignment was not possible due to a lack of genomic context (File S1).
Given that our Bionano data support the separation of the Canu contigs into two haplotypes, we attempted to confirm that the original reference genome is indeed the product of haplotype amalgamation. A SNP frequency analysis was performed using only reference chromosomes for which two full-length haplotypes were represented (five case studies in total). First, long-read data were mapped to the Canu contigs representing the full-length haplotypes and their corresponding reference chromosome. The two Canu haplotypes were then aligned to the appropriate reference genome scaffold and manually examined to compare the sites of difference (Fig. S9). SNP visualization for all five chromosomes examined strongly suggests that the reference chromosomes are a mixture of the two haplotypes resolved by Canu and supported by Bionano optical mapping (Table S7).
To assess the possibility that it is the Canu contigs that are the amalgams and that the published reference chromosomes represent only one of the two haplotypes, Illumina short reads were mapped to the Canu haplotypes and reference chromosomes. The SNP differences across individual Illumina reads (~ 120 bp) were consistent with the differences observed across the long-read contigs/haplotigs, further validating the notion that it is the reference sequence of Bowler et al.  that is mosaic. Roughly an equivalent number of Illumina reads were found to support each haplotype, which is consistent with the diploid nature of P. tricornutum (Fig. S10).
The level of allelic divergence in our P. tricornutum data was clearly significant enough for the Canu assembly algorithm to separate the two haplotypes rather than collapsing them together. We wanted to determine whether such high levels of allelic differences could influence the Bionano results. More specifically, could the underlying haplotype sequence differences give rise to heterogeneous Bionano enzyme labelling sites and thus compromise the ability of the Bionano approach to find the two haplotypes? The Bionano system converts images of electrophoretically separated, fluorescently labelled, long DNA molecules into virtual molecules, which are then clustered into virtual consensus maps . These consensus maps are then compared to reference sequences that have been computationally labeled at sites with the same motif. This enables the pattern of fluorescently labelled enzyme sites in real DNA molecules (as represented by the Bionano consensus maps) to be compared to the equivalent sequences in a long-read assembly. This could potentially allow the assembly to be separated into haplotypes as the contigs are oriented and aligned into larger, chromosome-level scaffolds. The ability of the Bionano software to distinguish between haplotypes is dependent on the sites selected for incorporating the dye, as well as the density of the sites selected for labeling. If there are too few labeling sites in the DNA, there may not be enough information to make informative patterns to ascertain haplotypes. Conversely, if there are too many sites, the labeling pattern can become distorted and unreliable.
In P. tricornutum, the direct labeling enzyme DLE-1 (CTTAAG) was selected from a limited number of direct labeling enzymes. When we evaluated the DLE-1 enzyme sites and SNPs across the Canu haplotypes, we found that 15% of enzyme sites were altered by allelic differences, indicating that one of the haplotypes would have fewer enzyme sites available and, as a result, a lower labeling density. The labeling density across the whole P. tricornutum genome was 7.51 sites/100 Kbp, which falls just below Bionano’s recommended labeling density of 8–25 sites/100 Kbp . The reduced labeling density resulting from the altered enzyme sites likely further compromised the resolution of the Bionano approach and resulted in an inaccurate dye pattern. The combination of low labeling density and high SNP diversity appears to have impacted the number of available enzyme labeling sites and contributed to the inability of Bionano to fully phase both P. tricornutum haplotypes. Additionally, the low labeling density likely contributed to the generation of the hybrid-hybrid Bionano-Canu contigs.
All things considered, while the Bionano data validate the separation of the Canu contigs into haplotypes, neither Bionano nor nanopore sequencing (together or in isolation) was able to fully phase the P. tricornutum genome.
Repetitive DNA and long-terminal repeat retrotransposon content in diatom genomes
A prominent feature of the P. tricornutum and T. pseudonana genomes is the presence of repetitive elements. We first explored this using RepeatMasker to identify, characterize and compare repetitive content among the polished long-read assemblies and reference genomes for both organisms. Repetitive elements were found to contribute 11.3 Mbp (19.9%) and 2.55 Mbp (7.6%) to our de novo P. tricornutum and T. pseudonana genome assemblies, respectively (Tables S8 & S9). These proportions represent more than a two-fold increase in repeat content relative to our reassessments of the published reference genomes [P. tricornutum 8.1% (2.22 Mbp) and T. pseudonana 3.2% (1.03 Mbp); Tables S8 & S9]. Transposable elements (TEs) comprised ~ 41% more of the T. pseudonana Flye assembly (1484 TEs, 3.8%, 1.27 Mbp; Table S9) than the reference genome TE content reassessed using the same RepeatMasker parameters (1049 TEs, 1.4%, 0.45 Mbp; Table S9). We identified a 3.3-fold increase in the number of TEs found in our P. tricornutum Canu assembly (5605 TEs, 15.9%, 9.05 Mbp; Table S8) versus the reference genome (1706 TEs, 6.4%, 1.76 Mbp; Table S8). Consistent with the results of Rastogi et al. , we detected a small proportion of the P. tricornutum genome (0.2%, 0.13 Mbp) as being comprised of short interspersed nuclear elements (SINES; a type of non-long terminal repeat retrotransposon), which were undetected in the original reference genome annotation . To date, SINES have only been reported in two other diatom species, Cyclotella cryptica  and Skeletonema costatum  and their genomic impact and functional roles are not well understood.
The most dramatic difference in TEs between the T. pseudonana Flye and reference assemblies and the P. tricornutum Canu and reference assemblies was in the number of full-length, decaying and nested Ty1/copia-like long terminal repeat retrotransposons (LTR-RTs). While the T. pseudonana Flye genome had a ~ 1.5-fold increase in Ty1/copia-like LTR-RTs when compared to the reference genome, an even more striking pattern was observed in P. tricornutum. Whereas Ty1/copia-like LTR-RTs comprise 5.7% of the P. tricornutum reference genome (1383 LTR-RTs, 1.57 Mbp; Table S8), they were classified as an even larger fraction of the Canu genome assembly at 14.4% (4703 LTR-RTs, 8.18 Mbp; Table S8).
Previously, diatom Ty1/copia-like LTR-RTs were classified into seven copia-like groups (called CoDi for copia-like in diatoms) with six CoDi groups forming two diatom-specific copia lineages [3, 11, 15]. Several studies have elucidated the role that Ty1/copia-like LTR-RTs have played in diatom diversity, genome structure and ecological adaptation [15,16,17]. While LTR-RTs have been identified in a number of other diatom genomes as well, e.g., those of Fragilariopsis cylindrus, Pseudo-nitzschia multistriata, Pseudo-nitzschia multiseries and T. pseudonana), these genomes lack the degree of CoDi expansion seen in P. tricornutum. As part of a larger genome reannotation study, Rastogi et al.  reassessed repetitive content in P. tricornutum and reported a greater proportion of CoDi elements (~ 7.6%) than earlier estimates (~ 5.4%) by Maumus et al. . The genomic fraction of CoDi elements reported by Maumus et al.  and Rastogi et al.  included full-length, decaying and nested CoDi elements. To provide greater insight into the proportion of full-length, putatively active Copia-type and Gypsy-type LTR-RTs potentially contributing to T. pseudonana and P. tricornutum genome evolution, we assessed both genomes with rigorous, optimized software programs for de novo LTR-RT discovery. This involved consideration of the key signatures of LTR-RTs, namely gag-pol genes, LTR sequences at each end, and target site duplication sequences directly flanking each LTR.
We identified 22 full-length CoDi and Gypsy loci in the original T. pseudonana reference genome; this included 10 additional putatively active loci that were overlooked in previous analyses (Table S10). For the de novo T. pseudonana Flye genome, we detected a total of 38 putatively active LTR-RTs. These CoDi or Gypsy elements were characterized as “previously reported loci” (i.e., loci homologous to those previously reported by Maumus et al.  in the reference genome), “overlooked loci” (loci homologous to those present in the reference genome but not reported) and “novel loci” (i.e., loci detected in our long-read assembly but without a homologous insertion in the reference genome). Unexpectedly, we identified only eight previously reported loci, four overlooked loci, and 26 novel CoDi and Gypsy insertions in the Flye T. pseudonana genome assembly (Table S10, Fig. S11). Using LTR-retriever , the insertion time for 14 of the 26 novel insertions was estimated to be zero, consistent with the possibility that these 14 insertions represent very recent LTR-RT insertions absent from the original reference genome (i.e., present in our T. pseudonana culture but not in that used for the initial reference genome). That said, it is possible that these novel insertions were present in some but not all of the original reference genome sequence and did not make their way into the final consensus due to bioinformatic constraints associated with the handling of alleles.
Results obtained for P. tricornutum were even more striking. Detection of LTR-RTs in the reference genome confirmed the 42 full-length previously reported CoDi elements and an equal number of overlooked full-length CoDi loci (Table S11, Fig. 3). We identified the CoDi5 group as a main contributor to the LTR-RT expansion, in addition to the previously recognized CoDi2 and Codi4 groups  (Table S11, Fig. 3). Out of the 84 LTR-RT insertions detected in our search, 73 (87%) were identified in the P. tricornutum Canu assembly (36 previously reported loci and 37 overlooked loci, Table S11). In addition to those 73 loci, we detected 327 putative novel CoDi insertions in our Canu genome assembly (Table S12, Fig. 3). It is worth noting that further analysis of the Canu contigs representing alternative haplotypes indicated that most CoDi insertions were located in only a single haplotype, which is consistent with the observations of Maumus et al. .
In an attempt to determine if those novel loci were indeed the product of recent LTR-RT proliferation in our cell culture since the reference genome was published, or if these loci were present in the data of Bowler et al.  but not identified due to bioinformatic processing steps, we analyzed the raw sequencing reads generated for the original P. tricornutum genome project and mapped them to the Canu genome. These analyses indicated that the vast majority of the novel loci uncovered for P. tricornutum are actually supported by the raw Sanger sequencing reads, although a small proportion of insertions (~ 10%, 33 novel insertions) were not supported by raw reads and thus presumably represent authentic novel CoDi insertions present in one or both alleles.
Our LTR-RT investigation of P. tricornutum indicates that there are far more full-length CoDi elements present in the genome than previously recognized. Out of the 8.18 Mbp of LTR-RT sequences estimated by RepeatMasker for the P. tricornutum Canu genome assembly, ~ 32% (~ 2.69 Mbp) were identified as full-length CoDi elements predicted to have the required structural and enzymatic components needed for activation. The rate at which these loci are actively proliferating, and the biological significance of this proliferation, remains to be determined.