Introduction

Serious damage caused by global warming has already been documented (Maddison and Rehdanz 2011; Lemoine and Kapnick 2016). Facing this issue and finding solutions are the overriding priority for humankind to sustain the planet. Every possible measure needs to be taken to decrease the emission of green-house gases including CO2. Among them, development of environmentally friendly industries based on renewable biomass instead of conventional petrochemistry-based industries consuming fossil resources is one of the hopeful options to be considered. Research into renewable biomass has passed from a phase of searching for sources of valuable compounds from edible biomass (e.g., bioethanol derived from starch and biodiesel derived from edible plant oil) to non-edible cellulose-based biomass. Microalgal biomass has additionally been recognized as a promising feedstock due to no-competition with food production and higher biomass productivity than cellulose-based biomass (Maeda et al. 2018). Furthermore, microalgae-derived oil can be used as an alternative source of ω-3 polyunsaturated fatty acids (PUFAs), which are highly required for sustainable aquaculture (Maeda et al. 2018), as well as jet-fuels (Fortier et al. 2014; Wei et al. 2019; Bwapwa et al. 2017), which cannot be mixed with bioethanol and biodiesel produced in previous generations of biofuel research.

As it is rare for naturally occurring microalgae to achieve economically feasible productivity of valuable compounds, engineering efforts are still needed to improve it. Large-scale metabolic engineering by introducing a number of heterogeneous and/or homogeneous genes has a great potential to address this issue. Indeed, sophisticated metabolic engineering of Saccharomyces cerevisiae has been demonstrated with the aid of computationally designed episomal vector platforms (e.g., artificial chromosomes), which are replicated and maintained in the yeast chassis (Walker and Pretorius 2018). In 2015, Karas et al. developed world-first diatom episomal vectors capable of introducing 49-kbp heterogeneous sequence for the large-scale metabolic engineering of the model diatoms, Phaeodactylum tricornutum and Thalassiosira pseudonana (Karas et al. 2015). They found that the centromeres and autonomous replication sequence (ARS) derived from S. cerevisiae are available to maintain the episomal vectors in the diatoms, although the mode-of-action of DNA replication and maintenance in diatom cells remains unclear (Diner et al. 2016). Subsequently, a method to predict diatom centromeres on each chromosome based on the feature of GC% was proposed (Diner et al. 2017). Nonetheless, they failed to find any consensus sequence motifs in the centromeres of these model diatoms, suggesting that more diatom species should be subjected to the centromere search. Identification of the features of the centromeres, including consensus sequence motifs, characteristic structures, and chromatin features (Talbert and Henikoff 2020), can provide insights into chromosome replication and maintenance in diatoms, as the functions of such motifs have been studied for S. cerevisiae towards the rational design of episomal artificial chromosomes.

The marine oleaginous diatom Fistulifera solaris is a promising microalga for practical applications for useful compound production because it accumulates significant amounts of lipids up to ~ 65 wt% of its dry cell weight (Maeda et al. 2017), and outdoor mass-cultivation of this diatom was successfully demonstrated (Matsumoto et al. 2017). Multi-omics analyses including genomics (Nomaguchi et al. 2018; Tanaka et al. 2015), transcriptomics (Nomaguchi et al. 2018), proteomics (Nojima et al. 2013; Nonoyama et al. 2019), and metabolomics (Liang et al. 2015) have been performed for this diatom, and the biological factors involved in its excellent lipid productivity have been partially elucidated. However, to discuss centromere features on each chromosome, we have to point out the shortcomings of our previous analyses based on genomic information obtained from pyrosequencing, which generated short sequence reads (Tanaka et al. 2015). In general, short read sequencing technology such as the pyrosequencing and Illumina technology generate relatively accurate but poorly contiguous assemblies which tend to contain unresolved repeat regions and unlinked contigs or scaffolds (Tyson et al. 2018). In our case, the previous analysis generated an F. solaris draft genome assembly containing 117 gaps with unresolved sequences and 53 out of 295 scaffolds unlinked to the hypothetically determined pseudo-chromosomes (Tanaka et al. 2015). In addition, only 9 out of 84 pseudo-chromosomes achieved telomere-to-telomere resolution. These facts strongly indicate that the previously assembled pseudo-chromosomes did not, indeed, fully represent the entire chromosome structures of the F. solaris genome, which is essential information to identify the centromere features on each chromosome.

To address this issue, in this study, we re-sequenced the F. solaris genome using an Oxford Nanopore Technology (ONT) sequencer, MinION. Nanopore sequencing is an emerging technology allowing particularly long-read sequencing, consequently leading to high contiguity. Re-sequencing efforts using MinION have been recently reported to resolve chromosome-scale genome information of model organisms including a nematode Caenorhabditis elegans (Tyson et al. 2018), a seed-plant Arabidopsis thaliana (Michael et al. 2018), and, in particular, the diatoms P. tricornutum and T. pseudonana (Filloramo et al. 2021). These studies resolved the unmapped sequences and unanchored or misassembled contigs, guiding us to re-examine the genome of F. solaris using MinION. Here, we extracted high molecular–weight (HMW) genomic DNA from F. solaris and subjected it to nanopore sequencing. The generated contigs were assessed by bioinformatics analyses to resolve a chromosome-scale assembly, in which sequence features of the telomeres and centromeres were detected. The consensus motifs were explored in the F. solaris putative centromeres. Besides the centromere analyses, nanopore sequencing unveiled the previously unsolved genomic features which could be involved in the oleaginous phenotype of F. solaris.

Materials and Methods

Strains and Culture Conditions

The marine diatom F. solaris strain JPCC DA0580 was cultured in half-strength Guillard’s f medium (f/2 medium) (Guillard and Ryther 1962). Cultures were grown for 5 days at 25 °C under 130 μmol photons/m2/s (photon flux density was measured in the range of 400 ~ 700 nm. luminometer HD2302.01 with a probe LP471PAR, Delta OHM S.r.l, Caselle di Selvazzano, Italy) of continuous illumination with 0.8 l/l/min airflow containing 2% CO2. Prior to and during cultivation, Hoechst 33,342 (Thermo Fisher Scientific, CA, USA) was used to confirm that there was no contamination of cultures by bacteria using fluorescence microscopy.

Genomic DNA Extraction

Cells were centrifuged at 8500 g for 10 min to obtain wet microalgal cells (approximately 490 mg). Hexadecyltrimethylammonium bromide (CTAB) method was used for DNA extraction. The wet microalgal cells frozen in liquid nitrogen were ruptured with a mortar and pestle, and suspended in a mixture of 9.5 ml of 10 mM Tris–HCl (pH 8.0), 0.5 ml of 10% sodium dodecyl sulfate (SDS), and 50 μl of 20 mg/ml proteinase K. After incubation for 1 h at 37 °C, 100 μl of 10 mg/ml RNase A (RNase Cocktail Enzyme Mix, 120 U/mg, Applied Biosystems, Thermo Fisher Scientific, CA, USA) was added to the suspension and incubated for 30 min at 37 °C. Then, 5 M NaCl (1.8 ml) and a mixture of 10% CTAB/0.7 M NaCl (1.5 ml) were added to the suspension, and the mixture was incubated for 20 min at 65 °C. Subsequently, an equal volume of phenol/chloroform/isoamyl alcohol (25:24:1) was added, and the suspension was centrifuged at 13,000 g for 15 min. After that, the aqueous layer was transferred to a new tube, and an equal volume of chloroform (100%) was added. The suspension was vortexed, and centrifuged at 13,000 g for 10 min. A tenth amount of 3 M sodium acetate and a triple amount of 100% ethanol (room temperature) were added to the collected aqueous layer, and the mixture was incubated at room temperature for 20 min. After centrifugation at 13,000 g for 30 min, the supernatant was removed, and 15 ml of 70% ethanol (room temperature) was added to the pellet. The pellet was washed and centrifuged at 13,000 g for 5 min. The pellet was dissolved in 200 µl of nuclease-free water (New England BioLabs, Massachusetts, USA) over 2 days. The extracted genomic DNA was subjected to electrophoresis using a 1% agarose gel. Prior to library preparation, genomic DNA was purified with Ampure XP beads (Beckman Coulter).

Nanopore Sequencing

The DNA libraries for nanopore sequencing were prepared following the manufacturer’s guidance for the rapid sequencing kit (SQK-RAD004) which attached adapter sequences with transposase activity (Oxford Nanopore Technologies (ONT), Oxford, UK) for F. solaris. The sequence library for F. pelliculosa was prepared using Ligation sequencing kit (SQK-LSK308, see Supplementary Information for nanopore sequencing of F. pelliculosa). Sequencing for F. solaris was performed with nanopore sequencer MinION and MinION R9.4 flow cell (FLO-MIN106, ONT, Oxford, UK) for 48 h under the control of MinKNOW software. Resulting FAST5 files were base-called using the Albacore with parameters for FLO-MIN106. The qualities of the passed reads were analyzed and visualized using NanoStat and NanoPlot (De Coster et al. 2018). The passed reads (Q score ≧ 7) were assembled using Canu (Koren et al. 2017). The assembled reads were polished using Nanopolish (Loman et al. 2015).

Analysis of the Assembled Contigs

Lengths and GC contents of the assembled contigs were measured using fx2tab function of seqkit (Shen et al. 2016). Minimap2 (Li 2018) was used to align the MinION read sequences to the genome sequence obtained in our previous pyrosequencing analyses (Tanaka et al. 2015). The Minimap2 was also used to align the Illumina reads obtained in our previous study (Tanaka et al. 2015) to the MinION contigs. The alignment results were visualized using integrative genomics viewer (IGV) (Robinson et al. 2011). The read depth of each contig was analyzed using SAMtools (Li et al. 2009). The contigs were classified into subgenomes and organelle genomes by comparison to those analyzed in our previous studies (Nomaguchi et al. 2018; Tanaka et al. 2015) using D-GENIES (Cabanettes and Klopp 2018). Putative 5S rRNA genes at the termini of contig 1973 and 1980 were predicted using BLASTn. The circular illustration highlighting the landscapes of nuclear and organelle genomes was prepared using ClicO FS (Cheong et al. 2015) and GeSeq (Tillich et al. 2017), respectively.

The centromeres were predicted by the method previously proposed (Diner et al. 2017). Briefly, the numbers of 100-bp windows less than or equal to 32% GC within a 3-kbp window (Nbox32) were counted and plotted along the contigs. GC contents within 100-bp windows were obtained using sliding and fx2tab functions of seqkit (Shen et al. 2016). Sequence similarities between the predicted centromeres or within the regions containing the tandemly repeated genes were assessed using BLASTn, Clustal Omega (Sievers et al. 2011), and Dotlet (Junier and Pagni 2000). Consensus motifs in the putative centromeres were searched using MEME SUITE (Bailey et al. 2009) with default setting (motif discovery mode: classic mode, sequence alphabet: DNA/RNA/Protein, site distribution: zero or one occurrence per sequence, number of motifs: 3), where the motifs with E-values smaller than 10−10 were employed in this study. Phylogenetic analysis of lysophosphatidic acid acyltransferase (LPAAT) sequences was performed using MEGA 11 with the neighbor joining method (Tamura et al. 2021).

PCR Amplification of Lipogenesis-Related Genes with a Tandemly Repeated Structure

The presence of a lipogenesis-related gene (gene ID: fso:g9516, having a tandemly repeated structure) was confirmed using PCR. Primers were prepared based on the sequence of a fso:g9516 and synthesized by Invitrogen (Massachusetts, USA) (Supplementary Table S1). As the DNA polymerase for PCR, PrimeSTAR GXL DNA Polymerase (Takara Bio Inc., Shiga, Japan) was used.

Results

Nanopore Sequencing and Assembly of Fistulifera solaris Genome

HMW genomic DNA of F. solaris was prepared with the CTAB method (Supplementary Fig. S1). A single run of nanopore sequencing of the extracted HMW DNA generated approximately 4.1 × 105 passed reads (~ 4.8 Gbp). Because our previous study using pyrosequencing suggested that the size of F. solaris nuclear genome was approximately 49.7 Mbp, the coverage depth was estimated to be 97 times. Distribution of read length and quality Q-score of the passed reads is shown in Supplementary Fig. S2. The average and maximal read lengths were 11.6 and 115.8 kbp, respectively (Table 1). We found 5 and 6634 reads longer than 100 and 50 kbp, respectively. The average quality Q-score of the passed reads was 10.3, indicating a read error rate of approximately 9.3%.

Table 1 Summary of the nanopore sequencing of the genome of Fistulifera solaris, and comparison to other model organisms

Assembly of these reads was conducted by Canu (Koren et al. 2017), followed by sequence polishing using nanopolish (Loman et al. 2015). As a result, a total of 62 contigs (~ 51.7 Mbp, Table 1) ranging from 3.4 kbp to 2.7 Mbp with N50 of 1.4 Mbp were generated (Supplementary Fig. S3A).

Among the 62 contigs, 2 contigs (contigs 139 and 145) showed high sequence identity with the chloroplast genome, and their GC% (approximately 32%, Supplementary Fig. S3B) were matched with the chloroplast genome (Tanaka et al. 2011). We found that another contig (the smallest contig 208, GC% = 54%) showed relatively low sequence identity to the chloroplast genome (sequence identity: 77.3%), while it showed similarities (approximately 90%) to partial sequences of Sphingomonas spp. genomes, implying potential contamination. Three contigs (contigs 1, 4, and 6) showed high sequence identities to the mitochondrial genome, and their GC% (approximately 28%, Supplementary Fig. S3B) were matched with that of the mitochondrial genome (Tang and Bi 2016). Other than these 6 contigs, the remaining 56 contigs showed high sequence identity with the nuclear genome of F. solaris and have a GC% (approximately 46%, Supplementary Fig. S3B) consistent with the nuclear genome of F. solaris (Tanaka et al. 2015). Among the 56 contigs corresponding to the nuclear genome, 11 had short lengths ranging from 34 to 60 kbp (Supplementary Fig. S3A) and showed high sequence identity with other longer contigs (Supplementary Table S2). It remains unclear why such redundant contigs were generated during the assembly process. However, the sequences of these short contigs were almost always comprised within the corresponding longer contigs (Supplementary Table S2), and thus, we concluded that they can be ignored for further analyses. We defined that the remaining 45 contigs correspond to the nuclear genome of F. solaris.

It should be noted that F. solaris has two distinct subgenomes denoted Fso_h and Fso_l (Nomaguchi et al. 2018) and is the sole example of a unicellular microalgal allopolyploid reported thus far. In our previous studies, pyrosequencing of F. solaris genome generated 295 scaffolds consisting of 84 types of the hypothetical pseudo-chromosome structures (Tanaka et al. 2015), among which 120 and 122 scaffolds were classified into Fso_h and Fso_l subgenomes composed of 42 hypothetical pseudo-chromosomes, respectively (Nomaguchi et al. 2018). The remaining 53 scaffolds could not be classified into either of the subgenomes because they were not involved in the hypothetical pseudo-chromosome structures. When we aligned the 295 pyrosequencing scaffolds and the 45 MinION contigs, 115 of the 120 scaffolds belonging to the Fso_h subgenome were aligned to 23 MinION contigs, and 121 of the 122 scaffolds belonging to the Fso_l subgenome were aligned to 22 MinION contigs (Fig. 1). Therefore, we defined these sets of MinION contigs as Fso_h and Fso_l subgenomes predicted by nanopore sequencing, respectively. Twenty-three MinION contigs classified as Fso_h subgenome also included one scaffold formerly belonging to the Fso_h subgenome and 25 scaffolds formerly not classified into either subgenome. Twenty-two MinION contigs classified as Fso_l subgenome also showed sequence similarity with 5 scaffolds formerly belonging to the Fso_h subgenome and 28 scaffolds formerly not classified into either subgenome. When we compared the sequences of 23 and 22 MinION contigs classified as Fso_h and Fso_l subgenomes, these sets of MinION contigs showed one-to-one homoeologous correspondence, with an exceptional case where contig 1973 and 1980 (Fso_h) together corresponded to the homoeologous counterpart of contig 76 (Fso_l), and thus, we considered that these two contigs were together parts of a single DNA molecule. The length and sequence of the gap between contig 1973 and 1980 remain to be determined. We found putative 5S rRNA genes at the termini of contig 1973 and 1980 (Supplementary Fig. S4), suggesting that repeat structures related to 5S rRNA genes, which are frequently found in eukaryotes (Long and Dawid 1980), might prohibit the assembly of these contigs by Canu.

Fig. 1
figure 1

Dot plot analysis by aligning the MinION contigs obtained in this study with previously obtained pyrosequencing scaffolds. Among 45 MinION contigs corresponding to the nuclear genome, 23 and 22 contigs were classified into Fso_h and Fso_l subgenomes of the allopolyploid genome of F. solaris, respectively. Among 295 scaffolds previously obtained by pyrosequencing, 141 and 154 scaffolds (both of which included the scaffolds previously classified into Fso_h, Fso_l, or neither subgenomes) showed sequence similarity to the MinION contigs in Fso_h and Fso_l subgenomes, respectively

Figure 2 shows a total of 22 pairs of these contigs as the homoeologous chromosomes predicted in this study (second outer circle), along with 84 types of the hypothetical pseudo-chromosomes previously predicted (first outer circle) (Tanaka et al. 2015). Several hypothetical pseudo-chromosomes predicted by pyrosequencing were included in single MinION contigs, suggesting the high contiguity of MinION contigs obtained in this study.

Fig. 2
figure 2

Circular view of the landscape of the allopolyploid genome of Fistulifera solaris as revealed by MinION sequencing. The outermost circle (orange and blue tiles without outlines) is pyrosequencing scaffolds previously obtained. The second outer most circle is MinION contigs obtained in this study (orange and blue tiles with outlines). Telomeric repeats found in the MinION contigs are shown by green triangles. The chromosome numbers in two subgenomes (Fso_h and Fso_l) are marked on these circles. The third circle represents the read depth along the MinION contigs. The lines in the innermost circle represent the positions of homoeologous gene pairs on the MinION contigs

Putative Centromeres in the Chromosomes of Fistulifera solaris

To assess the contiguity and completeness of the assembled contigs as the chromosomes of F. solaris, we examined whether the contigs had telomeres at both termini of their sequences. Telomeric repeats (CCCTAA, green triangles on the second outer circle of Fig. 2) (Bowler et al. 2008) were found at 82 positions out of 88 termini (~ 93%) of the 44 chromosomes. All chromosomes have telomeric repeats at either or both termini. This result suggested the high contiguity and completeness of the contigs obtained in this study. We propose that these sets of MinION contigs represent the chromosome structure of the allopolyploid genome of F. solaris (Fig. 2).

Identification of the entire chromosome structure of F. solaris allowed us to predict centromeres in each chromosome (Supplementary Fig. S5). In the proposed chromosomes of F. solaris, a limited number of them had clear Nbox32 peaks. We hypothesized that the position of these Nbox32 peaks indicated the centromeres. In addition, homoeologous chromosome pairs were presumably derived from distinct parental species which would be phylogenetically closely related, and thus, we assumed that the positional relationships of the putative centromeres and neighboring genes should resemble between homoeologous chromosome pairs. Among the chromosomes showing Nbox32 peaks, 8 homoeologous chromosome pairs (chromosomes 1, 3, 10, 11, 13, 14, 20, and 22) possessed the peaks at similar positions (red triangles in Supplementary Fig. S5). We confirmed the positional relationships and found that the proposed centromeres on each homoeologous chromosome pair and the neighboring genes were exactly conserved in 7 of 8 chromosome pairs, except for chromosome 13 (Supplementary Fig. S6). Although a positional shift of the putative centromeres was found in chromosome 13, a similar array of genes surrounds the proposed centromeres. Furthermore, putative centromere regions were narrowed by removing the regions generating transcription signals based on our previous transcriptome data (Tanaka et al. 2015), because, in general, eukaryotic centromeres are non-transcribed regions (Nakamura et al. 2018) (Supplementary Fig. S6). Eventually, we determined the 16 sequences with low-GC contents (36 ~ 42%) in the 8 chromosome pairs as putative centromeres in the F. solaris genome (Supplementary data 1).

The putative centromeres of Chr 1_l, 13_l, and 22_h contain small-scale repeated elements (Supplementary Fig. S7), but none of the predicted centromeres appear to contain highly repeated elements that are typical of other eukaryotic genomes (Willard and Waye 1987). A dot-plot analysis to assess the sequence similarities between the putative centromeres indicates that some of them showed sequence-similarities to their homoeologous pairs (i.e., Chr 1, 3, 14, Supplementary Fig. S7), whereas, other than these, the large-scale conserved sequences were not found within the proposed centromeres in F. solaris. Nonetheless, we further investigated the putative centromere sequences using MEME SUITE (default setting, see Materials and Method section) to find other small-scale consensus motifs. As a result, three putative consensus motifs (WTTTATTCCTAATTTCCTAAAGTYAGAATGYAATTTTGACATTCGACTG, TCCWTCWYTWSGATKHYRMHRAMKCAWRSWAACHGAMCRGWSCARRKAAA, and AWATGHAAAHRMAMAAWGGAAAATTCAGTCGAATRTCAADA) were discovered from 11 out of 16 putative centromere sequences (Fig. 3; Supplementary Fig. S8). Specific patterns of the positions (i.e., distance, order, and strand preference) of these motifs were not found in the present study. GC% of the motifs 1, 2, and 3 are 31.2 ± 2.8%, 41.7 ± 7.7%, and 30.8 ± 3.7%, respectively, suggesting particularly AT-rich feature for motifs 1 and 3.

Fig. 3
figure 3

Sequence features of the centromeres putatively identified from the MinION contigs of F. solaris. A The 3 putative consensus motifs discovered in the centromeres. B Distributions of the consensus motifs in the centromeres. No motif was found from those of chromosome 10_h, 13_h, 13_l, 20_h, and 20_l

Newly Discovered Genomic Features Potentially Related to High Oil Production

Previous pyrosequencing might overlook uncertain genomic regions which are related to the oleaginous phenotype of F. solaris. Furthermore, the limitations of the previous pyrosequencing might negatively affect the elucidation of the complex genomic features of this diatom, in particular allopolyploidy caused by interspecies hybridization between distinct parental species (Nomaguchi et al. 2018). In the present study, besides the putative centromere analyses, we investigated the previously unsolved genomic features (including organelle genomes, see Supplementary Information) and discovered that some of them could potentially be related to high oil production of F. solaris.

We, in the beginning, assessed the ploidy distribution in F. solaris. We examined the coverage depth along the chromosomes by aligning the MinION reads toward the assembled contigs (third outer circle in Fig. 2). Overall, all chromosomes showed similar depths (average 84, Supplementary Fig. S9), with exception of chromosome 4, suggesting that the paired chromosomes derived from each parental species are contained in the F. solaris genome at a 1:1 ratio.

Exceptionally, the chromosome 4 in subgenome Fso_l has a significantly high sequencing depth. We plotted the depth values at each base along the chromosomal positions and found a region (~ 135 kbp) with remarkably high depth, hereinafter referred to as “high depth region” (Supplementary Fig. S10). The average depth of this high depth region was 759, while that of other ordinal regions in the same chromosome is 80, comparable to other chromosomes (Supplementary Fig. S10A). F. solaris genomic DNA was also read by PacBio RSII platform to make sure that this read depth distribution was not the artifact of MinION platform. Although the PacBio did not achieve the high contiguity assembly, the high depth region was also detected in PacBio data at the exactly same position in the chromosome 4 (Supplementary Fig. S10A).

The high depth region was not found on the homoeologous chromosome 4 in subgenome Fso_h (Fig. 2). To examine the reason for generation of the high depth region, we analyzed the sequence and mapping data at the boundary of the ordinal depth region and high depth region on the assembled contig (Supplementary Fig. S10B) and found a telomeric repeat-like sequence in the middle of the assembled contig (Supplementary Fig. S10C). Subsequently, we focused on the mapping data of the MinION reads towards the assembled sequence and found that the MinION reads mapped to the boundary could be classified into two types of reads: the reads with and without telomeric repeats (Supplementary Fig. S11). The reads without telomeric repeats were mapped to both ordinal and high depth regions and showed a lot of sequence variations from the assembled sequence. By contrast, mapping position of the reads with telomeric repeats started from the boundary of the high depth region.

A possible explanation for these analytical results is the existence of independent mini-chromosomes with high sequence identity to the high depth region on chromosome 4 in the Fso_l subgenome (Supplementary Fig. S12A). This hypothetical mini-chromosome (~ 135 kbp, containing 47 genes, Supplementary Table S3) showed approximately eightfold copy number as compared to chromosome 4. The presence of an abnormal number of chromosomes suggests aneuploidy in the F. solaris genome. Although the significant peak of Nbox32 was not detected, the 3 types of putative consensus motifs were found within this high read depth region (Supplementary Fig. S12B), supporting the existence of the hypothetical mini-chromosomes because the centromere could be required for their maintenance. Aneuploidy in diatom genomes was previously suggested in the genome of Thalassiosira weissflogii (Von Dassow et al. 2008). Nonetheless, this is the first report of aneuploidy consisting of a partial region of a diatom chromosome. Genes related to photosynthesis, including those of chloride channel 7 and tratricopeptide-like helical domain-containing protein, exist in this region (Supplementary Fig. S12 and Supplementary Table S3).

In addition to the above-mentioned aneuploidy, we also found some tandemly repeated genes which long read nanopore sequencing first revealed. A particularly striking feature which can be related to the oleaginous phenotype is a sequence of five tandem repeat of 1-acyl-sn-glycerol-3-phosphate lysophosphatidic acid acyltransferase genes (LPAAT, fso:g9516) (green arrows in Fig. 4A), which encodes an enzyme catalyzing the transfer of an acyl chain to lysophosphatidic acid, and is essential for assembling glycerolipids including TAGs. The tandemly repeated LPAAT genes are located between the genes encoding bromodomain-containing protein (fso:g9515) and aarF domain-containing kinase (fso:g12324) in chromosome 9 belonging to subgenome Fso_h. Eight read sequences covered the entire region containing these tandem arrayed genes (Fig. 4B). In addition, PCR targeting the tandemly repeated LPAAT genes and their neighboring regions generated the amplification patterns which were consistent with the MinION assembly (Supplementary Fig. S13). These results indicate that this bona fide repeat feature is not the result of misassembly. Results of functional estimation and expression analysis of the arrayed LPAAT genes were described in the Supplementary Information (Supplementary Figs. S14, 15, 16, and 17).

Fig. 4
figure 4

Tandemly arrayed genes of 1-acyl-sn-glycerol-3-phosphate lysophosphatidic acid acyltransferase (LPAAT) found on chromosome 9 in the Fso_h subgenome (Chr. 9_h) of Fistulifera solaris. A Arrangements of the genes of the tandemly arrayed LPAAT genes and neighboring genes in Chr. 9_h. The tandemly arrayed LPAAT genes were not found in the homoeologous chromosome Chr. 9_l, and the corresponding contigs of Fistulifera pelliculosa CCMP543, a closely related species of F. solaris. B Eight sub-reads obtained by MinION sequencing covered the region of the tandemly arrayed LPAAT genes (shown in purple) in Chr. 9_h

In addition to the tandemly repeated LPAAT, we discovered tandemly repeated genes of enoyl-[acyl-carrier-protein (ACP)] reductase (EAR) as an additional feature potentially related to the oleaginous phenotype. EAR (EC1.3.1.9) functions in the fatty acid elongation process by reducing trans-2-enoyl-ACP to generate acyl-ACP, in which an NADPH is consumed. EAR genes were not fully identified in F. solaris genome in our previous study (Tanaka et al. 2015), and thus, this new discovery now completes the fatty acid generation pathway in this organism.

Discussion

Nanopore sequencing of the F. solaris genome using a MinION platform generated the assembled contigs the high contiguity and completeness which correspond to 44 chromosomes (22 chromosomes for each subgenome, Fso_h and Fso_l). This chromosome-scale assembly allowed us to perform the centromere prediction based on the assumption that a single chromosome of diatoms has one centromere (Diner et al. 2017). In fact, the method for centromere prediction proposed by Diner et al. generated clear Nbox32 peaks, which were often (but not all) found once per chromosome, in the P. tricornutm genome. The predicted regions were supported by a chromatin immunoprecipitation experiment combining with Illumina sequencing (ChIP-seq). We applied the same method and putatively identified 16 centromere sequences exhibiting high AT-richness (with GC% ranging from 36 to 42%). This feature is consistent with those of other eukaryotic organisms including the model diatom P. tricornutum (Talbert and Henikoff 2020). MEME SUITE tool allowed us to identify the sequence motifs frequently found in the F. solaris putative centromeres (Fig. 3). Previous study (Diner et al. 2017) attempted to find consensus motifs from the centromeres of P. tricornutum, but no consensus motif was discovered. Therefore, this is the first study to propose the consensus motifs included in diatom centromeres. Consensus motifs in centromeres were reported in several eukaryotes including S. cerevisiae (Carbon and Clarke 1984) and fungus (Navarro-Mendoza et al. 2019) and are involved in centromere functions such as assembling of kinetochore proteins (Camahort et al. 2007; Furuyama and Biggins 2007), although all eukaryote do not necessarily have consensus motifs in their centromeres. The consensus motifs found in the putative F. solaris centromeres have sequence variations to some extent (Supplementary Fig. S8). Sequence variations were also observed in the motifs found in the centromeres of different eukaryotic organisms (Navarro-Mendoza et al. 2019). This might be attributable to the rapid evolutional feature of eukaryotic centromeres (Talbert and Henikoff 2020). In addition, some chromosomes (for example, chromosomes 1_h, 2_h, 2_l, 4_h, 8_l, 9_h, 11_h, 12_l, 14_l and 19_l) have two or more noticeable peaks at separated position along each chromosome. A similar result was shown in the prediction of centromeres of P. tricornurtum (for example, chromosomes 1, 20, 23, 30, (Diner et al. 2017)). This could be attributed to optimal setting of the threshold value (32%) to calculate the Nbox32. Although the previous study demonstrated that this threshold value was suitable to predict some diatom centromeres, it remains elusive whether 32% is optimal for all diatom species. Determination of versatile thresholds might require comprehensive studies dealing with more information of diatom genomes with chromosome-scale assembly. The functions of these motifs and their distribution outside of the putative centromeres remain elusive. Therefore, at this point, we do not conclude that these motifs strictly define the diatom centromeres. Further analytical and experimental efforts need to be devoted to fully elucidate the characteristics of diatom centromeres in the future. Insights into functional centromeres in diatoms will be useful not only for fundamental biology studies dealing with diatom evolution, but also for biotechnological applications, in particular for the rational design of artificial chromosomes for efficient genetic engineering of diatoms.

Furthermore, read depth profiling along the chromosome suggests the feature of aneuploidy of F. solaris genome. Ratios of homoeologous chromosomes are not necessarily 1:1 in allopolyploid organisms due to complicated hybrid events. For example, Okuno et al. revealed that larger brewing yeast Saccharomyces pastorianus strains (allopolyploid hybrid between S. cerevisiae and S. eubayanus) possess one set of S. cerevisiae-type chromosomes (haploid) and two or three sets of S. eubayanus-type chromosomes (diploid or triploid) (Okuno et al. 2016). Uneven sequence depths in the middle of chromosomes were also found in S. pastorianus, whereas the underlying mechanism was not discussed (Okuno et al. 2016). An unusual depth was found in F. solaris genome not at the middle region but at the terminal region of Chr. 4_l, suggesting the existence of ~ 8 copy mini-chromosomes. Since this region contains photosynthesis-related genes (Supplementary Table S3), multiplication of these gene caused by anueploidization might contribute to reinforcement of photosynthesis and protein production leading to robust growth of this diatom.

In addition, tandemly repeated genes of LPAAT involved in glycerolipid synthesis and EAR involved in fatty acid synthesis were discovered from Fso_h subgenome and both subgenomes, respectively. As F. solaris is an allopolyploid organism, we searched a homoeologous LPAAT gene of fso:g9516. However, no homoeologous gene was found in the Fso_l subgenome (Fig. 4A). We found the homoeologous genes (fso:g8026 and fso:g8024 in Fso_l) of the neighboring genes (fso:g9515 and fso:g1232 in Fso_h), whereas only fso:g8025 exists between fso:g8026 and fso:g8024. These data suggested two possibilities: (1) the Fso_l subgenome has lost the homoeologous gene after the hybridization event through diploidization, and (2) the ancestor species providing Fso_h subgenome obtained these tandemly repeated genes prior to the hybridization event. To distinguish between these possibilities, we sequenced a closely related diatom F. pelliculosa CCMP543 using MinION (Supplementary Table S4) and found that the contigs contain the homologs of the neighboring genes (i.e., fso g9515 and fso:g12324). However, a homolog of fso:g9516 was not found between the homologs of neighboring genes (Fig. 4A; Supplementary Fig. S18), suggesting that the tandemly repeated LPAAT genes might be a specific feature of the Fso_h subgenome derived from the ancestor species. By contrast, tandemly repeated EARs were found in both Fso_h (chromosome 2_h, contig 1952) and Fso_l (chromosome 2_l, contig 0014) subgenomes, as well as in the genome of the closely related pennate diatom F. pelliculosa (Supplementary Fig. S19A). The EAR gene of the model pennate diatom P. tricornutum is not tandemly repeated, although the arrangement of the neighboring genes is well conserved (Supplementary Fig. S19A). These data suggest that tandemly repeated EAR genes are not common in pennate diatoms, but might be conserved in the genus Fistulifera. Alignment of a total of 4 EAR proteins from F. solaris, among which EAR1 (fso:g16257) and EAR3 and EAR 2 (fso:g16258) and EAR4 are homoeologous pairs, revealed identical amino acid sequences except for the extended C-terminal regions (Supplementary Fig. S19B). DNA sequence identities between EAR1 and EAR3 and EAR2 and EAR4 are 95% and 94%, respectively (Supplementary Fig. S19C), suggesting that the EAR homoeologous pairs contain relatively few sequence divergences as compared to other homoeologous pairs (median of identities of global homoeologous pairs is 91%). In general, allopolyploid organisms take advantage of the functional plasticity of their genomes which express similar but not identical proteins from homoeologous genes derived from distinct parental species. As opposed to this general notion, F. solaris might conserve the EAR genes to enhance fatty acid synthesis metabolism which supports the abundant production of lipids in F. solaris.

Conclusion

A single run of nanopore sequencer MinION revealed the previously unsolved chromosome structures of F. solaris genome composed of 44 chromosomes. We determined that, among 44 chromosomes, 22 chromosomes belonged to Fso_h subgenome, and 22 remaining chromosomes belonged to Fso_l subgenome. Identification of the entire chromosome structure allowed us to find 16 putative centromere sequences, leading to the discovery of 3 putative consensus motifs in the predicted diatom centromeres. We also found putative mini-chromosome and tandem gene repeats potentially related to oleaginous phenotype of this diatom. These new findings together improve our understanding of the molecular mechanisms underlying superior oil accumulation in F. solaris and are also useful to develop novel genetic tools including artificial chromosomes for large-scale metabolic engineering.