Background & Summary

Fusarium oxysporum species complex is a group of soil-borne plant pathogens with a broad host range worldwide, causing severe economic losses in valuable crops such as cotton, tomato, banana and watermelon1,2,3,4,5. This species complex consists of various formae speciales (f. sp.) that infect hundreds of plant species, causing vascular wilt diseases4,6,7,8,9. They are also natural producers of various toxic metabolites, such as fusaric acid, posing threats to plant and human health10,11. Furthermore, they can act as disease agents for immune-compromised humans or other mammals12,13. Thus, elucidating the molecular and evolutionary mechanisms underlying the pathogenesis of F. oxysporum is crucial for both agricultural safety and public health. F. oxysporum strain Fo5176 was first isolated from Brassica oleracea (cabbage)14,15 and is pathogenic to multiple ecotypes of Arabidopsis thaliana16. The Arabidopsis-Fo5176 pathosystem has been established to study host-pathogen interactions16,17. F. oxysporum has a dynamic genome organization, containing conserved “core” chromosomes and lineage-specific (LS) “accessory” chromosomes. The accessory chromosomes are typically repeat-rich (>74%) due to the enrichment of various transposons18, making them extremely challenging to assemble accurately and completely. To date, there has been no reported complete, gap-free genome assembly of pathogenic F. oxysporum. Although a chromosome-level genome sequence of the Fo5176 strain was assembled using PacBio Sequel long reads and Illumina short-reads, it still has many gaps, leading to incomplete and incorrect annotation of genes17,19. Moreover, telomere and centromere sequences were unreported in previous assemblies, leaving these complex genomic regions unknown. A gap-free reference genome resource for Fo5176 would provide new insights into the molecular and evolutionary mechanisms underlying fungal growth, development, pathogenicity and mycotoxin production.

To produce a gapless reference genome of Fo5176, we extracted genomic DNA from fungal mycelia and generated 6.27 Gb of continuous long reads (CLR) (290 × coverage) from the PacBio single-molecule real-time (SMRT) sequencing platform, and 2.40 Gb of PacBio circular consensus sequencing (CCS) high-fidelity (HiFi) reads (34.17 × coverage) (Table 1). We also used a previously reported Hi-C (high-throughput chromatin conformation capture sequencing) dataset17 to anchor contigs onto chromosomes. During the genome assembly process, the raw CCS HiFi reads of Fo5176 were assembled with four long-read assemblers: Hifiasm20, HiCanu21, Flye22, and NextDenovo23. The CLR reads were first corrected by HiCanu and then assembled by Hifiasm. All draft genome assemblies were then merged one by one using quickmerge after being individually polished by Nextpolish24. The raw contigs were then scaffolded and corrected using Hi-C reads with the Juicer25/3D-DNA pipeline26. The final assembly contained 18 chromosomes (Fig. 1A) with a contig N50 of 4.37 Mb, a significant improvement ( + 29.2%) over the previous assembly by Fokkens et al. (hereafter Fokkens assembly)17 (Table 2). The number of chromosomes was determined by centromeric interaction regions detected in the Hi-C contact map (Fig. 1A)27. Consistently, each contig was represented as a single chromosome, indicating a gapless assembly. Comparing the two assemblies using the D-GENIES dotplot28, we found that Chr19 and Chr17 of the previous assembly were merged into a single chromosome in the new assembly (Fig. 2A–C), which was supported by mapped HiFi reads spanning the junction (Fig. 2D).

Table 1 The statistics for sequencing data output of Fusarium oxysporum strain Fo5176.
Fig. 1
figure 1

Overview of a gap-free genome assembly and annotation of Fusarium oxysporum strain Fo5176. (A) Genome-wide Hi-C contact maps show the interaction matrices among 18 chromosomes. (B) Circular plot showing the gene features in Fo5176 genome. Track a to h: (a) Ideograms of 18 chromosomes with their lengths, in which the highlight lines in each ideogram represent the genomic location of secondary metabolite gene clusters. (b) GC content, (c) gene density black: low, red: high, (d) exon density green: low, yellow: high, (e) histogram of repetitive DNA density, (f) Simple repeat density, (g) ncRNA (non-coding RNA) and (h) Colony morphology of Fo5176 strain is photographed after 5-day incubation at 25 °C. The densities are calculated in 20 kb windows (b, d and e). Chr: chromosome.

Table 2 Statistics for genome assembly and annotation of Fusarium oxysporum Fo5176. Annotation statistics are based on genome annotation of two assemblies using the same pipeline reported in this study.
Fig. 2
figure 2

Comparison of two genome assemblies of Fusarium oxysporum strain Fo5176. (A) D-GENIES dotplot alignment comparing the chromosome features between the Fokkens assembly (19 chromosomes, horizontal) and the gap-free assembly (18 chromosomes, vertical). With numbered chromosomes as coordinates, Chr19 and Chr17 of the previous assembly are joint in Chr17 of the new assembly (the upper right). (B-C) The correct joining of previous Chr19 and Chr17 (horizontal) into new Chr17 (vertical) is confirmed by genome dotplot (B) and Hi-C contact map (C). (D) IGV visualization of PacBio HiFi reads spanning the genomic junction at Chr17 of the gap-free assembly. The dashed red boxes denote the Chr19 and Chr17 chromosome regions of Fokkens assembly, respectively, and the underneath (Fo5176_gapfree_Chr17) represents the Chr17 region from the gap-free assembly. Chr: chromosome. Ann: annotation.

It has been reported that F. oxysporum strain Fo47, a fungal endophyte and biocontrol agent, carries 11 core chromosomes and a single accessory (Chr7) chromosome29. To identify the characteristics of chromosomes in Fo5176, a whole-genome sequence alignment between Fo5176 and Fo47 using D-GENIES dotplot28 revealed that the size of the Fo5176 sequence mapped to the 10 core chromosomes of Fo47 ranged from 1.9 Mb to 6.5 Mb, including Chr01, Chr03, Chr05, Chr06, Chr07, Chr08, Chr09, Chr12, Chr15, and Chr17 (Fig. S1), suggesting a set of 10 entire core chromosomes. Furthermore, a translocation between the chimeric (core/lineage-specific) chromosomes Chr04 and Chr13 of Fo5176 was observed. For LS chromosomes, the Chr14 in Fo5176 is the counterpart to the LS Chr7 in Fo47. In addition to Chr14, five chromosomes (Chr02, Chr10, Chr11, Chr16, and Chr18) did not align to any Fo47 chromosomes (Fig. S1), indicative of five additional LS chromosomes in the Fo5176 genome. In total, Fo5176 contained 10 core chromosomes, two chimeric chromosomes (Chr04 and Chr13) and 6 LS chromosomes.

This gap-free assembly genome captured all centromeres on 18 chromosomes (Figs. 3A,B, 4). A total of 20 telomeres were identified, missing 16 telomeres including the 5′ end of chromosomes 1, 6, 8, 10, 13, 16 and 17, and the 3′ end of chromosomes 2, 3, 4, 5, 7, 8, 11, 13 and 16 (Figs. 3C,D, 4). Compared to the Fokkens assembly sized at 67.98 Mb with an N50 of 3.38 Mb, the 69.56 Mb gap-free assembly was larger with a contig N50 of 4.37 Mb (Table 2). A more comprehensive comparison between the previous study and this assembly using GenomeSyn plot30 showed that six gaps on Chr02, Chr03 and Chr11 were closed (Fig. 4). Furthermore, four small inversions on Chr03, Chr07 and Chr14, and one large inversion in the region of the centromere on Chr18 were corrected in the gap-free Fo5176 genome (Fig. 4).

Fig. 3
figure 3

Validations of centromere and telomere regions in Fusarium oxysporum strain Fo5176 genome. (A,B) The features of two representative centromere regions of Chr10 and Chr11 are visualized using IGV (integrated genome viewer), where PacBio HiFi and RNA-seq reads are mapped. The dashed red boxes denote the centromere regions. (C,D) IGV visualization of 5′ and 3′ end telomere regions in Chr09 where PacBio HiFi and RNA-seq reads are mapped. The dashed red boxes denote the centromere and telomere regions, respectively. Chr: chromosome. Ann: annotation.

Fig. 4
figure 4

Comparison of Fo5176 genome characteristics between the previous version (GCA_014154955.1) and the gap-free assembly in this study. For each chromosome, top is the Fokkens assembly (GCA_014154955.1) with gaps (orange lines) and bottom is the gap-free assembly from this study with all centromeres and 20 telomeres (brown triangles). Synteny and genomic inversion events are shown in grey and color lines, respectively. Gene density is shown throughout the chromosome as heatmaps. GC content was shown as line graph above or below each chromosome. The chromosome ideogram of previous Chr19, now part of Chr17 of the new assembly, is delineated by black dotted lines. Chr: chromosome.

For annotation of the protein-coding genes, three different sources of evidence were integrated, including ab initio prediction, Fusarium homologous proteins and RNA-seq data. The same annotation pipeline was also applied to the Fokkens assembly17, allowing a fair comparison of the two assemblies. A total of 21,460 protein-coding genes were identified in our gap-free assembly (Fig. 1B; Table 2), 26.3% more than previously reported for the Fokkens assembly. Repetitive elements (REs) are enriched in F. oxysporum accessory chromosomes and associated with virulence factors responsible for pathogenicity18. In our gap-free assembly, REs accounted for 20.13% with a high density in accessory chromosomes such as chromosomes 2, 16, and 18. The most abundant REs were DNA transposons (~7.89% of REs), of which tc1-is630-pogo (2.98%) and hobo-activator (2.26%) are the most abundant families. The proportions of REs in the gapless assembly differed greatly from the Fokkens assembly, specifically adding 100% (8,155 bp) small RNA and losing all Penelope and SINEs in the new assembly, respectively (Table 2). Moreover, the unclassified elements accounted for 4.72% of the genome (Table 2).

Fusarium oxysporum species are natural producers of bioactive secondary metabolites (SMs), some of which play important roles in pathogenicity, antivirals, defense response and nutrition acquisition10,11. For example, Fusaric acid contributes to the virulence of F. oxysporum in plant and mammalian hosts10,11. In the Fo5176 genome, 58 putative secondary metabolite (SM) gene clusters containing 71 key biosynthetic genes were predicted by antiSMASH v6.1.131, including 15 nonribosomal peptide synthetases (NRPS), 14 polyketide synthases (PKS), 13 NRPS-like genes, 11 terpenes, 5 indoles and 2 betalactones (Figs. 1B, 5; Table S1), with no predicted SM gene clusters on chromosomes 10, 14, 16 and 18. Effectors are important virulence factors of F. oxysporum during plant infection. F. oxysporum is known to secrete a group of small cysteine-rich proteins called effectors, such as SIX (secreted in xylem) proteins, which can suppress the host immunity and modify host cellular activities to promote infection9,32,33,34,35,36. In the Fo5176 gap-free genome, a total of 579 putative effectors were predicted across 18 chromosomes (Table S2) based on a pipeline integrating three tools: EffectorP37, SignalP38 and TMHMM39. The predicted effectors and putative secondary metabolite gene clusters provide a resource for future characterization of new pathogenicity factors of Fo5176 and elucidation of their molecular mechanisms.

Fig. 5
figure 5

Features and distribution of secondary metabolites (SM) gene clusters in Fo5176 genome. A total of 58 putative SM gene clusters including 71 key biosynthetic genes on 14 chromosomes (except for Chr10, Chr14, Chr16 and Chr18) are predicted by antiSMASH software. Each track represents a unique gene cluster, with its chromosomal distribution and diverse types of SM genes. Note different colors indicate the types of SM genes: Green: biosynthetic gene; Yellow: biosynthetic-additional gene; Purple: other gene; Blue: regulatory gene; Red: transport gene. Chr: chromosome.

In summary, this gap-free genome assembly is the first for the pathogenic F. oxysporum, one of the most common and devastating fungal pathogens of crop plants, as well as human pathogens. This genome represents a major improvement in terms of contiguity and accuracy and will serve as an essential resource for researchers studying this fungus. The genome assembly and annotation are not only important for decoding the mechanisms of plant-pathogen interactions using the model plant Arabidopsis as a host, but also beneficial for elucidating the dynamic evolution of Fusarium species.

Methods

Fungal cultivation, DNA extraction, PacBio library preparation and sequencing

The Fo5176 strain used in this study was routinely grown on potato dextrose agar (PDA) medium at 25 °C. For genomic DNA isolation, the hyphae of Fo5176 were harvested from a 2-day-old potato dextrose broth (PDB) culture set at 150 rpm and immediately frozen in liquid nitrogen. High-molecular-weight genomic DNA was extracted using a cetyltrimethylammonium bromide (CTAB) method40. A PacBio HiFi sequencing library was prepared for sequencing according to PacBio’s standard protocol. The genomic DNA was fragmented using a Covaris g-TUBE device, and the size and concentration were measured with an Agilent 2100 Bioanalyzer. DNA damage repair, end repair, A-tailing and hairpin adapter ligation were performed using the PacBio SMRT Express Template Prep Kit 2.0. The library was then treated with nuclease using SMRTbell Enzyme Cleanup Kits (PacBio). For the PacBio CLR library construction, the high-quality genomic DNA was extracted using the standard CTAB extraction protocol. The DNA concentration was measured using the Qubit dsDNA BR Assay Kit on the Qubit Fluorometer, and the DNA integrity was evaluated using the Agilent 2100 Bioanalyzer system. The final CLR SMRTbell sequencing library was prepared using the SMRTbell Express Prep kit v2.0 Protocol and sequenced on the PacBio Sequel II system in CLR mode with the Sequel II Sequencing Kit 2.0. The resulting library was sequenced at Biomarker Technologies Corporation (Qingdao, China) and the raw CCS data was generated using the CCS algorithm.

Genome assembly

To improve the quality of the genome assembly, raw contigs of Fo5176 were assembled with four long-read de novo assemblers using CCS reads: Hifiasm v.0.16.1 with the ‘-primary’ option20, HiCanu v.1.4 with the ‘-pacbio-hifi oeaErrorRate = 0.001’ options21, Flye v.2.9 with the ‘-pacbio-hifi’ option22, and NextDenovo with the ‘parameters: minimap2_options_cns = -x ava-hifi’ option23. Additionally, 6.27 Gb of PacBio CLR reads (290 × read depth; average length: 14,715 bp) were generated using the PacBio Sequel II system and used in the genome assembly. The CLR reads were assembled using Hifiasm following the HiCanu v.1.4 correction. All draft assemblies were assessed and then merged one by one using the single-scaffolding tool quickmerge v.0.3 under default parameters. Firstly, the draft assemblies of NextDenovo and Flye were merged; then the Hifiasm CCS draft assembly was merged on top of that, followed by the HiCanu draft assembly, and finally the Hifiasm CLR draft assembly. The final merged assembly was further polished by NextPolish v.1.4.024 with the HiFi reads.

RNA extraction and sequencing

To annotate the protein-coding genes, total RNA was extracted using the TRIzol approach (Invitrogen, USA)17 from 2-day-old PDB hyphae of Fo5176. After evaluating RNA integrity using the RNA Nano 6000 Assay Kit, the high-quality RNA was used for total mRNA-Seq library construction by the Illumina TruSeq RNA Library Prep kit (Illumina, CA), following the manufacturer’s instructions. RNA-seq analysis was performed on an Illumina Novaseq. 6000 sequencer at Biomarker Technologies Corporation (Qingdao, China). A total of 5.1 Gb of clean, paired-end (2 × 150 bp) RNA-seq data were obtained, which were then checked for quality control using fastp v.0.23.241 and mapped to the assembled genome sequence using Hisat2 v.2.1.042 under default parameters. Mapping ratios were calculated using SAMtools v.1.1543.

Gene annotation and repetitive elements analysis

To annotate the assembly, three approaches were combined: ab initio prediction, homology-based protein predictions, and RNA-seq evidence. During the ab initio prediction process, the repetitive sequences dispersed in the genome were de novo predicted and compiled by RepeatModeler v.1.0.1144 (parameters: -database -engine ncbi -pa). For further analyses, RepeatMasker v. 4.1.2.p145 was used to extract and softmask the repetitive elements. The GeneMark-ET model46 was trained to predict gene models, followed by ab initio gene prediction using BRAKER2 for five rounds47 (parameters:–species = Fo–fungus–softmasking–genome–bam–prot_seq–prg = gth–gff3–rounds = 5). Then, the MAKER v.3.01.03 pipeline48, a genome annotation and data management tool, was applied to train the SNAP49 semi-HMM model for two rounds. The gene finder AUGUSTUS50 was used to predict the built-in Fusarium genome feature model. For homology-based protein prediction, a set of protein sequences (anchored chromosome level) of Fo5176, Fol4287 and Fo47 were downloaded and combined from the Fusarium public database. For transcriptomic data, the RNA-seq reads of Fo5176 were aligned to the assembled genome using Hisat2 v.2.1.042. Reference-based assembly and de novo assembly of transcripts were also generated using Scallop (v.0.10.5)51 and Trinity (v.2.8.4)52 (parameters:–min_kmer_cov 3–normalize_max_read_cov 100), respectively. To predict the final gene model, the above transcript and protein datasets were integrated and aligned to the genome using the MAKER v.3.01.03 pipeline48. Moreover, the non-coding RNAs in the Fo5176 genome were predicted by Rfam/Infernal v.1.1.453 (parameters: ‘cmscan–cut_ga–rfam–nohmmonly–fmt 2 -clanin -tblout’) and tRNAscan-SE v.2.0.954 (parameters: ‘-E -X 20 -f -m -b -j–detail’).

Fungal effector prediction

For effector prediction, EffectorP v3.037 and SignalP v.6.038 were independently used to identify putative secreted proteins from the Fo5176 proteome, and the overlapping effectors from the two predictions were obtained. The candidate effectorome was then scanned using TMHMM v.2.039 to exclude those containing transmembrane helices, yielding the final set of predicted effectors.

Data Records

The raw sequencing data of PacBio HiFi, CLR, and RNA-seq have been deposited in the National Center for Biotechnology Information (NCBI) under the BioProject number PRJNA91052955 with the accession number of SRR2274692156, SRR2274692057 and SRR2274691858, respectively. The final assembled genome is deposited under the same BioProject at NCBI (GCA_030345115.1)59 and also in Genome Warehouse of National Genomics Data Center (https://ngdc.cncb.ac.cn/) at China National Center for Bioinformation under the accession number of GWHDOBK0000000060. The genome annotations including CDS and protein-coding regions files have been submitted to the online open access repository Figshare61.

Technical Validation

Manual correction, validation and evaluation of genome assembly

To obtain a nearly complete and error-free reference genome, we manually corrected the misassembly and removed redundant contigs using Hi-C reads alignment within Juicebox visualization62. To remove potential contamination sequences such as mitochondrial genomes and sequencing adaptors, we used megaBLAST63 to align our assembly to the species’ mitochondrial genome, common database (ftp://ftp.ncbi.nlm.nih.gov/pub/kitts/contam_in_euks.fa.gz), (ftp://ftp.ncbi.nlm.nih.gov/pub/kitts/adaptors_for_screening_euks.fa), and the nucleotide sequence database (remote mode), and found no errors. Compared to the previous assembly (NCBI: GCA_014154955.1), our gap-free assembly has reduced the gap length from 4,000 to 0 based on high-coverage HiFi, CLR, and Hi-C reads. Furthermore, all centromeres and most telomeres (20/36) (TTAGGG) are captured via StainedGlass64 and trf v.4.09.165 (parameters: 2 7 7 80 10 90 2000 -d -m -l 2), and then visualized using IGV66 (Fig. 3). A one-to-one correspondence between the previous and new assemblies showed 14,713 coding region genes via liftoff v.1.6.367 and BEDtools v.2.30.068 (parameters: intersect -wa -wb -f 0.9), which account for 86.6% of the previous assembly and 68.6% of this study’s genome. Compared to the previous version (NCBI: GCA_014154955.1), our assembly has corrected six gaps on Chr02, Chr03 and Chr11, and five major inversions on Chr03, Chr07, Chr14 and Chr18 (Fig. 4), visualized via GenomeSyn plot30.

Assessment of assembly quality and completeness

The accuracy and completeness of the genome assemblies were assessed by BUSCO (Benchmarking Universal Single-Copy Ortholog)69 and CEGMA v.2.5 (Core Eukaryotic Genes Mapping Approach)70 analyses. First, HiFi and CLR reads were mapped to the assembly using Winnowmap2 v.2.0371 with parameters ‘-W repetitive_k15.txt –t 104 -ax map-pb’. Completeness evaluation for our gap-free assembly showed BUSCO and CEGMA values of 98.9% and 99.6%, respectively (Table 2), suggesting a highly accurate and complete assembly.