Background & Summary

The cereal cyst nematodes (CCNs) are a group of 12 closely related species and considered to be one of the most damaging plant parasitic nematodes (PPNs) that limit production of cereal crops in many parts of the world including Australia, China, India, USA, Europe, North Africa and West Asia1,2. The species Heterodera filipjevi, H. avenae, and H. litipones are among the most economically important species and caused significant economic losses3. The yield losses caused by CCNs have been recorded from 10–35% on wheat in China, 24% on spring wheat in Oregon, and 20% on barley in Australia4. Among the CCNs, H. filipjevi is an important constraint to cereal crop production in different climatic regions1, and now it was widely recorded in many countries such as Tadzhikistan, Russia, Morocco, Tunisia, Pakistan, Libya, Turkey5,6, Estonia, Sweden, India, Norway, Iran, China7, United Kingdom, and USA8. Smiley et al.4 reported a 35% yield loss in spring wheat in Oregon, USA, due to H. filipjevi4, and Karimipour et al.9 estimated yield losses in wheat yield ranging between 20% and 25% in Iran by the same nematode species9. Also, Hajihasani et al.10 reported that grain yield loss caused by H. filipjevi occurred even at the lowest population density and reached a maximum loss of 48% with an initial population density (Pi) of 20 eggs and J2/ (g soil)–1 in Iran10.

Genomic data have proven to be powerful tools to explain the successful parasitization of plant nematodes. The first plant-parasitic nematodes genomes Meloidogyne incognita and M. hapla have been published in 2008. Recently, several PPNs genomes from Globodera pallida11, Globodera rostochiensis12, Heterodera glycines13,14, Bursaphelenchus xylophilus15, Bursaphelenchus mucronatus16, Ditylenchus destructor17, M. floridensis18 and M. graminicola19 have been published. However, the available reference genome of CCNs was absent, only the transcriptome of H. avenae was sequenced using short-read sequencing technology20,21,22,23. In the present study, a total of 95.79 Gb (711.56 x) raw data was obtained by Illumina, PacBio, 10x Genomics and Hi-C technologies, the detailed sequencing data were summarized in Table 1. The 17-mers were counted as 17,119,184,513 from 21.8 G Illumina short reads, and the k-mer depth was 124 (Table 1). Then, we used PacBio long-read, Illumina short-read, 10 × Genomics and Hi-C data to generate a high-quality chromosome-level reference genome for H.filipjevi. The genome assembly spanned 134.19 Mb with a scaffold N50 length of 11.88 Mb (Table 2). After chromosome-level anchoring, 9 chromosomes with a total length of 120 Mb (89.4% of the draft assembly) were constructed corresponding to the karyotype. In addition, the mapping rate of Illumina short reads was 93.14% and the genome coverage was 99.71% (Table 3). Moreover, 2,226 homozygous single-nucleotide polymorphisms (SNP) and a low homozygous rate (0.0032%) were identified, suggesting a low error rate in the assembly (Table 4). In conclusion, our evaluation indicated a high quality of the assembled H. filipjevi genome. Finally, we annotated 13,352 protein-coding genes in H. filipjevi genome with a mean of 8.14 exons per gene (Table 5 and Table 6) and found 61.9 Mb (46.14%) repeat elements. The reference genome obtained in this study will provide a foundation for future investigations on the pathogenesis of CCNs.

Table 1 Sequencing data used for the genome Heterodera flipjevi assembly.
Table 2 The statistics of length and number for the de novo assembled Heterodera flipjevi genome.
Table 3 Statistics of reads coverage of the Heterodera flipjevi genome.
Table 4 SNP statistics of the Heterodera flipjevi genome.
Table 5 Statistical results of the repetitive sequences of the Heterodera flipjevi genome.
Table 6 Gene annotation of Heterodera flipjevi genome via three methods.

Methods

Nematode sample and DNA extraction

The cysts of H. filipjevi were collected from wheat fields in Xuchang city Henan province. Ten cysts were chosen and inoculated on susceptible wheat cultivars Wenmai 19 in greenhouse for 6 generations. The fresh, healthy and unbroken cysts were manually picked and used for extraction of eggs by crushed in sterile water. the eggs were subsequently collected using the sucrose flotation technique24 and cleaned with sterile distilled water for three times. Six developmental stages of H. flipjevi including pre-parasitic second stage juveniles (Pre-J2), parasitic-J2 (Para-J2), third stage juveniles (J3), fourth stage juveniles 4 (J4), adult females (Fes) and eggs were collected according to the previous report20.

DNA isolation and sequencing

Genomic DNA was isolated from H. filipjevi egg mass according to the CTAB method. The DNA quality and concentration were assessed using agarose gel electrophoresis and a Qubit 2.0 Fluorometer (Life Technologies, CA, USA). A 20 kb insert sizes library was produced following the manufacturer’s protocol (PacBio, CA) and sequenced with the PacBio RS technology. For short-read sequencing, libraries with 350 bp insert sizes was prepared and sequenced on Illumina HiSeq 2500 as 2 × 150 bp reads (Table 1). The GEM (Gel Beads in Emulsion) reaction was conducted for 10 × Genomics sequencing using approximately 1 ng input DNA of 50 kb length, and 16 barcodes were introduced into droplets, subsequence, the droplets were fractured and 600 bp fragments were used for constructing libraries, which were sequenced on the Illumina HiSeq X platform at the Novogene Bioinformatics Institute, Beijing.

Genome size estimation, assembly and evaluation

For survey analysis, the H. filipjevi genome size was estimated using the 21.8 Gb paired-end Illumina sequencing data, based on the K-mer formula: Genome size = (total number of 17-mer) / (position of the homozygous peak). With the 14.74 Gb long reads generated from PacBio Sequel platform, the contig assembly of H. filipjevi genome was conducted using the FALCON assembler (version 1.2.4)25. Then, the assembly from PacBio data was polished by Quiver (smrtlink 5.0.1)26. The heterozygosity of assembly was removed by using the Purge Haplotigs software (version 1.1.1)27. The resulting contigs were connected to super-scaffolds by 42.58 Gb 10 × Genomics linked-read data using the fragScaff software (Version 140324)28 (Table 1). Finally, the short reads from Illumina were used to correct any remaining errors by Pilon (Version 1.22)29. These processes would yield a final draft H. filipjevi genome.

To acquire a high-quality H. filipjevi genome, the draft assembly was further improved using Hi-C analysis with 16.67 Gb Hi-C data. Firstly, the Hi-C reads were mapped to the draft assembly by using BWA30. Then, the low-quality reads and duplications were removed to build raw inter/ intra-chromosomal contact maps. Last, based on the agglomerative hierarchical clustering algorithm31, Lachesis (Version 201701) was applied for clustering, ordering and orienting, and the scaffolds from genomics were clustered into 9 pseudochromosomes32. Finally, Juicebox software (Version 2.20.00) was used to manually correct scaffolded chromosomes and plot heatmap of genomic interactions33. Above all, we obtained a 134,189,547 bp H. filipjevi genome including 9 pseudo-chromosomes, covering ~89.4% of the whole genome (Fig. 1) and 652 supper-scaffolds, the contig N50 and scaffold N50 are 0.45 Mb and 11.88 Mb, respectively (Table 2). Circos (version 0.64) was used to visualize the H. filipjevi genome data34.

Fig. 1
figure 1

Characteristics of the H. filipjevi genome. (a) Hi-C intra-chromosomal contact map of the H. filipjevi genome assembly. (b) Circos plot of the H. filipjevi genome assembly. (1) TRF distribution density; (2) DNA type repeat density; (3) LINE type; (4) SINE type repeat density; (5) LTR repeat density; (6) gene density; (7) GC content.

The completeness of genome assembly was assessed by the following methods. First, the Core Eukaryotic Genes Mapping Approach (CEGMA)35 was conducted based on a core gene set involved in 248 evolutionarily conserved genes from six eukaryotic model organisms. The CEGMA evaluation results showed that 248 CEGs assembled 230 genes, with a proportion of 92.74%, indicating that the assembly results were relatively complete. Second, the BUSCO36 (version 5.4.3) at genome model was used to evaluate the completeness of genomes in this study using nematoda_odb10 as a database. And we obtained a 55.8% assembly completeness, similar to other reported cyst nematode genomes (43.4–59.3%)19. Finally, small fragment library reads were aligned to the assembled genome using BWA software the alignment rate of the total small fragment reads to the genome is about 93.14% and the coverage is about 99.71%, indicating a good consistency between the reads and the assembled genomes (Table 3).

Genomic repeat annotation

Two technologies were applied to the annotation of repetitive sequences within H. filipjevi genome, including homologous comparison and ab initio prediction. For homologous comparison, RepeatMasker (Version 3.3.0) and the associated RepeatProteinMask were performed by aligning against Repbase database37. For ab initio prediction, LTR_FINDER (version 1.0.7)38, RepeatScout (Version 1.0.5)39 and RepeatModeler (Version 1.0.4) were firstly used for de novo candidate database construction of repetitive elements. Followed by, the repetitive sequences were annotated using RepeatMasker, while the tandem repeats were ab initio predicted using TRF (Version 4.07b)40. By combining Repbase and de novo datasets, we obtained a total of 61.91 Mb of consensus and nonredundant repetitive sequences, which occupied 46.14% of the genome (Table 5).

Gene prediction and functional annotation

Three approaches were employed for predicting the protein-coding genes within H. filipjevi genome, including homology-based prediction, ab initio annotation, and transcriptome-based prediction. For homology-based prediction, firstly, protein repertoires of H. glycines (GCA_004148225.2)41, G. pallida (GCA_000724045.1)42, G. rostochiensis (GCA_900079975.1)43, M. incognita (GCA_900182535.1)44, M. hapla (GCA_000172435.1)45, Caenorhabditis elegans (GCA_000002985.3)46, Haemonchus contortus (GCA_000442195.1)47, Pristionchus pacificus (GCA_918442795.1)48, Brugia malayi (GCA_000002995.5)49, Drosophila melanogaster (GCA_029775095.1)50 and Homo sapiens (GCA_024586135.1)51 were aligned against the H. filipjevi genome using TBLASTN (Version 2.2.29)52. Secondly, the BLAST hits were conjoined by Solar software (version 0.9.6)53. Thirdly, GeneWise (version 2.2.0)54 was used to predict the exact gene structure of the corresponding genomic region on each BLAST hit. Notably, homology predictions were denoted as “Homology-set”. In addition, about 33.2 Gb clean data of RNA-sequencing (RNA-seq) data derived from six developmental stages of H. filipjevi were assembled with Trinity (version 2.0)55, followed by, the assembled sequences were aligned against H. filipjevi genome using Program to Assemble Spliced Alignment (PASA) (version 2.0.2)56. The resulting effective alignments were clustered based on genome mapping location and assembled into gene structures. Notably, gene models created by PASA were denoted as PASA-T-set (PASA Trinity set). For ab initio annotation, five tools were simultaneously employed, including Augustus (version 3.0.2)57, GeneID (version 1.4)58, GeneScan (version 1.0)59, GlimmerHMM (version 3.0.2)60 and SNAP (version 11-29-2013)61. Among them, Augustus, SNAP and GlimmerHMM were trained by PASA-T-set gene models. For transcriptome-based prediction, RNA-seq reads were directly mapped to the genome using Tophat (version 2.0.9)62. Then the mapped reads were assembled into gene models (Cufflinks-set) by Cufflinks (version 2.1.1)63. According to these three approaches, all the gene models were ultimately integrated by Evidence Modeler64. The weight of each evidence was set as follows: PASA-T-set > Homology-set > Cufflinks-set > Augustus > GeneID = SNAP = GlimmerHMM = GeneScan. Meanwhile, in order to get the untranslated regions (UTR) and alternative splicing variation information, PASA2 was used to update the gene models. Ultimately, a total of 13,352 protein-coding genes were predicted in the H. filipjevi genome. The average transcript length was 3,258.25 bp with an average coding sequence (CDS) length of 1,235.80 bp. The average exon number per gene was 8.14 with an average exon length of 151.89 bp and average intron length of 283.41 bp (Table 6). The statistics of gene models, including lengths of a gene, CDS, intron, and exon in H. filipjevi were comparable to those for close-related species (Fig. 2).

Fig. 2
figure 2

The composition of gene elements in the H. filipjevi genome to other species. (a) CDS length distribution and comparison with other species. (b) Exon length distribution and comparison with other species. (c) Exon number distribution and comparison with other species. (d) Gene length distribution and comparison with other species. (e) Intron length distribution and comparison with other species.

In addition, the gene structures of transfer RNAs (tRNA), ribosomal RNAs (rRNA) and other non-coding RNAs in H. filipjevi genome were predicted. Specifically, the tRNA were predicted using tRNAscan-SE software (version 1.3.1)65. The rRNA fragments were predicted by searching against invertebrate rRNA database using BLAST with an E-value of 1E−10. The microRNAs (miRNA) and small nuclear RNAs (snRNA) genes were predicted by INFERNAL (version 1.1.1)66 using Rfam database67.

The predicted protein-coding genes in H. filipjevi genome were functionally annotated based on homologous searches against databases of SwissProt68, NR database (NCBI)69, Gene Ontology70, InterPro71 and KEGG pathway72. Notably, InterproScan tool73 in coordination with InterPro database74 were applied to predict protein function based on the conserved protein domains and functional sites. A total of 10,036 genes (75.20%) were successfully annotated by at least one public database (Fig. 3).

Fig. 3
figure 3

Venn diagram of number of genes with homology or functional classification by each method.

Data Records

The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive75 in National Genomics Data Center (NGDC)76, China National Center for Bioinformation/Beijing Institute of Genomics, Chinese Academy of Sciences (GSA: CRA014195)77 that are publicly accessible at https://ngdc.cncb.ac.cn/gsa. The genome assembly has been deposited in DDBJ/ENA/GenBank under the accession number JBDPZO00000000078, and NGDC under the GSA accession CRA01500279. Data of the gene functional annotations had been deposited at Figshare80.

Technical Validation

Nucleic acid quality

The DNA quality and concentration were assessed using agarose gel electrophoresis and a Qubit 2.0 Fluorometer (Life Technologies, CA, USA).

Evaluation of genome assembly

Various different strategies were used to evaluate the completeness and accuracy of the H. filipjevi genome. First, our assembly genome was verified to have high completeness by CEGMA35 (92.74%), indicating that the assembly results are relatively complete. Second, the BUSCO36 (v5.4.3) at genome model was used to evaluate the completeness of genomes in this study and other published genome, using nematoda_odb10 as a database. We obtained a 55.8% assembly completeness, similar to other reported plant nematode genomes (43.4–59.3%)11,12,13,14,19. The low completeness of the BUSCO estimates can be attributed to the substantial genetic divergence between the nematoda_odb10 database and cyst nematodes, with large differences in protein sequences. Moreover, to evaluate the accuracy of the assembly, we used BWA software to align small fragment library reads to the assembled genome to calculate the alignment rate, coverage degree and depth of reads. The results show that the alignment rate of the total small fragment reads to the genome is about 93.14% and the coverage is about 99.71%, indicating a good consistency between the reads and the assembled genomes (Table 3).