Background & Summary

The plum fruit moth Grapholita funebrana is an important fruit borer from the family Tortricidae of Lepidoptera1,2. Larvae of G. funebrana cause damage by boring the fruits of many wild and cultivated stone fruits and other plants in the family Rosaceae, such as apricot, cherry, peach, and plum3. This species is native to Europe and currently found in fruit-growing regions of Europe, northern Africa, and Asia4. In the orchards, G. funebrana often co-occur with other fruit borers, such as the oriental fruit moth Grapholita molesta (Busck), the codling moth Cydia pomonella, and peach fruit moth Carposina sasakii Matsumura5. While many studies have focused on the biology and management of fruit borers, research on G. funebrana is lagging behind6,7,8,9,10. In addition, moths from the family Tortricidae are ideal for unveiling the evolution of chromosome fusion11,12. While species from the order Lepidoptera often have a conserved chromosome number of n = 31, in the Tortricidae family, many species have a reduced number of chromosomes due to the fusion of chromosome pairs13,14. Recent research has found that a common ancestor of the suborders Tortricinae and Olethreutinae diverged from the ancestral lepidopteran chromosome pattern due to a fusion of sex chromosomes with autosomes15. The karyotype of tortricid moths was traditionally studied by cytogenetic methods and fluorescence in situ hybridization15. Determining the genome sequences will improve understanding of the molecular evolution of chromosomes of tortricid moths16. Currently, chromosome-level genomes have been published for the C. pomonella16, and G. molesta17, as well as many publicly available assemblies for Tortricidae in the GenBank (https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=7139).

In this study, we assembled a chromosome-level genome for the G. funebrana as well its mitochondrial genome using Oxford Nanopore Technologies (ONT) long-read sequencing, Illumina short-read sequencing, high-throughput chromatin conformation capture (Hi-C) sequencing, and RNA-sequencing (RNA-seq). We yielded a nuclear genome assembly of 570.9 Mb, with an N50 of 21 Mb. These high-quality genomes will provide invaluable resources for the study of G. funebrana and in-depth investigation of chromosome evolution on macroevolutionary and microevolutionary levels.

Methods

Material and sequencing

Apricot (Prunus armeniaca) fruits with G. funebrana larvae were collected from Yanqing, Beijing, China, and reared in the laboratory for about 30 days to obtain specimens of different developmental stages. To decrease the effect of heterozygosity, a single larva was used for long-read, short-read, and Hi-C library construction. Single larva, pupa, and adult (unknown sex) were collected for the construction of RNA-seq libraries, respectively. All samples were immediately flash-frozen in liquid nitrogen and stored at −80 °C for subsequent experiments.

Genomic DNA was extracted using the Magnetic bead method (Invitrogen, Thermo Fisher Scientific, USA), while RNA was extracted using RNAprep Pure Plus Kit (Tiangen, China), respectively. The quantity of DNA was measured using Qubit 3.0. To generate short-read data for the genome survey, an Illumina library with an insert size of 350 bp was constructed and sequenced on the Illumina NovaSeq 6000 platform. To perform de novo genome assembly, a 15~20 kb ONT library was prepared and sequenced on the ONT platform to generate long-read data. To generate the Hi-C data, tissue from a larva was fixed with paraformaldehyde and digested with restriction enzymes DnpII, generating fragments with sticky ends. These sticky ends were repaired using DNA polymerase and ligated together to form chimeric circles using DNA ligase. The ligated DNAs were then decrosslinked, purified, and sheared into 350 bp insertion size. The Hi-C sequencing library was sequenced on the Illumina NovaSeq 6000 platform to generate 150-bp paired-end reads. Paired-end libraries were constructed using the VAHTSTM mRNA-seq V2 Library Prep Kit (Vazyme, Nanjing, China) and then sequenced on the Illumina NovaSeq 6000 platform with PE reads of 150 bp for genome annotation. A total of 33.7 Gb Illumina short read, 69.7 Gb ONT long-read, 58.3 Gb Hi-C reads, and 21.9 Gb RNA-seq reads data were generated. The raw data of Illumina reads were filtered by Fastp v0.21.018 with default parameters.

Genome survey

Genome survey was performed using a k-mer based method. The k-mer coverage was counted from Illumina short reads using Jellyfish version 2.2.1019 with parameters: ‘count -m 21 -C -s 5 G’. Genome size, heterozygosity, and duplication rate were estimated using GenomeScope version 2.020. The results showed a genome size about 515 Mb, a heterozygosity rate of 1.91%, and a duplication rate of 1.21%.

Genome assembly

The Nanopore long reads were assembled to the primary set of nuclear genome contigs using NextDenovo v2.5.121 with parameters: ‘read_cutoff = 1k, genome_size = 400 m, pa_correction = 20, nextgraph_options = -a 1’. The contigs contain 215 sequences, with a size of 594 Mb, and N50 of 6.6 Mb. Due to the high error rate of assembly based on ONT reads, the primary contigs were polished using NextPolish 1.4.122 with one round based on long reads and one round based on short reads. To achieve chromosome-level assembly, the polished contigs were anchored into pseudomolecules based on Hi-C reads information. Specifically, the Hi-C reads were mapped to contigs using Chromap 0.2.423 with options: “–preset hic–remove-pcr-duplicates–trim-adapters–SAM”. The SAM output was sorted by read name and output to BAM format using Samtools v1.1724 with options: “sort -n -O BAM”. Yahs v1.2a.125 and Juicerbox 1.22.0126 were then used for unsupervised and supervised scaffolding, respectively. After scaffolding, most contigs (95.3% contigs and 99.86% base-pairs) were anchored into 28 pseudo-chromosomes (Fig. 1a), consistent with the karyotype of most species in the subfamily Olethreutinae. To fill the gaps between contigs, we performed two rounds of polishing based on long- and short-reads using Nextpolish. The final assembly has a genome size of 570.9 Mb, with a N50 of 21 Mb. The assembled genome is 56.9 Mb larger than the estimated genome size. MitoZ v3.6 pipeline27 was performed to assembly using Megahit v1.2928 (“–kmers_megahit 39 59 79 99 119 141–requiring_taxa Lepidoptera”) and annotate mitochondrial genome. The mitochondrial genome of G. funebrana was 15,488 bp in length and contain 13 protein coding genes, 22 tRNA genes and 2 rRNA genes (Fig. 1b).

Fig. 1
figure 1

The interaction heat map of nuclear genome (a), and distribution of genes and read coverage on mitochondrial genome (b).

Genome annotations

For repeat sequence annotation, a species-specific repeat library was generated using RepeatModeler v2.0.429 with options: “-LTRStruct”. The species-specific repeat library, a RepBase database, and a repeat element library for Arthropoda from the Dfam database were then combined and passed to RepeatMasker v4.1.430 for repeat annotation. RepeatMasker was performed with options:” -no_is -norna -xsmall -q”.

For gene structure annotation, we performed a pipeline integrating RNA-seq-based, ab initio, and homolog-based methods. The RNA reads of single larva, pupa and adult libraries were mapped to our final assembly with Hisat v2.2.027 and assembled to transcripts with Stringtie v2.1.231. The transcriptome assemblies and protein sequences of Plutella xylostella (Accession: GCA_932276165.132) were provided as evidence to MAKER v3.01.04 pipeline26 to integrate. SNAP v2013-02-1628 and Augustus v3.2.329 were used to conduct ab initio annotation. Transfer RNA (tRNA) was predicted using tRNAscanSE 2.0.1233 with default parameters, and ribosome RNA (rRNA) was predicted using Barrnap 0.9 (https://github.com/tseemann/barrnap). The above gene models were merged to produce consensus models by EvidenceModeler v2.1.033. Functional annotation of protein-coding genes was evaluated using EggNOG-mapper v234.

Chromosome feature

The gene number, repeat sequence density, and Guanine-Cytosine(GC) content were calculated in 500 Kb non-overlapping sliding windows using Bedtools v2.30.035. The name of the chromosomes was assigned as lepidopteran ancestral linkage groups14, based on homology to Sesia bembeciformis36. The homology was detected using LAST37 alignment. A Circos plot of chromosome feature was generated by TBtools v2.02138 (Fig. 2a).

Fig. 2
figure 2

Chromosome features of Grapholita funebrana genome. (a) Circos plot of GC content, gene count, and repeat content. Chromosomes were labeled using Merian elements according to the homology with the Lepidopteran ancestral linkage groups14. (b) Synteny blocks between the G. funebrana and G. molesta reveal the same number of chromosomes and highly conserved gene order in the two moths. The chromosomes of two genomes were numbered according to their length. The grey lines show the synteny blocks between two genomes.

Data Records

Illumina, Nanopore, Hi-C, and transcriptome data for G. funebrana genome sequencing have been deposited in the NCBI Sequence Read Archive with accession number SRP48223139. The final assembled nuclear genome of G. funebrana has been deposited in the NCBI Genbank with accession number GCA_038095595.140. The mitochondrial genome has been deposited in the NCBI Genbank with accession number PP77602341. The genome assembly and annotation files are available in Figshare42.

Technical Validation

The Hi-C heatmap revealed a well-structured interaction pattern. Short-read sequencing data were mapped to the final assembly with BWA v0.7.1743, revealing a mapping rate of 97.7%. The completeness of G. funebrana genome assembly was evaluated using the BUSCO44 base on the lepidoptera_odb10 database (n = 5286). The completeness of the initial assembly (contig level) was 90.9%, while it increased to 97.7% (97.2% single-copied genes, 0.5% duplicated genes, 0.6% fragmented, and 1.7% missing genes) after polishing with NextPolish22 (Table 1). We identified 14,547 protein-coding genes, 11,673 of which were functionally annotated. The completeness of the annotated gene set was 95.8% (94.8% single-copied genes and 1.0% duplicated genes, 1.1% fragmented, and 3.1% missing genes). A synteny analysis between G. funebrana and G. molesta17 was performed using MCSCAN in JCVI package45. Strong syntenic blocks were found between the two closely related species (Fig. 2b). All evidence strongly supported the completeness and accuracy of G. funebrana genome assembly.

Table 1 Statics of G. funebrana genome assembly.