Background & Summary

Achelura yunnanensis is a notorious pest that feeds on the flowering cherry trees1,2, which are economically valuable ornamental plants3. During periods of high infestation, a single cherry tree can harbor up to hundreds of larvae, significantly impacting tree growth and resulting in substantial economic losses4. Currently, chemical pesticides still serve as the primary method for controlling A. yunnanensis outbreaks; however, these chemicals often lead to environmental pollution and pose food safety risks5,6. Moreover, previous studies have found that the expansion of the uridine diphosphate glycosyltransferases gene family in A. yunnanensis may be linked to its increased resistance to both plant metabolites and pesticides, exacerbating the challenges of chemical pest control7,8,9. Therefore, there is an urgent need to explore alternative biocontrol methods to achieve effective and environmentally safe pest management to this species. However, the lack of genomic resources has hindered the development of biocontrol strategies, including those based on specific molecular targets.

Taxonomically, A. yunnanensis belongs to the Zygaenidae family, a diverse group of moths distributed throughout the world1,10. Unlike most nocturnal moths, most Zygaenidae species are diurnal and have eye-catching aposematic colors and patterns on their wings to warn off daytime predators11,12. Although diurnality is widespread in Lepidoptera (moths and butterflies) and has independently evolved many times, the molecular mechanisms underlying this behavior remain poorly understood13. A recent study based on transcriptome data by Akiyama et al.14 suggested that the parallel evolution of opsins may contribute to the diurnal adaptation of certain day-flying species within the hawkmoth family (Lepidoptera, Sphingidae)14. Undoubtedly, to fully understand the genetic mechanism behind the shift of day-night activity of Lepidoptera, we need to study the genome data across various lepidopteran taxa, including Zygaenidae. However, genomic resources for Zygaenidae are extremely limited. Before this study, of the four subfamilies of Zygaenidae, only the Zygaeninae subfamily has one species been sequenced, and the subfamilies Chalcosiinae, Callizygaeninae, and Procridinae all lack of genomic data15. This scarcity of genomic resources has impeded further exploration of the genetic basis underlying diurnality in this moth family.

In this study, we present a chromosome-level genome of A. yunnanensis, a representative species of the Chalcosiinae subfamily of Zygaenidae. The final genome assembly was 368.15 Mb, with contig N50 and scaffold N50 values of 12.20 Mb and 12.61 Mb, respectively, indicating a high level of completeness and contiguity (Table 1). Comparative genomic analysis revealed a significant expansion of gene families associated with lipid catabolism and xenobiotic biodegradation and metabolism in A. yunnanensis, which may contribute to the species’ remarkable adaptability, including its broad host range and its ability to degrade toxic compounds from both plants and the environment. Overall, this genome assembly serves as a valuable resource for future endeavors in the integrated pest management of A. yunnanensis and has the potential to uncover the genetic mechanisms governing day-night activity patterns in Lepidoptera through comparative genomics studies.

Table 1 Summary statistics of the Achelura yunnanensis genome.

Methods

Sample collection, library construction and sequencing

Two fifth-instar larvae samples of A. yunnanensis were collected in September 2023 from cherry trees located at Yunnan University, Kunming, Yunnan Province, China. Guts were removed from each larvae sample to reduce gut microbe contamination, then each sample was cleaned with phosphate-buffered saline (PBS) buffer twice.

DNA and RNA were extracted from one larvae sample using the TIANGEN Blood & Tissue Kit (Tiangen, Beijing, China) and the TRIzol Reagent Kit (Invitrogen, USA), respectively. The quality and concentration of nucleic acid were assessed using a Qubit 3.0 Fluorometer (Life Technologies, CA, USA) and 1.0% TBE agarose gel electrophoresis. For short-read genomic sequencing, DNA sequencing libraries were constructed according to the TruSeq DNA Sample Preparation Guide (Illumina, USA) and sequenced on the Illumina NovaSeq. 6000 platform. For PacBio HiFi sequencing, circular consensus sequencing (CCS) libraries were constructed using the Pacific Biosciences SMRT bell Express Template Prep Kit 2.0, and sequenced on the PacBio Sequel II System with HiFi mode. For Hi-C sequencing, the Hi-C libraries were prepared from the other larvae sample according to the standard procedure with minor modifications16 and sequenced on the Illumina NovaSeq. 6000 platform. For transcriptome sequencing, RNA-seq libraries were constructed using the Illumina TruSeq Stranded mRNA Library Prep Kit (Illumina, USA) and sequenced on the Illumina NovaSeq6000 platform. After filtering the low-quality reads and trimming adaptor sequences from the raw data using fastp (v0.23.2)17, we obtained a total of 57.00 Gb Illumina short-reads (~155-fold coverage), 26.12 Gb PacBio HiFi long-reads (~71-fold coverage), 55.94 Gb Hi-C reads (~152-fold coverage) and 6.5 Gb RNA-seq data (Supplementary Table 1).

Genome survey and de novo assembly

The 57.00 Gb Illumina short-reads were used for genome survey to estimate genome characteristics such as genome size, repetitive sequence content, and heterozygosity. K-mer frequencies were assessed using jellyfish (v2.3.0)18 with a length set to 17 k-mer and were then used to conduct a genome survey using GenomeScope (v1.0)19. As a result, the estimated genome size was approximately 320.47 Mb, with a heterozygosity rate of 1.43% and a repetitive sequence content of 35.64% (Fig. 1a; Table 2).

Fig. 1
figure 1

(a) Result of the 17-mer frequency distribution analysis for the Achelura yunnanensis genome. (b) Hi-C interactions heatmap for the A. yunnanensis genome. The color gradient bars on the right side of the map represent chromosome interaction strengths, which range from yellow (low) to red (high). It shows that intra-chromosome (red blocks on the diagonal line) interactions are stronger than inter-chromosome.

Table 2 Results of the survey analysis.

The PacBio HiFi long-read (quality value > = 20) data were de novo assembled into a draft genome (comprising dozens of contigs) using Hifiasm (v0.19.6)20 with the default parameters. To generate a chromosome-scale genome assembly of A. yunnanensis, Hi-C reads were mapped to the draft genome with the BWA mem algorithm21. Based on the quality-controlled Hi-C read alignments, a contact matrix was generated using Juicer (v1.6.2)22 with default parameters. 3D-DNA (v190716)23 was then employed to correct misjoins, and order and orientate the contigs, resulting in most of the contigs being anchored to the pseudochromosomes. And JuiceBox (v2.17.0)24 was used to visualize the Hi-C interactions between contigs, and manually correct any misjoins, translocations, and inversions. For contigs that could not be anchored to the chromosomes, BLASTN (v2.15.0)25 was used to search them against the Nucleotide Sequence Database (NT). Contigs that hit non-metazoan targets were viewed as contamination and discarded. Next, genome integrity was assessed by BUSCO (v5.4.3)26 based on the Lepidoptera_odb10 database (n = 5,286 single-copy orthologues). To calculate the mapping rate and identify sex chromosomes27, we mapped Illumina short-reads to the genome assembly using BWA (v0.7.17)21. The mapping rate and sequencing depth for each chromosome were then calculated by using QualiMap (v.2.3)28. Chromosomes with half the sequencing depth were identified as sex chromosomes. As a result, the initial assembly based on PacBio HiFi long-reads yielded a draft genome of 375.40 Mb, comprising 96 contigs with an N50 size of 12.20 Mb. These contigs were anchored to 32 chromosomes by the Hi-C data (Fig. 1b). Chr1 and Chr32 were identified to be the sex chromosomes. After removal of the contaminating contigs and mitochondria sequences, the resulting chromosome-level genome was 368.15 Mb in length with a scaffold N50 of 12.61 Mb and a GC content of 35.15% (Table 1; Fig. 2). Quality evaluation of the genome assembly showed that a total of 99.02% of the Illumina short-reads were properly mapped to it. Furthermore, a BUSCO assessment indicated that 98.0% of the target orthologous genes could be identified in complete form from the genome assembly (Supplementary Table 2). Together, these evaluations suggest a remarkably high level of completeness, contiguity, and accuracy of the genome assembly of A. yunnanensis.

Fig. 2
figure 2

Circos plot of the Achelura yunnanensis genomic features. The tracks from inside to outside: (A) DNA TE abundance; (B) LINE abundance; (C) GC content; (D) Gene density; (E) Pseudo-chromosomes. Window size = 100 kb.

Repetitive element and noncoding RNA annotation

To annotate repeat elements in the A. yunnanensis genome, we firstly de novo constructed a repeat library from the genome using the integrated results from three embedded programs (RECON (v1.0.8)29, RepeatScout (v1.0.6)30 and TRF (v4.09)31) in RepeatModeler (v2.0.3)32. This library was then merged with the known repeat element database, including the Insecta set of Repbase-2018102633 and Dfam 3.734, to form a custom library. Based on this custom library, RepeatMasker (v4.1.5)35 was used to identify and mask repetitive regions in the genome assembly with xsmall parameter. In total, 136.55 Mb of repeat sequences were identified, which accounted for 37.10% of the genome assembly. Among these repeat elements, long interspersed elements (LINEs) represented the most abundant class of repeats, constituting 52.48 Mb (14.26% of the whole genome). Additionally, DNA transposons, short interspersed nuclear elements (SINEs), and long terminal repeats (LTR) account for 6.43%, 2.47%, and 1.56% of the whole genome, respectively (Table 3).

Table 3 Summary statistics of repeat annotation in the Achelura yunnanensis genome.

For the discovery of transfer RNAs (tRNAs), tRNAscan-SE (v2.0.12)36 was applied with eukaryotic parameters according to the characteristics of tRNA. The identification of ribosomal RNAs (rRNAs) and its various copies was performed using Barrnap (https://github.com/tseemann/barrnap). Other genomic noncoding RNAs (ncRNAs), such as small nuclear RNAs (snRNAs) and microRNAs (miRNAs), were identified through comparison with the Rfam37 database (release 14.10) using Infernal (v1.1.5)38. Totally, 1828 ncRNAs were identified in the A. yunnanensis genome, including 483 rRNAs, 66 miRNAs, 72 snRNAs, 1099 tRNAs, and 108 other ncRNAs, respectively (Table 4).

Table 4 Summary statistics of noncoding RNA annotation in the Achelura yunnanensis genome.

Protein-coding gene prediction and function annotation

Protein-coding gene structure predictions were conducted by combining evidence from the transcriptome-based, ab initio, and homology-based predictions. For transcriptome-based prediction, RNA-seq data from the larvae body and the adult sex pheromone glands39 were aligned with the genome to produce BAM alignments by HISAT (v2.2.1)40 and Samtools (v1.19)41. The RNA-seq alignments were used to perform genome-guided assembly by StringTie (v2.2.1)42, and the likely open reading frames within the transcripts were identified with TransDecoder (v5.5.0)43. For the ab initio predictions, BRAKER (v3.0.7)44 was adopted, which automatically trained the predictors Augustus (v3.4.0)45 and GeneMark-ETP (v4.72)46 using the Arthropoda reference proteins database from OrthoDB10 (v10)47 and the RNA-seq alignments mentioned above. SNAP (v2006-07-28)48 was also used for ab initio gene prediction, where the B.mori.hmm was selected as the training set. For the homology-based prediction, we downloaded the reference gene sets of six related species from Ensembl and NCBI databases, namely Bombyx mori49, Colias croceus50, Helicoverpa armigera51, Spodoptera frugiperda52, Vanessa cardui53 and Zygaena filipendulae54, to generate a homology-based gene set (Supplementary Table 3). GeMoMa (v1.9)55, GenomeThreader (v1.7.3)56, and Miniport (v0.12)57 were used to align the homology-based gene set to the genome and predict the gene structure information. Finally, EVidenceModeler (v2.1.0)58 was used to integrate the gene-predicted results from the three methods and generate a consensus gene set. As a result, a total of 15,523 protein-coding genes were predicted from the A. yunnanensis genome, with an average gene length of 770,1.7 bp. These genes have an average of 6.1 exons per gene, with an average exon length of 235.5 bp, and an average of 5.1 introns per gene, with an average intron length of 1230.2 bp (Table 5). The completeness of the predicted protein gene sequences was 97.3% (96.3% single-copied genes and 1.0% duplicated genes) as assessed by BUSCO (v5.4.3)26 search based on the Lepidoptera_odb10 database (n = 5,286 single-copy orthologues) (Table 5).

Table 5 Summary statistics of gene prediction in the Achelura yunnanensis genome.

To add functional annotation to the predicted protein-coding genes, we searched the predicted genes against the UniProtKB database (SwissPro and TrEMBL) and the nonredundant protein sequence database (NR) using the high-sensitivity mode of Diamond (v2.1.8)59. We further employed eggNOG-mapper (v2.1.12)60 to search the eggNOG (v5.0)61 database. In addition, InterProScan (v5.59–91.0)62 was used to assign Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), and Reactome pathway annotations to the predicted genes, and to identify protein domains among the predicted genes. A total of 15,116 (97.38%) protein-coding genes obtained a final prediction of gene functions following the above steps (Table 6). The final physical characteristics of the genome assembly features were visualized using Circos (v 0.69-8)63 (Fig. 2).

Table 6 Summary statistics of functional annotation in the Achelura yunnanensis genome.

Data Records

The raw sequencing data of A. yunnanensis reported in this paper have been submitted to the NCBI with Bioproject ID PRJNA1115809. Illumina, PacBio, Hi-C, and transcriptome raw data have been deposited in the NCBI Sequence Read Archive with accession numbers SRR29152278-SRR2915228164,65,66,67. The final assembled genome has been submitted to the Genome database of NCBI with accession numbers GCA_041274885.168. The annotation file is available in figshare2596283569.

Technical Validation

Evaluation of the genome assembly

Three independent methods were used to assess the completeness, contiguity, and accuracy of the A. yunnanensis genome assembly. Firstly, the initial assembly contained a total of 96 contigs, with a contig N50 size of 12.20 Mb and the longest contig of 17.44 Mb. After the Hi-C data was added, the chromosome-level assembly was characterized by a scaffold N50 size of 12.61 Mb and the longest scaffold of 17.44 Mb, which indicates high continuity of the genome assembly. Secondly, the genome assembly displayed a BUSCO completeness of 98.0% (97.4% single-copied genes and 0.6% duplicated genes) based on the Lepidoptera_odb10 database. Finally, to verify the accuracy of the genome assembly, we calculated mapping rates by aligning clean Illumina data to the genome assembly. As a result, 99.02% of the Illumina reads aligned with the genome assembly. Overall, these assessments reflect the high quality and accuracy of the chromosome-level assemblies.

Genomic synteny analysis

Genome synteny analysis of A. yunnanensis and another Zygaenidae species, Zygaena filipendulae54, was conducted using MCScanX70 to identify the Z and W chromosomes and evaluate the accuracy of the genome assembly. A high degree of collinearity was observed between our assembly and the Z. filipendulae genome (Fig. 3). All chromosomes in our assembled genome, except for the W chromosome, exhibited strong collinearity with those of Z. filipendulae. The lack of linear correlation between the W chromosomes is likely due to the fact that MCScanX synteny analysis relies on the collinear analysis of coding genes, while the W chromosome contains few coding genes. Additionally, we identified a chromosomal fusion and fission event between the two genomes, with chromosome 21 of Z. filipendulae being syntenic to chromosomes 28 and 29 of A. yunnanensis. Apart from the W chromosome and that chromosomal fusion-fission event, all chromosomes in the assembled genome demonstrated one-to-one collinearity with those of Z. filipendulae, highlighting the accuracy of our genome assembly.

Fig. 3
figure 3

Chromosome-level genomic synteny between Achelura yunnanensis and another Zygaenidae species, Zygaena filipendulae. Gray lines indicate conserved syntenic blocks between the two genomes.

Phylogenetic analysis

To determine the phylogenetic position of A. yunnanensis, we performed a phylogenomic analysis based on 4,316 single-copy protein-coding genes collected from the genomes of A. yunnanensis and 14 Lepidopteran species49,54,71,72,73,74,75,76,77,78,79,80,81,82 (Supplementary Table 4). Orthologous sequences of single-copy protein-coding genes among all species were determined using OrthoFinder (v2.5.4)83. Protein sequence alignments for each gene were built using MAFFT (v7.505)84 and poorly aligned regions were removed using Gblocks (v0.91b)85 with default settings. Phylogenetic tree was constructed from the concatenated supermatrix using FastTree (v2.1.11)86 under the JTTCAT model. Based on the phylogenetic tree, r8s (v1.81)87 was used to estimate the divergence times among taxa. To calibrate the timetree, the divergence time between E. monodactyla and B. mori was fixed at 98 million years ago (Mya) according to the documented divergence time available in the TimeTree database88. Our phylogenetic tree (Fig. 4a; rooted with Plutella xylostella82) showed that Zygaenidae, which A. yunnanensis and Z. filipendulae belong to, is the sister group of Limacodidae, and the divergence between the two families took place at approximately 70.10 Mya. Within the family Zygaenidae, the divergence time between A. yunnanensis and its European relative Z. filipendulae was estimated to be 58.11 Mya.

Fig. 4
figure 4

Gene family evolution in Achelura yunnanensis. (a) The numbers of the expanded gene families (red) and contracted gene families (blue) are shown to the right of each branch. The pie charts represent the proportions of gene family expansions (red) and contractions (blue). (b) GO enrichment analysis on the expanded gene families of A. yunnanensis. The top 20 most significant GO categories were included (p < 0.05). (c) KEGG pathways enrichment analysis on the expanded gene families of A. yunnanensis. The graph depicts the most highly enriched pathways.

Gene family expansion and contraction

To investigate genome-wide changes associated with adaptation in the A. yunnanensis genome, we performed an analysis of gene family expansion and contraction across 15 Lepidopteran species using CAFÉ (v5.0)89 with a p-value threshold < 0.05 as the cut-off. Subsequently, we used the R package clusterProfiler (v4.10.0)90 to conduct GO and KEGG enrichment analyses on the significantly expanded gene families (p < 0.05).

We identified 531 and 467 gene families that had expanded and contracted in A. yunnanensis, respectively (Fig. 4a; the detailed analysis results of the expanded and contracted gene families of A. yunnanensis are given in Supplementary Tables 5, 6). GO enrichment analysis (Fig. 4b) showed that the expanded genes were significantly enriched in catabolic processes, such as glycosphingolipid catabolic process (GO:0046479, P = 3.72 × 10−30) and lipid catabolic process (GO:0016042, P = 3.74 × 10−22). KEGG pathway enrichment analysis (Fig. 4c) suggested that the expanded genes were significantly involved in lipid metabolism and xenobiotics biodegradation and metabolism, such as the metabolism of xenobiotics by cytochrome P450 pathway (ko00980, P = 5.39 × 10−17).

A. yunnanensis larvae feed on various plant species of the Rosaceae family9. The expansion of the catabolism-related genes may contribute to their ability to feed on a wide range of plants, enhancing their adaptability as pests. Additionally, detoxification-related genes are crucial for herbivorous insects to neutralize toxic chemicals from their host plants or the environment. Several expanded gene families in the A. yunnanensis genome were significantly enriched with xenobiotics detoxification systems, further increasing their adaptability and complicating pest control efforts. Therefore, understanding the functions of the genes within these expanded gene families may help in developing novel pest management strategies.