Background

Lepidoptera is the second largest insect order, with > 160,000 species [1]. This order includes both butterflies and moths, many of which are important model organisms in ecology and evolutionary biology [2]. In Lepidoptera, mitochondrial (mt) genomes are widely used to study population genetics, phylogeography, phylogenetics and molecular taxonomy [3,4,5]. In particular, the mitogenome represents an ideal tool for the analysis of phylogenetic relationships due to its simple structure, maternal inheritance, low recombination, and high conservation over the course of evolution [6, 7]. Mitogenomes may also provide information to identify novel genes that may serve as targets in future research [8]. The mitogenome size of Lepidoptera ranges from 15,000 bp to above 16,000 bp [6, 7], mostly due to the variable length of noncoding regions, particularly the control region [8]. Moreover, lepidopteran mt genomes have a conserved rich adenine and thymine (A + T) region and usually consist of 37 genes, encoding 13 conserved protein-coding genes (PCGs), 22 tRNAs, 2 rRNAs, and a noncoding control region [6, 7, 9]. Until recently, the bulk of Argyresthiidae phylogenetic analyses utilized a common set of 8–11 mitochondrial and nuclear genes [10] and a set of up to 27 protein-coding genes [11,12,13,14,15,16]. However, inadequate node support hindered research that attempted to unravel relationships among superfamilies, even with over 1500 genes [17,18,19,20]. The potential causes and consequences of the competing phylogenetic hypotheses were discovered to be compositional bias and other model violations [21].

Yponomeutoidea is a large superfamily of Lepidoptera with 11 families and 1800 species [16, 22]. Surprisingly, only seven mt genome species of this superfamily are available in databases [23]. For the Argyresthiidae family, which belongs to this superfamily and contains 157 species [16], no mt genomes exist. Thus, obtaining the mt genomes of this family may be warranted for further resolving patterns of genomic evolution and assessing phylogenetic relationships.

The apple fruit moth (Argyresthia conjugella, Zeller) has a wide circumpolar distribution [24, 25]. Its main host, rowan (Sorbus aucuparia), is a masting species with spatiotemporally synchronized crop output [26]. In heavy intermast years, the apple fruit moth can hatch to find no host material available and will therefore seek secondary hosts, causing serious damage to apple crops [27]. A. conjugella is known to have high genetic diversity and a wide distribution in Fennoscandia [28,29,30], but the lack of complete mitogenomes for A. conjugella and the family Argyresthiidae hampers further studies on systematics, population genetics, taxonomy and evolutionary biology.

Our primary aim was to characterize the first entire mitogenome of a species in the Argyresthiidae family using the apple fruit moth in Norway as our study organism. Second, we analysed genome structure, base composition, substitution, and evolutionary rates among superfamilies using previously published Lepidoptera mitogenomes to obtain a better understanding of the phylogeny of Lepidoptera. We hypothesized that our phylogenetic analysis would recover Argyresthiidae nested with Yponomeutoidea. Furthermore, we evaluated the phylogenetic hypothesis that Argyresthiidae shows a sister-group relationship with Lyonetiidae, i.e., the ‘AL’ clade (Argyresthiidae + Lyonetiidae) of Sohn et al. (2013) [16] obtained based on nuclear genes. Finally, we wanted to provide an up-to-date identification of source taxa of lepidopteran sequences lacking superfamily-, family-, and/or genus-level ID on GenBank using a phylogenetic systematics framework.

Results and discussion

Genome assembly

Except for the variable control region (CR) and Norgal assemblies, we recovered the same gene order and content using both of our mitogenome assembly strategies. We discovered that Norgal failed to assemble mitogenome sequences when using the de novo assembly strategy, with no mitogenomic features (PCGs, tRNAs, and rRNAs) found for both assemblies resulting from using default (assembly size = 24,871 bp) and adjusted parameters (assembly size = 29,520 bp, -m 500). However, when run using the baited de novo assembly strategy, Norgal recovered the same gene order and content as SPAdes and Geneious Prime®, with the exception of the large ribosomal RNA gene (rrnL), which differed in size by 1 bp and sequence from that recovered by SPAdes and Geneious Prime® (pairwise p-distance = 0.01). We found that the variations in mitogenome sizes were associated with the properties of the control region (CR), which include variation in the copy number of tandemly repeated sequences and extensive length variation of a variable domain [31, 32]. When using SPAdes under the de novo assembly strategy, the nearly complete CR (1101 bp) was recovered. When the baited de novo assembly strategy was used, SPAdes recovered a partial CR of 380 bp in which the repetitive sequences could not be assembled. As a result, we present the complete mitogenome sequences of the apple fruit moth from the SPAdes de novo assembly, where the mitogenome is a 16,044 bp closed circular molecule (GenBank accession: ON496993; Fig. 1). Interestingly, the mitogenome size of the apple fruit moth was similar to available Yponomeutoid mitogenomes [33,34,35], which are relatively longer on average compared to other superfamilies of Lepidoptera (n = 4, 16092 ± 353 bp, Table 1).

Fig. 1
figure 1

Circular map of the complete mitogenome of Argyresthia conjugella depicting gene order. Labelling of tRNA genes was conducted in accordance with IUPAC-IUB single-letter amino acid codes

Table 1 General information and nucleotide composition for a subset of 51 representative mitochondrial genomes of the order Lepidoptera and 5 Trichopteran outgroups used in this study

Genome organization and base composition

The gene content of the apple fruit moth mitogenome is similar to that of other Ditrysian insects studied previously, with 22 tRNA genes, 13 PCGs, 2 rRNAs and a noncoding control region. The low-strand codes for 9 PCGs (cob, cox1, cox2, cox3, atp6, atp8, nad2, nad3 and nad6), 14 tRNAs (trnM, trnI, trnW, trnL2, trnK, trnD, trnG, trnA, trnR, trnN, trnS1, trnE, trnT and trnS2), 4 PCGs (nad1, nad4, nad4L and nad5), 8 tRNAs (trnC, trnF, trnH, trnL1, trnP, trnQ, trnV, trnY) and two mitochondrial rRNAs (rrnL and rrnS) (Fig. 1, Table 2). The lengths of the tRNA genes range from 64 to 75 bp (Table 2), which is well within the range of the corresponding tRNA genes of other lepidopterans: Plutella xylostella [34], Parnassius apollo [36], Leucoma salicis [7], Ephestia kuehniella [37] and Speiredonia retorta [6]. All 22 tRNAs had cloverleaf secondary structures, except trnS1, where one of the dihydrouridine (DHU) arms is missing (Fig. 2). The loss of the DHU arm in tRNAs has been detected in various Lepidoptera species [6, 38, 39]. DHU lacking arm was hypothesized to have evolved in response to recognition signals for seryl-tRNA synthetases, reflecting potential differences in gene expression [40, 41]. The location of rrnL is between trnV and trnL1, while rrnS is detected between the control region and trnV. These are the same gene positions found in P. xylostella [34]. The lengths of rrnL and rrnS in A. conjugella are 1371 bp and 783 bp, while the lengths of these genes are 1371 bp & 783 bp, 1344 bp & 840 bp and 1413 bp & 781 bp in S. retorta, L. salicis and P. xylostella, respectively [6, 7, 34]. The rRNA genes were A + T rich (82%), falling within the range detected in other Lepidoptera species, including Agrotis segetum [42], Agrotis ipsilon [43], Spodoptera frugiperda [44], and Papilio machaon [45]. The rRNA AT and GC skewness values were found to be negative in most of the analyzed Lepidoptera mitogenomes in the study, including A. conjugella; however, in Tecia solanivora [46], Spilarctia subcarnea [47] and S. retorta [6], these values were positive. In A. conjugella, the cox1 gene starts with ATT, which is different from the start codon in the superfamily Yponomeutoidea members P. xylostella, Leucoptera malifoliella and Prays oleae, where the gene start codon is CGA. The start codon of the cox1 gene was found to be variable in other Lepidoptera species [48]. The size of this gene (1534) in A. conjugella is 3 bp larger than that in these three species (P. xylostella, L. malifoliella and P. oleae) in the same superfamily. The cox2 gene size (682 bp) is the same size as that of L. malifoliella but larger than that found in P. xylostella and P. oleae (679), while all these species have the size of the cox3 gene (789 bp). The largest PCG found in A. conjugella mitogenomes is nad5 (1732 bp), and the smallest one is atp8 (162 bp). These results are widely reported in various insect mitogenomes [49, 50]. Overlap of the alginate sequences of atp6 and atp8 in A. conjugella (Fig. 3) showed the conserved nucleotide sequence ATG ATA A, which is detected in most lepidopteran species [34, 51].

Table 2 The organization and characteristics of the complete mitochondrial genome of Argyresthia conjugella
Fig. 2
figure 2

Predicted secondary structures of the 22 typical tRNA genes in the A. conjugella mitogenome

Fig. 3
figure 3

Alignment of atp8 and atp6 overlap of the selected lepidopteran species in the study, including A. conjugella. The green arrow shows the apt6 start codon, and the red arrow shows the atp8 stop codon

We found that the locations of the trnM gene follow the ditrysian type trnM-trnI-trnQ [52], which is different from non-ditrysian groups in Lepidoptera and from the ancestral order in which trnM is translocated: trnI-trnQ-trnM [52,53,54]. The control region of A. conjugella is large (1101 bp), which is a common feature detected in the superfamily Yponomeutoidea [35]. In comparison, the CR of the olive and diamondback moths were found to be ~ 1600 bp and ~ 1081 bp, respectively [34, 35]. We found that the CR is comprised of nonrepetitive sequences, including the motif ‘ATAGA’ followed by a 20 bp poly-T stretch, dinucleotide microsatellites (AT)18 and (AT)53, each flanked by ATTTA motifs, a (TAAA)4 adjacent to trnM instead of the 11 bp poly-A adjacent to tRNAs, and several imperfect repeat elements, indicating that the sequence in the present study may be partial. We found that the nucleotide composition of the CR was highly AT-rich, where the AT content was estimated at 94.3%, (A: 47.6%, T: 46.7%, G: 1.8%, C: 3.9%), where the AT skew was positive and the GC skews was negative, 0.010 and − 0.368, respectively. Overall, the nucleotide composition of the apple fruit moth mitogenome was also highly AT-rich, where the AT content was estimated at 82%, (A: 40.8%, T: 41.2%, G: 7.4%, C: 10.6%), and AT and GC skews were negative, − 0.005 and − 0.178, respectively (Table 1). These results are in agreement with results obtained in P. xylostella [34], L. salicis [7], E. kuehniella [37] and S. retorta [6].

The codon usage in A. conjugella was compared with twelve Lepidopteran species from different families (Fig. 4). The comparison showed that the pattern of codon usage in the PCGs of the A. conjugella mitogenome is very similar to the patterns in these Lepidopteran mitogenomes. Asn, Ile, Leu2, Met and Phe are the most commonly used codon families in all these species, while Cys codons are the rarest (Figs. 4 and 5). The relative synonymous codon usage (RSCU) was analysed for A. conjugella and compared with the same set of Lepidopteran insects (Fig. 6). CTG, CTC, AGG and ACG were completely absent in the A. conjugella mitogenome PCGs. Codons with high G and C contents are also rare or absent in the PCGs in other Lepidopteran mitogenomes. Moreover, TTA (Leu2), TCT (Ser2), CGT (Arg), GCT (Ala), and GGA (Gly) are the most frequently used codons and account for 36.41%. These five amino acids are also detected in other Lepidoptera species, such as Manduca sexta [55], Helicoverpa armigera [56], P. xylostella [34], T. solanivora [46], P. machaon [45], and Ostrinia nubilalis [57]. In particular, Leu2 was found to be the most frequently detected amino acid in all Lepidoptera species in the study, and this result is supported by results found in L. salicis [7] and S. retorta [6].

Fig. 4
figure 4

Comparison of codon usage of the 20 selected mitochondrial genomes of the Lepidoptera species in the study, including A. conjugella

Fig. 5
figure 5

Relative synonymous codon usage (RSCU) of the 20 selected mitochondrial genomes of Lepidoptera in the study, including A. conjugella. Codon are plotted on the x-axis

Fig. 6
figure 6

The distribution of codons among the selected lepidopteran species in the study. CDspT codons per thousand codons

Phylogenetics

To obtain an overview of A. conjugella and its relationships with other Lepidoptera species, our study investigated 18 superfamilies representing 42 families and 507 Lepidoptera species (Tables S1, S2 and Figure S2). This is the first phylogenetic study (using the mt genome) of A. conjugella in the Argyresthiidae family, which belongs to the Yponomeutoidea superfamily. Various studies tried to resolve phylogenetic tree of Lepidoptera using mitochondrial genomes, nucleotide alignments, amino acid alignments and transcriptomes and target enrichment approaches [6, 7, 9, 17,18,19,20,21, 58]. However, inadequate node support hindered research that attempted to unravel relationships among superfamilies [17,18,19,20,21]. The challenges are not the lack of data but, how to the data analyze, the quality of data and the number of taxon investigated [18, 21]. We constructed a phylogenetic tree using 507 Lepidopetera species (Fig. S2), and the subset data using 51 species (Fig. 7) to understand the position of A. conjugella in Lepidoptera phylogenetic tree. Using the ML approach, analyses of the three datasets (specified in the materials & methods section) resulted in the generation of three topologies. Generally, our study agrees with the most updated study Rota et al. (2022) [21], that detected nine main clades superfamilies in a butterfly and moth phylogeny using 331 genes for 200 taxa. Additionally, our phylogenetic analysis supports the previous morphological characterization of the Yponomeutoidea superfamily [16, 59, 60]. The 507 Lepidoptera species showed that some families clustered together, such as Papilionidae & Pieridae, Pyralidae & Tortricidae, Geometridae & Sphingidae, Erebidae & Noctuidae and Gelechiidae & Sphingidae, while other families as Tortricidae and Crambidae clustered alone and separately. Yponomeutoidea was recovered as a well-supported monophyly group and as one of the earliest lepidopteran groups after Tineoidea and the basal Hepialoidea (Fig. 7, Figures S1 and S2). However, the paraphyletic Tineoidea to some extent led to the phylogenetic instability of the monophyly of Yponomeutoidea in cases of Datasets 1 and 2 (Fig. 7, Figure S1), which was fully resolved with dense taxon sampling (Figure S2). Wang et al. (2018) [61], Bao et al. (2019) [38], Jeong et al. (2022) [23] and Zhang et al. (2020) [62], all found similar results for Yponomeutodiea and Tineoidea superfamilies. Furthermore, Boa et al. (2019) [38] and Jeong et al. (2022) [23] also found that Yponomeutoidea, Tineoidea and Gracillarioidea in Ditrysia have strong phylogenetic relationships. We also detected strong relationships between Yponomeutoidea, Zygaenidae and Tortricoidea, findings that are in line with results found by Liu et al. (2016) [48], Zhang et al. (2020) [62], Wang et al. (2018) [61], and Kim et al. (2014) [63]. Only a weak phylogenetic relationship was observed between the superfamilies Yponomeutoidea and Bombycoidea, results that are supported by Liu et al. (2016) [64] and Liu et al. (2017) [65]. Nonetheless, we consistently recovered Argyresthiidae embedded in Yponomeutoidea with a sister-group relationship to Plutellidae (Dataset 1: SH-aLRT = 92, UFBoot2 = 100; Dataset 2: SH-aLRT = 88, UFBoot2 = 100; Dataset 3: SH-aLRT = 87, UFBoot2 = 99). Our phylogenetic tree hypothesis rejects the provisional ‘AL’ clade (Argyresthiidae + Lyonetiidae) recovered with nuclear gene datasets by Sohn et al. (2013) [16]. We found that Lyonetiidae was unstable, possibly due to its relatively long branch length. We recovered Lyonetiidae as basal to the Yponomeutoidea clade (Figure S1, Dataset 1: SH-aLRT = 99, UFBoot2 = 100) or as a sister-group to Praydidae with Yponomeutoidea (Figure S2, Dataset 3: SH-aLRT = 84, UFBoot2 = 100), and as sister-group to Gracillariidae of the order Tineoidea, although with weak support (Fig. 7, Dataset 2: SH-aLRT = 43, UFBoot2 = 91). With increased taxon sampling, our phylogenetic tree hypotheses strongly supported the basal placement of Lyonetiidae within the Yponomeutoidea clade (Fig. 7, Figure S2, Dataset 2: SH-aLRT = 98, UFBoot2 = 99). Moreover, we consistently recovered the previously described pairing of Yponomeutoidea and Gracillariidae as internested subclades [16, 22]. At a higher level, our phylogenetic tree hypothesis recovers some fundamental and uncontroversial lepidopteran clades that agree with the majority of mitogenomic phylogenies as well as those that included both mitochondrial and/or nuclear markers. The analyses found that A. conjugella had the closest relationship with P. xylostella, L. malifoliella and P. oleae, which belong to the Plutellidae, Lyonetiidae and Praydidae families, respectively (Fig. 7, Figure S2). Wei et al. (2013) [34], Sohn et al. (2013) [16], Liu et al. (2016) [48], Yang et al. (2020) [66], Jeong et al. (2021) [67] and Jeong et al. (2022) [23] all found that P. xylostella, L. malifoliella and P. oleae are closely related.

Fig. 7
figure 7

The phylogenetic tree included A. conjugella, constructed using the nearest neighbor interchange (NNI) approach to search for tree topology and for computing branch supports with 1000 replicates of the Shimodaira-Hasegawa approximate likelihood-ratio test SH-aLRT [68] and 1000 bootstrapped replicates of the ultrafast bootstrapping (UFBoot2) approach [69]. Phryganea cinerea, Phryganopsyche latipennis, Cheumatopsyche brevilineata, Limnephilus hyalinus, and Stenopsyche angustata were used as controls and outgroup species (Table 1)

In our study, Tineodiea superfamily was represented by four species (Amorophaga japonica, Dahlica ochrostigma, Gibbovalva kobusi and Eudarcia gwangneungensis) with relatively high nodal support (Fig. 7, Figure S2). This superfamily is known to have high genetic diversity and has three different lineages [21]. The crosstalk of the complexity and the relationships among Tineidae group and the disagreements within the superfamily Gelechioidea, Carposinoidea, and Pterophoroidea remain unresolved issues. Both this study and that of Rota et al. (2022) [21], detected a sister relationship between Yponomeutoidea and the superfamily Tineidae, and the sub-clades Gelechioidea, Tortricoidea, Zygaenoidea are clustered together in the same clade. Rota et al. (2022) [21], found Gelechioidea clustered at different positions, when different analyses were performed with different datasets, these may be explained by high amount of compositional heterogeneity, or the limited materials used in the study (five species). While our study showed, the 20 species from Gelechioidea superfamily were clearly clustered together using both datasets and data two analyses (EME and NJ), but surprisingly, one single species (Periacma orthiodes) belonging to the superfamily Noctuoidea was clustered together with this family. This might be misidentification of the taxon of the mt genome found in the genebank. Our study showed, Gelechioidea grouped together with Pyralidae, these results are in agreement with the results of [17]. Pyaloidea was also sister to Carposinoidea; and Calliduloidea, Pterophoroidea, Gelechioidea and Thyridioidea are recovered in the same part of the tree, but with Thyridoidea sister to Macroheterocera [21]. Previously, Pterophoroidea was reported as a sister group with a monophyletic Papilionoidea, included Hedyloidea and Hesperioidea. In the same study, Choreutoidea and Immoidea were recovered as sister to Tortricoidea [21]. However, when 50 genes were removed, Choreutoidea were recovered as sister to Urodoidea and Pterophoroidea [21]. One phylogenetic study reported Pterophoroidea within the clade Obtectomera, [19] but a more recent study showed results contrary to these findings [21]. The position of Pterophoroidea is highly dependent on the dataset. This superfamily is recovered in the same clade with Urodoidea regardless of the alignment analysed, whereas it’s recovered in the clade with Gelechioidea, Calliduloidea and Thyridoidea is dependent on which datasets are analysed [21]. Pterophoroidea can also be recovered as sister to Papilionoidea and Noctuoidea, when different datasets were used [70]. It should also be noted that, using software with systematic errors and alignment issues can persist with regard to detecting homologies due to use of designed to assess the alignment quality using a threshold of alignment scores, [71].

Comprehensive analyses of insect mitogenomes provide important phylogenetic information to identify potentially novel genes that may serve as valuable targets in future research efforts. Further investigations of the whole genome of A. conjugella along with other genomes of Lepidoptera species will facilitate the understanding of the taxonomy and evolutionary process acting on the Ditrysia natural group.

Materials and methods

Specimen collection and DNA extraction

During August 2016, we collected a single female apple fruit moth larva from an infested rowan berry in the field in Skiftenes (N 6471746 and E 472502) in southern Norway. To confirm species identification of the larva, we employed both morphological [24, 65] and molecular methods [28] using microscopy and STR markers, respectively. We placed the apple fruit moth larva on rolls of corrugated cardboard until it entered pupal diapause, and then we stored it at -80 °C until DNA was extracted. DNA was extracted from the apple fruit moth pupal tissue using the DNeasy Blood and Tissue Kit (Qiagen, Tokyo) following a modified version of the manufacturer’s instructions [28].

Mitogenome sequencing and assembly

We outsourced the whole genome sequencing of the apple fruit moth to the Norwegian Sequencing Centre (Oslo, Norway), where the whole genome library was prepared (insert size = 350 bp) and sequenced on one lane of the Illumina HiSeq 4000 platform (Illumina, USA) with paired end (PE) sequencing (2 × 150 bp). A total of 820,368,162 raw reads (of which 820,365,390 were paired in sequencing, i.e., 410,182,695 PE read clusters) were generated. We evaluated the quality of the Illumina sequencing run using MultiQC v.2.31 [72]. Then, we used AdapterRemoval v.2.1.3 to search for and remove adapter sequences and to trim low-quality bases from the 3' end of reads following adapter removal [73, 74]. After quality control (QC), we used the cleaned PE reads for mitogenome assembly by means of two assembly strategies: (i) de novo and (ii) ‘baited’ de novo.

For de novo sequence assembly (i), we employed the programs Norgal v.1.0 [75] and SPAdes v.3.15.3 [76]. The Norgal assembler was executed with (1) default parameters and (2) a k-mer range of 21–255 with an interval of 28 and a contig length threshold of 500 bp. We executed SPAdes for all k-mer sizes from 21 to 127 (-k 21, 33, 55, 77, 99, 127), with –careful option to minimize number of mismatches in the final contigs. For the baited de novo assembly strategy (ii), we first constructed the FM-index for the mitogenome reference sequence of the olive moth P. oleae (Bernard, 1788; Lepidoptera: Yponomeutoidea: Praydidae) (NCBI accession number NC_025948.1; [35] using the index command of the BWA v.0.7.17 aligner [77]. Additionally, we also used the mitogenome reference sequence of the diamondback moth P. xylostella [34] (Linnaeus, 1758; Lepidoptera: Yponomeutoidea: Plutellidae) (NCBI accession number JF911819.1). We selected the mitogenomes of the olive and diamondback moths as references due to their completeness and taxonomic and phylogenetic placement in Yponomeutoidea and the reliability of the PCR-based amplification method used to sequence these mitogenomes, 14 segments of 1.2–2.4 kb and nine segments of variable size, respectively. Second, we aligned the cleaned PE reads of the apple fruit moth separately to the indexed reference genome of the ‘model’ moths using the BWA-MEM algorithm of BWA, excluding reads with a minimum quality score of < 30, and then used the SAMtools v.1.9 suite [78] to convert the SAM to BAM alignment file. Third, we sorted and indexed the BAM alignment file using the sort and index commands, respectively, from SAMtools. Fourth, we obtained the QC statistics for the sorted and indexed alignment using BAMQC as implemented in Qualimap v.2.2.20 [79]. Fifth, we extracted reads that mapped properly as pairs using SAMtools. Finally, we used the mitochondrial filtered reads for de novo mitochondrial genome assembly using Norgal, SPAdes and Geneious Prime® v.2022.1.1 (Biomatters Ltd., Auckland, New Zealand; [80].

Mitogenome annotation and visualization

We conducted a preliminary annotation of the mitogenome assembly referring to the results of the MITOS2 webserver (http://mitos2.bioinf.uni-leipzig.de/index.py; [81] Donath et al. 2019) by assessing the location of protein coding (PCGs), transfer RNA (tRNAs), and ribosomal RNA genes (rRNAs). Then, we confirmed gene boundaries for PCGs and rRNAs manually using BLASTn, SMART BLAST, BLASTp, and ORF Finder as implemented at the National Center for Biotechnology Information (NCBI) database [82]. Subsequently, we also validated that coding sequences were translated in the correct reading frame and confirmed the initiation and termination codons in Geneious Prime® using the published mitochondrial genome sequences of other moths as references, including the olive moth. We then used the program ARWEN [83] to detect the tRNA genes of the apple fruit moth and finally predicted the secondary structures of tRNAs using MITFI [84] as implemented in MITOS2 and tRNAscan-SE v.2.0 [85]. We also annotated the control (A + T-rich) region (CR) of the apple fruit moth by screening for structural elements characteristic of the region, which include (i) tandem repeats, identified using Tandem Repeats Finder v.4.10 [86] using default settings, and (ii) the motif ‘ATAGA’ and poly-T stretch. We produced the annotated circular map of the complete mitochondrial genome of the apple fruit moth using the beta version of the CGview server (http://cgview.cahttp://cgview.ca; [87]. The secondary structure of tRNAs was predicted using tRNAscan-SE-2.0 [88].

Comparative mitogenomics of Lepidoptera

We conducted a systematic and comprehensive search for complete mitochondrial genomes of Lepidopteran species published in the NCBI nucleotide database using the following keywords: (“lepidoptera”[Organism] OR “lepidoptera”[All Fields]) AND “complete mitochondrial genome”[All Fields] AND mitochondrion[filter] (10 May 2022: 842 hits). We downloaded and processed the full GenBank files in Geneious Prime® to (i) obtain taxonomy metadata, (ii) remove least recently modified duplicates, (iii) remove nonlepidopteran species, and (iv) remove mitogenomes with > 90% missing annotations (retained 507 species). To ensure that the taxonomic status of all species was the latest, we verified all the species names against 60 taxonomic databases, including the Catalogue of Life and the Integrated Taxonomic Information System (ITIS), using the R package taxize v.9.94.91 [89]. Then, we corrected any misspellings and used the classification function implemented in taxize to retrieve the taxonomic ranks of individual species. We included seven species of Trichoptera, representing five families and four superfamilies, to serve as outgroups.

We compared the assembled A. conjugella mitochondrial genome with the mitochondrial genomes of 507 other Lepidoptera obtained from GenBank, representing 18 superfamilies and 42 families (Supplementary Table S2). We included only one representative per valid species (longest mitogenome sequence) when more than one known sequence was available in GenBank. We calculated the overall composition of individual mitogenomes based on the proportion of A + T out of the total (%AT content) using MEGA v.11.0.11 [90]. To measure the base composition skewness of nucleotide sequences, we used the formulae of Perna and Kocher (1995) [91]: AT-skew = [A-T]/[A + T] and GC-skew = [G-C]/[G + C].

Sequence alignment and phylogenetic reconstruction

We produced codon-aware multiple sequence alignments for each of the 13 PCGs using MACSE v.2.01 [92]. We inspected and manually trimmed each set of alignments using MEGA, and any remaining ambiguously aligned sites were then further trimmed using BMGE v.1.12.1, with a sliding window size of 3 and maximum entropy of 0.5 [93]. We aligned rRNA genes using the online version of MAFFT v.7.299 [94, 95] and removed ambiguously aligned sites using BMGE. Before phylogenetic analysis, we produced two concatenated mitogenomic datasets from (i) the aligned individual PCG datasets (Dataset 1: 13PCGs_NT dataset) and (ii) the 13 PCGs plus the large and small mitochondrial ribosomal RNA (rRNA) genes (rrnL and rrnS) (Dataset 2: 13PCGs_rRNAs_NT dataset) with the R package concatipede v1.0.1 [96]. We derived the third mitogenomic dataset by translating the 13PCGs_NT dataset in MEGA (Dataset 3: 13PCGs_AA dataset). Furthermore, we used DAMBE v.7.2.141 [97] to conduct two-tailed tests of substitution saturation [98] for each codon position of the 13 PCGs, taking into account the proportion of invariant sites as recommended by Xia and Lemey (2009) [99]. According to the observed index of substitution saturation (ISS), all codon positions showed little saturation (ISS < ISScSym (assuming a symmetrical topology) and ISS < ISScAsym (assuming an asymmetrical topology); see Supplementary Table S2). Likewise, visual inspection of nucleotide saturation for each codon position of the 13 PCGs with DAMBE by plotting transitions and transversions against Kimura two-parameter [100] distances showed little saturation in all codon positions. Therefore, none of the codon positions were excluded, and the 13 PCG nucleotide (Dataset 1) and protein (Dataset 3) datasets were initially gene-by-codon partitioned (39 partitions) and gene partitioned (13 partitions), respectively. For Dataset 2, we designated two partitions for the rRNA genes (rrnS and rrnL, treated each as a single partition) and 39 partitions covering the three codon positions in each of the 13 protein-coding genes.

We used ModelFinder [101] to select the best-fitting partitioning scheme and models of evolution using the corrected Akaike Information Criterion (AICc) and the edge-linked proportional partition model [102] as implemented in IQ-Tree v. 2.2.0.3 [103]. We applied the new model selection procedure (-m MF + MERGE), which additionally implements the FreeRate heterogeneity model inferring the site rates directly from the data instead of being drawn from a gamma distribution (-cmax 20; [104]. To reduce the computational burden, the top 30% partition merging schemes were inspected using the relaxed clustering algorithm (-rcluster 30), as described in [105].

We reconstructed phylogenies based on the maximum likelihood (ML) criterion in IQ-Tree, where we used the substitution models indicated by ModelFinder (Table 3). We used the nearest neighbor interchange (NNI) approach to search for tree topology and for computing branch supports with 1000 replicates of the Shimodaira-Hasegawa approximate likelihood-ratio test SH-aLRT [68] and 1000 bootstrapped replicates of the ultrafast bootstrapping (UFBoot2) approach [69]. We abided by the advice that clades with UFBoot2 ≥ 95 and SH-aLRT ≥ 80 can be regarded as being well supported [106].

Table 3 The substitution model MODELFINDER [101] was used to reconstruct phylogenies based on the maximum-likelihood (ML) criterion in IQ-TREE