Background

Eukaryotic mitochondria are monophyletic and originate from the bacterial phylum Alphaproteobacteria [1, 2]. One of the descendant lineages, the animal mitochondrion, contains a circular DNA molecule around 16 kb in length. It comprises 22 transfer RNAs (tRNAs), 13 protein-coding genes (PCGs), two ribosomal RNAs (rRNAs), and one or more non-coding regions including the control region [3]. Mitochondrial sequences have been the most popular genetic markers for many types of studies, including disease detection [4], species identification [5], phylogeography [6, 7], and phylogenetics [8, 9].

Mitogenomic data are effective for revealing higher-level relationships of diverse animal groups [918]. Although Sanger sequencing has been used to obtain high-quality mitogenomic sequences, the time required for primer design, and the cost invested to recover large numbers of mitogenomic sequences remains challenging. Recently, next-generation sequencing (NGS) methods have been shown to overcome such shortcomings [19]. This makes practical the task of re-examining and re-evaluating phylogenies with much larger datasets.

In the second largest insect order Lepidoptera, there were 172 mitogenomes representing 113 species on NCBI (as accessed on May 14, 2014). However, 44 of the 172 submitted mitogenomes were from the genus Bombyx[2022], 53 of the 113 species were from butterflies (Papilionoidea), and only ten of 43 currently recognised superfamilies have been covered [2327]. For inferring molecular phylogeny, mitochondrial sequences have been questioned on the basis that they are too saturated to distinguish deep-level relationships [28, 29]. Though some studies point out that other mitochondrial genes have more relevant information than the most commonly used cox1 or rrnL genes [9, 27, 30], researchers have instead focused on using nuclear genes to infer phylogenies. To evaluate sufficiently the utility of mitogenomes in inferring lepidopteran relationships at the level of family and below, here we focus on the well-known group of brush-footed butterflies, Nymphalidae.

The family Nymphalidae was chosen as our primary focus, firstly because this group represents the most diverse butterfly family with 12 subfamilies, 559 genera and 6,152 species, that is one third of all butterfly species [3134]. Secondly, Nymphalidae is also arguably the most utilised lepidopteran family in biological studies as they are distributed in various habitats worldwide and include many model species for ecological, conservation, evolutionary and developmental studies [3537]. Thirdly, there is a good existing framework against which to judge our results, because the backbone relationships of Nymphalidae have been inferred based on dense sampling of both morphological characters and DNA sequences [31, 38]. Nymphalidae phylogeny is far from fully resolved though, because although groups at subfamily and tribal levels appear to be monophyletic, some nodal support at these levels remains weak with no topological consensus. In particular, the branching pattern at deep nodes remains unresolved: both subfamilies Libytheinae and Danainae have been placed as sister to the rest of the family, but without strong nodal support (ML bootstrap ≤ 70, MP bootstrap < 48) [31]. Also in the subfamily case study we focus on here, the grouping of the tribes (Adoliadini and Limenitidini) in the Limenitidinae has been inconclusive due to low support values (MP bootstrap =40, and ML bootstrap not significant) [31].

There is also a particular prior interest in the tribe Limenitidini, as these butterflies include models for the evolution of mimicry [39]. The mimetic wing patterns of Limenitis butterflies are considered to evolve multiple times and hybridised frequently [40, 41] and the wing patterns of these gliders have converged to an extraordinary degree for butterflies not widely thought to be Müllerian mimics with strong predator unpalatability [42]. The phylogenetic relationships and evolutionary patterns of the American Limenitidini butterflies have been analysed in particular detail [39, 40, 43]. By contrast, the phylogenetic relationships of Palearctic Limenitidini butterflies have been almost neglected. Even though some studies have used nuclear (EF1-α) and mitochondrial genes (cox1 or cob gene) to investigate the relationships of East Asian Limenitidini [44, 45], results based on these genes are still unresolved and poorly supported. A further study even showed a polyphyletic relationship of the genus Athyma[46], part of which had a sister relationship with Limenitis butterflies [31]. Further effort is needed to clarify the relationship of this group and unravel the evolutionary basis of mimicry.

The molecular data deposited in public databases have taxonomic biases and annotation errors [47, 48]. However, they still enable the testing of various strategies of data partitioning and substitution models to improve the accuracy of phylogenetic inference [49, 50]. Branch lengths differ greatly when different substitution models are applied, potentially producing different topological and molecular dating estimates [51]. The mitogenome has long been considered as a single non-recombining locus because of maternal inheritance, storing substantial numbers of mutations heterogeneously distributed among genes [52, 53]. The lack of extensive taxon sampling across Nymphalidae, however, has impeded the evaluation of the effects of different partitioning methods on mitogenomic phylogenies. A denser coverage of nymphalid mitogenomes subjected to various partitioning methods is needed for accurate reconstruction of phylogenies in this group.

The aim of this study was to establish a highly supported phylogeny for the Nymphalidae using 70 mitogenomes obtained from both our pyrosequencing-derived mitogenomes (n = 30) and those previously deposited in GenBank (Additional file 1). The combined datasets were used to evaluate nodal supports using 12 different partitioning strategies (PSs) and two categories of substitution models. Finally, the phylogenies inferred from entire mitogenomes and single genes were compared to assess topological congruence.

Results

Sequence information

A total of 30 mitogenomic sequences (total length of 458,050 bp) were obtained from two Roche Junior runs (Additional file 1). After quality control treatments and sequence assembly, a total of 121,305 pyrosequencing reads were obtained (Additional file 2). The mean length of read was 403.6 bp with mean quality Q-score 28.2, indicating that the error rate of our sequences is near 0.001. Sequence coverage was not uniform (Figure 1, Additional file 2). Generally, highest coverage regions were located at both ends of the amplicons, whereas low coverage regions (<10X) were often found near the control region or in the tRNAs region between the nad3 and nad5 genes in cases where these regions contained polymeric stretches or tandem repeats (Figure 1, Additional file 3). The low sequence recovery for some species might have been caused by unequal DNA estimation. Details about the lengths sequenced via pyrosequencing and Sanger methods, overall sequence coverage, and gaps, are shown in Figure 1.

Figure 1
figure 1

An overview of 30 newly sequenced mitogenomes. The sequencing regions derived from pyrosequence-assembled contigs, Sanger sequencing, and gaps are shown.

To verify sequence information, most gaps and low-coverage regions were checked via Sanger sequencing. This also allowed us to fill in many gaps. However, some ambiguous sites remained unresolved (all the ambiguous sites are listed in Additional file 3). Six major undetermined regions are listed here: (1) for Athyma perius, 79 ambiguous sites in the control region; (2) for Dichorragia nesimachus, an undetermined 33-bp length within the rrnL gene; (3) for Euthalia irrubesens, a 52-bp region within a TA-repeat noncoding region between the trnE and trnF gene; (4) for Neope pulaha, a 16-bp region in the nad6 gene; (5) for Tanaecia julii, a 25-bp region within a TA-repeat noncoding region between the trnE and trnF genes; and (6) for Sumalia daraxa, a 601-bp fragment containing part of 3’ end of the nad1 gene and trnS2 gene which failed to sequence. The ratio of these sites was below 0.02% of our total sequenced mitogenomes (Additional file 3). This proportion of missing data seems not to seriously affect the evaluation of phylogenetic utility.

The lengths of these new and published mitogenomes are quite conserved between 15-16 kb. Neptis soma has the shortest sequence length (15,130 bp), whereas Papilio maraho has the longest (16,094 bp) (Additional file 1). Although most length variation was found in the control region, some additional lengths were also observed among these 30 mitogenomes. There was a 150-bp additional stretch found between the nad2 and trnW genes (for Bhagadatta austenia), a region including (TA)11-64 repeats that are found between the trnE and trnF genes (for Bhagadatta austenia, Dichorragia nesimachus, Dophla evelina, Euthalia irrubesens, Junonia almana, Parthenos sylvia, Polyura arja and Tanaecia julii), a 41-bp length found between the nad6 and cob genes (for Yoma sabina), a 34-bp fragment found between the trnL1 and rrnL genes (for Polyura arja), and a 59-bp fragment observed in the centre of the rrnL gene (for Bhagadatta austenia), and finally a 37-bp fragment found in the rrnS gene of Ypthima akragas.

The gene order and orientation were the same in all 70 assembled lepidopteran mitogenomes. This alignment suggests that the arrangement of three tRNAs (CR-M-I-Q-nad2) between the control region (CR) and the nad2 gene can be inferred as a derived character relative to the insect ground plan, CR-I-Q-M-nad2[54]. The A + T compositional ratio of the obtained mitogenomes ranged from 77.8-82.7% (Additional file 4), which is within the range of most insect mitogenomes [55].

The 37-gene aligned matrix contained 15,495 bp, of which 11,340 bp was derived from the 13 PCGs, 2,528 bp from the two rRNA genes, and 1,627 bp from the 22 tRNA genes. A total of 6,920 polymorphic sites, 5,714 of them parsimony-informative, were identified (Additional file 5). Removing the four outgroups resulted in 6,710 variable sites, 5,472 of them parsimony-informative. Both the numbers of variable and parsimony-informative sites were lower in the 15-gene dataset than those in the 37-gene dataset (Additional file 5). However, the ratio of informative sites to polymorphic sites was slightly higher than that in the 37-gene dataset. A similar pattern was also found in the 13-gene dataset.

Sequence variable sites were not homogeneously distributed across the gene (Figure 2). Most polymorphic sites were found towards the 5’ or 3’ ends of genes, while relatively conserved regions were found towards the centre of PCGs. Two genes (nad6 and the 3’ end of nad5) had the most variable regions, whereas the 5’ end of cox1 (barcode region) and rrnL were the most conserved (Figure 3).

Figure 2
figure 2

A sliding window of 70 sampled mitogenomes (alignment length: 18,646 bp). The sliding window was derived using DnaSP v5 software. The window width was set to 500 bp and the step size was set to 2 bp (excluding gaps and missing data).

Figure 3
figure 3

Genetic distance variation among different datasets. The red line is the mean and the cox1 B* is the barcode region. Genus level in pink; tribal level in green; subfamily level in blue; family level in yellow.

Phylogenies

Phylogenetic relationships inferred by Bayesian inference (BI) and maximum likelihood (ML) methods were congruent and strongly supported (>0.95 posterior probability and ML bootstrap >85% for most nodes), across each PS and substitution models for each of the three datasets (Figure 4; for more supporting information, see Additional files 6 and 7 for the BI trees, Additional file 8 for the ML trees, and Additional file 9 for 12 different PSs). Concerning family-level relationships, the topology is concordant with a recent study [32], in which the family Hesperiidae (along with Hedylidae) is subsumed within Papilionoidea, and the family Papilionidae is the sister clade of the remaining butterflies. At the subfamily level within Nymphalidae, ten sampled subfamilies were also well supported. Calinaginae and Charaxinae grouped with Satyrinae (hereafter referred as “satyroid” group). Apaturinae and Nymphalinae was a sister group and most closely related to the clade comprising Heliconiinae and Limenitidinae. Danainae was the sister of the remaining nymphalid subfamilies. However, the phylogenetic positions of Pseudergolinae and Libytheinae were not stable among the 12 different PSs (Figure 5). Libytheinae was placed as the sister of all nymphalids except Danainae, or as sister to a clade composed of Pseudergolinae, Apaturinae, Nymphalinae, Heliconiinae, and Limenitidinae (Figure 5). Pseudergolinae showed a sister relationship to the clade that included Apaturinae, Nymphalidae, Heliconiinae, and Limenitidinae, but also showed a relationship to the clade composed of Apaturinae and Nymphalidae only.

Figure 4
figure 4

Mitogenomic phylogeny of sampled butterflies. The BI topology was based on the PS2 partitioning strategy and the best-fit model setting. The values at each node are posterior probability (PP) and ML bootstrap (BS). The label “*” means PP =1 and BS = 100.

Figure 5
figure 5

Four mitogenomic topologies of Nymphalidae subfamilies. The four topologies were summarised from 36 datasets based on the BI and ML methods. The symbols "gPS1 to gPS12" mean BI topologies inferred with the GTR + G model, whereas “oPS1 to oPS12” means BI topologies inferred with the best-fit model. The symbols "mPS1 to mPS12" mean topologies inferred by the ML method with the GTR + G model.

Tribal-level relationships within Nymphalidae were well supported (Figure 4). According to our sampling focus on the subfamily Limenitidinae, four tribes were highly supported by different PSs (Figure 4; Additional files 6, 7 and 8). The tribe Parthenini was sister to a clade in which Neptini was sister to the group composed of Adoliadini + Limenitidini. Although the species Bhagadatta austenia has been classified within the tribe Limenitidini [56], our data showed this species always grouped with Neptini with high support. Within Limenitidini, a previous study indicated that the genus Athyma was paraphyletic [37]. Our analyses provided strong supports at nodes, indicating that Athyma is an artificial group.

Although our taxon sampling is limited, some well-supported relationships were also found in other subfamilies. Within the subfamily Heliconiinae, the tribe Argynnini was sister to the clade composed of Acraeini + Heliconiini [57]. Within the subfamily Nymphalinae, the three tribes Kallimini, Junoniini and Melitaeini formed a clade with strong support. Melitaeini, embedded within the subfamily Nymphalinae, in concordance with the result of Wahlberg et al. [31]; however, the ML bootstrap value is too low to support Kallimini as sister to the clade comprising Junoniini and Melitaeini.

The topologies inferred by parsimony (MP) were different from the BI and ML trees at family and subfamily levels (Figure 6 and Additional file 10). All three datasets showed the same higher-level (family, subfamily, and tribal level) relationships, however, there was some incongruence at species level. A total of 19 most parsimonious trees were found for the 37-gene dataset, seven for the 15-gene dataset, and two for the 13-gene dataset. Two major inconsistences were found when compared with the likelihood topologies at family level (1) Hesperiidae was placed as sister to four sampled butterfly families, and (2) a sister group relationship was found between Lycaenidae and Pieridae. The latter relationship was also found in the most recent analysis using MP [32], but these morphologically implausible family-level relationships had lower bootstrap values in MP analyses than in the ML and BI analyses. At subfamily level within Nymphalidae, Danainae was also found to be sister to the other nymphalids. Other subfamilies had similar relationships to those inferred by ML and BI, except for an unexpected result that Libytheinae and Pseudergolinae grouped together and were placed sister to other nymphalids.

Figure 6
figure 6

Most parsimonious topology. This topology was based on the 13-gene dataset using 50% majority rule. Two equally most parsimonious trees were found.

A total 15 individual single-gene phylogenies were inferred and none recovered the topologies generated using the whole mitogenomic alignment (Additional file 11).

Bayes factors

Bayes factors for the 12 PSs are shown in Table 1. Our results show that partitioned datasets are strongly preferred over non-partitioned ones (PS1, PS5, and PS9), and more partitioning is preferred over less. One exception is that codon partitioning (PS2, PS6, and PS10) performed better than gene partitioning (PS3, PS7, and PS11). Comparing the marginal likelihood values by two different model settings (optimal substitution models and GTR + G model), most of the Bayes factor results showed that the datasets based on the best-fit models had better performance than those based on the GTR + G model. Only the most highly partitioned strategy (PS4, PS8, and PS12, separately) was preferred over the same strategies using the best-fit substitution model.

Table 1 Bayes factor comparisons between models

The effect of data size and partitioning strategies on tree uncertainty

The summary of credible sets of Bayesian trees, treated by different PSs and substitution models, is presented in Table 2. The results show that the number of trees increases with increasing data partitioning among the 12 PSs. The number of credible trees produced from 15 single-gene datasets is two orders of magnitude greater than that from 12 PS datasets, indicating that 12 PS datasets had sufficient information to decrease tree uncertainty.

Table 2 The effect of used datasets and models on tree uncertainty

Discussion

Mitogenomes in butterfly systematics

Our results show that mitochondrial gene length and order are conserved among 70 sampled lepidopteran mitogenomes, and that the 37-gene aligned matrix includes over 36% parsimony-informative sites. Based on this genetic variation, our BI and ML analyses all show strong support for relationships at different hierarchical levels (Figure 4, and Additional files 6, 7 and 8). This strong congruence is not only supported with the most recent phylogenetic studies of butterflies at family and subfamily-level [32], but is also concordant with other mitogenomic phylogenies [23, 24]. Though taxon sampling is limited across butterfly families in our mitogenomic analyses, this study demonstrates that the entire mitogenome constitutes a particularly efficient marker for studying the phylogeny of butterflies as well as other groups of insects [11, 1618, 58].

Encouragingly, our mitogenomic phylogenies provide an insight into deep-level relationships of Nymphalidae. At subfamily level, Danainae is placed as sister to the remaining nymphalids (Figure 4). However, this outcome is both inconsistent with a morphologically-based study [38] and with the most comprehensive taxa-sampling study of Nymphalidae [31], in which Libytheinae emerges as sister to the other nymphalids. It should be noticed that the new position of Libytheinae (Figure 5) substantially alters the interpretation of the deep evolutionary history of Nymphalidae. Libytheinae has distinct morphologies compared to other nymphalids (a long labial palpus and a fully developed female foreleg that was thought to be reduced in all other nymphalids) [59], and it has well-preserved late Eocene fossils, important for calibrating the dating scheme of butterflies [60]. As well as being supported in previous mitogenomic studies [23, 24], this mitogenome-suggested position for both Danainae and Libytheinae may reflect important rather than conflicting signal in the mitogenomic data and merits further attention including recalibration as necessary.

For the “satyroid” assemblage of subfamilies, mitogenomic phylogenies provide strong support within these three subfamily-level topologies across 12 PSs (Figure 4, Additional files 6, 7 and 8). These results indicate that the mitogenome might be a potential marker to re-investigate a long-standing phylogenetic question within the subfamily Satyrinae [34, 38, 61], for which deep-level relationships, across the clades of Elymniini, Zetherini, Satyrini, Dirini, Melanitini, Haeterini, and Morphini, remain controversial. Though our sampling is limited for the large subfamily Satyrinae, the tribe Satyrini was monophyletic with respect to Melanitini with strong support, while Ypthimina (Ypthima akragas) was sister to the clade composed of Satyrina (Hipparchia autonoe) and Lethina (Neope pulaha).

As for relationships within the subfamily Limenitidinae, the four tribes were all monophyletic and well-supported (Figure 4). The tribal relationships are clearly resolved by the BI method, though ML bootstrap shows lower support for the clade comprising Adoliadini + Limenitidini, indicating that taxon sampling might still not be sufficient. At least, we can confirm that Bhagadatta is a Neptini and within the Limenitidini, the genus Athyma is polyphyletic and in need of taxonomic revision. We also note that further work is needed according to these phylogenetic results to illuminate mimetic and evolutionary processes of Asian Limenitidini butterflies in detail.

Some conflicting relationships are found when comparing our results to phylogenetic studies of Papilionidae [62, 63]. The subfamilies of Parnassiinae and Papilioninae are sister groups, and four Papilioninae tribes are clearly recognised [62]. However, our results show Parnassius bremeri did not group with other Parnassiinae members; instead, this species grouped with the tribe Papilionini (Figure 4). We checked its identification via NCBI and BOLD databases (using the cox1 gene), and this taxon clusters within Parnassius phoebe, showing consistency with Parnassius. This lack of nesting within Parnassiinae merits further examination, but we note that the tribal relationships of Parnassiinae are not strongly supported so far [63]. Moreover, our topology suggests that long-branch attraction has occurred in our mitogenomic phylogenies. In Papilionidae, the tribe Teinopalpini is regarded as the sister group of Papilionini + Troidini [62], but our result unexpectedly showed a sister relationship of Teinopalpini + Troidini as in other mitogenomic studies [23, 24]. These conflicting relationships might be caused by inadequate taxon sampling in our study, but we strongly suspect that the public papilionid mitogenomes in particular require detailed validation and/or resequencing. In Nymphalidae, the subfamily Pseudergolinae had low nodal support in our inferred topologies, yet its position was highly supported by Wahlberg et al. [31] (as also in our topology 4 of Figure 5), indicating that greater taxon sampling is important to resolve such equivocal relationships [64]. Further studies should add the subfamilies Biblidinae and Cyrestinae to clarify this point.

Utility of mitogenomic data in Lepidoptera

Higher level phylogenetic relationships of Lepidoptera have been studied and revised based largely on increasing numbers of nuclear genes, recently using the RNAseq approach [65, 66]. However, some relationships are still conflicting due to limited taxon sampling (for Lepidoptera, particularly limited at superfamily level) or missing sequences. Similar conflicts are found when the studies focus on a subset of taxa [67, 68]. We agree that nuclear genes are useful for inferring deep-level lepidopteran relationships, but it is a technological and financial challenge to obtain a huge dataset composed of several nuclear genes from many species and a serious computing challenge to analyse them. Instead, our study has demonstrated that it is much easier to obtain a large number of mitogenomes via NGS technologies, and the need for filling with Sanger sequencing will further reduce as technologies and coverage improves. To infer molecular phylogenies in Lepidoptera, we suggest using mitogenomic sequences to infer phylogeny as a first step, following by more nuclear genes as a second step to obtain highly stable relationships based on unlinked gene histories.

Mitochondrial genes have a mutation rate an order of magnitude faster than nuclear DNA [69], and these sequences are thus susceptible to phylogenetic noise and long-branch attraction in insects [29, 70, 71]. Even inference of deep-level relationships based on one or two mitochondrial genes shows much inferior results than that based on a handful of nuclear genes. This is why mitochondrial sequences have not historically been favoured in phylogenetic reconstruction. This viewpoint is misconceived. Obviously, phylogenetic results inferred from one or two genes cannot represent the phylogenetic utility of informative SNPs across whole mitogenomes. Comparing topologies inferred from the 13, 15, and 37-gene datasets, phylogenies inferred from individual genes do not even recover the same subfamily and tribal-level relationships (Additional file 11). It is apparent that single mitochondrial genes simply do not have enough information to infer relationships at this level, even though they include slow-evolving sites whose character changes support some deeper nodes. For investigating deep-level relationships, Cameron [9] points out that gene annotation of insect mitogenomes also provides useful information to reveal the evolution of insect mitogenomes. In Lepidoptera, the gene order of trnM-trnI-trnQ is considered a derived character, and only one study has addressed the ancestral gene arrangement found in Hepialidae [72], while one study has found that the atp6 and atp8 genes are absent between the cox2 and cox3 genes due to gene rearrangement in the species, Rohana parisatis (Nymphalidae, Apaturinae) [73]. Our 70 sampled mitogenomes all reveal the derived gene order, indicating that this genome rearrangement is conserved in most Lepidoptera. Any new gene annotations observed in Lepidoptera will be diagnostic characters for phylogenetic inference.

The effects of model selection and data partitioning in model-based analyses

The current approaches to reconstructing phylogenies emphasise the selection of proper substitution models and partitioning methods [49, 51, 74]. Although gene or codon partitioning is the appropriate method for analysing protein coding genes [74], the best-fit partitioning method is still under debate. It is evident from our sliding-window result (Figure 2) that SNPs were not uniformly distributed by gene region. Nevertheless, our results suggest that different partitioning strategies should always be adopted to evaluate the effect of partitioning on phylogenetic topology. In our case, the strategy with the most intense partitioning (PS4, PS8, and PS12) is favoured by the Bayes factors, even though all the topologies were independent of datasets with different lengths, partitioning schemes and substitution models (Figure 5). We also note that more free parameters (more complex models) would yield greater topological uncertainty (Table 2), a pattern consistent with other studies [49, 50].

Tree length is the product of substitution rate and time [51] but can be significantly affected by the assumptions of model selection [75]. In our case, the Bayes factors showed that phylogenies based on the best-fit model (oPS1-12) have better performance (Table 1), and these results confirm that although the mitochondrion evolves as a single gene because of maternal inheritance, dataset partitioning and model selecting are both important treatments to evaluate the phylogeny.

NGS and Sanger sequencing methods for mitogenomes

For the purpose of obtaining large numbers of long sequences, the NGS method has clear economic advantages over Sanger sequencing [76]. The greatest benefit of NGS is to obtain massive quantities of sequence without primer design. Although the NGS method has error rates of 0.01-1% [41], high sequence coverage largely compensates for base errors. NGS methods are also very effective when target mtDNAs are rare and highly degenerated [77, 78]. As shown by recent advanced studies [17, 19] and our work, the 454-pyrosequencing method using less than a quarter of a 454 Titanium run (around 0.2 million reads) can recover over 85% target regions for a goal of obtaining near 30 mitogenomic sequences (Additional file 2). In our analyses, regions which cannot be pyrosequenced are often concentrated near the control region or in noncoding regions between the cox3 and nad5 genes (Figure 1). These gaps can easily be compensated by Sanger sequencing. Moreover, we emphasise that using the unique-tag method [79] multiplexing reads can more easily be divided to avoid chimaeric mitogenomes and to accelerate bioinformatic processing. Similarly, Timmermans et al. [17] report an efficient option of combining high sequence coverage by NGS with Sanger sequences which are used to bait the pyrosequencing-contigs.

In contrast to the NGS method, Sanger sequencing is more economic when the main purpose is to sequence one or a small set of mitogenomes due to their limited length (~16 kb). For example, our designed primers were largely based on conserved regions in 66 butterfly mitogenomes (Additional file 12). These primers could be used to sequence mitogenomes from other species of nymphalids. The data processing for Sanger-derived sequences requires less training in bioinformatics. Overall, each sequencing technology has its own advantages and is mutually complementary for maximising the efficiency of high quality sequence retrieval. We recommend first the application of NGS to recover a large proportion of the mitogenome, followed by Sanger sequencing as a complementary approach to obtain sequences for regions with low coverage, gaps or high ambiguity, or tag baits.

Conclusions

This study aimed to obtain a large number of Lepidoptera mitogenomes simultaneously via NGS methods. The matrix of 30 newly obtained mitogenomes together with 40 others deposited in GenBank yields a well-supported phylogeny of the superfamily Papilionoidea and its subsets, suggesting that the entire mitogenome provides an excellent marker for studying the phylogeny of butterflies and other insect relationships. Our newly designed primers based on 66 nymphalids also provide a basis for sequencing additional butterfly mitogenomes by the Sanger method, while we anticipate that pyrosequencing will become prevalent for mitogenomic studies. We emphasise the need for good quality control or resequencing of existing public sequences, to avoid topological artefacts.

Methods

Sampling

A total of 36 butterfly mitogenomes deposited in GenBank (up to March 31, 2013) represent five of the seven recognised butterfly families and eight of the twelve subfamilies of Nymphalidae [23, 25, 30, 55, 80103] (also see the references in Additional file 1). Among these subfamilies, five are represented by a single species. To investigate more comprehensively subfamily and tribal relationships within Nymphalidae, 30 additional nymphalids were selected. The combined datasets, representing ten of the twelve nymphalid subfamilies [31], comprise 20 species of Limenitidinae, six Heliconiinae, four Nymphalinae, four Satyrinae, three Apaturinae, three Danainae, one Libytheinae, one Pseudergolinae, one Calinaginae, and one Charaxinae (Additional file 1). For inferring detailed tribal relationships within Nymphalidae, we focused on the subfamily Limenitidinae, in which four tribes are recognised [31]. The tribe Parthenini is considered to be sister of the remaining limenitidine groups, while the relationships of the other three tribes (Neptini, Limenitini, and Adoliadini) remain controversial [31, 39]. Moreover, to evaluate the utility of mitogenomes at within-tribal level, we focused on the genus Athyma, of which the members show paraphyletic relationships with low nodal support [46]. Finally, we also sampled two specimens of Athyma selenophora from Taiwan and Hainan for examining subspecies-level identification. In addition, four moth mitogenomes were selected as outgroups on the basis of previously reported relationships [66].

Molecular techniques

Genomic DNA was extracted from the thoracic muscle tissue or legs using the Purgene DNA Isolation kit (Gentra Systems, Minnesota, USA), following the manufacturer’s protocol. Precipitated DNAs were resuspended in 70 μL of sterile dH2O.

For amplifying long fragments of whole mitogenomes, TaKaRa LA Taq™ (Takara Bio Inc., Shiga, Japan) was used. Generally, two fragments were amplified, one of ~6.5 kb in length and covering the AT rich region, and the other of ~9.5 kb in length; the two fragments overlap at the cox2 and rrnL gene regions. Primers were mostly adopted from Wu et al. [30]. PCR products were checked using 1-2% agarose gel with TAE buffer. Each product was purified using the UltraClean™ DNA Purification kit (MoBio Inc., Solana, CA, USA). The concentration of purified genomic DNA was measured using the Nanodrop ND-1000 (NanoDrop Technologies, Wilmington, DE, USA). The two long PCR products for each sample were mixed in equimolar concentrations and sheared into fragments of size 200-1000 bp. Before pyrosequencing, each sample was ligated with Roche adaptors and a unique species identifying tag. A total of 30 tagged-samples were mixed together and sequenced using the Roche 454 GS Junior System (Roche/454 Life Sciences, Branford, CT) at the Genomics BioSci & Tech Co. (Taipei, Taiwan).

A total of two dataset runs (one full and two half runs) were obtained in our study. The reads extracted from the raw data (sff-file) were binned in accordance with unique tag sequences using the Mothur 1.27 software [104]. The adaptors and unique tag regions were trimmed by Mothur, following the 454 SOP (http://www.mothur.org/wiki/454_SOP) to get fasta-format files of each sample. Sequences with a length of less than 50 bp and quality of less than Q20 were excluded. The dataset for each sample was assembled into contigs via the software MIRA 3 [105], using the default setting for the 454 platform. Contigs for each sample were checked and edited using Sequencher 4.8 (GeneCode, Boston, USA). The definition of possible gene regions, gene annotation, and PCG codon position were determined against published references of Euploea mulciber, Argynnis hyperbius, Athyma sulpitia, Kallima inachus, and Hipparchia autonoe (Additional file 1) by using Sequencher and alignment by eye which was straightforward at codon level. These results were also double-checked via the MITOS pipeline [106]. Gaps and low coverage regions (<10X) were re-sequenced and checked using Sanger sequencing on an ABI 3730 (Applied Biosystems, Foster City, CA, USA). Newly designed primers for Sanger sequencing are listed in Additional file 12. Although most sequences could be obtained and verified by pyrosequencing and Sanger sequencing, a few regions with polymers or multiple tandem repeats failed to sequence correctly. These missing or ambiguous sites were further designated as IUPAC codes (Additional file 3).

For phylogenetic analyses, the 37 mitochondrial genes were separately aligned by MUSCLE in MEGA5 [107]. PCGs were aligned according to amino sequence similarity, whereas RNA genes were directly aligned according to sequence similarity using default settings. All genes were concatenated using Microsoft Excel, and the datasets were exported as fasta-format files. General sequence information was analysed using both DnaSP v5 [108] and MEGA 5 software. All sequenced mitogenomes were submitted to GenBank (accession numbers KF590524-KF590553, also listed in Additional file 1), whereas pyrosequencing rawdata and individual barcode information were both submitted to the Sequence Read Archive under the study number of SRP041730.

Data partitioning and model selection

In order to evaluate the effect of data partitioning and incorporation of RNAs on phylogeny, three datasets referred as the 37-gene dataset (13 PCGs plus two rRNAs and 22 tRNAs), the 15-gene dataset (13 PCGs plus two rRNAs), and the 13-gene dataset (13 PCGs only) were constructed. The three datasets were further partitioned by 12 strategies considering gene region and codon position (Additional file 9). For the 37-gene dataset, the partitioning strategies (PSs) were set as (1) no partition (the combined 37 genes), (2) four partitions with three for the codon position of the 13 PCGs and one for two rRNAs and 22 tRNAs combined, (3) 37 partitions, for each gene, and (4) 63 partitions, partitioning by both gene and codon position of the PCGs. For the 15-gene dataset, the PSs were set as (5) no partitions, combining the 13 PCGs and the two rRNAs, (6) four partitions, including three for the codon position of the PCGs and one for two rRNAs, (7) 15 gene partitions, and (8) 41 partitions, partitioning by both gene and codon position of the PCGs. The PSs of the 13-gene dataset was set as (9) no partition (the combined 13 PCGs), (10) three codon-based partitions, (11) 13 PCGs each as a single gene partition, and (12) 39 partitions based each gene and each codon position. The optimal substitution model of each partition (oPS) was determined by jModelTest 2 [109], using the corrected Akaike information criterion (AICc) (Additional file 13).

Phylogeny

Three phylogenetic methods, Maximum Parsimony (MP), Bayesian inference (BI) and Maximum Likelihood (ML) were used to compare topologies on our datasets. MP was carried out using the TNT 1.1 [110], BI was conducted using the MrBayes 3.2.1 [111], and ML was performed in the RAxML Pthreads-based SSE3 version 7.4.2 [112, 113], with 16 precursors on a Linux system. For the BI analysis, two model settings (GTR + G model and best-fit models) were carried out on the 12 PSs (Additional file 13). First, partitioning datasets were constructed using the best-fit models. Some models (TVM, TIM1, TIM2, TIM3, TPM3, and TrN) that could not be directly used in MrBayes were replaced by the nearest over-parameterised model following a previous study [114]. Second, the model GTR + G was taken following the previous study [31], because the gamma shape (parameter G) is highly correlated with invariable sites (parameter I) [115]. We set all the partitions to the GTR + G model to compare the topologies with the best-fit model setting. The analyses of each dataset were performed with eight chains (seven heated and one cold) and run for five million generations. Every 100 generations were sampled as a consensus tree. The log-likelihood scores were plotted against generation time to determine whether stationarity was reached. Stationarity of Bayesian phylogenies was further assessed using the “sump” option to get the effective sample size (ESS) of parameters. If the ESS value was below 100, the number of generations was increased to 10-20 million. If stationarity was achieved, the first 25% of sampled trees were discarded and the remaining trees were used to represent the posterior probability (Table 2).

ML and MP methods were compared with the results of the BI method. For ML analyses, datasets obtained from each PS were processed with the model of GTRGAMMA, and all model parameters were estimated and joined to the branch length optimisation for the best ML tree. Node stability was evaluated using 1000 bootstrap replicates with 10 additional ML searches of each replicate to improve bootstrapping performance [112]. For MP analyses, three different length datasets (37-gene, 15-gene, and 13-gene) were run. Initially, the MP topologies were searched with 1000 random addition replicates and a MaxTrees of 10000. TNT searches were executed using Tree Ratchet, Tree Drifting, and Tree Fusing methods [116]. Nodal support was evaluated using 1000 bootstrap replicates. All three datasets were analysed using the same MP settings.

The performance of one mitochondrial gene in phylogenetic reconstruction

To investigate the performance of each mitochondrial gene in reconstructing the phylogeny, the 15 major genes (13 PCGs and two rRNAs, respectively) were analysed. Each gene was analysed by BI: eight chains, the best-fit model (Additional file 13) and run for five million generations, sampled every 100. When stationarity was reached, the first 25% sampled trees were discarded and the remaining trees were used to calculate the posterior probability.

Bayes factors

To investigate which partitioning model is preferred, we calculated Bayes factors to compare pairs of likelihood models. The value, B10, is calculated as the ratio of the model likelihoods f (X | M1)/f (X | M0), where the symbol X represents a generating data set, M1 and M0 are two compared models, and f (X | M) stands for model likelihood. We calculated the Bayes factors via the marginal likelihood, which was estimated as the harmonic mean (HM) of the likelihood scores using MrBayes “sump” option. The comparison for two PSs was calculated as 2ln (B10) = 2[ln (HM1) – ln(HM0)], where HM1 and HM0 are two harmonic means of each posterior probabilities. Positive values of 2ln (B10) indicate a preference for the later strategy over the former. Interpretations of significance were applied following Kass and Raftery [117]. For Bayes factor between 0 and 2, no model is preferred; for values between 2 and 6, model 1 is favoured over model 0; for values between 6 and 10, model 1 is strongly favoured; for the value over 10, model 1 is very strongly preferred. We took values over 10 as significant [49].