Background

Molecular tools have provided a plethora of new opportunities to study questions in evolutionary biology (e.g. speciation processes) and in phylogenetic systematics. Only recently, however, have claims been made that the sequencing of a small (648 bp) fragment at the 5' end of the gene cytochrome c oxidase subunit 1 (COI or cox1) from the mitochondrial genome would be sufficient in most Metazoa to identify them to the species level [1, 2]. This approach called "DNA barcoding" has gained momentum and the "Consortium for the Bar Code of Life (CBOL)" founded in September 2004 intends to create a global biodiversity barcode database in order to facilitate automated species identifications. Right from the start, however, this approach received opposition, especially from the taxonomists' community [38]. Some arguments in this debate are political in nature, others have a scientific basis. Concerning the latter, one of the most essential arguments focuses on the so-called "barcoding gap". Advocates of barcoding claim that interspecific genetic variation exceeds intraspecific variation to such an extent that a clear gap exists which enables the assignment of unidentified individuals to their species with a negligible error rate [1, 9, 10]. The errors are attributed to a small number of incipient species pairs with incomplete lineage sorting (e.g. [11]). As a consequence, establishing the degree of sequence divergence between two samples above a given threshold (proposed to be at least 10 times greater than within species [10]) would indicate specific distinctness, whereas divergence below such a threshold would indicate taxonomic identity among the samples. Furthermore, the existence of a barcoding gap would even enable the identification of previously undescribed species ([1113] but see [14]). Possible errors of this approach include false positives and false negatives. False positives occur if populations within one species are genetically quite distinct, e.g. in distant populations with limited gene flow or in allopatric populations with interrupted gene flow. In the latter case it must be noted that, depending on the amount of morphological differentiation and the species concept to be applied, such populations may also qualify as 'cryptic species' in the view of some scientists. False negatives, in contrast, occur when little or no sequence variation in the barcoding fragment is found between different biospecies (= reproductively isolated population groups sensu Mayr [15]). Hence, false negatives are more critical for the barcoding approach, because the existence of such cases would reveal examples where the barcoding approach is less powerful than the use of other and more holistic approaches to delimit species boundaries.

Initial studies on birds [10] and arthropods [9, 16] appeared to corroborate the existence of a distinct barcoding gap, but two recent studies on gastropods [17] and flies [18] challenge its existence. The reasons for these discrepancies are not entirely clear. Although levels of COI sequence divergence differ between higher taxa (e.g. an exceptionally low mean COI sequence divergence of only 1.0% was found in congeneric species pairs of Cnidaria compared to 9.6–15.7% in other animal phyla [2]), Mollusca (with 11.1% mean sequence divergence between species) and Diptera (9.3%) are not peculiar in this respect. Meyer & Paulay [17] assume that insufficient sampling on both the interspecific and intraspecific level create the artifact of a barcode gap. Proponents of barcoding might argue, however, that the main reason for this overlap is the poor taxonomy of these groups, e.g. cryptic species may have been overlooked which are differentiated genetically but very similar or even identical in morphology.

If the barcode gap does not exist, then the threshold approach in barcoding becomes inapplicable. Although more sophisticated techniques (e.g. using coalescence theory and statistical population genetic methods [1921]) can sometimes help to delimit species with overlapping genetic divergences, these approaches require additional assumptions (e.g. about the choice of population genetic models or clustering algorithms) and are only feasible in well-sampled clades.

Barcoding holds promise nonetheless especially in the identification of arthropods, the most species-rich animal phylum in terrestrial ecosystems. Identification of arthropods is often extremely time-consuming and generally requires taxonomic specialists for any given group. Moreover, the fraction of undescribed species is particularly high, as opposed to vertebrates. Hence, there is substantial demand for improved (and rapid) identification tools by scientists who seek identification of large arthropod samples from complex faunas. Therefore arthropods deserve to be considered the yard-stick for the usefulness of barcoding approaches among Metazoa and it is not surprising that several recent studies have tried to apply DNA barcoding in arthropods [9, 1113, 16, 18, 19, 2227]. Diversity is concentrated in tropical ecosystems, but measuring intra- and interspecific sequence divergence in tropical insects is hampered by the fragmentary knowledge of most taxa. In contrast, insects of temperate zones, and most notably the butterflies of the Holarctic region, are well known taxonomically compared to other insects. The species-rich Palaearctic genus (or subgenus) Agrodiaetus provides an excellent example to test the existence of the barcode gap in arthropods. This genus is exceptional because of its extraordinary interspecific variation in chromosome numbers which have been investigated for most of its ca 120 species ([2830] and references therein). As a result several cryptic species which hardly or not at all differ in phenotype have been discovered (e.g. [3139]). Available evidence suggests that apart from a few exceptions (e.g. due to supernumerary chromosomes) differences in chromosome numbers between butterfly species are linked to infertility in interspecific hybrids [40]. This is due to problems in the pairing of homologous chromosomes during meiosis. Since major differences in chromosome numbers are indicative of clear species boundaries, they are helpful also to infer species-level differentiation for allopatric populations. Agrodiaetus butterflies therefore are an ideal case for testing the validity of the barcoding approach. If valid, then it must be possible to safely recognize all species that can be distinguished by phenotype, karyotype or both character sets with reference to sequence divergences alone. On the contrary, failure of DNA barcodes to differentiate between species that are distinguished by clear independent evidence would undermine the superiority of the barcoding approach, which has especially been attributed to taxa with "difficult" classical taxonomy, such as Agrodiaetus.

Results

Intraspecific divergence

The average divergence in 1189 intraspecific comparisons is 1.02% (SE = 1.13%). 95% of intraspecific comparisons have divergences of 0–3.2%. The few values higher than 3.2% are conspicuous and probably due to misidentifications (Lampides boeticus, Neozephyrus japonicus, Arhopala atosia, Agrodiaetus kendevani, see below), unrecognized cryptic species (Agrodiaetus altivagans [41], Agrodiaetus demavendi [30]), hybridization events (Meleageria marcida [30, 42]) or any of those (Agrodiaetus mithridates, Agrodiaetus merhaba).

The evidence for the possible misidentifications is the following:

Lampides boeticus is the most widespread species of Lycaenidae and a well-known migrant which occurs throughout the Old World tropics and subtropics from Africa and Eurasia to Australia and Hawaii. Apart from a single unpublished sequence (AB192475), all other COI GenBank sequences of this species (from Morocco, Spain and Turkey) are identical with each other or only differ in a single nucleotide (= 0.15% divergence). They are also nearly identical to two specimens of Lampides boeticus in the CBOL database (BOLD) [43] from Tanzania and another sequence of this species from Papua New Guinea (Wiemers, unpubl. data). The GenBank sequence AB192475 (of unknown origin, but possibly from Japan), however, differs strongly (8.2–8.7%) from all other Lampides boeticus sequences and therefore we assume this to represent a distinct species. Its identity however remains a mystery because it is not particularly close to any other GenBank sequence and a request for a check of the voucher specimen has remained unanswered for more than a year.

• The questionable unpublished sequence of Neozephyrus quercus (AB192476) is identical to a sequence of Favonius orientalis and therefore probably represents this latter species which is very similar in phenotype but well differentiated genetically (4.8% divergence).

• A similar situation constitutes the questionable unpublished sequence of Arhopala atosia (AY236002) which is very similar (0.4%) to a sequence of Arhopala epimuta.

Agrodiaetus kendevani is a local endemic of the Elburs Mts. in Iran. The two sequences of this species in the NCBI database which exhibit a divergence of 5.4% have been published in two different papers by the same work group [29, 44]. While one of them is identical to a sequence of Agrodiaetus pseudoxerxes, the other one is nearly identical to Agrodiaetus elbursicus (0.2% divergence). These latter two species however belong to separate species groups [30] and thus conspecificity of the two sequences of A. kendevani is very improbable as there is no evidence of hybridization between members of different species groups in Agrodiaetus[30].

Higher intraspecific divergence values are also found between North African and Eurasian populations of Polyommatus amandus (3.8%) and Polyommatus icarus (5.7–6.8%). In the former species the North African population is also well differentiated in phenotype (ssp. abdelaziz), while in the latter species phenotypic differences have never been noted. Cases with substantial, but lower genetic divergence between North African and European populations which do not correspond to differentiation in phenotype also occur in the butterflies Iphiclides (podalirius) feisthamelii (2.1%; [30]) and Pararge aegeria (1.9%; [45]). In all cases these allopatric populations may actually represent distinct species, although we do not currently have additional evidence in support of this hypothesis.

Although some of the other higher divergence values >2% are possibly due to cryptic species (e.g. in Agrodiaetus demavendi) or hybridization between closely related species (e.g. in the species pair Lysandra corydonius and L. ossmar, as evidenced by the comparative analysis of the nuclear rDNA internal transcribed spacer region ITS-2 [30]), most of those values represent cases in which there is hardly any doubt regarding the conspecificity of samples. The highest such value is 2.9% between distant populations of the widespread Agrodiaetus damon (from Spain and Russia). Outside the genus Agrodiaetus high values are also found between North African and Iranian populations of Lycaena alciphron (2.7%), Spanish and Anatolian populations of Polyommatus dorylas (2.3%) and even between Polish and Slovakian populations of Maculinea nausithous (2.3%). Table 1 lists mean intraspecific divergences in those species that are represented by more than one individual in the data set.

Table 1 Intraspecific nucleotide divergences

Interspecific divergence

The average divergence in 236348 interspecific comparisons is 9.38% (SE = 3.65%) ranging from 0.0% to 23.2% (between Baliochila minima and Agrodiaetus poseidon). Of these, 57562 are congeneric comparisons with an average divergence of 5.07% (SE = 1.73%) ranging from 0.0% (between 23 Agrodiaetus as well as 3 Maculinea species pairs) to 12.4% (between Arhopala abseus and Arhopala ace). 94% of those comparisons are within Agrodiaetus. Only congeneric comparisons were included in subsequent analyses in order to make comparisons feasible across taxonomic levels. Table 2 lists mean interspecific divergences in genera of which at least two species are represented in the data set. Sequence divergence in 95% of interspecific (congeneric) comparisons is above 1.9%, and 87.6% of such comparisons reveal distances above 3%.

Table 2 Interspecific nucleotide divergences

The barcode gap

As apparent in Figure 1 (and Figure 2 for comparisons within Agrodiaetus only) no gap exists between intraspecific and interspecific divergences. Since some (0.14%) interspecific divergences are as low as 0% no safe threshold can be set to strictly avoid false negatives. Although species pairs with such low divergences include some whose taxonomic status as distinct species is debatable, they also include many pairs which are well differentiated in phenotype, have a very different karyotype (in Agrodiaetus), and occur sympatrically without any evidence for interbreeding. Examples include Agrodiaetus peilei – A. morgani (0.0%), Agrodiaetus fabressei – A. ainsae (0.2%), Agrodiaetus peilei – A. karindus (0.2%), Polyommatus myrrhinus – P. cornelia (0.4%), or Agrodiaetus poseidon – A. hopfferi (0.6%).

Figure 1
figure 1

Frequency distribution of intraspecific and interspecific (congeneric) genetic divergence in Lycaenidae. Total number of comparisons: 1189 intraspecific and 57562 interspecific pairs across 315 Lycaenidae species. Divergences were calculated using Kimura's two parameter (K2P) model.

Figure 2
figure 2

Frequency distribution of intraspecific and interspecific (congeneric) genetic divergences in Agrodiaetus. Total number of comparisons: 737 intraspecific and 54209 interspecific pairs across 114 Agrodiaetus species. Divergences were calculated using Kimura's two parameter (K2P) model.

The minimum cumulative error based on false positives plus false negatives is 18% at a threshold level of 2.8% (Figure 3). Minimum errors are very similar for Agrodiaetus (18.6% at 3.0% threshold, not shown) and other Lycaenidae (18.6% at 2.0% threshold, not shown), but much lower in Arhopala (5.3% at 3.4% threshold, Figure 4).

Figure 3
figure 3

Cumulative error based on false positives plus false negatives for each threshold value in 315 Lycaenidae species including only congeneric comparisons. The optimum threshold value is 2.8%, where error is minimized at 18.0%.

Figure 4
figure 4

Cumulative error based on false positives plus false negatives for each threshold value in 30 Arhopala species. The optimum threshold value is 3.4%, where error is minimized at 5.3%.

For safe identification, minimum distances between species (Figure 5) are critical and not average distances. In Agrodiaetus, all but two species (= 98.3%) have close relatives with interspecific distances below 3%. In the other genera combined, "only" 74% of taxa are affected but this lower rate is probably due to undersampling and would rise, if more sequences of more closely related species become available for the analysis.

Figure 5
figure 5

Frequency distribution of minimum interspecific (congeneric) genetic distances across 263 Lycaenidae species.

Identification with NJ tree profile

The approach of species identification with a Neighbour-Joining (NJ) tree profile as proposed by [9] does not necessarily depend on the barcoding gap but on the coalescence of conspecific populations and the monophyly of species (details see Data analysis).

The success rate in the identification of our Lycaenidae data set with this method was 58%. Five out of 158 misidentifications or ambiguous identifications (3.2%) can be attributed to incorrectly identified specimens (Lampides boeticus, Neozephyrus japonicus, Agrodiaetus kendevani, see above). Further 90 cases (57%) were among closely related sister species whose taxonomic status is in dispute (Table 3). If these cases are not taken into account (i.e. counted as successful identifications, an unrealistic best case scenario for barcoding success), the success rate would rise to 84%. In Agrodiaetus the success rate would remain lower (79%) while in the remaining genera it would reach 91%. But even with these corrections, 61 cases of misidentifications (16%) remain, 46 of these in Agrodiaetus (affected taxa in Table 4). The complete Neighbour-joining tree (available for download as additional file 1: NJ-tree) shows the reason for this failure: Only 46% of conspecific sequences form a monophyletic group on this tree while the others are either paraphyletic (10%) or even polyphyletic (44%). In Agrodiaetus, only 34% of species are monophyletic (Table 1), while the others are paraphyletic (11%) or polyphyletic (55%). If incorrectly identified specimens are excluded and critical taxa (Table 3) are lumped together, still only 59% of species are monophyletic (43% in Agrodiaetus) while 7% are paraphyletic and 34% polyphyletic (49% in Agrodiaetus).

Table 3 Sister species or species complexes with disputable species borders
Table 4 Taxa misidentified with the NJ tree profile approach

Conclusion

We found an upper limit for intraspecific sequence divergences in a wide range of species of the diverse butterfly family Lycaenidae, but no lower limit for interspecific divergences and thus no barcoding gap. This result is especially well documented in the comprehensively sampled genus Agrodiaetus (114 of ca 130 recognized species sequenced) while the smaller overlap in Arhopala can be attributed to the lower percentage of species sampled (33 of more than 200 species). The choice of species by [46] was to maximize coverage of divergent clades while minimizing the total number of species which is a common and sensible approach for phylogenetic studies, but undermines the power of such sequence data as critical tests for the barcoding approach. The general level of sequence divergence is not exceptionally low in Lycaenidae compared to other Lepidoptera. The mean congeneric interspecific sequence divergence of 5.1% in Lycaenidae (5.1% in Agrodiaetus and 5.0% in the other genera) was only slightly lower than the mean value of 6.6% found by [2] in various families of Lepidoptera.

We thus confirm the results of Meyer & Paulay [17] and Meier et al. [18]. Our results also agree with those from a recent study in the Neotropical butterfly subfamily Ithomiinae (Nymphalidae) [47] which records highly variable levels of divergence in mtDNA (COI &COII) between taxa of the same rank. Our results however fail to agree with those of Barrett & Hebert [9] on arachnids. In that study the mean percent sequence divergence between congeneric species was 16.4% (SE = 0.13) and thus three times higher than in our study while the divergence among conspecific individuals was only slightly higher with 1.4% (SE = 0.16). The contradiction between our study and theirs can be explained by the very incomplete and sparse taxon sampling in their data set amounting to just 1% of the species contained within the families. We conclude that the reported existence of a barcode gap in arachnids appears to be an artifact based on insufficient sampling across taxa.

Despite these difficulties, species identification of unidentified samples with the help of barcodes is entirely possible. The NJ tree profile approach which does not rely on a barcode gap enabled the correct assignment of many sequences, and other methods (e.g. applying population genetic approaches) might further increase the success rate. However, 17% of test sequences could still not be identified correctly, even in some sympatric species pairs which clearly differ in phenotype and chromosome number (e.g. Agrodiaetus ainsae [n = 108–110]/fabressei [n = 90], Agrodiaetus hopfferi [n = 15]/poseidon [n = 19–22]). The main reason for this failure is that a large proportion of species are not reciprocally monophyletic, e.g. due to incomplete lineage sorting, which is in accordance with a previous study [48]. Moreover, the success with this method is again completely dependent on comprehensive sampling. If the correct species is not included in the profile, the assignment must by necessity be incorrect and misleading. Because of the non-existence of a barcoding gap, this error will often be impossible to detect. This limits possible applications of the barcoding approach. For example, cryptic species can only be detected with the help of a barcoding approach at high genetic divergence from all phenotypically similar species. An example is Agrodiaetus paulae which was discovered in this way [41]. In contrast, and on the one hand, the sympatric species pairs Agrodiaetus ainsae-fabressei, A. hopfferi-poseidon and A. morgani-peilei would have gone unnoticed by barcoding approaches even though their strong phenotypical and karyological differentiation (n = 108 vs. n = 90, n = 15 vs. n = 19–22 and n = 27 vs. n = 39, respectively) clearly indicates their specific distinctness. On the other hand, sequence divergence in what is currently believed to represent one species does not per se prove the specific distinctness of the entities in question. In Polyommatus icarus or P. amandus, for example, the high divergences between North African and Eurasiatic samples is a strong hint for the presence of unrecognized cryptic species, but this needs to be rigorously tested with sequence data from samples that cover the geographic range more comprehensively. Also in practical application the problem of misidentified specimens and sequences in GenBank remains a real threat to the accuracy of barcode-based identifications. An example is the GenBank sequence AB192475 of Lampides boeticus which is also used in the CBOL database (see above). This underscores the importance of voucher specimens and documentation of locality data, an issue raised by barcoding supporters but unfortunately still much neglected by GenBank. Another case of misidentification (GenBank sequence AF170864 of Plebejus acmon which was originally submitted as Euphilotes bernardino) [30] has already been corrected with the help of the voucher specimen.

In conclusion, the barcoding approach can be very helpful, e.g. in identifying early stages of insects or when only fragments of individuals are available for analysis. However, correct identification requires that all eligible species can be included in the profile and that sufficient information is available on the amount of intraspecific genetic variation and genetic distance to closely related species.

The barcoding procedure is not very well suited for identifying species boundaries but it may help to give minimum estimates of species numbers in very diverse and inadequately known taxonomic groups at single localities. Our case study on Agrodiaetus shows that a substantial number of species would have gone unnoticed by the barcoding approach as 'false negatives'. Thus, especially in clades where many species have evolved rapidly as a result of massive radiations with minimum sequence divergence, the barcoding approach holds little promise of meeting the challenge of rapid and reliable identification of large samples. Yet, it is exactly these situations which pose the most problematic tasks in the morphological identification of insects.

Although molecular data can be helpful in discovering new species, a large genetic divergence is not sufficient proof since it must be corroborated by other data. Furthermore, most closely related species which are difficult to identify with traditional means, are also similar genetically and would go unnoticed by an isolated barcoding approach. Mathematical simulations have shown that populations have to be isolated for more than 4 million generations (i.e. 4 million years in the mostly univoltine Agrodiaetus species) for two thresholds proposed by the barcoding initiative (reciprocal monophyly, and a genetic divergence between species which is 10 times greater than within species) to achieve error rates less than 10% [49]. This might help to explain why the barcoding approach appears to be more successful in the Oriental genus Arhopala which is thought to represent a phylogenetically older lineage of Lycaenidae estimated to be about 7–11 Million years old [50], while the origin of the Palaearctic genus Agrodiaetus is dated at only 2.5–3.8 Million years [44].

Our data show that the lack of a barcoding gap and reciprocal monophyly in Lycaenidae is not confined to the genus Agrodiaetus with its extraordinary interspecific variation in chromosome numbers, but also to other genera of Lycaenidae with stable chromosome numbers. It should also be noted that in Agrodiaetus there is neither evidence for exceptional rapid radiation as in cichlids of the East African lakes [51] nor for unusual (i.e. sympatric) speciation patterns caused by karyotype evolution. Rather, karyotype diversification seems to have been a mere by-product of the usual mode of allopatric speciation [29, 30, 44].

Methods

Data sources

A total of 694 barcode sequences were used for our analysis. We used a 690 bp fragment at the 5' end of cytochrome c oxidase subunit I (COI) of 309 Lycaenidae sequences from a molecular phylogenetic study by Wiemers [30]. Most sequences belong to Agrodiaetus (198), the others (111) mostly to closely related Polyommatinae. All sequences have been deposited in GenBank [52] (AY556844-AY556867, AY556869-AY556963, AY556965-AY557155) with LinkOuts provided to images of the voucher specimens deposited with MorphBank [53]. These sequences were supplemented by 385 further sequences of Lycaenidae deposited in GenBank as of March, 2006 (Table 5). They include sequences from further studies on Agrodiaetus [29, 44], the Palaearctic genus Maculinea [54], Nearctic Lycaeides melissa [55], the Oriental genus Arhopala [46, 50], the Australian genera Acrodipsas [56] and Jalmenus [57], and the South African Chrysoritis [58] as well as a few sequences which have only been used as outgroups in non-Lycaenidae studies (e.g. [59, 60]). Sequence length in the 5' region as defined by CBOL ranged between 240 bp and the maximum of 987 bp. (18 COI sequences from a study on Japonica only contained a 3'end fragment and therefore were not included.) Of these, 89% are at least 648 bp long as recommended by CBOL and 98% at least 500 bp long which is deemed sufficient for barcode sequences [13]. However, sequence overlap for sequences from different studies was sometimes lower because of slightly different sequence locations within the barcode region (Figure 6). It should be noted that these inconsistencies in barcode comparisons are a common situation in barcode sequences due to differences in primer use (e.g. [2]).

Table 5 Material
Figure 6
figure 6

Sequence overlap for pairwise barcode comparisons. Length of sequence overlap in 246229 cross-comparisons of 694 aligned sequences

Laboratory protocols

DNA was extracted from thorax tissue recently collected and preserved in 100% ethanol using Qiagen® DNeasy Tissue Kit according to the manufacturer's protocol for mouse tail tissue. In a few cases only dried material was available and either thorax or legs were used for DNA extraction.

Amplification of DNA was conducted using the polymerase chain reaction (PCR). The reaction mixture (for a total reaction volume of 25 μl) included: 1 μl DNA, 16.8 μl ddH20, 2.5 μl 10 × PCR II buffer, 3.2 μl 25 mM MgCl2, 0.5 μl 2 mM dNTP-Mix, 0.25 μl Taq Polymerase and 0.375 μl 20 pm of each primer. The two primers used were:

Primer 1: k698 TY-J-1460 TAC AAT TTA TCG CCT AAA CTT CAG CC [61]

Primer 2: Nancy C1-N-2192 (CCC) GGT AAA ATT AAA ATA TAA ACT TC [61]

PCR was conducted on thermal cyclers from Biometra® (models Uno II or T-Gradient) or ABI Biosystems® (model GeneAmp® PCR-System 2700) using the following profiles:

Initial 4 minutes denaturation at 94°C and 35 cycles of 30 seconds denaturation at 94°C, 30 seconds annealing at 55°C and 1 minute extension at 72°C.

PCR products were purified using purification kits from Promega® or Sigma® and checked with agarose gel electrophoresis before and after purification.

Cycle sequencing was carried out on Biometra® T-Gradient or ABI Biosystems® GeneAmp® PCR-System 2700 thermal cyclers using sequencing kits of MWG Biotech® (for Li-cor® automated sequencer) or ABI Biosystems® (for ABI® 377 automated sequencer) according to the manufacturers' protocols and with the following cycling times: initial 2 minutes denaturation at 95°C and 35 cycles of 15 seconds denaturation at 95°C, 15 seconds annealing at 49°C and 15 seconds extension at 70°C. Primers used were the same as for the PCR reactions for the ABI (primer 1 was used for forward and primer 2 for independent reverse sequencing), but for Li-cor truncated and labelled primers were used with 3 bases cut off at the 5' end and labelled with IRD-800. For ABI sequencing the products were cleaned using an ethanol precipitation protocol. Electrophoresis of sequencing reaction products was carried out on Li-cor® or ABI® 377 automated sequencers using the manufacturer's protocols.

Data analysis

Sequences were aligned with BioEdit 7.0.4.1 [62] and pruned to a maximum of 987 bp, the section proposed by CBOL for barcoding. Pairwise sequence divergences were calculated separately for intraspecific as well as for interspecific, but intrageneric comparisons with Mega 3.1 [63] using Kimura's two parameter (K2P) distance model. This is not necessarily the best model to analyze the data (see [64]), but it was chosen to facilitate comparisons with other barcode studies of Hebert and co-workers [1, 912, 16] who have been using this model. Distance tables were processed to calculate divergence means (incl. standard errors and ranges) within and between species.

The taxonomy was taken from GenBank in most cases but two minor spelling inconsistencies were corrected. In four cases where a taxon within Agrodiaetus was treated as a species taxon by one author but only as a subspecies by another, we matched them by treating those taxa as distinct species. The generic subdivision of Lycaenidae is very much in flux. Some genera are only treated as subgenera by some authors and many genera (like Polyommatus or Plebejus) are probably paraphyletic or polyphyletic, however we undertook no revision of the GenBank taxonomy since it appeared consistent enough for our analysis. The remaining inconsistencies only affect few taxa in our analysis and include the treatment of Sublysandra (distinct genus or subgenus of Polyommatus), Eumedonia (distinct genus or subgenus of Aricia), Otnjukovia (synonym to Turanana), Maculinea (synonym to Phengaris) and Callipsyche (synonym to Satyrium). (A complete list of sequences with corresponding taxa names and voucher numbers is found in the additional file 1: NJ tree.)

A Lycaenidae species profile was created according to [9]. Of the 694 barcode sequences, we excluded 9 short Arhopala sequences with a barcode length of only 240 bp. (To check the position of those sequences, a separate analysis was run containing only the Arhopala sequences.) Of the remaining 685 sequences, we randomly selected 1 sequence from each of the 308 Lycaenidae species for inclusion into a COI species profile. We chose a sequence of Apodemia mormo (GenBank accession number AF170863) from the family Riodinidae as outgroup because this family appears to represent the sister group to Lycaenidae [6567]. The other 377 sequences which had not been included in the profile were used as "test" sequences: They were singly added to the test profile in repeated Neighbour-joining analyses and their "classification success" was recorded. A test was recorded as successful if the test sequence grouped most closely with the conspecific profile sequence and not with another species. Results of three GenBank sequences which were not identified to species level (all belonging to the genus Agrodiaetus) were not counted. After the classification test, another NJ analysis was run including all sequences in order to understand possible failures in classification. The main reason for using the Neighbour-joining as a tree-building method is its computational efficiency. Although this method is well suited for grouping closely related sequences, it should be noted that other methods (such as Maximum Parsimony, Maximum Likelihood or Bayesian inference of phylogeny) are usually superior in constructing phylogenetic trees.