1 Introduction

The mitochondrial molecule of the honeybee, Apis mellifera L., was fully sequenced in 1993 by Crozier and Crozier. Since then, several regions from the mitochondrial molecule have been used for phylogeography and molecular diversity studies of A. mellifera subspecies and populations, particularly the tRNAleu-cox2 intergenic region (Evans et al. 2013; Meixner et al. 2013). This intergenic region shows length and sequence variation allowing a grouping of subspecies largely congruent with the morphometrical classification that subdivide the variation of A. mellifera into four different evolutionary lineages: West (M) and North Mediterranean (C) lineages; an African lineage (A) which groups the African honey bee subspecies; and the Oriental lineage (O) located geographically in the Middle East (Ruttner 1988). Several further divisions have been recognized within the aforementioned lineages via sequencing of the intergenic region (Franck et al. 2001; De la Rúa et al. 2001; Pinto et al. 2012). Based on these molecular markers, Iberian honey bee populations have been recognized as part of a cline from North Africa to Europe (Cánovas et al. 2008; Pinto et al. 2013), while honey bee populations from the Atlantic coast including the Macaronesian region have been distinguished as an African sublineage with Atlantic distribution (De la Rúa et al. 2006, 2009; Muñoz et al. 2013, Murray et al. 2009).

For most organisms, the main noncoding region of the mitochondrial genome is the control region (CR). It is widely used for genetic inferences especially in mammals where it is also known as the D-loop (for displacement-loop), a three-stranded structure that forms during replication of the molecule. At least in vertebrates, functional structures within this region have been identified across a wide number of species namely highly conserved sequences associated with the molecule replication, conserved and variable blocks, as well as palindromic and stable cloverleaf structures that may be involved in the replication of the molecule as well as transcription initiation sites for each strand (Chang and Clayton 1984; Saccone et al. 1991; Clayton 1982, 1996). In invertebrates, the control region is much less studied and understood. Its use as a genetic marker is very scarce and there are good reasons for this already pointed out by Zhang et al. (1995) and Zhang and Hewitt (1997). Particularly in insects, the control region is known as A+T rich region due to its high content in adenine and thymine nucleotides that can make up to 96 % of the base content as is the case of Apis and Drosophila. Most of this content is due to length polymorphisms, tandem repetitions of (T)n(A)n motifs, stretches of Ts and As and innumerous insertion/deletions of nucleotides. The overall sequence similarity of the A+T rich region, even among closely related sequences, is problematic and its use to infer evolutionary relationships is limited (Brehm et al. 2003; Vila and Björklund 2004; Kim et al. 2007; Wan et al. 2011a, b). Attempts to use the A+T rich region in insects have been focused on its structure and evolution with particular emphasis on functional conserved structures and putative regulatory sequences rather than intraspecific comparisons (Simon et al. 1994; Brehm et al. 2001; Sugihara et al. 2006; Kim et al. 2007; Zhang et al. 2013). Zhang and Hewitt (1997) divided the control region of insects into two groups according to their structural conservation. One of the groups included Drosophila with conserved and variable domains, and a second group, where such division is much less perceivable, was formed by grasshoppers, mosquitoes, and butterflies (Zhang and Hewitt 1997; Kim et al. 2007). Alignment of sequences could be possible only for those conserved blocks that kept some degree of homology, since the lack of variability in the conserved blocks is also a handicap for lower taxa discrimination. On the other hand, the high mutation rate that characterizes the variable blocks is certainly the cause for the high number of parallel or recurrent mutations in different lineages producing poorly resolved phylogenetic networks or unsupported topologies. In relation to the genus Apis, only three complete mitochondrion genomes containing the control region (A. mellifera, Apis cerana, and Apis florea) exist in databases (Crozier and Crozier 1993; Tan et al. 2011; Wang et al. 2013), but further characterization of the sequence variation of the A+T rich region has not been performed.

The aim of the present work is to present a deeper analysis of the A+T rich region in A. mellifera by looking at sequences from individuals belonging to different populations of the Iberian Peninsula and Macaronesian islands. Such an analysis should provide us with an accurate picture of the structure of the A+T rich region and the way it is organized in terms of conserved/variable domains. As far as we know, this is the first time a comparison of sequences from the control region has been done in an otherwise well-studied species. We also intend to evaluate the usefulness of the A+T rich region as a genetic marker particularly to address intraspecific variation.

2 Material and methods

2.1 Samples

A. mellifera honey bee workers (n = 54) were taken from the inner frames of colonies (one worker per colony, from two to six colonies per locality) in three localities of the Iberian Peninsula [Cádiz (Spain), n = 4; Málaga (Spain), n = 4; Viseu (Portugal), n = 5], the Canary Islands [Tenerife, n = 5; La Palma, n = 5; La Gomera, n = 4; Gran Canaria, n = 5], Madeira (n = 7), Azores Islands (São Miguel, n = 5), Morocco [Tan Tan, n = 2; South Rabat, n = 2], and Cabo Verde archipelago (n = 6) and preserved in ethanol until analysis. All the samples were previously identified as belonging to the African evolutionary lineage through the determination of their tRNAleu-cox2 haplotype (De la Rúa et al. 2001, 2006, 2007, unp. results; Cánovas et al. 2008). We have not performed a morphometric analysis of these samples to determine their subspecies status but from the geographic location it can be inferred that Iberian samples corresponded to Apis mellifera iberiensis and Moroccan samples to Apis mellifera intermissa (South Rabat) and Apis mellifera sahariensis (Tan Tan). Island samples were included in the African evolutionary sublineage of Atlantic distribution.

2.2 DNA isolation and PCR amplification

Total DNA was extracted from thoraces and legs of single individuals of A. mellifera following the E. Z. N. A. Insect DNA Isolation kit protocol (Omega Bio-Tek). PCR amplification of the mitochondrial DNA (mtDNA) control region was performed in a 50 μL reaction containing 10 ng of DNA, 10 pmol of each primer [12SAIR: 5′-AGGGTATCTAATCCTAGTTT-3′ and METAPIS: 5′-GACAGGGTATGAACCTGTTAGCTT-3′], 50 mM MgCl2, 5 μL 10X PCR Buffer, 2 U Taq DNA Polymerase (both from Invitrogen Life Technologies), and 10 mM of each dNTP. After initial heating at 95 °C for 180 s, 35 cycles of PCR amplification were performed, consisting of three steps: denaturation at 94 °C for 60 s, annealing at 50 °C for 90 s, and extension at 72 °C for 120 s, and a final extension at 72 °C for 240 s was included. A single fragment of about 1.2 kb was obtained. Amplified products were visualized after electrophoretic separation on 1 % agarose gels, stained with ethidium bromide, and photographed under UV illumination. As expected, a slight size variation of these primary fragments occurred in different individuals so the amplification of each product was performed in triplicate for each specimen ensuring that each individual always yielded fragments of the same size.

2.3 Sequencing of mtDNA

Direct sequencing of 54 PCR products was performed in an ABI310 Genetic Analyzer (Applied Biosystems). Four internal primers were used [INTF1: 5′-ATCTTAAAAACTACAACATGA-3′, INT2F: 5′-CATTGTTTCAGAATTCTCT-3′, INT2R: 5′-TTATTATGCTTATTTATTC-3′, and INTR1: 5′-CGAATCTAGAGTTAAAGTTAGA-3′, see Figure 1]. All reactions were done in triplicate for each individual to ensure that no mistakes were done in the cycle process of PCR amplification and sequencing. Sequences were aligned using the program DNASTAR (Lasergene) to produce complete fragments from all partial sequences obtained using all of forward and reverse primer pairs.

Figure 1.
figure 1figure 1

The complete sequence of the control region from AZ04 and its main features mentioned in the text. The partial sequence of the 23S rRNA small subunit (green and underlined) and both transfer RNAs (tRNAglu and tRNAser, both colored and underlined) are also included and correspond to the sequences depicted in supplementary Tables S1 and S2. Length and position of three of the internal primers used are also depicted above the sequence. Boundaries of the control region are shown by dark square brackets. Variable blocks 1–4 are limited by arrows linked by solid lines. The sequence limited by arrows and a dashed line that includes the T-stretch refers to conserved block 1 (CB1) mentioned in the text. The sequence highlighted in pink is block B proposed by Zhang et al. (1995) discussed in the text. The putative STOP signal is underlined. The sequence wave-underlined corresponds to the long stem-loop structure discussed in the text and depicted in detail in Figure 2b.

2.4 Structural and phylogenetic sequence analysis

DNA sequences reported here are available on the European Nucleotide Archive at www.ebi.ac.uk/ena/ under accession numbers LN775355-LN775408. Multiple sequence alignments of the complete fragments were performed with CLUSTAL X (Thompson et al. 1997) implemented in MEGA 5 (Tamura et al. 2011) and maximized for sequence similarity by a careful visual inspection. Potential secondary structures of the A+T rich region were identified using the MFOLD program (Mathews et al. 1999; Zuker et al. 1999).

Genetic relationships of haplotypes were depicted using the software NETWORK v. 4.6.1.1 (http://www.fluxus-engineering.com/) using two strategies: firstly, we used all individuals to perform a network of haplotypes using the combined conserved blocks of the A+T rich region, and secondly, we performed the same analysis separately for each population using the whole region (conserved and variable blocks). To perform a network analysis with all the individuals, we followed the specifications of the program and applied the following constraints: transversions were weighted three times more than transitions, individual characters were weighted 10 (default value) except where runs of Ts or As (as the case of the T-stretch) produced individual variation as an insertion or deletion. In these cases, bases inserted or deleted in the runs were weighted zero. Other hypervariable characters characterized by an indel across individuals were weighted 5. The Sstar contraction algorithm (SC, Forster et al. 2001) was applied to get a network with the minimum possible complexity. After this, the median-joining (Bandelt et al. 1999) option was applied to clean up the contracted network. Finally, we ran the maximum parsimony option (Polzin and Daneschmand 2003) to further clear the topology from unnecessary median vectors and links. In the case of population-specific networks, we applied the MJ option alone.

Haplotype diversity and nucleotide composition of the whole A+T region and of conserved blocks combined were obtained using Arlequin V3.0 (Excoffier et al. 2005). Genetic distances among sequences were computed using the conserved blocks combined. To assess the population differentiation among populations, the corrected average pairwise differences, population pairwise Ф st (similar to F st) and analysis of molecular variance (AMOVA) were also computed using Arlequin for which the three populations from the Iberian Peninsula were considered as a group.

As supporting files, we provide the complete aligned sequences as a FASTA file (additional Table S1, remove the .txt terminus) as well as an Excel file with all variable positions including indels (additional Table S2).

3 Results

The complete sequences of the A+T rich region including flanking regions comprising a partial segment of a srRNA and two complete transfer RNAs (tRNAs) (tRNAglu and tRNAser) were obtained for 54 individuals. Identical sequences (in the triplicate repeats from each individual) from genes surrounding the control region were always obtained reducing the chance of being dealing with numts of this region.

3.1 Base composition and size

The average nucleotide composition of the A+T rich region was 45.9 % T, 3.0 % C, 50.2 % A, and 0.9 % G. Thus on average, an AT base composition of 96.1 % was found reflecting an extremely high bias towards A and T. The transition/transversion bias (R) is 3.85 under a Kimura (1980) two-parameter model. The size of the A+T rich region varied across the 54 individuals studied and ranged from 897 bp (TF56) to 984 bp (HP8). Most of the length variation observed was due to three separate regions encompassing tandem repetitions of short sequences as described below. Other small insertions/deletions also occurred along the A+T rich region and did not contribute significantly to the overall size variation of this region. The sequence and main features of a complete control region and its flanking regions are presented in Figure 1 (in this case, we chose AZ04 as the model to which all comparisons were done; see also additional complementary Tables S1 and S2).

3.2 Domains and tandem repetitions

To a large extent, size variation of the A+T rich region was due to three regions of different lengths containing a variable number of tandem repeat units. The first of these regions encompasses a variable stretch of mainly TAA tandem repeats which is located exactly at the beginning of the A+T rich region (variable block 1—VB1, Figure 1). This block can have 3–14 repeats of TAA and is interrupted at the beginning by a TATA motif which is common to all individuals analyzed. VB1 can thus vary from 16 to 49 nucleotides. A second block (variable block 2—VB2, Figure 1) lies within a region of TA repeats. Nevertheless, this TA region shows a high number of transversions of the type A↔T. VB2 is the longest region of the A+T rich region with variable number of repeats and mutated bases. The length of variable block 2 ranges from 118 nts (as in LP51, LP52) to 187 nts (as in HP8). This region is a type of [T(T)A(A)]n sequence which is also apparent in similar positions in other insects (Zhang and Hewitt 1997) though a definite proof of homology is still lacking. Variable block 3 (VB3) is situated towards the end of CR (variable block 3—VB3, Figure 1). It ranges from 98 nts (LP51 and LP52) to 146 nts (AZ2, 3, 4). This block includes a subregion, ranging from 34 to 36 bp with an extremely high content of adenines punctuated here and there by thymines. Three of the individuals analyzed (AZ2, 3, 4) show a partial repetition of this variable block (Figure 1, see Supplementary additional files). Finally, at the very end of the A+T rich region and close to the tRNAglu, lies VB4, a small variable region whose sequence varies from 9 to 12 nts.

The high conserved domain between VB1 and VB2, called here conserved block 1 or CB1, harbors a sequence which is a putative secondary structure forming a stem and loop described by Clary and Wolstenholme (1987) and Zhang et al. (1995) and is possibly implicated in the replication of circular DNA molecules. This sequence is read on the complementary strand (Clary and Wolstenholme 1987), and with a few nucleotide changes, it is the one highlighted in the pink box of Figure 1 and the grey box in Figure 2a. Interestingly, within this structure, there is an almost exact match of a stop signal described in humans and mice by Doda et al. (1981) which has the motif 5′-ACATTAAAYYAAT-3′ and is present on A. mellifera complementary strand as 5′-ATATTAAAATAAT-3′ (bold letters denote mutated nucleotides, see sequence underlined in CB1 of Figure 1 and also Figure 2a). This sequence, highlighted in pink in Figure 2, is conserved in all individuals analyzed but never forms a single stem and loop structure contrary to the Apis model proposed by Zhang et al. (1995, see their Figure 4c) unless at a high energetic cost. In Drosophila, it has been more or less established that a conserved stem and loop structure is a common feature to all species (Monforte et al. 1993; Zhang et al. 1995; Andrianov et al. 2010), and that it may be related to the replication of mtDNA particularly the site of initiation of the second strand synthesis. In A. mellifera, this region apparently exists but not as the stem and loop structure described for Drosophila. It bears other features that resemble signals for initiation of the L-strand replication. For example, the conserved motif 5′….G(A) n T…3′ (n can vary from 1 to 4) which is a flanking attribute of this region in mammals, insects, chickens, etc., is also apparently present in the honey bee but in the complementary strand, read as 3′….GAAT……5′. The consensus sequence at the 5′ flanking region of this sequence does not have a TATA motif that is apparent in other organisms although closely related sequences have been proposed (like TAATA, Zhang et al. 1995). In view of the description of this new STOP signal mentioned above, we propose alternative locations for the TATA box which are those highlighted in blue in Figure 1: one TATA box is the already mentioned included in VB1 and a second location is overlapping the STOP signal we just described. Further experimental approaches will allow infer the proper locations although the second seems much more plausible since it is located in a conserved domain. As to structural elements generally found in other insects, a poly-thymidine stretch was found just before block 2 and was composed of 16–18 Ts (poly-T, Figure 1). Positive identification of this structure is further confirmed because the (T)n stretch is flanked by two guanines. As proposed by Clary and Wolstenholme (1987), this sequence represents the site of initiation of the second-strand synthesis (rather than first-strand synthesis). The complete CB1 consistently forms five hairpin structures (Figure 2a), one of them carrying the T-stretch just mentioned as the terminal loop of one of the stems. The other hairpin structures are quite stable and with the exception of a single loop-out, all of them produce stems with almost complete paired nucleotides.

Figure 2.
figure 2

a Secondary structure model predicted for the conserved block 1 (CB1) showing the five stem-loops mentioned in the text. Arrows and accompanying letters refer to polymorphisms found in the 54 individuals analyzed. Asterisks denote insertions of a thymine. A white triangle denotes a deletion. The grey area corresponds to the putative STOP signal described in the text. b The most stable secondary structure of a long stem-loop existing in the conserved block 2 (CB2). An arrow denotes the only polymorphism found in all individuals analyzed.

The second highly conserved region (CB2) is the longest of the control region. It ranges from 413 to 442 nucleotides. Excluding all variable positions and indels, this region is 82 % conserved among all individuals studied. Second structure elements (a 30-bp long stem with only four mismatches and a terminal loop of just five nucleotides, see Figure 1 for its location) within CB2 consistently appear varying in the degree of energy (∆G) required which points to a folding structural stability in this region. This stem-loop structure that never changes in all folding alternatives for this region is depicted in detail in Figure 2b. The existence of stem and loop structures is to be expected with sequences that are 97 % biased to A+T. Stability of such secondary structures improve with the number of repeats. The present case seems to be a very stable segment that is always present within CB2, a fragment that has few polymorphic sites with the stem and loop structure admitting just a single singleton. Finally, after VB3, the remaining sequence can be considered also highly conserved among the individuals studied except for a very small stretch imbedded in it that in Figure 1 is named variable block 4 (VB4).

3.3 The tRNAs

Figure 3 depicts the most probable secondary structure of tRNAglu and tRNAser. Both tRNA genes fully agree with the structures inferred by Crozier and Crozier (1993). The glutamine tRNA presents a polymorphism across the individuals studied represented by deletions of two adenines in the dihydrouridine (DHU) arm. tRNAser lacks the typical cloverleaf structure because of the absence of the DHU arm which is typical of most other insects (Yang et al. 2011, Zhang et al. 2013) and presented no intraspecific polymorphisms. Anticodons of both tRNAs are TTC and TCT, respectively.

Figure 3.
figure 3

Most stable secondary structures of tRNAglu and tRNAser. Asterisks with lambda symbol represent polymorphisms since the adenine stretch may have 8–10 nucleotides.

3.4 Genetic variability indices and phylogenetic value of the A+T rich region

There are no significant differences between base composition of the complete A+T rich region and that of the conserved blocks combined (49.2 % T, 3.5 % C, 46.1 % A, and 1.2 % G). This is equally true if we compare base composition among the populations studied. This “homogeneity” is just apparent because the several conserved blocks are extremely different in relation to base composition. Excluding CB4, the other three conserved blocks show statistically significant different base composition: CB1 (54.3 T, 1.4 C, 43.1 A, 1.3 G), CB2 (48.0, 3.7, 47.2, 1.1), and CB3 (32.1 T, 11.3 C, 54.7 A, 1.9 G) (Kruskal-Wallis test, P < 0.008). Such differences are mainly due to CB3 that is diverging from the average showing an increment in Cs and detriment of Ts. Finally, we performed an analysis of molecular variance based solely on the conserved regions combined to see if a substructuring of the populations is perceivable, independently of a phylogeographic link among them. The overall F st was 0.489 (P < 0.0001). Most of the existing variability is due to within populations variation (51 %), variation among populations within groups was 48 % (in this case, the only group formed was composed of the three populations from the Iberian Peninsula, all other populations formed separately) and only 1 % could be attributed to among groups’ differences. The fact that no common haplotypes exist is certainly the cause for these results, a feature common to other insects (Wan et al. 2011a, b). Pairwise differences (as F st) varied from 0.0 (AZ vs. MAL) to 0.83 (CV vs. VIS) and were all statistically significant except for the pair AZ/MAL. The AMOVA performed using all the populations separately as groups did not change much. In fact, most of the variance is still attributed to within population variability and the remaining attributed to among populations differences (data not shown).

The final topology depicted from network (Figure 4) using the constraints described in the methods section showed one of the three shortest trees found differing from each other by a couple of minor branch swapping. Even using just the conserved blocks with only 54 variable sites, it is possible to perceive that some geographically related groups retain close links as it is the case of Iberian VIS and CA groups and CV. The Cape Verde honey bees (CV) which subspecies classification remains unknown (but see Pedersen 2001) are all related to the Moroccan ones (A. m. intermissa from SR and A. m. sahariensis from TT). In congruence with previous results based on nuclear (microsatellites) variation (De la Rúa et al. 2001; Muñoz et al. 2013), honey bees from the Canary Islands (yellow in Figure 4) are not a homogeneous group since at least two distinct clades emerge in the topology. A more or less complex group occupies a central position in the topology and connects most populations from Madeira (MD), Azores (AZ), and the Iberian Peninsula (VI, ML) which reflects the dispersion of the African sublineage with Atlantic distribution in the prospected area (Pinto et al. 2012).

Figure 4.
figure 4

Median-joining networks topologies of mtDNA control region haplotypes observed in eight populations from the Iberian Peninsula and northwestern Atlantic Islands. Circle sizes are proportional to haplotype frequencies. Black dots represent median vectors that connect related haplotypes. Line sizes are proportional to mutational positions connecting haplotypes.

4 Discussion

In A. mellifera, the almost absence in the A+T rich region of G/C nucleotide pairs makes more difficult the search for homologous positions since GC sites are most probably less affected by mutations due to a possible functional role. This is particularly true when looking for similarities among distant taxa, but it is not the case when looking at intraspecific variation. Actually, it has been the comparison of individual honey bees belonging to different locations that allowed the detection of highly conserved blocks that have kept a severe constraint both on point mutations or length variation events. The conserved blocks are thus prone to be used in phylogenetic studies given their low mutational rate. In this sense, the obtained network topology demonstrates that using a few variable characters from conserved blocks will allow us to recover a plausible connection among individuals. If one performs a network analysis for each population separately, the topology becomes even clearer and almost free from redundant sites or complex relationships among haplotypes which usually appear as highly complex reticulations (data not shown). This means that within populations the relationships among haplotypes are straightforward making each population a solid and discrete unit. The substitution rate of these conserved blocks is probably much slower than that of third position of coding regions something that was already found in Drosophila and mosquitoes (Caccone et al. 1996; Brehm et al. 2001). Such constraint on mutation rates further points to a possible functional role of these regions. It will be interesting to verify if this mutation rate in other Apis species agrees with previous observations in other organisms.

The length variable segments for phylogenetic inferences should be used in caution since it may obscure relationships due to recurrent variation in the number of repeats and equally important, the occurrence of mutations within the repeat units. Zhang and Hewitt (1997) claimed that the length variation at intraspecific level is mainly due to high mutation rate, and in spite of the fact that we did not find any signs of heteroplasmy in the individuals studied, the obvious pattern of these tandem repetitions among populations could well be the result of a phylogenetic relationship. Until now, length variation in the genus Apis has been observed in the intergenic region between mtDNA cox2 and the tRNAleu genes but not in the A+T rich region (Cornuet et al. 1991). The present data outlines the importance of looking at intraspecific variation in order to detect such variations. The fact is that the control region in A. mellifera has important segments of variable size due solely to variable number of tandem repeats. Moreover, mutations occurring within these variable repeats are in some cases individual specific (e.g., VB2). Other models have been proposed to explain such observations (for a revision, see Zhang and Hewitt 1997), but most probably the length variation typically observed in the A+T rich region could be better explained by replication slippage. Apart from the limited variable blocks in A. mellifera, the reduced variability of most of the A+T rich region may be due to high mutational pressure towards AT, TA, or GC/AT substitutions.

In conclusion, the A+T rich region of A. mellifera appears to contain valuable information especially at the population level. The region has important conserved blocks with a limited number, if any, of parsimoniously informative positions preventing its use to infer phylogenetic relationships. Nevertheless, the information retrieved from the comparison of individuals belonging to different populations was crucial to understand and characterize how the control region is organized in A. mellifera. The comparison of A. mellifera A+T rich region with other species of the same genus will be most valuable in order to confirm the suggestions made in the present study regarding the role of particular sequences within the A+T rich region of this insect.