1 Introduction

Contemporary honey bee breeding programs often use a closed population mating structure to enhance phenotypes of interest (Laidlaw and Page 1997). Matings in these programs typically are controlled using instrumental insemination or geographic isolation. A limitation of these breeding strategies is the inherent risk of low genetic diversity and concomitant inbreeding-related issues which can directly impact colony productivity. Honey bees are particularly vulnerable to complications associated with low genetic diversity largely because of their haplodiploid sex-determining system which relies on heterozygosity at the complementary sex-determiner locus (csd). At csd, heterozygosity results in females, hemizygosity results in males, while homozygosity produces abnormal diploid males generally destroyed by worker bees (Woyke 1963). At the colony level, homozygosity at csd results in a loss of brood that can detrimentally affect population growth and productivity (Tarpy and Page 2001; Woyke 1980; Woyke 1981). Management of allelic diversity at this locus is critical for breeding programs, and specific information about csd allele distribution could directly benefit mating decisions (Page and Laidlaw 1982; Page and Marks 1982). Evaluation of csd in breeding programs in New Zealand (Hyink et al. 2013) and Russia (Kaskinova et al. 2019) found levels of allelic diversity and patterns of frequency distributions comparable with those found in minimally managed populations of honey bees worldwide (Lechner et al. 2014; Zareba et al. 2017).

Largely due to the economic impact of Varroa destructor, significant effort by USDA recently has gone into developing new stocks of bees resistant to this parasitic mite. Two, in particular, have been developed focusing on the Varroa-sensitive hygiene (VSH) trait (Harbo and Harris 2005). The additive genetics of VSH (Harbo and Harris 2001) enabled development of ‘Pol-line’ stock through introgression of VSH into bees managed by several large-scale beekeepers (Danka et al. 2016). Pol-line was developed during 2008–2014 by selecting colonies having good beekeeping functionality and mite resistance. Since 2014, the stock has been maintained by standard closed-population breeding; approximately 200 colonies are created each year by propagating queens from 12 to 16 queen lines and instrumentally inseminating the queens with pooled semen collected from drones from all available queens. A second, subsequent breeding effort in a public-private partnership currently is seeking to further improve Varroa-resistant honey bees for use in commercial beekeeping. This population of ‘Hilo’ bees now consists of foundational breeding lines derived from original VSH material and Pol-line stock (sourced during 2010–2016) and commercial Italian stock; these multiple sources should promote genetic diversity. Hilo bees recently have begun to be maintained in a semi-closed population, i.e., one that will be enlarged as selection continues before the stock is distributed widely to the beekeeping industry.

We assessed genetic diversity at the csd locus, specifically within the hypervariable region (HVR) in both Pol-line and Hilo honey bee stocks. Determination of differential patterns in allele frequencies coupled with phylogenetic analysis enabled us to characterize the diversity-related effects of breeding practices in the Pol-line and Hilo breeding programs. This information can be applied in each of the respective breeding programs to maintain or modify csd diversity, as needed.

2 Methods and materials

2.1 Sample collection

Drones were collected from 47 Pol-line colonies in Louisiana, USA and 42 Hilo bee colonies in Hawaii, USA; most colonies had queens of known pedigrees from the respective stocks’ breeding populations. A minimum of six drones (purple-eyed or older pupae or teneral adults) were collected per colony in the field and immediately placed on ice, then stored at − 20 °C until being processed. Pol-line samples were collected from five apiaries in autumn 2018 and summer 2019. Pol-line samples represented 17 lines that did not share maternal ancestry in the prior generation. Hilo samples were collected from ten apiaries in summer 2019. Maternal Hilo lines included 16 derived from VSH or Pol-line stocks, 11 derived from a commercial Italian stock, and 5 derived from high-performing field colonies of the mite-resistant or Italian sources but whose maternal lineage was unknown.

2.2 DNA extraction

DNA was extracted from the thorax of each bee as previously described (Bourgeois and Rinderer 2009), with some modifications. The process is summarized here. Samples were first homogenized in lysis buffer (100 mM Tris at pH 8.0, 10 mM EDTA at pH 8.0, 1% SDS) and 5 mm stainless steel beads for 3 min at a rate of 30 BPS and then treated with proteinase K (20 mg/mL) at 70 °C for 10 min. Protein precipitation was then completed, followed by ethanol precipitation and lyophilization. Pure genomic DNA was rehydrated in Millipore filtered and deionized dH2O and stored at − 20 °C.

2.3 csd amplification, screening, and sequencing

Region 3 of the csd gene was amplified following the protocol defined by Hyink et al. (2013). Briefly, six drones per colony were tested for the presence of their queen’s csd alleles. The initial screening assay included PCR amplification of region 3 of the csd gene, including the RS domain, the HVR, and P-rich region in exons 6–8 (Beye et al. 2003). The primers (Hyink et al. 2013) included flanking sequences to facilitate direct sequencing of the PCR products in the haploid drones. We used M13FConcsdrev: 5′-GTAAAACGACGGCCAGTCATCTCATWTTTCATTATTCAAT-3′ and AD1genoRfw: 5′TCGCTGTCGGTGAAGACRATATGAAAAATTACACAATGA-3′ primers. Amplification conditions were optimized and used for all subsequent reactions. The optimized amplification profile was 2 min at 94 °C, followed by 35 cycles of 15 s at 94 °C, 30 s at 48 °C, 15 s at 60 °C, 1 min at 65 °C, 5 min at 65 °C, and a 4 °C hold. Each 20 μL reaction included 50 ng of template DNA, 1.5 pmol of each primer, and 10 μL of 2X Platinum II Hot-Start Green PCR Master Mix (2X) (providing final reaction concentration of 1.5 mM MgCl2, Invitrogen).

A 10-μL aliquot of each drone PCR product was digested with FastDigest VspI (Thermofisher Scientific) restriction endonuclease for 7 min at 37 °C and visualized on a 2% E-Gel agarose gel infused with SYBR Safe DNA gel stain (Thermofisher Scientific, Inc.). Two drones with an RFLP profile that represented both of their queen’s alleles were selected and processed for DNA sequencing. The sequencing temperature profile was 1 min at 96 °C followed by 25 cycles of 10 s at 96 °C, 5 s at 50 °C, and 4 min at 60 °C. Sequencing reactions contained 1.5 μL of PCR product, 0.5× strength of Big Dye Terminator v3.1, 2.0 μL sequencing buffer and either the forward (M13FS-40: 5′-GTAAAACGACGGCCAGT-3′) or reverse (AD1: 5′TCGCTGTCGGTGAAGAC-3′) primers (Hyink et al. 2013). Sequencing reactions were purified with 100% EtOH and 0.125 M EDTA 8.0 and pellets reconstituted in 10 μL of HiDi formamide (ThermoFisher Scientific, Inc) and visualized on a 3500 Genetic Analyzer (ThermoFisher Scientific).

2.4 Data analysis

DNA sequences were checked for validity, aligned using ClustalW (Thompson et al. 1994) with manual adjustments and translated to csd protein sequences in MEGA-X (Kumar et al. 2018). Sequence accuracy was confirmed through BLAST search of known csd sequences. Haploid DNA sequence data were analyzed in DnaSP (Rozas et al. 2017) for diversity measurements, considering insertions/deletions as multiallelic states and using the sliding window option to account for alignment gaps that are inherent to the HVR of csd. The protein sequences of the HVR were aligned in MEGA-X using ClustalW, trimmed to include the HVR and the immediate flanking regions (Figure 1), and subsequently analyzed for HVR length and number of amino acid substitutions. HVR delineation was determined following (Hyink et al. 2013). HVR allele frequencies and overlap were determined in Excel. Phylogenetic analyses of both DNA and protein sequences were conducted in MEGA-X. The neighbor-joining tree was built in MEGA, based on number of amino acid differences in pairwise comparison (Figure 2). Sequence similarity between alleles in our population and those of a reference data set including a set of presumably functional heterozygotes found in honey bee populations from Kenya (Lechner et al. 2014) were explored using a network approach (Figure 3). We constructed a sparse adjacency matrix using the protein alignment which retained edges only between those alleles that had a sequence similarity greater than 95% across our region of interest. Our postulate is that alleles with this degree of similarity are likely to produce effective homozygotes. The network of csd alleles and visualizations were created using the igraph package in R (Csardi and Nepusz 2006).

Figure 1.
figure 1

A representative alignment of the amino acids in the hypervariable region (HVR) of the csd locus. Highlighted amino acids are conserved. The repetitive region is designated.

Figure 2.
figure 2

Maximum likelihood phylogenetic tree and allele frequency histogram of Pol-line- and Hilo-specific csd protein alleles. The figure outlines phylogenetic relationships inferred from protein sequence alignment between csd alleles. Each tip of the tree is labeled with the allele name.

Figure 3.
figure 3

Network and distribution of unique csd protein alleles. Each node represents one unique sequence in the data set. The size of the node corresponds to the total number of individuals with the allele, and the colors correspond to each of the populations that comprise each node. The Africa population data originates from Lechner et al. (2014). Lines in the chart link those alleles meeting our sequence similarity threshold (sequence identity > 95%) of the protein alignment. The purple arrow and outline mark the allele group with the highest frequency, and the teal arrow and outline mark the largest allele grouping.

3 Results

3.1 DNA diversity of the csd locus

We screened drones from 47 Pol-line colonies and 42 Hilo colonies. Usable sequences (of appropriate length and quality) from each colony were obtained for a total of 166 drones, 88 and 78 of which represented the Pol-line and Hilo stocks, respectively. Among these DNA sequences, 83 Pol-line and 62 Hilo DNA haplotypes were identified. Haplotype diversity was comparable between the two stocks (0.992 and 0.998, respectively; 0.997 over all samples). Nucleotide diversity (π) levels showed a similar pattern, for both stocks and all samples combined (0.059, 0.042, 0.051, respectively).

3.2 Amino acid diversity of the csd locus

HVR length varied in a continuous distribution of 10 to 29 amino acids (without gaps), with the exception of one Pol-line allele having 40 (Figure 1, Supplemental file F1). A total of 60 alleles were identified, 11 of which were shared between the two stocks (Figure 2). The number of pairwise amino acid differences for all colonies (including gaps) ranged from 1 to 42. Our assessment identified 35 allele groups based on sequence similarity, of which 14 were found only in Pol-line, 7 only in Hilo, and 14 shared between the stocks (Figure 2). In the allele network (Figure 3), all alleles are represented and groups of those alleles similar by sequence are connected by lines. Frequency distribution patterns within each allele group demonstrate stock-specific biases.

4 Discussion

Both the frequency distribution and diversity of alleles within the HVR of the csd locus in both Pol-line and Hilo stocks were high at the nucleotide and amino acid level and comparable with those reported in various populations of Apis mellifera worldwide. Nucleotide diversity levels worldwide ranged from 0.03 to 0.09 (Hasselmann and Beye 2006; Hasselmann et al. 2008; Lechner et al. 2014; Wang et al. 2012), encompassing the range we found in our samples. Allelic diversity of the csd protein is most commonly reported as frequency distributions and total numbers of protein alleles, showing total values from 16 alleles in small, closed populations, to over 100 alleles when assessed regionally in open populations (Hyink et al. 2013; Kaskinova et al. 2019; Zareba et al. 2017). Proportionally, our samples fell within this range.

The length of the HVR shows considerable variation across populations worldwide, regardless of whether queens are open mated or in closed populations. This is exemplified by HVR lengths reported in Kenyan and global samples ranging from 6 to 33 amino acids (Lechner et al. 2014) and in a smaller sampling of a closed breeding system ranging from 21 to 38 amino acids (Hyink et al. 2013). Our ranges in size of the HVR (10–29 for all but one sample) and numbers of pairwise differences fell within those reported for both open and closed populations of bees (Hasselmann and Beye 2006; Hasselmann et al. 2008; Hyink et al. 2013; Kaskinova et al. 2019; Lechner et al. 2014; Wang et al. 2012; Zareba et al. 2017). Variance in HVR length and diversity levels are inherently intertwined with sample size; however, the apparent high mutation rate of the HVR contributes to considerable variation that is detectable even in relatively low numbers of breeding lines, as was seen with the Pol-line stock being represented by only 17 breeding lines in our samples. The high variability in Pol-line stock may have come from extensive outcrossing in the original breeding plan (Danka et al. 2016) and may have been sustained after the population was closed by mating queens with pooled semen collected from drones of many colonies.

The network of allele groups visually demonstrates not only the relationship among alleles, as does the phylogeny, but also the frequencies and proportional population representation of each of those alleles (Figure 3). We employed a stringent minimum threshold of > 95% sequence similarity in pairwise comparisons to constitute an allele group. The high recombination rate in the honey bee genome and concomitant high mutation rate found within the HVR makes it difficult to define what constitutes functional differences between alleles. Beye et al. (2013) previously described functional alleles as having an average of 4.7 differences in amino acid composition, a more conservative and empirically tested number. Zareba et al. (2017) followed the same criteria (Beye et al. 2013). Lechner et al. (2014) further addressed the concept of functional heterozygosity of csd alleles, defining the minimum difference for a pair of alleles to be considered as functional is: dHVR ≥ 6, dPSD ≥ 1, and 3dPSD + 2de8 ≥ 9, where: dHVR is the difference in the length of the HVR region; dPSD is the number of amino acid mismatches in the PSD region; and de8 is the number of amino acid mismatches in exon 8. Our grouping criteria are bioinformatically derived and require empirical testing to demonstrate functional heterozygosity, but do serve as a general guideline to examine the degree of similarity in csd alleles across our breeding populations.

Our analysis highlights the interconnectedness of the Pol-line and Hilo populations, based on common csd alleles. This relationship stems from the genetic history of the two stocks; at their founding, both stocks are rooted in Italian bees that exhibited high Varroa resistance facilitated by the VSH trait (Danka et al. 2016). The largest two allele groups showed very divergent relationships among individual protein alleles, in terms of frequency distribution and population of origin. The most frequent allele group was predominantly populated with bees in the Pol-line stock, with a more heterogeneous distribution among Hilo and African bees (Figure 3). In contrast, the most widespread allele group, also with the second highest overall frequency, was composed primarily of Hilo and African bees (Figure 3). High diversity is evident in the quantity of allele groups and between sequences within groups (Figures 2 and 3). In addition, we see several unique allele groups in each stock. Pol-line and Hilo stocks have been isolated from one another (i.e., no gene flow) for approximately 3 years, and integral to the Hilo breeding strategy is introgression of genes from commercial Italian stock. Also, only a fraction of the Pol-line genetics contributed to the Hilo program. These characteristics may explain the differential bias in alleles and their distribution across the two stocks. When both Hilo and Pol-line HVR alleles are compared with a population from Kenya, commonalities are evident among all three groups (Figure 3). No direct connection exists between these populations, suggesting that alleles in Hilo and Pol-line that are shared with the Kenya bees are identical by descent.

Diversity at csd is being maintained despite different mating structures being used in these two breeding programs. Unlike typical populations of honey bees in which queens are open mated and admixture is common (Harpur et al. 2012), the breeding programs we studied here use instrumental insemination to maintain complete control over drone sources. In Pol-line, queens of each new generation are mated using pooled semen collected from drones of many lines in the population. In Hilo, pedigree information is used to avoid inbreeding by mating unrelated or distantly related lines. Combining csd allele designations with pedigree information to inform breeding choices (i.e., drone selection) should benefit both types of breeding programs as the populations diverge, this type of assessment may prove fruitful.

The primary concern over compromised csd diversity, especially in closed breeding populations, is reduced brood viability (Tarpy and Page 2001). The data provided here indicate high csd allelic diversity despite the closed breeding approach. The information developed for specific colonies here can be used to increase the probability that uncommon alleles are purposefully maintained as breeding choices are made. As both programs are relatively new, compared with long-standing open-mated commercial populations, periodic monitoring of csd diversity is recommended.