1 Introduction

It is sometimes necessary or desirable to obtain estimates of the density of honey bee colonies in an area, either as an instantaneous estimate, or across time (Utaipanon et al. 2019b). Circumstances in which such information may be needed include planning a biosecurity response to an incursion of an exotic pest or disease, assessing the impacts of feral honey bees on native flora and fauna, quantifying any declines in honey bee abundance over time, or determining whether there are sufficient colonies present in an agricultural area to provide an adequate pollination service (Breeze et al. 2011; Potts et al. 2016; Nolan and Delaplane 2017; Utaipanon et al. 2019b).

Assessing the density of wild honey bee colonies by direct surveys is almost impossible. First, the density can be as high as 150 colonies/km2, with a non-uniform distribution of nests that is influenced by hollow availability and the distribution of food resources (McNally and Schneider 1996; Oldroyd et al. 1997). Second, honey bee colonies mostly build their nests in lofty locations in trees (Seeley and Morse 1976), human-built structures, or caves and rock faces (Oldroyd and Wongsiri 2006) where they can be difficult to spot. For the cavity-nesting species (Apis mellifera, A. cerana, A. koschevnikovi, A. nuluensis and A. nigrocincta), the nest entrance is often small and cryptic, making it even more difficult to identify nests by direct observation than it is with the open-nesting species (A. dorsata, A. laboriosa, A. florea and A. andreniformis).

Unlike most other insects, sampling foraging bees is an inappropriate method for estimating honey bee densities. Foragers recruit their nestmates to the best food sources (Visscher and Seeley 1982), and their foraging range can exceed 10 km (Eckert 1933; Visscher and Seeley 1982; Beekman and Ratnieks 2000; Steffan-Dewenter and Kuhn 2003). This means that the density of honey bees in an area can be drastically over- or under-estimated depending on when and where the physical sampling takes place (Utaipanon et al. 2019b).

An alternative to direct searches for colonies or foragers is to sample male bees at their mating aggregations (Moritz et al. 2008; Jaffé et al. 2010; Arundel et al. 2014; Hinson et al. 2015; Utaipanon et al. 2019b). Mature honey bee drones go on mating flights each afternoon, weather permitting. The exact timing of these flights depends on the species, the season and the region (Koeniger and Koeniger 1991; Loper et al. 1992; Yoshida et al. 1994; Otis et al. 2000). Once they leave their nest, drones fly to mating leks called Drone Congregation Areas (DCAs) where they search for a virgin queen with which to copulate (Ruttner 1974; Ruttner 1976). Each species has its own preferences for the location of a DCA. The DCAs of the western honey bee, A. mellifera, tend to be in an open space surrounded by tree lines such that there is a reduced horizon. (Suburban sports fields are often sites for DCAs.) These characteristics of DCAs allow human investigators to locate likely sites, often by using satellite maps such as Google Earth (Mortensen and Ellis 2014).

Once a DCA has been located, drones can be lured into a Williams drone trap (Williams 1987; Moritz et al. 2007; Hinson et al. 2015; Utaipanon et al. 2019a). The Williams trap is a tapered cylinder about 1 m long and 500 mm in diameter at the base, made up of three metal loops covered by thin insect netting (Scheiner et al. 2013). The trap utilizes dummy queens as lures to attract males. The lures are blackened cigarette filters infused with synthetic 9-oxo-decanoic acid (9-ODA) (Williams 1987), a major component of the queen’s sex attractant, queen mandibular pheromone (Butler et al. 1962; Brockmann et al. 2006). The lures are hung both inside and slightly beneath the trap where they serve as both visual and chemical cues for males. The trap is normally lifted by a Helium-filled weather balloon about 10–30 m above the ground (Dietemann et al. 2013; Mortensen and Ellis 2014). Drones are attracted by the movements of the queen lures. Once a drone approaches the lure, he tends to fly upward and gets caught inside the net. During good weather conditions, it is easy to collect several hundred males in 30 min.

Males are assumed to be drawn from a circular area surrounding the DCA (Utaipanon et al. 2019a). Microsatellite genotypes of trapped drones can be used to group brother drones into families and to infer the number of families (colonies) in the trapping area that contributed drones to the sample (Kraus et al. 2005b; Moritz et al. 2007; Jaffé et al. 2010; Arundel et al. 2013; Moritz et al. 2013; Arundel et al. 2014; Utaipanon et al. 2019a). Honey bee males are haploid, greatly facilitating the assignment of males to individual families using maximum likelihood. This calculation approach is implemented in COLONY (Wang 2005), along with algorithms for minimizing the effects of genotyping and miscellaneous errors (Wang 2004; Wang 2013). The number of colonies inferred using this technique has acceptable accuracy provided that the sample size is adequate (Utaipanon et al. 2021). After the number of colonies present in a sample has been determined, this count can be converted into a colony density estimate based on the assumption that drones fly a maximum of 3.75 km from their colony to a DCA (Utaipanon et al. 2019a). Therefore, a sample of drones obtained at a DCA samples the colonies in an area of 44 km2.

One of the limitations of the drone-sampling technique arises as a consequence of honey bee reproductive biology; colonies do not produce drones all year round (Allen 1958; Allen 1963; Free and Williams 1975; Seeley and Mikheyev 2003). Generally, mature drones are abundant during the period from spring to autumn, but largely absent from late autumn to early spring when they are rejected from their colony and die (Free and Williams 1975; Currie and Jay 1988). In addition, even during the breeding season, drone flight activity is strongly affected by temperature and wind speed (Reyes et al. 2019). Hence, the sampling period can be critical to the quality of the sample in terms of its size and the proportion of colonies that are successfully sampled. Further, the larger the sample size, the more accurate the grouping of drones into families will be. Larger sample sizes also minimize non-sampling error—the failure to sample drones from a colony within the DCA’s catchment. Utaipanon et al. (2021) showed that provided the average number of sampled drones per family is equal or greater than six, the number of colonies represented in a sample is estimated with less than 10% error from all sources: non- sampling and misclassification.

COLONY utilizes population allele frequencies to obtain the most likely grouping of genotypes into families. In practice, allele frequencies are calculated directly from the genotyping data using a group-likelihood approach. This approach helps correct the allele frequencies for the unequal numbers of offspring from each family (Wang 2004). However, it is also possible to use independent estimates of allele frequency obtained by other means—for example as a result of repeated sampling at the same site. In theory, the better the allele frequency estimate, the more accurate the inference of family membership. However, there is a caveat to this generality because allele frequency is likely to fluctuate somewhat over time (Kraus et al. 2005a; Jaffé et al. 2009). Thus, using repeated samples from the same site may introduce its own set of problems.

In this study, we quantify the significance of sampling at different times of year on the number of inferred colonies at the same sites. To do this, we repeatedly collected drones at two highly populous DCAs every month during the drone-flight season for 2–2.5 years using a Williams drone trap (Williams 1987). We then analysed drone genotypes by year and by month for each DCA to infer the number of colonies represented at the DCA. Based on these analyses, we explore the effects of sample size and time of sampling on the estimate of the number of colonies present in the area. We also calculate independent estimates of allele frequency based on various groups of samples, using the estimates generated by COLONY itself during its calculations. We then explore the effects of using the different estimates of allele frequency on the reconstructed queen populations. Finally, we make recommendations about handling data sets that have been collected over time, or at different times of the year.

2 Methods

2.1 Sample collection

We sampled two DCAs in Sydney Australia. The first site was number 1 oval at the University of Sydney’s (USYD) Camperdown campus (33° 53′ 16.1″ S, 151° 11′ 6.1″ E). This site is on the fringe of the CBD, and the DCA has been established since before 1980 (personal observations of BPO). The second site was Barra Brui oval (BB), St. Ives, NSW, Australia (33° 44′ 36.3″ S, 151° 10′ 3.8″ E) on the north east urban fringe of Sydney, which is surrounded by remnant bushland. Each site is a large sports field surrounded by trees. The two sites are 16 km apart, much further than the typical drone flight distance (Utaipanon et al. 2019a).

Drones were caught every month using Williams drone traps except during winter when we confirmed that no drones were present. Drones were preserved in 95% ethanol once trapped. If the weather on the sampling day was not suitable for drone flight, i.e. cloudy or windy, or if ‘comets’ of drones (i.e. 10–100 drones chasing each other) were not present during the sampling, we resampled the site during the following week. At the USYD site, we sampled from September 2017 to April 2019. At the BB site, we sampled from January 2018 to April 2020 (Table I).

Table I Numbers of inferred colonies (NC) present in the sample from each month using the monthly data and yearly data, and the number of inferred colonies using the monthly data. Estimates include the number of undetected colonies based on the Poisson distribution estimates of missing colonies (NT). Greyed-out cells indicate that sampling was not attempted. USYD = Number 1 oval, University of Sydney. BB = Barra Brui oval, St. Ives

2.2 Microsatellite genotyping

DNA of sampled drones was extracted from one hind leg using the Chelex protocol (Walsh et al. 1991). Extracted DNA was then used as template for nine pairs of microsatellite primers: A8, A24, A29, A79, A88, A107, A113, B124 and HbThe3 (Estoup et al. 1994; Solignac et al. 2003). These loci are unlinked either to each other or to centromeres. Reactions were performed in two multiplexed PCRs using fluorescently labelled primers. PCR products were electrophorized in a 3130xl Genetic Analyzer (Thermo Fisher Scientific, Waltham, MA, USA). The output from the analyzer was collated and corrected using GeneMapper v5 software (Applied Biosystems®), and final microsatellite genotypes were assembled manually. We ensured that we minimized the effects of genotyping errors and unequal information among datasets by removing any drone genotype in which more than one locus failed to amplify. Information on allelic diversity for each locus and population is given in Table S1. The genotype of each drone is given in Tables S2 and S3.

2.3 Inference of number of colonies

Drone genotypes were used to group individuals into families of brothers using COLONY (Wang 2004; Yang et al. 2013). To explore the effects of sample size and the method used to estimate allele frequency on sibship assignment, we analysed the entire data set from 2 years for USYD and 2 and a half years for BB in three alternative ways to determine which is best (Figure 1).

Figure 1.
figure 1

Diagram of the inference methods used in this study.

First, drones were grouped together based on the pooled sample for each site-year combination, that is, the combined September-April samples in each of the 2 years were treated as a single sample. Then, the number of colonies present in each month was determined by assigning each individual drone to the month in which it was caught (method (i) in Figure 1).

Second, we explored the effects of the alternative methods of estimating population allele frequencies (method (ii) in Figure 1). Here we used COLONY to calculate the number of families present in each monthly sample, but used the allele frequencies calculated by COLONY from the entire year’s data, rather than just the month in question.

Finally, we analysed each month’s genotype data separately on a monthly basis. From these monthly data inputs, we obtained numbers of colonies per month using allele frequencies estimated by COLONY for each individual month by default (method (iii) in Figure 1). This kind of analysis (a single sample from a single site) is what is typically done in practice (Moritz et al. 2007; Utaipanon et al. 2019a).

Data were analysed using COLONY 2.0.6.5 with the following settings: long length runs, haplodiploid mating system, 3 replicate runs, updated allele frequency and full likelihood. We ran the analysis twice: once with sibship scaling and once without sibship scaling.

2.4 Data analysis

We compared the number of families estimated with and without sibship scaling and found that the numbers of families inferred from the second-year data at USYD and first-year data at BB changed by less than 2.6%, while the other two samples returned similar numbers of colonies. ‘Scaling-On’ is commonly applied when analysing data with COLONY because it is thought to provide the greater precision (Wang 2013; Utaipanon et al. 2021). We therefore report our results below with the sibship scaling setting ‘on’.

We used Wilcoxon paired signed-rank tests to determine whether there were significant differences in the number of families inferred from each month using the three alternative sampling schemes. In addition, we evaluated the adequacy of sample size for each method using the criteria that the average of numbers of drones (Di) per number of inferred families (NC) should be equal to or greater than 6 (Utaipanon et al. 2021). Finally, we used a Monte Carlo version of Fischer’s exact test to perform pairwise comparisons of allele frequencies of DCA populations across months. The tests were performed in R 3.6.2 (R Core Team 2016) using MASS (Venables and Ripley 2002) and stats packages (R Core Team 2016).

3 Results

We collected 5074 drones at USYD over a 2-year period, and 4944 from BB over 2.5 years. Drone mating seasons ended and began at similar times at both locations. Generally, drones visited DCAs from the beginning of spring (September–October) until mid-autumn (March–April) each year. The number of drones caught each month was highly variable—a minimum of 3 drones to a maximum of 574 drones. This was despite the fact that we repeated our sampling whenever the number of caught drones was < c.a. 300 within 1 week for each month (Table I). This suggests that the population of mature drones present in the area was very low during those months when we failed to catch a lot of drones, and that our low sample size during these months was not a consequence of abiotic factors, but an actual absence of drones.

The estimated number of colonies that contributed drones to the trap varied according to the season and the data set used in the analysis (Table I). When we used the annual dataset of drone genotypes (method (i) in Figure 1), there were 243 and 246 unique colonies found at USYD in the first (2017–2018) and second (2018–2019) years respectively. At BB, we inferred 101 colonies from the January–April 2018 samples, and 274 (September 2018–April 2019) and 257 (September 2019–April 2020) colonies from the full-year samples.

The number of inferred colonies present in each month’s samples based on the data from individual months (method (iii), Figure 1) was significantly smaller than those inferred using annual data sets (V=0, df= 34, p<0.001, Table II). In addition, the average number of drones per inferred family (Di:NC ) (Utaipanon et al. 2021) in each month based on the monthly data was relatively low (Table II), whereas the Di:NC ratios were above six for all analyses based on yearly data. When allele frequencies based on the entire year’s data were used for the analysis of the monthly data (methods ii, Figure 1), the number of inferred colonies per month remained similar to those inferred based on monthly allele frequencies alone (methods iii, Figure 1) (Table III).

Table II Numbers of drones (Di) captured in each month and the average number of drones per inferred family (Di/NC) using different techniques. Greyed months indicate that no sampling was conducted
Table III Comparison of the number of inferred colonies (NC) using monthly data, based on yearly (f) allele frequencies and individual month allele frequencies

Allele frequencies of sampled drones at each DCA varied significantly between months within seasons (Fischer’s exact test, p<0.001 (111 pairs)), though most likely this was due to lower sample size per month rather than actual changes in allele frequency.

3.1 Estimates of non-sampling error

The number of non-sampled colonies can be estimated by fitting the distribution of family sizes to a truncated Poisson distribution (Baudry et al. 1998; Utaipanon et al. 2019b). We can calculate number of total colonies including the missing colonies (NT) from:

$${N}_T=\frac{N_d}{\hat{\lambda}}$$

where \(\hat{\lambda}\) is the maximum likely estimate of the expected number of colonies per family (i.e. the observed mean number of drones per family), and Nd is the total number of sampled drones (Baudry et al. 1998; Utaipanon et al. 2019b). Based on this calculation, there were only 0 or 1 non-sampled colonies per month when the monthly data were used (Table I and Table S4). We report the number of colonies following the Poisson correction (NT) (Table I). However, since the distribution of family sizes was a poor fit to the Poisson distribution in all cases (P < 0.001, see Table S4), the reliability of this correction is unclear.

4 Discussion

A risk of using drone samples that have been collected over a period of time is that allele frequencies in the drone population in the sampled area will have changed over time (Kraus et al. 2005a; Chapman et al. 2019). Variance in allele frequency over time could potentially lead to inaccurate family reconstruction (Kraus et al. 2005a). Here we have shown that the benefits to be gained from high sample size as a result of repeated samplings outweighs the possible drawback of temporal fluctuations in allele frequencies arising from real changes in DCA population structure over time. (If allele frequency fluctuations derive from low sample size, pooling is always advantageous.) Our results show that the number of families inferred using monthly data sets alone was lower than when the number of families was inferred from longer-term studies. Only when we used multiple samples did the average number of drones per inferred family (Di:NC) exceed 6, indicating that sample sizes had been adequate (Table II) (Utaipanon et al. 2021). We therefore conclude that the higher number of families detected per month using the annual data set was a result of an improvement in inference due to higher sample size and reduced non-sampling error (NSE) (Baudry et al. 1998; Utaipanon et al. 2019b). NSE arises when drones from a particular colony are visiting a sampled DCA, but by chance were not sampled on the sampling day. We know that non-sampling error was a real problem, since some colonies were identified months apart, without being seen in the intermediate months (data not shown). This emphasises that repeated samples at the same site improve the estimate of colony density, by increasing the sample size. Our analysis also suggests that fitting the distribution of family sizes to the Poisson distribution is unlikely to solve the problem of NSE if the sample size has been inadequate.

Small sample size can mean that some families are represented by just a few males, which can be erroneously allocated by COLONY into existing families with which they share alleles, especially when families are highly related, or the available genetic information is limited. This problem can be ameliorated by increasing the sample size. Alternatively, the precision of allocation of individuals to family groups can be improved by using more microsatellite or SNP markers.

Allele frequencies estimated from a DCA population can fluctuate significantly over a period of just 2–3 days (Moritz et al. 2007; Collet et al. 2009). Whether these differences are attributable to sampling error or real changes in population allele frequencies is an open question. We detected significant differences in allele frequency over the seasons evaluated. However, our data show that when allele frequencies estimated from the annual drone population are used a priori for the analysis of monthly data (method (ii) in Figure 1), the number of inferred families remained similar to those based on allele frequencies estimated from each month’s data (method (iii) in Figure 1 ) (Table III). This indicates that fluctuations in allele frequencies over time have no material impact on the numbers of colonies inferred. Thus, if repeated sampling is required because the sample size is low, data from the repeated samples can be pooled with confidence. In this case, the research question should be taken into an account to assess whether repeated sampling is warranted, and how long the study period should be.

Our study shows that drone sampling should not be conducted until at least 1 month after drones first appear at a DCA after winter, and at least 1 month before drones cease flying in winter. Although the number of families found in each month was variable (Table I), comets of drones were present from October to March at both locations in each year, whereas the presence of comets was uneven in September and April. Therefore, the time of the year that sampling is conducted is likely to have a material influence on the accuracy of the estimates of colony density that are obtained.

5 Conclusions

In an environment where the density of colonies is unknown, an a priori goal for sample size is unknown, as it depends on the number of colonies present in the environment (Utaipanon et al. 2021). (Of course, sample size should always be as large as is practicable, within the constraints of a field trip.) Where the actual colony density is very low, small sample sizes (c.a. 200 drones) are likely to be adequate. If the density of colonies is high, higher sample sizes are necessary (> 3000 drones) (Utaipanon et al. 2021). Our study suggests that there is no harm in collecting drones over a period of a week or even months; allele frequencies, real or due to sampling error, do not fluctuate sufficiently to be cause for concern. This property of drone populations and maximum likelihood estimates of allele frequencies means that researchers can analyse an initial sample and return to the field to collect additional samples if the average number of drones per family turns out to be inadequate. Such a protocol will maximise the efficiency of the method. It is also possible to fit the data to a truncated Poisson distribution to estimate the number of non-sampled colonies (Baudry et al. 1998), but because the distribution of family size is often a poor fit to the Poisson distribution, we suggest that additional samples should be obtained wherever possible.