1 Introduction

The honey bee (Apis mellifera Linnaeus, 1758) plays a crucial role as the primary pollinator of numerous crops and wild flowering species, particularly in the Northern Hemisphere. It relies on the commercial pollination of specific crops while also safeguarding ecosystem diversity (Hung et al. 2018; Papa et al. 2022). Beekeeping is also an important source of benefits for many rural livelihoods. According to the International Payments for Ecosystem Services, the western honeybee is the most widespread pollinator worldwide, with more than 81 million hives producing approximately 1.6 million tons of honey per year (IPBES, 2016).

However, in recent years, there has been a trend of increased losses in managed honeybee colonies in certain regions worldwide. Over the past decade, some regions of the world have experienced elevated colony loss rates, making it difficult for beekeepers to sustain their operations. In some places, losses exceed 40% (Australian Honeybee Health Survey 2019; Chejanovsky et al. 2021; Gray et al. 2020; Jacques et al. 2017; Laurent et al. 2016; USDA 2021; Meixner 2010). Spain is not beyond this trend. The European Pilot Surveillance Program of the Loss of Bee Colonies (Laurent et al. 2016), along with its continuation in Spain, has allowed the assessment of colony mortality in the European Union. According to the latest report from the program, the winter mortality in Spain for the 2019–2020 period reached 19.2%, the highest recorded in the entire historical series of the program. However, this figure might have been underestimated since the 2nd spring visit in 45 of the participating apiaries (30.2%) could not be completed because of movement limitations caused by the COVID-19 pandemic and the Alarm State situation in Spain in March (MAPA, 2021). This trend of increased losses is of great concern because Spain is one of the countries in the European Union with a large number of colonies (3,097,647) and honey production (34,065 Tm) (MAPA 2023).

The reasons for these losses are complex and involve various factors either alone or in combination (Cavigli et al. 2016; Dussaubat et al. 2016; Goulson and Hughes 2015; Higes et al. 2010; Hristov et al. 2020; Insolia et al. 2022; Laurent et al. 2016; Rosenkranz et al. 2010; Thompson et al. 2014). Multiple drivers have been described including pathogens such as Varroa destructor, Nosema ceranae, and viruses like DWV (Dainat et al. 2012a, b; Di Prisco et al. 2016; Higes et al. 2010; Van Der Zee et al. 2015), pesticides (Yasrebi-de Kom et al. 2019), poor nutrition (DeGrandi-Hoffman et al. 2010; Goulson et al. 2015; London-Shafir et al. 2003), improper beekeeping practices (Jacques et al. 2017; Kulhanek et al. 2021) and adverse climatic conditions (Flores et al. 2019; Insolia et al. 2022). Given the intricate and multifaceted nature of colony decline, it is imperative to conduct a thorough assessment involving extensive data analysis and using advanced statistical methods, to identify the most influential factors in countries that practice professional beekeeping. Therefore, the aim of this work was to identify the main factors associated with hive mortality in Extremadura, a key beekeeping area in southwestern Spain, by determining the relative importance of each component in predicting the mortality.

2 Material and methods

2.1 Data collect

A total of 179 beehives were assessed in three different apiaries from the traditional beekeeping areas of Extremadura (southwestern Spain), between 2020 and 2021, in three different apiaries situated in Cáceres province (A1, 39.524661, − 6.320094; A2, 39.176452, − 6.277247; A3, 39.156202, − 6.265553). The climatic conditions in this region are Mediterranean, with oceanic influence. Summers are warm, dry, and mostly clear and winters are cold and partly cloudy. Over the course of the year, the temperature generally varies from 1 to 34 °C and rarely drops below − 4 °C or rises to 38 °C. The rainiest month is in early fall, with an average of 60 mm of rain. Summer is the season with the lowest rainfall, with an average of 4 mm of rain. The vegetation that predominates in this zone is herbaceous vegetation, grassland, dryland cereal crops, stands of Lavandula stoechas and Olea sylvestri, Populus alba, and Salix spp. on the banks. In addition, the pasture spaces of holm oak Quercus rotundifolia and cork oak Quercus suber stand out (MAPA 1991).

All the hives included in the study were examined and sampled twice. Firstly, at the end of the productive season (July–August), when parameters related to the strength of the beehives, nutritional and sanitary status was registered. Secondly, after the winter period (March), the beehives were reexamined to assess their mortality through overwintering.

2.2 Monitoring of beehives

The registered strength parameters of the beehives were the number of adult bees, operculated and open brood, and honey/pollen reserves, as described by Delaplane et al. (2013). The number of bees per colony was estimated by calculating the percentage of surface occupied on both sides of each frame. The remaining parameters were referred to as cm2 of surface occupied by each one (operculated brood, open brood, pollen, and honey reserves). To reduce bias, these parameters were estimated in duplicate by two different technicians, and the mean value was calculated. The estimates were carried out at the same time, to avoid a greater variability in the number of bees outside the hive.

2.3 Diseases diagnosis

Approximately 400 adult bees from each hive were sampled for diagnostic purposes, allowing the assessment of both the initial and final infestation levels of V. destructor, N. ceranae, Deformed Wing Virus (DWV), and Chronic Bee Paralysis Virus (CBPV). Sampled bees were collected at extreme frames to avoid damaging queen bees or broods in the middle frames of the hive and were taken to the laboratory under refrigeration. Once in the laboratory, approximately 30 bees were preserved at − 20 °C to check for N. ceranae quantification, whereas the other 30 bees were preserved at − 80 °C to check for viral load. The remaining bees were refrigerated to check for V. destructor detection.

To estimate V. destructor infestation levels, about 300 bees were immersed in a 5% ethanol solution and were shaken for 5 min, following which they were passed through a sieve, which the total numbers of bees and mites were counted (Calatayud and Verdú 1992). Using this method, the number of phoretic V. destructor in each hive was determined and expressed as the number of mites per 100 bees as a percentage of parasitism.

To assess the presence of N. ceranae, the digestive tract was extracted from 30 bees per hive. The guts were brought in 5 ml of sterile DNase-free water and homogenized in a stomacher bag for 5 min. Subsequently, 1 ml of each sample was incubated with 200 μl of germination solution (NaCl 0.5 M + NaCHO3 0.5 M) at 37 °C for 15 min. This final solution was used as a matrix for DNA extraction with NukEx Complete Mag RNA/DNA® (Gerbion GmbH and Co. KG, Germany) and KingFisher Flex (Thermo Fisher Scientific Inc.). The DNA concentration of each sample was measured using NanoDrop™ 2000 (Thermo Fisher Scientific Inc.), and quantitative PCR was performed as previously described (Bourgeois et al. 2010). Briefly, for each reaction, 10 μl of Premix Ex Taq™ (2 ×) (TaKaRa Bio, Japan), 0.2 μM of each primer and ROX Reference Dye, 0.8 μl of probe, and 2 μl of DNA sample (5 ng of DNA) were used in a total volume of 20 μl PCR reaction mixture. The PCR cycling parameters consisted of an initial denaturation at 95 °C for 30 s, followed by 40 cycles at 95 °C for 5 s, and 60 °C for 34 s. Duplicate reactions were performed for template samples, standards, and non-template controls. The number of DNA copies present in each sample was estimated based on a standard curve calculated using a thermocycler-specific software (Applied Biosystems 7300, Thermo Fisher Scientific Inc. USA) and expressed as the number of copies of N. ceranae DNA per ng of total DNA.

To determine the occurrence of DWV and CBPV, 20 bees per hive were brought in 5 ml of sterile PBS and homogenized for 5 min in a Blender Smasher™ (BioMérieux, Spain). RNA extraction was performed from 1 ml of homogenized mix per sample using NukEx Complete Mag RNA/DNA kit® (Gerbion GmbH and Co. KG, Germany) and KingFisher™ Flex (Thermo Fisher™ Scientific Inc.). Subsequently, retro transcription of the samples was performed to obtain cDNA using the PrimeScript™ RT Reagent Kit (TaKaRa, Japan) following the manufacturer’s recommendations. Total RNA was quantified using a NanoDrop™ 2000 (Thermo Fisher™ Scientific Inc.). Finally, qPCR for DWV and CBPV was performed using Premix Ex Taq™ (TaKaRa Bio, Japan), as previously described (Schurr et al. 2019). The results were expressed as the number of copies of viral cDNA per ng of total DNA from each sample.

2.4 Data analysis

Data analyses were conducted using Python version 3.11.4. The objective of the statistical analysis was to obtain a machine learning model to predict the response variable “mortality” based on the proposed explanatory variables (Table I), considering the principles of parsimony (maximum accuracy using the smallest number of variables.

Table I Potential explanatory variables to mortality

Before training the machine learning model, we performed correlation analysis using Spearman’s correlation coefficient and generated a hierarchical dendrogram to visualize the similarity structure among the features. This process allowed us to identify potentially correlated variables as redundant features, which were excluded when they had a correlation greater than 95%.

To predict bee mortality, the Random Forest (RF) algorithm, which is an ensemble model based on the aggregation of single decision trees trained on a subsample of the training dataset with replacement, was used. We assessed the performance metrics of the initial model on both the training and test sets, including the following metrics accuracy, sensitivity, and specificity. An accuracy of the model is a measure of how the model can correctly predict both the positive and negative outcomes; the specificity is the true negative rate which measures the ability of the model to correctly identify negative cases; and finally, the sensitivity is the true positive rate that can measure the ability of the model to correctly identify positive cases.

The dataset of this study was introduced, containing features related to bee colony health and strength (Provided in Supplementary Data 1) discussed previously. The original dataset was split into training (85%) and test subsets (15%) subset. Hyperparameters of the RF model were selected using cross-validation K-fold on the training dataset. We then used the best-performing random forest classifier with the optimal hyperparameters, combined with a high number of trees (500), to achieve better predictive performance on the data.

To compute the importance of the variables, we employed permutation importance (Molnar 2020), a technique used to assess the contribution of each feature to a model’s prediction, and repeated it twice, using only features with importance values different from zero, which also ensures that removing these features did not change the accuracy of the model. After this step, we obtained the importance of each feature and only kept those features with importance different from zero (after examining that removing this characteristic did not change the accuracy of the model). Finally, we introduced Shapley values to explain individual predictions and generate plots to visualize the impact of each feature on the prediction (see Glossary for better compression).

3 Results

In the entire data set the mortality rate was 0.43. The training set contains 152 samples, each with 15 features. In 2020, 51% of the colonies died, meaning 46 out of 91 beehives; in 2021, 32 of 88 colonies died, which is 36%. All the parameters were recorded as indicated in Table II.

Table II The summarizing of the situation of each apiary (A) with respect to the measured variables

Figure 1 shows the dendrogram using Spearman’s correlation. CBPV_copies and CBPV showed a correlation greater than 95% so CBPV_copies had been removed of the analysis.

Figure 1.
figure 1

Dendrogram of redundant features using Spearman’s correlation.

It forms three distinct groups, and the blue cluster comprises Nosema, DWV, and V. destructor, while the green includes Pollen and Place, and lastly, the orange incorporates the remaining variables. After this step, the test set consisted of 27 samples and 15 features. The mean colony loss in the training set was approximately 0.428. The mean colony loss value in the test set was approximately 0.481.

The results of Random Forest Permutation Importance (Fig. 2) show that DWV, Place, Nosema, and CBPV had no significant effect, so we run the definitive model discarding these variables.

Figure 2.
figure 2

Random Forest importance permutation plot. Each bar corresponds to a feature, and the height of the bar reflects the importance of that feature in influencing the model’s predictions after permutation. Those that had an importance below 0 were eliminated in the final model

After removing these features, we obtained for the training data an accuracy = 0.83, sensitivity = 0.65, and specificity = 0.98, and for the test data, acc = 0.70, sen = 0.46, and spe = 0.93. The relationship between different features and mortality was represented by partial dependence plots. The increase in the model’s average was associated with higher mortality, whereas a decrease was correlated with a higher probability of survival (Fig. 3). Notably, higher values of Open brood, operculated brood, pollen, honey, bees, and also Nosema decreased the probability of death, while higher values of V. destructor and DWV copies increased average mortality (Fig. 3). Finally, Fig. 4 explains the individual predictions using Shapley values. Features with a point cloud concentrated around zero had a low impact on the prediction, and conversely, when the cloud is more spread, it means a higher impact on the prediction. In this case, we can see that the most explanatory variables were as follows: open brood, operculated brood, V. destructor, honey, bees, pollen, DWV_ copies, and Nosema_ copies.

Figure 3.
figure 3

The plots are displayed in a 3 × 3 grid layout, providing a view of each feature’s influence, allowing observe both the marginal effect and the interaction effect of the features. This aids in identifying trends, uncovering non-linear relationships, validating our understanding of the model’s behavior, having an average of mortality in orange, and showing the increase and decrease with each feature

Figure 4.
figure 4

Shapley values for all features in the training data. In red high importance given in the model of each feature and blue low

4 Discussion

Using machine learning methodology as a tool for data analysis, we conducted a comprehensive investigation of bee colony loss in Spain over a 2-year period. The union of modeling and visualization techniques has established a robust framework for identifying pivotal factors and understanding their influence on bee survival. This understanding is of paramount significance for the conservation of pollinators.

In this case, the model performed better on the training data than on the test data. The training data metrics showed high accuracy and specificity, indicating that the model correctly identified several negative instances in the training data. The medium sensitivity of the model indicates that it can correctly identify positive instances in the training data. In addition, the test data had a high accuracy, medium sensitivity, and high specificity. These parameters show that the model is more capable of predicting negative (those hives that were alive at the final moment in the experiment) cases than positive cases.

Is common to use models to predict the loss of hives in winter, some of these studies used simple and multiple linear regression models and reported that 20% of winter mortality variability can be explained by the analysis of weather conditions (Becsi et al. 2021). Others found that winter colony mortality was significantly affected by operation size, year, and cluster membership using a Generalized Linear Mixed (Kuchling et al. 2018). Finally, the main advantages of using machine learning over conventional statistical methods were the ability to search non-linear relationships and can handle a large number of predictor variables among others; the first time using this kind of analysis was by Calovi et al. (2021), but the accuracy of model’s overwintering survival (73.3–65.7%) was lower than reported here; also, sensibility and specificity of the model had to be provided because these are essential metrics that complement a model’s accuracy providing a detailed understanding of its performance in classifying different outcomes that are mentioned above. Also is important that in other studies, this kind of complex statistical analysis, such as deep learning, has been shown to be useful in apiology for V. destructor detection (Divasón et al. 2023). It is noteworthy that both the Partial Dependence Plots and Shapley values indicate the same top seven factors associated with colony loss, providing evidence that these were the top features, and it is not the only one implied in colony losses using two different methods. These two methods indicate that factors associated with colony loss are multifactorial, with different degrees of implication.

The most influential parameters for beehive survival were open and operculated broods, while the number of adult bees occupied the seventh position. This shows how the optimum egg laying of the queen and a high bee population is crucial for correct beehive running. These parameters are determined by the dynamics of the beehive population (size of the colony, reproductive potential of the queen, access to nectar, pollen, and water, and adequate space for brood rearing and food storage) and have been described previously as the best factors to predict the surveillance of the colony during the following winter (Harris 2010; Lee and Winston 1987).

Additionally, good brood status was associated with healthy colonies without various pathologies related to this stage (Aronstein and Murray 2010; Forsgren 2010; Genersch 2010). Also, our partial dependence plots showed that in both types of broods, when they decreased below 1500 cm2 occupied by brood, the chance of the hive dying was greatly increased, while above 1500 cm2, their influence was null. Thus, these tools generated threshold values for some factors, in addition to their global influence, which could be useful for beehive management.

We found that V. destructor played a significant role in colony loss, which is consistent with findings reported by several authors (Dainat et al. 2012a, b; Le Conte et al. 2010; Stahlmann-Brown et al. 2022). This mite poses a significant threat to Western honey bee colonies worldwide. It is considered the primary cause of colony mortality in the United States, leading to colony collapse and death if infestations are left untreated (Brodschneider et al. 2018; Guzmán-Novoa et al. 2010). This trend of colony loss increased through winter, this study was encompassing this season, and it is noteworthy that similar high colony losses due to V. destructor infestations have been observed in Europe, where regional surveys reported average winter losses of 20.9% in 2016–2017 and 16.7% in 2018–2019 (Brodschneider et al. 2018; Gray et al. 2020; Van Dooremalen et al. 2012). Although studies in Uruguay showed an average winter colony loss of 18.3% in 2013–2014, parasites and diseases, such as V. destructor, were responsible for 61.5% of the reported losses (Antúnez et al. 2017). The overwintering state of a honeybee colony is characterized by changes in the behavior and physiology of individual bees, including an increase in the fat body that is necessary to pass this season (Doeke et al. 2015), doing more susceptible to colony death through this season (Dahle 2010; Dainat et al. 2012a, b; Guzmán-Novoa et al. 2010; Van Dooremalen et al. 2012). Based on the statistical results obtained using machine learning, this pathogen plays a crucial role on colony mortality during the overwintering period.

Our findings highlight that Varroa destructor remains one of the primary pathogens affecting colonies. The mechanism by which this ectoparasite acts consists primarily of feeding on the fat body (Ramsey et al. 2019). This predisposes bees to the entry of other external infectious agents, reducing the life expectancy of infected individuals and transmitting pathogens. Some existing products, such as coumaphos, tau-fluvalinate, flumethrin, and amitraz, can generate resistant populations (González-Cabrera et al. 2013; Hernández-Rodríguez et al. 2021; Maggi et al. 2009; Rodríguez-Dehaibes et al. 2005) leaving residues in the colony matrices (Bernal et al. 2010; Calatayud-Vernich et al. 2018; Kast et al. 2021; Marti et al. 2022; Wilmart et al. 2016; Xiao et al. 2022). These facts emphasize the importance of research into new products that are effective and sustainable for mite control. In addition, none of the existing management options has been able to completely eradicate V. destructor from infested colonies. A promising new tool is Integrated Pest Management (IPM), an ecologically based, sustainable approach to pest management that relies on a combination of control tactics to minimize environmental impacts. Therefore, these strategies have only been successful in maintaining infestations at levels that are not critically damaging (Jack and Ellis 2021).

Furthermore, nutritional status is closely related to colony mortality, with pollen and honey loads being critical factors (third and fourth place). The nutritional requirements of honeybees are intricate and unique compared with those of other livestock or animals, considering their superorganism colony structure and caste systems. As honey bee workers age, they undergo a transition from high-essential amino acid diets to predominantly carbohydrate-based diets (Brodschneider and Crailsheim 2010; Leach and Drummond 2018). Nutritional stress resulting from insufficient pollen or monofloral sources can negatively affect hives (Branchiccela et al. 2019; Naug 2009) and contribute to disease incidence and losses in winter (DeGrandi-Hoffman and Chen 2015; Dolezal and Toth 2018). Providing an alternative floral resource can offer suitable nutrition and contribute to fulfilling their nutritional requirements (Crailsheim 1990; Di Pasquale et al. 2013). Similarly, supplementation can maintain and develop colonies and can be used by beekeepers (Oliveira et al. 2020a, b; DeGrandi-Hoffman et al. 2008; Borovšak et al. 2015b, a; Watkins de Jong et al. 2019).

For this reason, we found high importance in the model of honey and pollen reserves. Honey is essential for colony nutrition in the adult stage (Brodschneider and Crailsheim 2010). Especially long periods with no nectar input, that is, nutritional carbohydrate deficiencies, usually after honey harvesting, starvation is one of the most important reasons for beehive mortality during winter (Brodschneider and Crailsheim 2010). At the same way than the brood and number of adult bees, the partial dependence plots show how when honey reserves decreased below of approximately 3500 cm2, the probability of mortality greatly increased, while above this threshold value, reserves of honey did not affect to beehive survival.

Regarding reserves of pollen, the model showed that beehive surveillance increased as pollen reserves increased. The pollen supposes the only source of protein for the colony (Brodschneider and Crailsheim 2010). In contrast with honey, bees only store a few reserves of pollen in the hive, which decreases rapidly when concluding the foraging of pollen (Schmickl and Crailsheim 2001, 2002). Insufficient input of pollen in the diet leads adult bees with less corporal and wing size and less longevity (Alqarni 2006; Manning et al. 2007), also malformations during their development as pupae (Jay, 1964), and cannibalism of the brood and cessation of egg laying. All these consequences of absence of pollen greatly influence in the survival of the colony (Schmickl and Crailsheim 2001, 2002).

Additionally, we observed that Nosema ceranae has an unexpected relationship with average mortality, decreasing it. This parasite is an obligate intracellular parasite that infects the epithelial cells of the honey bee ventriculus after ingestion of spores and acts as a pathogen in bee colonies (Botías et al. 2013; Higes et al. 2007, 2010). The parasitic load of Nosema ceranae is strongly correlated with season and age, being high in old and spring ones (Emsen et al. 2020; Jabal-Uriel et al. 2022). Consequently, these variables offer a means to elucidate variations in Nosema quantities. This elucidation extends to the elevated parasitism observed in survivor colonies, which can vary depending on the bees collected and the season contributing to the explanation of the model.

Moreover, viruses, particularly the Deformed Wing Virus (DWV), also play an important role in the death of colonies. DWV is a positive single-stranded RNA (+ ssRNA) Picorna-like virus that belongs to the family Iflaviridae. It exists in four variants (A, B, C, and D) (Paxton et al. 2022) and is strongly associated with V. destructor (Piou et al. 2022). Some authors previously reported a link between variant A and colony loss (Kevill et al. 2019). Additionally, experiments on individual honey bees suggest that DWV-B may be more virulent than DWV-A, at least in adult hosts (McMahon et al. 2016). In our study, we only examined DWV without distinguishing the variants; however, some authors have suggested that both DWV-A and DWV-B can cause honey bee colonies to die which aligns with our findings. In addition, DWV and V. destructor have a strong relationship; the deformed wing virus (DWV) is the most common virus transmitted by this ectoparasite, and the mite is correlated with increased viral prevalence and viral loads in infested colonies (Piou et al. 2022). Another virus examined was CBPV, the causative agent of severe, usually fatal paralysis in adult honeybees (Ellis and Munn 2005). CBPV is an enveloped virus with a genome consisting of two single-stranded RNA molecules (Gisder and Genersch 2017). Previous studies have shown that CBPV can cause colony loss in association with Nosema and V. destructor (Dittes et al. 2020) but in this study, it was not an influential factor maybe because there were a few cases (only 39 in the entire dataset).

Finally, the location of the apiary and the season showed a very little influence in the mortality of the beehives. We included the location in the models to encompass in it the difference between the different zones as weather conditions, types of land uses and vegetation, or access to sources of food or water, while the variable year included mainly the differences in the weather conditions and flowering patterns among the years of the study. Even though the apiaries were in different locations, their climatology was very similar; for this fact, this effect is not very influential, so more studies with records of all those cited parameters in different zones are necessary to elucidate better the influence of each one in the mortality of beehives.

These findings offer valuable information to beekeepers, enabling them to maintain an appropriate sanitary status within their colonies, ensure good nutritional levels, and understand the most relevant factors contributing to their colony losses, such as the strength of the beehive and V. destructor. Moreover, knowledge of how pollen loads and other nutritional aspects can influence on colony mortality provides beekeepers insights for optimizing hive health. They can take measures to ensure a diverse and balanced diet for their honeybee colonies, addressing potential nutritional stressors that may arise from insufficient pollen resources or monofloral sources, focusing on various levels, from larval to adult nutrition, so beekeepers can stimulate the well-being and resilience of bee colonies. Additionally, understanding the significance of a healthy brood status in a colony allows beekeepers to monitor and address brood infections promptly, using good sanitary beekeeping practices that allow them to not transmit different pathogens between their apiaries and improve their sanitary status.

In future studies, it will be of utmost importance to include additional variables, such as the presence of pesticides and various climatic factors, to obtain a more comprehensive and holistic understanding of colony mortality in Spain. Overall, this study shows that machine learning models have the potential to positively impact the beekeeping industry by predicting the most influential factors contributing to colony loss.

5 Glossary

5.1 Main concepts used in “Sect. 2” and their definition

Cross-validation K-fold

Cross-validation is a resampling technique used to validate machine learning models against a limited sample of data.

Machine learning model

It is an intelligent file that has been conditioned with an algorithm to learn specific patterns in datasets and give insights and predictions from those patterns. When creating an ML model, you define the answer that you would like to capture and set parameters for the model to work within and learn from it.

Permutation importance

A model inspection technique that measures the contribution of each feature to a fitted model’s statistical performance on a given dataset.

Random Forest (RF) algorithm

Regression tree technique which uses bootstrap aggregation and randomization of predictors to achieve a high degree of predictive accuracy.

Shapley values

The (weighted) average of marginal contributions.