Forecasting model of Corylus, Alnus, and Betula pollen concentration levels using spatiotemporal correlation properties of pollen count

The aim of the study was to create and evaluate models for predicting high levels of daily pollen concentration of Corylus, Alnus, and Betula using a spatiotemporal correlation of pollen count. For each taxon, a high pollen count level was established according to the first allergy symptoms during exposure. The dataset was divided into a training set and a test set, using a stratified random split. For each taxon and city, the model was built using a random forest method. Corylus models performed poorly. However, the study revealed the possibility of predicting with substantial accuracy the occurrence of days with high pollen concentrations of Alnus and Betula using past pollen count data from monitoring sites. These results can be used for building (1) simpler models, which require data only from aerobiological monitoring sites, and (2) combined meteorological and aerobiological models for predicting high levels of pollen concentration.


Introduction
Corylus L. (hazel), Alnus Mill. (alder), and Betula L. (birch) belong to the order Fagales Engl. and the family Betulaceae Gray (Bremer et al. 2009). These trees are very common in the Northern Hemisphere (Kornas and Medwecka-Kornas 2002). The dominant species from this family in Poland are Alnus glutinosa, Alnus incana, and Betula pendula; less common are Betula pubescens, Corylus avellana, and their cultivars. The start and length of the Corylus and Alnus pollen seasons are very variable from year to year. Their pollen season in Poland usually begins some time between early February and late March and lasts on average for 30 days (Corylus) and 26 days (Alnus). The Betula pollen season occurs between the middle of April and the middle of May and lasts for approximately 18 days. Its pollen season start and duration are less variable than those of Corylus and Alnus (Nowosad et al. 2015). Furthermore, the daily and annual pollen counts of Corylus, Alnus, and Betula vary greatly (Nowosad et al. 2015).
Corylus, Alnus, and Betula pollen are well known for their allergenic properties (Viander and Koivikko 1978), and the occurrence of allergic reactions is connected with pollen concentration levels. According to Rapiejko et al. (2007), the first allergy symptoms are seen during exposure to a concentration of 35 pollen=m 3 of air for Corylus, 45 pollen=m 3 of air for Alnus , and 20 pollen=m 3 of air for Betula. Allergy symptoms in all subjects were noted at concentrations of 80, 85, and 75 pollen=m 3 of air, respectively, for Corylus, Alnus, and Betula. Sensitization rates to tree species of the family Betulaceae in Poland are high: Corylus, 22.3 %; Alnus, 22.8 %; and Betula, 27.7 % (Heinzerling et al. 2009). The pollen allergens from Corylus, Alnus, and Betula are structurally and immunochemically similar. Thus, cross-reactions are very likely between allergens of the Betula family (Valenta et al. 1991).
One of aerobiology's objectives is to develop models enabling the prediction of pollen concentration in the air (Rodriguez-Rajo et al. 2006). Forecast models of pollen concentrations have many practical applications. They are highly important for allergy sufferers because predictions can allow them to undertake appropriate treatment. Models could also be useful in agriculture, forestry, and many fields of science. Most of the published results are based on the relationship between pollen season characteristics or on pollen count and meteorological conditions (Bringfelt et al. 1982;Cotos-Yáñez et al. 2004;Castellano-Méndez et al. 2005;Rodriguez-Rajo et al. 2006;Hilaire et al. 2012). Other models have been built based on an operational weather forecast system (Vogel et al. 2008) and the System for Integrated modeLling of Atmospheric coMposition (SILAM) (Sofiev et al. 2013). In Poland, Latałowa et al. (2002) delimited the major meteorological parameters as a basis for future forecast modeling of the atmospheric Betula pollen concentration in Gdańsk; Puc (2012) built an artificial neural network model of the relationship between Betula pollen and meteorological factors in Szczecin; and Myszkowska (2013) predicted Betula pollen season characteristics in Kraków.
Aerobiological surveys have been focused either on the statistical relationship between pollen count and meteorological variables or on a deterministic representation of pollen dispersion. A recent study showed that pollen counts are highly inert, temporally and spatially (Nowosad et al. 2015). Models based on this property would be relatively simple, because they do not require data other than that of pollen concentration. The forecasts would require data only from aerobiological monitoring sites and could be almost completely automated. The aim of this study was to create and evaluate Corylus, Alnus, and Betula pollen concentration level predictions based on previous pollen count values from given sites. To the best of our knowledge, there are no reports in the literature regarding how these kinds of models perform in practice.
Poland is a country in Central Europe, extending 649 km from north to south and 689 km from east to west. It is a lowland country with an average elevation of 173 m and a surface area of 312,679 km 2 , only 3.1 % of which is higher than an elevation of 500 m. There are five topographic zones in Poland. The order of the zones from north to south is as follows: the Baltic coastal plains, the lake region, the central lowlands, the uplands, and the mountains (Sudeten and Carpathian ranges). Agricultural land covers approximately 60 % of the surface area of the country, and forest, bush, and wooded land cover about 30 %. Built-up and urbanized areas occupy approximately 5 % of the total area (Dmochowska 2013). All of the studied cities are agglomerations surrounded mainly by forests and farmlands. However, the proportion of land use classes around each city is different. Gdańsk lies on the coast of the central Baltic Sea, with sea to the east and agricultural lands or forests to the west and south; Szczecin is located between Dąbie Lake to the northeast, forests to the southeast, and agricultural areas to the southwest; Poznań and Łódź are surrounded by forests and agricultural areas; Lublin is surrounded by agricultural areas; Rzeszów lies between forests to the north and agricultural areas to the south, east, and west; Kraków and Sosnowiec are distinguished by a larger proportion of urban areas.
Poland has a temperate continental climate. The effects of Atlantic masses of air and the proximity of the Baltic Sea are felt in Gdańsk and Szczecin. Poznań, Łódź, Sosnowiec, and Kraków are located in a transition zone between oceanic and continental air masses. The climate of Rzeszów and Lublin is influenced by continental air masses. In addition, Kraków, Sosnowiec, and Rzeszów lie near the Carpathian Mountains, which affect their climate (Blazejczyk 2006). A two-sample Kolmogorov-Smirnov test yielded no significant differences (D ¼ 0:08,

Aerobiological data
Daily average pollen counts were measured by a volumetric spore trap of the Hirst design (Hirst 1952), according to the recommendations of the European Aerobiology Society's Working Group on Quality Control Galán et al. (2014). Traps were located 12 m above ground level (Rzeszów) or higher. Two different pollen counting methods were used. Pollen grains were counted along 12 vertical transects using the methods outlined by Stach (2000) (Gdańsk and Rzeszów), or along four horizontal transects using the method recommended by the Spanish Aerobiology Network (Kraków, Lublin, Łódź, Poznań, Sosnowiec, Szczecin) (Galán et al. 2007). Cariñanos and Emberlin (2000) reported that both methods follow similar trends and provide close approximations to the pollen count from the entire slide. The pollen concentration was expressed as the number of grains=m 3 of air per 24 h (Comtois 1998).

Dataset creation
All the calculations were carried out using R software packages (R Core Team 2014; Kuhn 2014; Liaw and Wiener 2002). The pollen season limits of Corylus, Alnus, and Betula at each location and for each year were calculated using the 90 % method (Nilsson and Persson 1981). In this method, a season starts when 5 % of the total catch has been achieved and ends when 95 % has been reached. For each taxon, the earliest day of pollen season start and the latest day of pollen season end based on all of the data were used as the temporal scope. The aim of this work was to forecast the pollen count level of allergenic risk. For each taxon, two levels of concentration were distinguished: low and high. The ranges of concentration level values were different for each taxon. The concentration levels were as follows: Corylus, low 0-35 grains=m 3 and high [35 grains/m 3 ; Alnus, low 0-45 grains/m 3 and high [45 grains/m 3 ; and Betula, low 0-20 grains/m 3 and high[20 grains/m 3 (Table 2). Threshold values were based on first symptom values for patients allergic to these taxa (Rapiejko et al. 2007).

Statistical modeling
By using a stratified random split to divide the datasets, the distribution of the outcome in the training and test sets was preserved. Two subsets were created: • Training set, used for training a model and choosing its optimal parameters (2/3 of cases) • Test set, used only to evaluate the model on data not present during previous stages (1/3 of cases) Most of the machine learning algorithms expect equal instances of each class. Thus, the imbalance between classes can have a significant impact on the quality of the model. The dataset was slightly imbalanced in the case of Betula, and highly imbalanced for Corylus and Alnus (Table 2). Up-sampling was used to reduce this class imbalance. This technique imputes additional data to improve balance across the classes. Training sets were sampled with replacements to create equal class distribution. All of the original training data were left intact, and additional samples were added to the minority classes with replacements. However, the test sets were not changed, since they should reflect the class imbalance. This is important to obtain reliable estimates of a model's performance (Kuhn and Johnson 2013). Previous research in Poland showed that there is usually an increase-delayed on average by 1-3 days-in the correlation of pollen count between the pairs of monitoring sites (Nowosad et al. 2015). This is due mainly to prevailing winds from the west toward the east in this latitude zone of the Northern Hemisphere (Rossby waves) and the movement of atmospheric fronts. Levels of Corylus, Alnus, and Betula pollen concentration were used as the outcome data. The independent variables were the previous 4 days' pollen counts from all of the sites: Random forest (Breiman 2001) was used to predict the pollen concentration levels of Corylus, Alnus, and Betula. Preliminary studies showed that a random forest model's performance is comparable to other techniques, such as support vector machines and boosting trees. Furthermore, valuable information about random forest results could be obtained using, for example, variable importance. A random forest model has one tuning parameter: the number of randomly selected predictors to choose from at each split (m try ). A total of 100 iterations of the bootstrap were applied as the re-sampling scheme to select the optimal values of the model's tuning parameter. A series of models were fit to the training sets. For each model, the optimal parameter value was obtained based on specificity: the rate that days with high concentration were predicted correctly: where CP high is the number of correctly predicted days with high concentration and All high is the total number of days of high concentration. The general effect of predictors on each model was calculated. Variable importance was estimated by looking at the increase in prediction error when data for a given variable were changed, while all the other variables remained constant (Breiman 2002;Liaw and Wiener 2002). Afterward, the variable importance was scaled to values between 0 and 100.

Evaluation of the models' performance
The final 24 models (3 taxa Â 8 cities) were applied to generate predictions based on the test sets. Model predictions were then compared with the observed data in the test sets. A confusion matrix, unweighted Kappa statistic, sensitivity, specificity, and balanced accuracy were used to describe the performance of the models. The Kappa statistic is: where O is the observed accuracy and E is the accuracy expected to be achieved based on the marginal totals of the confusion matrix. The Kappa statistic ranges from À1 to 1. A value of 0 indicates no agreement between the observed and predicted classes, while a value of 1 indicates perfect agreement. Negative values rarely occur and indicate that ''the prediction is in the opposite direction of the truth'' (Kuhn and Johnson 2013). The sensitivity is defined as the rate that days with low concentration are predicted correctly: where CP low is the number of correctly predicted days with low concentration and All low is the total number of days with low concentration. Balanced accuracy helps to reduce the impact of imbalanced classes on a model's evaluation. It is defined as follows: Additionally, Mann-Whitney U test (Mann and Whitney 1947) was used to compare the weather conditions (precipitation and temperature) between true and false predictions of low and high pollen concentration levels of Corylus, Alnus, and Betula.

Model
The start of the earliest Corylus pollen season was on day 36 of the year, and its latest season end was on day 110 of the year. For Alnus, these start and end dates ranged from days 46 to 111 of the year, and for Betula they ranged from days 97 to 135 of the year. Data only from these periods were used for model creation and evaluation (Fig. 2). Twenty-four final models were created.

Variable importance
Variable importance for Corylus, Alnus, and Betula models shared similar temporal and spatial properties. The values of pollen counts from one day before were the most important variable, while the values from 4 days before were the least. Moreover, variables from the same site were the most important in the majority of the models.
For six of eight of the Corylus models, the most important variable was the pollen count from one day (Gdańsk, Lublin, Sosnowiec, and Łódź) or 2 days (Poznań) before. Only in the case of models of pollen concentration levels in Kraków and Rzeszów was the most important independent variable from a different city: Sosnowiec in both cases. In Corylus models, the low importance of pollen concentration inputs from 4 days before was noticeable (Figs. 3, 6).
Variable importance for Alnus models was the least uniform. Only in half of the models (Lublin, Poznań, Rzeszów, and Szczecin) was the most important variable from the same city as the output. Also, in four of eight models (Poznań, Łód_ z, Kraków, Sosnowiec), the most important variable was the pollen count in Poznań from one day before. Exceptionally, the most important variable for predicting pollen concentration levels in Gdańsk was data from Sosnowiec. The low impact of variables from 4 days before was also apparent (Figs. 4, 6).
Variable importance for Betula models showed some regularities. In six of eight models, the most important variables were pollen concentration from a day before at the same site as the outcome. In addition, the pollen count values at the same site from 2 to 3 days before had a visible influence. The order of variable importance was different in two models. The most important variable for the Betula model in Sosnowiec was the pollen concentration value in Kraków from one day before; and the pollen concentration value in Poznań from one day before was the most important variable for the Szczecin model. Moreover, models for Gdańsk and Lublin were built based mainly on the values from the same site. In the rest of the models, many independent variables were important. Inputs from Gdańsk also had a small impact on the rest of the models. In addition, for most of the models, variables from 4 days before had either little importance or no importance at all (Figs. 5, 6).

Performance of the models
A confusion matrix, Kappa statistic, sensitivity, specificity, and balanced accuracy were used to evaluate the models on the test sets ( Fig. 7; Table 4). Corylus models showed the lowest Kappa values. In two cases (Gdańsk and Poznań), Kappa statistics were equal to 0. Only the model of Lublin had a substantial Kappa value (0.68).; it correctly predicted 10 of 14 cases with a high pollen concentration ( Fig. 7; Table 4).
The Kappa statistics also proved important for Alnus models, with an average value of 0.7. The minimum Kappa value was for the Gdańsk model (0.6), and the maximum was for the Łódź model (0.86). However, most of the Alnus models had a specificity lower than the Betula models. Only models for Lublin and Łodź specificity exceeded 0.8 (0.88, 0.82, respectively). The lowest specificity was found for the model for Gdańsk: 0.54. For the test set, the model of pollen concentration group of Gdańsk predicted correctly only 7 of 13 cases of high pollen concentration ( Fig. 7; Table 4).
Based on th given criteria, models of Betula were the most reliable. Their average Kappa was 0.73, with a minimum value of 0.62 for Szczecin and a maximum of 0.81 for Kraków and Sosnowiec. Moreover, all of the specificity values for Betula models exceeded 0.8. In all of the Betula models, specificity values were higher than the Kappa statistic ( Fig. 7; Table 4).  True and false predictions were compared with temperature and precipitation. The results indicated that the rainfall was connected with a false prediction of high level in Alnus (p value ¼ 0:0018) and Betula (p value ¼ 0:0000002) models. However, this relation was not found in Corylus (p value ¼ 0:12) models. Additionally, true and false predictions were compared to the day-to-day changes in precipitation and temperature. The results showed that the final models were robust to changes in the precipitation; however, predictions of high level of Corylus (p value ¼ 0:0016), Alnus (p value ¼ 0:01), and Betula (p value ¼ 0:001) are sensitive to the changes in temperature.

Discussion
One of the main goals in aerobiological models is to predict pollen concentration levels which can trigger the onset of allergic symptoms. Stepwise multiple regression (Bringfelt et al. 1982;Myszkowska 2013), additive logistic models, partially linear models (Cotos-Yáñez et al. 2004), artificial neural networks (Castellano-Méndez et al. 2005), ARIMA models (Rodriguez-Rajo et al. 2006), and stochastic gradient boosting (Hilaire et al. 2012) have been used in aerobiological studies aimed at pollen concentration prediction. Most pollen predictive modeling studies have focused on the impact of meteorological variables (such as temperature, humidity, precipitation, wind direction, and speed), on pollen season start and duration, and on pollen concentration (Bringfelt et al. 1982;Cotos-Yáñez et al. 2004;Castellano-Méndez et al. 2005;Rodriguez-Rajo et al. 2006;Hilaire et al. 2012;Myszkowska 2013). Only Castellano-Méndez et al. (2005) attempted to forecast the level of allergenic risk associated with Betula using previous Betula pollen and meteorological information. However, to the best of our knowledge, empirical predictive models have not used pollen count values from other sites before now. Previous research shows that Corylus, Alnus, and Betula pollen concentration are correlated not only in time, but also in space (Nowosad et al. 2015). Nowadays, daily pollen concentration data are a result of manual pollen counting. Thus, information about the pollen count from the previous day is not available fast enough to be used for predicting levels of pollen concentration. However, there are many efforts to create a semiautomatic and automatic systems for counting airborne pollen (Boucher et al. 2002;Holt and Bennett 2014). As a result of these studies, it should be possible to obtain information about the pollen data from the day before quickly enough to be used by a forecast system.
In this study, random forest (Breiman 2001) was used to predict the pollen concentration level of Fig. 3 Scaled variable importance of each predictor (pollen count data with 1-day lag (t À 1), 2-day lag (t À 2), 3-day lag (t À 3), 4-day lag (t À 4) at given sites) for Corylus in each location. For better spatial relations recognition, a schematic map with measurement sites was provided Aerobiologia (2016) 32:453-468 461 Corylus, Alnus, and Betula using a spatiotemporal correlation of pollen count values at the given sites. The use of random forest presents a distinct advantage with respect to classical statistical methods: the ability to process complex, nonlinear relationships between predictors (Recknagel 2001). The algorithm of random forest is based on the ensemble of a large number of decision trees (Breiman 2001). Consequently, random forest has the advantage of tree-based models over artificial neural networks or support vector machines: interpretability (Geurts et al. 2009). A random forest model can be explained by visualization of decision trees or by using measures of variable importance.
The models of Alnus and Betula had, at the least, considerable values of model evaluation statistics. More than 81 % of events with high Betula pollen concentration could be predicted at each of the given Fig. 4 Scaled variable importance of each predictor (pollen count data with 1-day lag (t À 1), 2-day lag (t À 2), 3-day lag (t À 3), 4-day lag (t À 4) at given sites) for Alnus in each location. For better spatial relations recognition, a schematic map with measurement sites was provided sites. The models for Corylus showed low values of performance statistics in most cases. There are a few possible explanations for this. Firstly, there were insufficient events with high pollen concentration levels in the training/test sets and small values of Corylus pollen count generally. In the years 2003-2005 and 2009-2011, high Corylus pollen counts occurred only between 6 and 43 times (26 on average), and Corylus pollen concentration was lower than 19 grains/m 3 on 90 % of the analyzed days. Secondly, Corylus models could be highly overfit, possibly due to a relatively short time series. The start and course of Corylus pollen season strongly depends on the type of habitat. Corylus pollen season starts sooner in sunny locations and cities, where only a few degrees above zero are enough to start the pollination. On the Fig. 5 Scaled variable importance of each predictor (pollen count data with 1-day lag (t À 1), 2-day lag (t À 2), 3-day lag (t À 3), 4-day lag (t À 4) at given sites) for Betula in each location. For better spatial relations recognition, a schematic map with measurement sites was provided Aerobiologia (2016) 32:453-468 463 contrary, Corylus pollen season starts later in the sunless sites or forests. Therefore, its pollen season lasts longer than Alnus or Betula (Puc and Kasprzyk 2013). The pollen season of Alnus starts on average one weeks after Corylus in Poland (Nowosad et al. 2015). Alnus is also less common in the cities, and its start of flowering is less changeable between trees (Bugała 2000;Puc and Kasprzyk 2013). As a result, the relationship between Corylus pollen data in different cities is more complex than Alnus or Betula. Longer time series could reveal more information about Corylus spatiotemporal properties. Therefore, the variable importance of Corylus models should be interpreted with caution. In this study, the model for Betula correctly forecasted between 81 and 95 % of the days with high pollen concentrations in the test set. This model performance is similar to the previous work of Castellano-Méndez et al. (2005), who used artificial neural networks for predicting whether Betula pollen concentrations exceed certain thresholds-20, 30, 70, and 80 g/m 3 -using previous pollen and meteorological information. The artificial neural networks model predicted between 83 and 100 % of overlevel pollen days on the validation set (years 2000 and 2001). Thus, the models based on previous pollen counts from several sites could serve as an alternative to models based on pollen and meteorological data from a single analyzed site. The importance of independent variables showed a clear temporal and spatial dependency. In 23 of 24 models, variables from a day before had the largest impact. Moreover, input from the same site as the output was the most important in 16 of 24 models. In Alnus models, a high impact of variables from Poznań on Łódź, Sosnowiec, and Kraków was observed. This relationship was also found for Betula models. At the same time, in Betula models, aside from those at Szczecin and Sosnowiec, the input values from the nearest sites had clear importance; for example, Rzeszów and Sosnowiec for Kraków; Poznań for Łódź; Szczecin and Łódź for Poznań; and Poznań for Szczecin. One possible explanation is that neighboring stations have similar weather conditions and therefore  (Kotas et al. 2013). It should be noted that variable importance is not a measure of causation. Pollen production and dispersion is affected by many factors: regional flora, land use, vegetation structure, topoclimate, and weather conditions. Pollen Fig. 7 Confusion matrices and specificity/sensitivity for test sets of pollen concentration level prediction for Corylus, Alnus, and Betula at each location Aerobiologia (2016) 32:453-468 465 concentration in air cannot be described as a linear effect of the impact of these factors. Despite the fact that Alnus and Betula models had substantial prediction quality, some of the events of low or high pollen level were wrongly classified. This is connected mainly with unusual events, such as no pollen or low pollen concentration at monitoring sites on the days before and high pollen count at a given site. Or it may be the opposite: low pollen concentration at a given site and high pollen count at monitoring sites in the preceding days. The occurrence of these situations can be partially explained by the influence of atmospheric conditions. The pollen concentration level could be low during the rainfall and high on the next day with a dry weather. The rapid day-to-day temperature changes also can be the cause of the models errors. Additionally, the occurrence of wrongly classified cases could probably be explained by other factors, such as random local events or changes in scale smaller than those analyzed.
Gdańsk distinguished itself from the other sites. Days with high pollen concentration of Corylus, Alnus, and Betula were less frequent there. In most of the models, independent variables from Gdańsk had little or no importance. Moreover, the prediction quality for Gdańsk was the lowest in the Corylus and Alnus models. As has been reported previously (Nowosad et al. 2015), Gdańsk has different pollen characteristics from other Polish sites. Its northern, coastal location has an impact on the local climate, and the start of the growing season is usually delayed there.

Conclusions
In this study, data from eight Polish monitoring sites over six years were used. The final 24 models are not necessarily the best ones in terms of prediction quality. The Corylus models performed poorly, which could be a mixed result of (1) insufficient events with high pollen concentration level in the training/test set, (2) highly overfit models, and (3) fast change of pollen autocorrelation drop. On the other hand, the study has clearly shown that it is possible to predict the occurrence of days with high pollen concentration of Alnus and Betula using past pollen count data from monitoring sites. For these taxa, random forest models offer capabilities for forecasting pollen concentration levels, with substantial accuracy. The models are an alternative to pollen concentration models based on weather conditions, and they show promise as a useful source of information on high pollen concentration levels for allergists and their patients. It would thus be worthwhile to combine two groups of independent variables-meteorological and aerobiological-from several sites to improve models for predicting pollen concentrations which exceed threshold values. An analysis of longer time periods or a denser monitoring network could also result in better model quality, especially in the case of Corylus.