Spatiotemporal models for predicting high pollen concentration level of Corylus, Alnus, and Betula
- First Online:
- 930 Downloads
Corylus, Alnus, and Betula trees are among the most important sources of allergic pollen in the temperate zone of the Northern Hemisphere and have a large impact on the quality of life and productivity of allergy sufferers. Therefore, it is important to predict high pollen concentrations, both in time and space. The aim of this study was to create and evaluate spatiotemporal models for predicting high Corylus, Alnus, and Betula pollen concentration levels, based on gridded meteorological data. Aerobiological monitoring was carried out in 11 cities in Poland and gathered, depending on the site, between 2 and 16 years of measurements. According to the first allergy symptoms during exposure, a high pollen count level was established for each taxon. An optimizing probability threshold technique was used for mitigation of the problem of imbalance in the pollen concentration levels. For each taxon, the model was built using a random forest method. The study revealed the possibility of moderately reliable prediction of Corylus and highly reliable prediction of Alnus and Betula high pollen concentration levels, using preprocessed gridded meteorological data. Cumulative growing degree days and potential evaporation proved to be two of the most important predictor variables in the models. The final models predicted not only for single locations but also for continuous areas. Furthermore, the proposed modeling framework could be used to predict high pollen concentrations of Corylus, Alnus, Betula, and other taxa, and in other countries.
KeywordsAllergenic pollen Betulaceae Predictive modeling Spatiotemporal models Machine learning Random forest
Corylus L. (hazel), Alnus Mill. (alder), and Betula L. (birch) are considered to be among the most important sources of allergic pollen in the temperate zone of the Northern Hemisphere (D’Amato et al. 2007). According to Heinzerling et al. (2009), approximately 21–24 % of Europeans are sensitized to tree pollen from the Betulaceae family. These rates in Poland are 22.3, 22.8, and 27.7 %, respectively, for Corylus, Alnus, and Betula (Heinzerling et al. 2009). There are also high levels of cross-reactivity between Corylus, Alnus, and Betula (Ebner et al. 1995). As a consequence, Corylus and Alnus pollination can lead to more marked clinical symptoms during a Betula pollen season (D’Amato et al. 2007).
Pollen concentration in the air is the resultant of many factors of different temporal and spatial variability. The spatial distribution of the taxa and phytosociological and habitat relationships mainly affect the temporal variability and intensity of pollen seasons. Moreover, meteorological factors have an impact not only on the production and release but also on the dispersal of tree pollen grains. Previous studies found a relationship between the temperature in the preceding year and the annual pollen sum (Latałowa et al. 2002; Rasmussen 2002). The influence of air temperature on pollen concentration has often been reported (Rodríguez-Rajo et al. 2004; Puc2007, 2012; Kizilpinar et al.2011). The impact of other meteorological parameters, such as precipitation, wind speed, and humidity, has also been reported (Latałowa et al. 2002; Puc2007, 2012). In addition, recent studies have shown that the temporal variations in Corylus, Alnus, and Betula pollen counts are related to three groups of factors. The temporal span of these factors are (i) daily, (ii) approximately 3.5 days, and (iii) more than 15 days (Nowosad et al. 2015).
Spatial analyses in aerobiology mainly involve the following: the comparison between two or more different localizations (Stach et al. 2008; Puc and Kasprzyk 2013; Sauliene et al. 2014); the description of spatial variation of pollen season properties or pollen concentrations (Emberlin et al. 2002; Rieux et al. 2008; Myszkowska et al. 2010; Nowosad et al. 2015); or the investigation of pollen transportation using back trajectories (Skjoth et al.2008, 2009; Veriankaite et al.2009; Rojo and Pérez-Badia 2015). There have been only a few studies in which spatial models of Betula pollen count were built. Vogel et al. (2008) included the parametrisation of the emissions of Betula pollen into a non-hydrostatic mesoscale model. Sofiev et al. (2013) used the SILAM dispersion model to create a Betula pollen emission model. According to the author’s knowledge, spatial models of Corylus and Alnus pollen concentration have not been reported.
The main aim of this study was to develop spatiotemporal predictive models of Corylus, Alnus, and Betula pollen concentration levels, using preprocessed gridded meteorological data. Based on the final models, it is possible to predict pollen concentration levels, not only in aerobiological monitoring sites but also at unsampled locations.
Materials and methods
Population and area of the cities with the aerobiological monitoring sites; longitude, latitude, altitude, and the studied years of the aerobiological monitoring sites
Population (in thousands)
Based on first symptom values for patients allergic to each taxon, two levels of concentration (low and high) were distinguished (Rapiejko et al. 2007). The limits were set at 35 grains/m 3 for Corylus, 45 grains/m 3 for Alnus, and 20 grains/m 3 for Betula (Fig. 2).
AGRI4CAST Interpolated Meteorological Data (Baruth et al. 2007) were used as the main input data. The AGRI4CAST database is a collection of daily meteorological parameters from weather stations interpolated to a 25 × 25 km grid and contains data from 1975 to 2014. For the purpose of this study, grid data were restricted to the area of Poland and a zone of 200 km around Polish borders. The buffer value was based on the longest distance between the nearest aerobiological sites (Szczecin and Poznań): approximately 200 km.
Training set, which contained 2/3 of the data from eight cities in Poland (Gdańsk, Kraków, Lublin, Olsztyn, Poznań, Rzeszów, Siedlce, and Szczecin). The data were split randomly based on the dates available in this study.
First test set, which contained the remaining 1/3 of the data from the same eight cities (Gdańsk, Kraków, Lublin, Olsztyn, Poznań, Rzeszów, Siedlce, and Szczecin).
Second test set, which contained data from Bydgoszcz, Łódź, and Sosnowiec.
The average monthly temperatures for each month over the previous year for each site
Four- and 16-day averages, calculated for each of the meteorological parameters. The temporal span of these factors was based on a recent study which showed that the temporal variations in Corylus, Alnus, and Betula pollen counts are related to factors that change (i) diurnally, (ii) approximately every 3.5 days, and (iii) in more than 15 days (Nowosad et al. 2015). These values were then lagged by 1 day
Cumulated growing degree days (GDD), lagged by 1 day
Longitude, latitude, and altitude of grid cell
Explanation of the predictor variable abbreviations used in spatiotemporal modeling of Corylus, Alnus, and Betula pollen concentration levels
Predictor variable name
Average monthly temperature for January in the preceding year
Average monthly temperature for February in the preceding year
Average monthly temperature for March in the preceding year
Average monthly temperature for April in the preceding year
Average monthly temperature for May in the preceding year
Average monthly temperature for June in the preceding year
Average monthly temperature for July in the preceding year
Average monthly temperature for August in the preceding year
Average monthly temperature for September in the preceding year
Average monthly temperature for October in the preceding year
Average monthly temperature for November in the preceding year
Average monthly temperature for December in the preceding year
Average maximum temperature in preceding 4 days
Average maximum temperature in preceding 16 days
Average minimum temperature in preceding 4 days
Average minimum temperature in preceding 16 days
Average vapor pressure in preceding 4 days
Average vapor pressure in preceding 16 days
Average wind speed in preceding 4 days
Average wind speed in preceding 16 days
Average daily precipitation in the preceiding 4 days
Average daily precipitation in the preceiding 16 days
Average potential evaporation in the preceding 4 days
Average potential evaporation in the preceding 16 days
Average total global radiation in the preceding 4 days
Average total global radiation in the preceding 16 days
Cummulated growing degree days (GDD) lagged by one day
Grid cell longitude
Grid cell latitude
Average altitude of grid cell
GDDs were accumulated by adding the number of degree days that accumulated each day from January 1. The base temperature was designated as 5 ∘C, which is the standard threshold temperature for growth in temperate species (Dahl et al. 2013). If the daily maximum temperature is not higher than the base temperature, then no degree days accumulate.
Random forest (Breiman 2001) was used to spatiotemporally predict the pollen level of Corylus, Alnus, and Betula. For classification tasks, it is an ensemble of unpruned classification trees. The prediction is made by aggregating the prediction of the ensemble. The random forest algorithm uses two parameters: ntree (the number of trees) and mtry (the number of input variables randomly chosen at each split). In this study, ntree was set to 500, while optimal values of mtry were obtained by using 100 repetitions of ten-fold cross-validation on the training set.
Evaluation of the models performance
Corylus, Alnus, and Betula models were evaluated on two test sets. Firstly, the temporal modelsŠ performance was determined by comparison between true pollen concentration levels and predictions on the first test set. Secondly, models’ predictions were compared with true pollen concentration levels from Bydgoszcz, Łódź, and Sosnowiec (the second test set). Data from these cities were not used for model creation. Thus, the evaluation was used to determinate spatial quality of the models.
For each taxon, the final model had different probability thresholds. In the Corylus models, the probability threshold dividing low and high pollen concentration levels was optimized to 0.22. In the other taxa, the class imbalance was less severe, and thus the optimal probability threshold value was higher: 0.32 for Alnus and 0.42 for Betula (Fig. 3).
Performance of the models
Temporal performance of the models
The positive predictive value of the Corylus model prediction on the first training set was 0.47. However, it is more important to correctly predict high pollen concentration levels than to misclassify low pollen concentration levels. The Corylus model performed reasonably well in predicting high levels of pollen concentration, with a sensitivity of 0.61. The Alnus model correctly predicted 203 out of 288 occurrences of days with high pollen concentration (sensitivity = 0.70). Moreover, the Alnus model’s positive predictive value was distinctly higher than that of Corylus. The Betula model showed the best performance on the first test set. The model correctly classified approximately 88 % of days with high pollen concentration levels. The Kappa statistic value was 0.83, indicating a very high fit for the model.
Spatial performance of the models
The spatial quality was distinctly different in the models of each taxon. The Corylus model showed the lowest predictive capability. The model correctly predicted high pollen concentration levels in 40 out of 81 days (sensitivity = 0.49). The same model incorrectly classified 40 cases as high levels of pollen. The Alnus model performed better on the second test set. Both the models’ sensitivity and positive predictive value were clearly higher: 0.61 and 0.59, respectively. The Alnus model correctly predicted 110 occurrences of high pollen concentration levels and misclassified 76 cases as high level. On the second test set, the performance of the Betula model was found to be the best. The model Kappa statistic was 0.80, the sensitivity was 0.87, and the positive predictive value was 0.81. High Betula pollen concentration levels were correctly predicted in 394 of 451 cases. At the same time, only 94 days were incorrectly classified as high level.
Corylus, Alnus, and Betula pollen have an enormous impact on the quality of life and the productivity of allergy sufferers. Therefore, these tree pollen are the origin of significant social and financial burdens. Many aerobiological studies have been conducted in response to this problem, some of which have tried to build predictive models of pollen concentration (Bringfelt et al. 1982; Cotos-Yáñez et al. 2004; Castellano-Méndez et al. 2005; Rodriguez-Rajo et al. 2006; Vogel et al. 2008; Hilaire et al. 2012; Puc 2012; Sofiev et al. 2013). As a result of these studies, it is possible to predict high pollen concentration levels with considerable accuracy in the analyzed sites. However, in many countries, the aerobiological network is not dense, and therefore, it is not possible to predict pollen counts in unsampled locations. In this study, gridded meteorological data were used as predictor variables to build a model of high Corylus, Alnus, and Betula pollen concentration levels for spatially continuous areas of Poland.
The days with high pollen concentration levels of the analyzed taxa occur rarely. This property should be taken into consideration when building predictive models. This study used a novel technique of obtaining an optimal threshold, by minimizing the distance between sensitivity, specificity, positive predictive value, and negative predictive value and the best possible performance. Preliminary studies showed that, in this two-class problem, the optimizing probability threshold technique surpasses other strategies for overcoming class imbalances, such as upsampling and downsampling.
In the Corylus, Alnus, and Betula models, cumulated growing degree days lagged by 1 day proved to be one of the most important variables. The fully growth competent buds need stimulation before they can burst; therefore, the occurrence of temperatures above a certain base level is required (Dahl et al. 2013). Secondly, most of the 16-day averages of meteorological factors (daily potential evaporation, total global radiation, vapor pressure, minimum temperature, and maximum temperature) showed high values of the variable importance. Previous studies showed that the readiness to flower is dependent on, inter alia, light intensity, and evaporation (Pacini and Hesse 2004; Dahl et al. 2013). In contrast, the importance of the preceding years average monthly temperatures for each month and grid cell longitude, latitude, and altitude had little impact on the model. Although the studies of Latałowa et al. (2002) and Rasmussen (2002) found a relationship between the annual pollen sum and the mean temperature in the preceding year, this relationship has a small influence on the daily pollen concentration.
The Corylus, Alnus, and Betula models varied in terms of predictive quality. The Corylus model predicted correctly approximately 55 % of the high pollen concentration levels on the test sets. This model misclassification could be connected with very rare (330 cases, about 2.5 % of the analyzed days) occurrences of high Corylus pollen concentration levels. Corylus inflorescences produce about two times fewer pollen grains than Alnus inflorescences (Piotrowska 2008). Thus, a dataset of longer time periods or a denser monitoring network could result in a more precise model. The Alnus model performed better, with correct prediction of approximately 2/3 of high pollen concentration levels on the test sets. The problem of class imbalance was less severe in the Alnus dataset. Nonetheless, Alnus (and Corylus) pollen seasons are highly changeable from year to year. In addition, the location of aerobiological monitoring sites influences the variability of the pollen count of these taxa (Nowosad et al. 2015). The Betula model had the best values of model evaluation statistics. Almost 88 % of high pollen concentration levels were correctly predicted on the test sets. The negative impact of class imbalance was modest due to the relatively frequent occurrence of high Betula pollen concentration levels. Moreover, the Betula pollination period is relatively short and less changeable, and therefore, the Betula pollen count is more predictable.
The predictive quality of the Betula model is comparable to previous work. Castellano-Méndez et al. (2005) created a neutral network model for prediction of the risk of pollen concentration values exceeding a given level, using pollen and meteorological data. That model was built and validated in only one location: the city of Santiago, Spain. In contrast, in this study data from 11 aerobiological sites, as well as gridded meteorological data, were used in the process of model creation; therefore, model prediction should be verifiable in substantial areas surrounding the aerobiological monitoring sites.
Relationships between pollen concentration in the air and meteorological factors are complex and strongly nonlinear. Thus, classical statistical models, such as logistic regression or linear discriminant analysis, tend to perform poorly. Machine learning techniques could find patterns in nonlinear, noisy data, and generate prediction with relatively high accuracy (Recknagel 2001). Some of the most often used machine learning methods include nonlinear classification models (e.g., neural networks and support vector machines) and tree-based models (e.g., classification trees and random forest) (Kuhn and Johnson 2013). Random forest proved to give more accurate prediction than single tree models (Breiman 2001). In addition, this technique was compared to neural networks, and support vector machines require minimal preprocessing of the data. However, none of the single modeling techniques work best for every problem (Wolpert 1996). Therefore, it would be worthwhile to compare performance of different machine learning models for predicting pollen concentration.
Prediction errors of the Corylus, Alnus, and Betula models are the result of a combination of numerous factors: (i) omission of some non-meteorological predictors, (ii) influence of medium- and long-range pollen transport, and (iii) temporal and spatial uncertainty of pollen data. Meteorological conditions are also not the only factor that influence pollen concentration values. After the same meteorological conditions, high or low pollen concentration levels of Corylus, Alnus, and Betula can be observed on different occasions. Pollen concentration in the air is a result of nonlinear interactions between many factors, such as the land cover, topography, and human impact (Piotrowska and Kubik-Komar 2012). Taking into account the proportion of the analyzed taxa in the local vegetation could positively influence model quality. Previous studies also showed that most of recorded airborne pollen comes from local sources (Adams-Groom et al. 2002; Damialis et al. 2005). Nevertheless, medium- and long-range transport is also often recorded, as pollen grains are found hundreds or thousands of kilometers away from their source (Damialis et al. 2005; Ranta et al. 2006). In the Corylus, Alnus, and Betula models, the effect of long-range transport is not included. Moreover, uncertainty in the results of models could be connected with several characteristics of the data. The results of aerobiological monitoring are not the exact values of the pollen concentration of the surrounding area but are subject to errors from various sources, such as device, preparation of the sampling surface and slides, and slide analysis (Gottardini et al. 2009). In addition, there is diurnal variation in the number of pollen grains in the air (Galán et al. 1991; Skjoth et al. 2008). It is estimated that approximately 10 % of Corylus, Alnus, and Betula pollen count variations can be due to diurnal fluctuations and measurement errors (Nowosad et al. 2015). Only 11 aerobiological monitoring sites, which are not randomly distributed in Poland, were used in this study. The sites are located mainly in large cities, where the local climate is modified by human activities. These cities are significantly warmer than the surrounding rural areas, on average by 0.8–1.3 ∘C (Szymanowski 2005). Furthermore, the local airflow and turbulence are affected by buildings and non-building structures (Emberlin and Norris-Hill 1991). As a result, the deposition patterns in cities are different from those in the countryside (Emberlin and Norris-Hill 1991; Gonzalo-Garijo et al. 2006). Moreover, given the lack of sites in mountainous areas, caution should be exercised when using prediction models in those areas. In the long term, it will be valuable to add monitoring sites in remote rural areas, as well as in mountainous areas.
The modeling framework used in this study can be used as the basis for further research. The models are built based on meteorological factors and could be easily implemented in other countries. Moreover, it would be worthwhile to analyze the possibility of improving the models’ quality by utilizing non-meteorological parameters, such as the distribution of tree species and local land use.
In this study, the probability of high pollen concentration levels of Corylus, Alnus, and Betula was predicted using preprocessed gridded meteorological data. The result of the models could be used for prediction in continuous areas rather than just in single locations
The models built allow moderately reliable predictions of high pollen concentration levels of Corylus and highly reliable predictions of high levels of Alnus and Betula pollen
Temporal verifiability was higher than spatial verifiability in each of the Corylus, Alnus, and Betula models
Average monthly temperatures for the preceding year were not very important for the results of the models
Cumulated growing degree days was one of the most important variables in the Corylus, Alnus, and Betula models. In addition, sixteen-day averages of potential evaporation, total global radiation, vapor pressure, minimum temperature, and maximum temperature were important variables for the models.
Spatial variables such as latitude, longitude, and altitude had little impact on the models
The modeling framework could be applied in predicting high pollen concentrations of the different pollen taxa in the study sites and also in other areas
This study was carried out within the framework of the project no. NN305 321936 financed by the Ministry of Science and Higher Education. The author is grateful to Kazimiera Chłopek, Łukasz Grewling, Idalia Kasprzyk, Małgorzata Latałowa, Barbara Majkowska-Wojciechowska, Dorota Myszkowska, Krystyna Piotrowska, Malgorzata Puc, Piotr Rapiejko, Tomasz Stosik, Agnieszka Uruska, and ElŻbieta Weryszko-Chmielewska for providing pollen data. Thanks are also due to Alfred Stach for his valuable feedback.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.