, Volume 34, Issue 3, pp 301–313 | Cite as

Statistical techniques for modeling of Corylus, Alnus, and Betula pollen concentration in the air

  • Jakub NowosadEmail author
  • Alfred Stach
  • Idalia Kasprzyk
  • Kazimiera Chłopek
  • Katarzyna Dąbrowska-Zapart
  • Łukasz Grewling
  • Małgorzata Latałowa
  • Anna Pędziszewska
  • Barbara Majkowska-Wojciechowska
  • Dorota Myszkowska
  • Krystyna Piotrowska-Weryszko
  • Elżbieta Weryszko-Chmielewska
  • Małgorzata Puc
  • Piotr Rapiejko
  • Tomasz Stosik
Original Paper


Prediction of allergic pollen concentration is one of the most important goals of aerobiology. Past studies have used a broad range of modeling techniques; however, the results cannot be directly compared owing to the use of different datasets, validation methods, and evaluation metrics. The main aim of this study was to compare nine statistical modeling techniques using the same dataset. An additional goal was to assess the importance of predictors for the best model. Aerobiological data for Corylus, Alnus, and Betula pollen counts were obtained from nine cities in Poland and covered between five and 16 years of measurements. Meteorological data from the AGRI4CAST project were used as a predictor variables. The results of 243 final models (3 taxa \(\times\)  9 cities \(\times\) 9 techniques) were validated using a repeated k-fold cross-validation and compared using relative and absolute performance statistics. Afterward, the variable importance of predictors in the best models was calculated and compared. Simple models performed poorly. On the other hand, regression trees and rule-based models proved to be the most accurate for all of the taxa. Cumulative growing degree days proved to be the single most important predictor variable in the random forest models of Corylus, Alnus, and Betula. Finally, the study suggested potential improvements in aerobiological modeling, such as the application of robust cross-validation techniques and the use of gridded variables.


Allergenic pollen Pollen concentration in the air Betulaceae Regression models Predictive modeling Machine learning 



This study was carried out within the framework of Project no. NN305 321936, financed by the Ministry of Science and Higher Education in Poland.


  1. Anderson, M. J. (2001). A new method for non parametric multivariate analysis of variance. Austral Ecology, 26(2001), 32–46. Scholar
  2. Baruth, B., Genovese, G., & Leo, O. (2007). CGMS version 9.2—User manual and technical documentation. Technical report, Office for official publications of the European Communities, Luxembourg.
  3. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press.Google Scholar
  4. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. Scholar
  5. Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Boca Raton: CRC Press.Google Scholar
  6. Bringfelt, B., Engström, I., & Nilsson, S. (1982). An evaluation of some models to predict airborne pollen concentration from meteorological conditions in Stockholm, Sweden. Grana, 21(1), 59–64. Scholar
  7. Castellano-Méndez, M., Aira, M. J., Iglesias, I., Jato, V., & González-Manteiga, W. (2005). Artificial neural networks as a useful tool to predict the risk level of Betula pollen in the air. International Journal of Biometeorology, 49(5), 310–316. Scholar
  8. Cotos-Yáñez, T. R., Rodríguez-Rajo, F. J., & Jato, M. V. (2004). Short-term prediction of Betula airborne pollen concentration in Vigo (NW Spain) using logistic additive models and partially linear models. International Journal of Biometeorology, 48(4), 179–185. Scholar
  9. Dahl, A., Galán, C., Hajkova, L., Pauling, A., Sikoparija, B., Smith, M., et al. (2013). The onset, course and intensity of the pollen season. In M. Sofiev & K. C. Bergmann (Eds.), Allergenic pollen: A review of the production, release, distribution and health impacts (pp. 29–70). Dordrecht: Springer.CrossRefGoogle Scholar
  10. Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Dystems, 1, 155–161.Google Scholar
  11. Emberlin, J., Savage, M., & Woodman, R. (1993). Annual variations in the concentrations of Betula pollen in the London area, 1961–1990. Grana, 32(6), 359–363. Scholar
  12. Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1–67. Scholar
  13. Galán, C., Smith, M., Thibaudon, M., Frenguelli, G., Oteros, J., Gehrig, R., et al. (2014). Pollen monitoring: Minimum requirements and reproducibility of analysis. Aerobiologia, 30(4), 385–395. Scholar
  14. Gosso, A. (2012). elmNN: Implementation of ELM (extreme learning machine) algorithm for SLFN (single hidden layer feedforward neural networks).
  15. Grolemund, G., & Wickham, H. (2011). Dates and time made easy with lubridate. Journal of Statistical Software, 40(3), 1–25.
  16. Hilaire, D., Rotach, M. W., & Clot, B. (2012). Building models for daily pollen concentrations: The example of 16 pollen taxa in 14 Swiss monitoring stations. Aerobiologia, 28(4), 499–513. Scholar
  17. Hirst, J. (1952). An automatic volumetric spore trap. Annals of Applied Biology, 39(2), 257–265. Scholar
  18. Hyndman, R. J., & Athanasopoulos, G. (2013). Forecasting: Principles and practice. Melbourne: OTexts.Google Scholar
  19. Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab: An S4 package for Kernel methods in R. Journal of Statistical Software, 11(9), 1–20. Scholar
  20. Kaufman, L., & Rousseeuw, P. J. (2005). Finding groups in ordinal data. An introduction to cluster analysis (Vol. 344). New York: Wiley.Google Scholar
  21. Kuhn, M. (2016). Package ’caret’: Classification and regression training.
  22. Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer. Scholar
  23. Kuhn, M., Weston, S., Keefer, C., & Coulter, N. (2014). Cubist: Rule- and instance-based regression modeling.
  24. Laaidi, M. (2001). Regional variations in the pollen season of Betula in Burgundy: Two models for predicting the start of the pollination. Aerobiologia, 17(3), 247–254. Scholar
  25. Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(December), 18–22.
  26. Makridakis, S. (1993). Accuracy measure: Theoretical and practical concerns. International Journal of Forecasting, 9(1), 527–529.CrossRefGoogle Scholar
  27. Mevik, B., Wehrens, R., & Liland, K. (2015). pls: Partial least squares and principal component regression.
  28. Milborrow, S. (2016). Multivariate adaptive regression splines.
  29. Molinaro, A. M., Simon, R., & Pfeiffer, R. M. (2005). Prediction error estimation: A comparison of resampling methods. Bioinformatics, 21(15), 3301–3307.CrossRefGoogle Scholar
  30. Myszkowska, D. (2013). Prediction of the birch pollen season characteristics in Cracow, Poland using an 18-year data series. Aerobiologia, 29(1), 31–44. Scholar
  31. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society Series A (General), 135(3), 370–384.CrossRefGoogle Scholar
  32. Nowosad, J. (2016). Spatiotemporal models for predicting high pollen concentration level of Corylus, Alnus, and Betula. International Journal of Biometeorology, 60(6), 843–855. Scholar
  33. Nowosad, J., Stach, A., Kasprzyk, I., Weryszko-Chmielewska, E., Piotrowska-Weryszko, K., Puc, M., et al. (2016). Forecasting model of Corylus, Alnus, and Betula pollen concentration levels using spatiotemporal correlation properties of pollen count. Aerobiologia, 32(3), 453–468. Scholar
  34. Puc, M. (2012). Artificial neural network model of the relationship between Betula pollen and meteorological factors in Szczecin (Poland). International Journal of Biometeorology, 56(2), 395–401. Scholar
  35. R Core Team. (2016). R: A language and environment for statistical computing. arXiv:1011.1669v3.
  36. Rapiejko, P., Stankiewicz, W., Szczygielski, K., & Jurkiewicz, D. (2007). Progowe stȩżenie pyłku roślin niezbȩdne do wywołania objawów alergicznych (Threshold pollen count necessary to evoke allergic symptoms). Otolaryngologia Polska, 61(4), 591–594. Scholar
  37. Ritenberga, O., Sofiev, M., Kirillova, V., Kalnina, L., & Genikhovich, E. (2016). Statistical modelling of non-stationary processes of atmospheric pollution from natural sources: Example of birch pollen. Agricultural and Forest Meteorology, 226–227, 96–107. Scholar
  38. Rodríguez-Rajo, F. J., Valencia-Barrera, R. M., Vega-Maray, A. M., Suárez, F. J., Fernández-González, D., & Jato, V. (2006). Prediction of airborne Alnus pollen concentration by using ARIMA models. Annals of Agricultural and Environmental Medicine, 13(1), 25–32.Google Scholar
  39. Salvador, S., & Chan, P. (2004). Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In Proceedings of international conference on tools with artificial intelligence, ICTAI (pp. 576–584).
  40. Sofiev, M., Belmonte, J., Gehrig, R., Izquierdo, R., Smith, M., Dahl, Å., et al. (2013a). Airborne Pollen Transport Mikhail. In M. Sofiev & K. C. Bergmann (Eds.), Allergenic pollen: A review of the production, release, distribution and health impacts (pp. 127–159). Dordrecht: Springer. Scholar
  41. Sofiev, M., Siljamo, P., Ranta, H., Linkosalo, T., Jaeger, S., Rasmussen, A., et al. (2013b). A numerical model of birch pollen emission and dispersion in the atmosphere. Description of the emission module. International Journal of Biometeorology, 57(1), 45–58. Scholar
  42. Therneau, T., Atkinson, B., & Ripley, B. (2015). Recursive partitioning and regression trees.
  43. Tibshirani, R. (1996). Regression selection and Shrinkage via the Lasso. Journal of the Royal Statistical Society B, 58(1), 267–288. Scholar
  44. Vogel, H., Pauling, A., & Vogel, B. (2008). Numerical simulation of birch pollen dispersion with an operational weather forecast system. International Journal of Biometeorology, 52(8), 805–814. Scholar
  45. Wickham, H. (2009). ggplot2: Elegant graphics for data analysis. New York: Springer. Scholar
  46. Wold, S., Martens, H., & Wold, H. (1983). The multivariate calibration problem in chemistry solved by the PLS method. In: Matrix pencils (pp. 286–293), Springer.
  47. Zou, H., & Hastie, T. (2008). elasticnet: Elastic-net for sparse estimation and sparse PCA.

Copyright information

© Springer Science+Business Media B.V., part of Springer Nature 2018

Authors and Affiliations

  • Jakub Nowosad
    • 1
    Email author
  • Alfred Stach
    • 2
  • Idalia Kasprzyk
    • 3
  • Kazimiera Chłopek
    • 4
  • Katarzyna Dąbrowska-Zapart
    • 4
  • Łukasz Grewling
    • 5
  • Małgorzata Latałowa
    • 6
  • Anna Pędziszewska
    • 6
  • Barbara Majkowska-Wojciechowska
    • 7
  • Dorota Myszkowska
    • 8
  • Krystyna Piotrowska-Weryszko
    • 9
  • Elżbieta Weryszko-Chmielewska
    • 9
  • Małgorzata Puc
    • 10
  • Piotr Rapiejko
    • 11
  • Tomasz Stosik
    • 12
  1. 1.Space Informatics LabUniversity of CincinnatiCincinnatiUSA
  2. 2.Institute of Geoecology and GeoinformationAdam Mickiewicz UniversityPoznańPoland
  3. 3.Department of Ecology and Environmental BiologyUniversity of RzeszówRzeszówPoland
  4. 4.Faculty of Earth SciencesUniversity of SilesiaSosnowiecPoland
  5. 5.Laboratory of Aeropalynology, Faculty of BiologyAdam Mickiewicz UniversityPoznańPoland
  6. 6.Department of Plant EcologyUniversity of GdańskGdańskPoland
  7. 7.Department of Immunology, Rheumatology and Allergy, Faculty of MedicineMedical UniversityŁódźPoland
  8. 8.Department of Clinical and Environmental AllergologyJagiellonian University Medical CollegeKrakówPoland
  9. 9.Department of BotanyUniversity of Life Sciences in LublinLublinPoland
  10. 10.Department of Botany and Nature Conservation, Faculty of BiologyUniversity of SzczecinSzczecinPoland
  11. 11.Allergen Research CenterWarsawPoland
  12. 12.Department of Botany and EcologyUniversity of Science and TechnologyBydgoszczPoland

Personalised recommendations