Statistical techniques for modeling of Corylus, Alnus, and Betula pollen concentration in the air

Abstract

Prediction of allergic pollen concentration is one of the most important goals of aerobiology. Past studies have used a broad range of modeling techniques; however, the results cannot be directly compared owing to the use of different datasets, validation methods, and evaluation metrics. The main aim of this study was to compare nine statistical modeling techniques using the same dataset. An additional goal was to assess the importance of predictors for the best model. Aerobiological data for Corylus, Alnus, and Betula pollen counts were obtained from nine cities in Poland and covered between five and 16 years of measurements. Meteorological data from the AGRI4CAST project were used as a predictor variables. The results of 243 final models (3 taxa \(\times\)  9 cities \(\times\) 9 techniques) were validated using a repeated k-fold cross-validation and compared using relative and absolute performance statistics. Afterward, the variable importance of predictors in the best models was calculated and compared. Simple models performed poorly. On the other hand, regression trees and rule-based models proved to be the most accurate for all of the taxa. Cumulative growing degree days proved to be the single most important predictor variable in the random forest models of Corylus, Alnus, and Betula. Finally, the study suggested potential improvements in aerobiological modeling, such as the application of robust cross-validation techniques and the use of gridded variables.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

References

  1. Anderson, M. J. (2001). A new method for non parametric multivariate analysis of variance. Austral Ecology, 26(2001), 32–46. https://doi.org/10.1111/j.1442-9993.2001.01070.pp.x.

    Article  Google Scholar 

  2. Baruth, B., Genovese, G., & Leo, O. (2007). CGMS version 9.2—User manual and technical documentation. Technical report, Office for official publications of the European Communities, Luxembourg. https://doi.org/10.2788/37265

  3. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press.

    Google Scholar 

  4. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324.

    Article  Google Scholar 

  5. Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression trees. Boca Raton: CRC Press.

    Google Scholar 

  6. Bringfelt, B., Engström, I., & Nilsson, S. (1982). An evaluation of some models to predict airborne pollen concentration from meteorological conditions in Stockholm, Sweden. Grana, 21(1), 59–64. https://doi.org/10.1080/00173138209427680.

    Article  Google Scholar 

  7. Castellano-Méndez, M., Aira, M. J., Iglesias, I., Jato, V., & González-Manteiga, W. (2005). Artificial neural networks as a useful tool to predict the risk level of Betula pollen in the air. International Journal of Biometeorology, 49(5), 310–316. https://doi.org/10.1007/s00484-004-0247-x.

    Article  Google Scholar 

  8. Cotos-Yáñez, T. R., Rodríguez-Rajo, F. J., & Jato, M. V. (2004). Short-term prediction of Betula airborne pollen concentration in Vigo (NW Spain) using logistic additive models and partially linear models. International Journal of Biometeorology, 48(4), 179–185. https://doi.org/10.1007/s00484-004-0203-9.

    Article  Google Scholar 

  9. Dahl, A., Galán, C., Hajkova, L., Pauling, A., Sikoparija, B., Smith, M., et al. (2013). The onset, course and intensity of the pollen season. In M. Sofiev & K. C. Bergmann (Eds.), Allergenic pollen: A review of the production, release, distribution and health impacts (pp. 29–70). Dordrecht: Springer.

    Google Scholar 

  10. Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Dystems, 1, 155–161.

    Google Scholar 

  11. Emberlin, J., Savage, M., & Woodman, R. (1993). Annual variations in the concentrations of Betula pollen in the London area, 1961–1990. Grana, 32(6), 359–363. https://doi.org/10.1080/00173139309428965.

    Article  Google Scholar 

  12. Friedman, J. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 19(1), 1–67. https://doi.org/10.2307/2241837.

    Article  Google Scholar 

  13. Galán, C., Smith, M., Thibaudon, M., Frenguelli, G., Oteros, J., Gehrig, R., et al. (2014). Pollen monitoring: Minimum requirements and reproducibility of analysis. Aerobiologia, 30(4), 385–395. https://doi.org/10.1007/s10453-014-9335-5.

    Article  Google Scholar 

  14. Gosso, A. (2012). elmNN: Implementation of ELM (extreme learning machine) algorithm for SLFN (single hidden layer feedforward neural networks). https://cran.r-project.org/package=elmNN.

  15. Grolemund, G., & Wickham, H. (2011). Dates and time made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. http://www.jstatsoft.org/v40/i03.

  16. Hilaire, D., Rotach, M. W., & Clot, B. (2012). Building models for daily pollen concentrations: The example of 16 pollen taxa in 14 Swiss monitoring stations. Aerobiologia, 28(4), 499–513. https://doi.org/10.1007/s10453-012-9252-4.

    Article  Google Scholar 

  17. Hirst, J. (1952). An automatic volumetric spore trap. Annals of Applied Biology, 39(2), 257–265. https://doi.org/10.1111/j.1744-7348.1952.tb00904.x.

    Article  Google Scholar 

  18. Hyndman, R. J., & Athanasopoulos, G. (2013). Forecasting: Principles and practice. Melbourne: OTexts.

    Google Scholar 

  19. Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). kernlab: An S4 package for Kernel methods in R. Journal of Statistical Software, 11(9), 1–20. https://doi.org/10.1016/j.csda.2009.09.023.

    Article  Google Scholar 

  20. Kaufman, L., & Rousseeuw, P. J. (2005). Finding groups in ordinal data. An introduction to cluster analysis (Vol. 344). New York: Wiley.

    Google Scholar 

  21. Kuhn, M. (2016). Package ’caret’: Classification and regression training. https://doi.org/10.1053/j.sodo.2009.03.002. https://github.com/topepo/caret/.

  22. Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. New York: Springer. https://doi.org/10.1007/978-1-4614-6849-3.

    Book  Google Scholar 

  23. Kuhn, M., Weston, S., Keefer, C., & Coulter, N. (2014). Cubist: Rule- and instance-based regression modeling. https://cran.r-project.org/web/packages/Cubist/index.html.

  24. Laaidi, M. (2001). Regional variations in the pollen season of Betula in Burgundy: Two models for predicting the start of the pollination. Aerobiologia, 17(3), 247–254. https://doi.org/10.1023/A:1011899603453.

    Article  Google Scholar 

  25. Liaw, A., & Wiener, M. (2002). Classification and regression by randomForest. R news, 2(December), 18–22. https://doi.org/10.1177/154405910408300516. http://cran.r-project.org/doc/Rnews/.

  26. Makridakis, S. (1993). Accuracy measure: Theoretical and practical concerns. International Journal of Forecasting, 9(1), 527–529.

    Article  Google Scholar 

  27. Mevik, B., Wehrens, R., & Liland, K. (2015). pls: Partial least squares and principal component regression. https://cran.r-project.org/package=pls.

  28. Milborrow, S. (2016). Multivariate adaptive regression splines. http://cran.r-project.org/package=earth.

  29. Molinaro, A. M., Simon, R., & Pfeiffer, R. M. (2005). Prediction error estimation: A comparison of resampling methods. Bioinformatics, 21(15), 3301–3307.

    Article  CAS  Google Scholar 

  30. Myszkowska, D. (2013). Prediction of the birch pollen season characteristics in Cracow, Poland using an 18-year data series. Aerobiologia, 29(1), 31–44. https://doi.org/10.1007/s10453-012-9260-4.

    Article  Google Scholar 

  31. Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society Series A (General), 135(3), 370–384.

    Article  Google Scholar 

  32. Nowosad, J. (2016). Spatiotemporal models for predicting high pollen concentration level of Corylus, Alnus, and Betula. International Journal of Biometeorology, 60(6), 843–855. https://doi.org/10.1007/s00484-015-1077-8.

    Article  Google Scholar 

  33. Nowosad, J., Stach, A., Kasprzyk, I., Weryszko-Chmielewska, E., Piotrowska-Weryszko, K., Puc, M., et al. (2016). Forecasting model of Corylus, Alnus, and Betula pollen concentration levels using spatiotemporal correlation properties of pollen count. Aerobiologia, 32(3), 453–468. https://doi.org/10.1007/s10453-015-9418-y.

    Article  Google Scholar 

  34. Puc, M. (2012). Artificial neural network model of the relationship between Betula pollen and meteorological factors in Szczecin (Poland). International Journal of Biometeorology, 56(2), 395–401. https://doi.org/10.1007/s00484-011-0446-1.

    Article  Google Scholar 

  35. R Core Team. (2016). R: A language and environment for statistical computing. https://doi.org/10.1007/978-3-540-74686-7. http://www.r-project.org. arXiv:1011.1669v3.

  36. Rapiejko, P., Stankiewicz, W., Szczygielski, K., & Jurkiewicz, D. (2007). Progowe stȩżenie pyłku roślin niezbȩdne do wywołania objawów alergicznych (Threshold pollen count necessary to evoke allergic symptoms). Otolaryngologia Polska, 61(4), 591–594. https://doi.org/10.1016/S0030-6657(07)70491-2.

    Article  Google Scholar 

  37. Ritenberga, O., Sofiev, M., Kirillova, V., Kalnina, L., & Genikhovich, E. (2016). Statistical modelling of non-stationary processes of atmospheric pollution from natural sources: Example of birch pollen. Agricultural and Forest Meteorology, 226–227, 96–107. https://doi.org/10.1016/j.agrformet.2016.05.016.

    Article  Google Scholar 

  38. Rodríguez-Rajo, F. J., Valencia-Barrera, R. M., Vega-Maray, A. M., Suárez, F. J., Fernández-González, D., & Jato, V. (2006). Prediction of airborne Alnus pollen concentration by using ARIMA models. Annals of Agricultural and Environmental Medicine, 13(1), 25–32.

    Google Scholar 

  39. Salvador, S., & Chan, P. (2004). Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In Proceedings of international conference on tools with artificial intelligence, ICTAI (pp. 576–584). https://doi.org/10.1109/ICTAI.2004.50.

  40. Sofiev, M., Belmonte, J., Gehrig, R., Izquierdo, R., Smith, M., Dahl, Å., et al. (2013a). Airborne Pollen Transport Mikhail. In M. Sofiev & K. C. Bergmann (Eds.), Allergenic pollen: A review of the production, release, distribution and health impacts (pp. 127–159). Dordrecht: Springer. https://doi.org/10.1007/978-94-007-4881-1.

    Google Scholar 

  41. Sofiev, M., Siljamo, P., Ranta, H., Linkosalo, T., Jaeger, S., Rasmussen, A., et al. (2013b). A numerical model of birch pollen emission and dispersion in the atmosphere. Description of the emission module. International Journal of Biometeorology, 57(1), 45–58. https://doi.org/10.1007/s00484-012-0532-z.

    Article  CAS  Google Scholar 

  42. Therneau, T., Atkinson, B., & Ripley, B. (2015). Recursive partitioning and regression trees. https://cran.r-project.org/package=rpart.

  43. Tibshirani, R. (1996). Regression selection and Shrinkage via the Lasso. Journal of the Royal Statistical Society B, 58(1), 267–288. https://doi.org/10.2307/2346178.

    Article  Google Scholar 

  44. Vogel, H., Pauling, A., & Vogel, B. (2008). Numerical simulation of birch pollen dispersion with an operational weather forecast system. International Journal of Biometeorology, 52(8), 805–814. https://doi.org/10.1007/s00484-008-0174-3.

    Article  Google Scholar 

  45. Wickham, H. (2009). ggplot2: Elegant graphics for data analysis. New York: Springer. https://doi.org/10.1007/978-0-387-98141-3.

    Book  Google Scholar 

  46. Wold, S., Martens, H., & Wold, H. (1983). The multivariate calibration problem in chemistry solved by the PLS method. In: Matrix pencils (pp. 286–293), Springer. https://doi.org/10.1017/CBO9781107415324.004.

  47. Zou, H., & Hastie, T. (2008). elasticnet: Elastic-net for sparse estimation and sparse PCA. http://cran.r-project.org/package=elasticnet.

Download references

Acknowledgements

This study was carried out within the framework of Project no. NN305 321936, financed by the Ministry of Science and Higher Education in Poland.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jakub Nowosad.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nowosad, J., Stach, A., Kasprzyk, I. et al. Statistical techniques for modeling of Corylus, Alnus, and Betula pollen concentration in the air. Aerobiologia 34, 301–313 (2018). https://doi.org/10.1007/s10453-018-9514-x

Download citation

Keywords

  • Allergenic pollen
  • Pollen concentration in the air
  • Betulaceae
  • Regression models
  • Predictive modeling
  • Machine learning