Skip to main content

Cyanotoxin level prediction in a reservoir using gradient boosted regression trees: a case study

Abstract

Cyanotoxins are a type of cyanobacteria that is poisonous and poses a health threat in waters that could be used for drinking or recreational purposes. Thus, it is necessary to predict their presence to avoid risks. This paper presents a nonparametric machine learning approach using a gradient boosted regression tree model (GBRT) for prediction of cyanotoxin contents from cyanobacterial concentrations determined experimentally in a reservoir located in the north of Spain. GBRT models seek and obtain good predictions in highly nonlinear problems, like the one treated here, where the studied variable presents low concentrations of cyanotoxins mixed with high concentration peaks. Two types of results have been obtained: firstly, the model allows the ranking or the dependent variables according to its importance in the model. Finally, the high performance and the simplicity of the model make the gradient boosted tree method attractive compared to conventional forecasting techniques.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

References

  • Barnes DJ, Chu D (2010) Introduction to modeling for biosciences. Springer, New York

    Book  Google Scholar 

  • Boopathi T, Ki J (2014) Impact of environmental factors on the regulation of cyanotoxin production. Toxins 6:1951–1978

    Article  CAS  Google Scholar 

  • Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth and Brooks/Cole, Monterey

    Google Scholar 

  • Brönmark C, Hansson L-A (2005) The biology of lakes and ponds. Oxford University Press, New York

    Google Scholar 

  • Bühlmann P, Hothorn T (2007) Boosting algorithms: regularization, prediction and model fitting. Stat Sci 22(4):477–505

    Article  Google Scholar 

  • Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, San Francisco, California, USA, pp 785–794

  • Chorus I, Bartram J (1999) Toxic cyanobacteria in water: a guide to their public health consequences, monitoring and management. Spon Press, New York

    Book  Google Scholar 

  • David P, Fewer DP, Köykkä K, Halinen K, Jokela J, Lyra C, Sivonen K (2009) Culture-independent evidence for the persistent presence and genetic diversity of microcystin-producing Anabaena (cyanobacteria) in the Gulf of Finland. Environ Microbiol 11:855–866

    Article  CAS  Google Scholar 

  • de Hoyos C, Negro A, Aldasoro JJ (2004) Cyanobacteria distribution and abundance in the Spanish water reservoirs during thermal stratification. Limnetica 23:119–132

    Google Scholar 

  • Döpke J, Fritsche U, Pierdzioch C (2017) Predicting recessions with boosted regression trees. Int J Forecast 33:745–759

    Article  Google Scholar 

  • Freedman D, Pisani R, Purves R (2007) Statistics. WW Norton & Company, New York

    Google Scholar 

  • Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232

    Article  Google Scholar 

  • Friedman JH (2002) Stochastic gradient boosting. Comput Stat Data Anal 38(4):367–378

    Article  Google Scholar 

  • Friedman JH, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28(2):337–407

    Article  Google Scholar 

  • Gault PM, Marler HJ (2009) Handbook on cyanobacteria: biochemistry, biotechnology and applications. Nova Science Publishers, New York

    Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, Berlin

    Book  Google Scholar 

  • Hillebrand H, Dürselen C–D, Kirschtel D, Pollinger U, Zohary T (1999) Biovolume calculation for pelagic and benthic microalgae. J Phycol 35:403–424

    Article  Google Scholar 

  • Hinners J, Hofmeister R, Hense I (2015) Modeling the role of pH on Baltic Sea cyanobacteria. Life 5(2):1204–1217

    Article  CAS  Google Scholar 

  • Huisman J, Matthijs HCP, Visser PM (2010) Harmful cyanobacteria. Springer, New York

    Google Scholar 

  • Jeppesen E, Sondergaard M, Jensen JP (2003) Climatic warming and regime shifts in lake food webs: some comments. Limnol Oceanogr 48:1346–1349

    Article  Google Scholar 

  • Johnson NE, Ianiuk O, Cazap D, Liu L, Starobin D, Dobler G, Ghandehari M (2017) Patterns of waste generation: a gradient boosting model for short-term waste prediction in New York City. Waste Manag 62:3–11

    Article  Google Scholar 

  • Józwiak T, Mazur-Marzec H, Plinski M (2008) Cyanobacterial blooms in the Gulf of Gdan'sk (southern Baltic): the main effect of eutrophication. Oceanol Hydrobiol Stud 37:115–121

    Article  Google Scholar 

  • Landry M, Erlinger TP, Patschke D, Varrichio C (2016) Probabilistic gradient boosting machines for GEFCom2014 wind forecasting. Int J Forecast 32(3):1061–1066

    Article  Google Scholar 

  • Mayr A, Binder H, Gefeller O, Schmid M (2014a) The evolution of boosting algorithms: from machine learning to statistical modelling. Methods Inf Med 6(1):419–427

    Google Scholar 

  • Mayr A, Binder H, Gefeller O, Schmid M (2014b) Extending statistical boosting: an overview of recent methodological developments. Method Inform Med 6(2):428–435

    Google Scholar 

  • Negro AI, de Hoyos C, Vega JC (2000) Phytoplankton structure and dynamics in Lake Sanabria and Valparaíso reservoir (NW Spain). Hydrobiologia 424:25–37

    Article  Google Scholar 

  • Persson C, Bacher P, Shiga T, Madsen H (2017) Multi-site solar power forecasting using gradient boosted regression trees. Sol Energy 150:423–436

    Article  Google Scholar 

  • Peschek GA, Obinger C, Renger G (2011) Bioenergetic processes of cyanobacteria: from evolutionary singularity to ecological diversity. Springer, New York

    Book  Google Scholar 

  • Picard R, Cook D (1984) Cross-validation of regression models. J Am Stat Assoc 79(387):575–583

    Article  Google Scholar 

  • Ploug H (2008) Cyanobacterial surface blooms formed by Aphanizomenon sp. and Nodularia spumigena in the Baltic Sea: small-scale fluxes, pH, and oxygen microenvironments. Limnol Oceanogr 53:914–921

    Article  CAS  Google Scholar 

  • Quesada A, Sanchis D, Carrasco D (2004) Cyanobacteria in Spanish reservoirs. How frequently are they toxic? Limnetica 23:109–118

    Google Scholar 

  • Quesada A, Moreno E, Carrasco D, Paniagua T, Wormer L, de Hoyos C, Sukenik A (2006) Toxicity of Aphanizomenon ovalisporum (cyanobacteria) in a Spanish water reservoir. Eur J Phycol 41:39–45

    Article  CAS  Google Scholar 

  • Ridgeway G (2007) Generalized boosted models: a guide to the GBM package. http://www.saedsayad.com/docs/gbm2.pdf. Accessed 3 Aug 2007

  • Ridgeway G (2017) gbm: Generalized boosted regression models. R package version 2.1.1. http://CRAN.R-project.org/package=gbm. Accessed 21 Mar 2017

  • Saqrane S, Oudra B (2009) CyanoHAB occurrence and water irrigation cyanotoxin contamination: ecological impacts and potential health risks. Toxins 1:113–122

    Article  CAS  Google Scholar 

  • Schapire RE (2003) The boosting approach to machine learning an overview. In: Denison DD, Hansen MH, Holmes CC, Mallick B, Yu B (eds) Nonlinear estimation and classification, Lecture notes in statistics, vol 171. Springer, Germany, pp 149–171

    Chapter  Google Scholar 

  • Scheffer M (2005) Ecology of shallow lakes. Springer, New York

    Google Scholar 

  • Spoof L, Berg KA, Rapala J, Lahti K, Lepistö L, Metcalf JS, Codd GA, Meriluoto J (2006) First observation of cylindrospermopsin in Anabaena lapponica isolated from the boreal environment (Finland). Environ Toxicol 21:552–560

    Article  CAS  Google Scholar 

  • Stewart I, Webb PM, Schluter PJ, Shaw GR (2006) Recreational and occupational field exposure to freshwater cyanobacteria—a review of anecdotal and case reports, epidemiological studies and the challenges for epidemiologic assessment. Environ Health 5:1–13

    Article  CAS  Google Scholar 

  • Taieb SB, Hyndman RJ (2014) A gradient boosting approach to the kaggle load forecasting competition. Int J Forecast 30(2):382–394

    Article  Google Scholar 

  • Texeira MR, Rosa MJ (2006) Comparing dissolved air flotation and conventional sedimentation to remove cyanobacterial cells of Microcystis aeruginosa: part I: the key operating conditions. Sep Purif Technol 52:84–94

    Article  CAS  Google Scholar 

  • Touloupakis E, Cicchi B, Silva Benavides AM, Torzillo G (2016) Effect of high pH on growth of Synechocystis sp. PCC 6803 cultures and their contamination by golden algae (Poterioochromonas sp.). Appl Microbiol Biotechnol 100:1333–1341

    Article  CAS  Google Scholar 

  • van der Valk AG (2006) The biology of freshwater wetlands. Oxford University Press, New York

    Google Scholar 

  • Vapnik V (1998) Statistical learning theory. Wiley-Interscience, New York

    Google Scholar 

  • Vasconcelos V (2006) Eutrophication, toxic cyanobacteria and cyanotoxins: when ecosystems cry for help. Limnetica 25:425–432

    Google Scholar 

  • Whitton BA, Potts M (2000) The ecology of cyanobacteria: their diversity in time and space. Springer, New York

    Google Scholar 

  • World Health Organization (1998) Guidelines for drinking-water quality: health criteria and other supporting information, vol 2. World Health 408 Organization, Geneva

    Google Scholar 

  • Yamamoto Y, Nakahara H (2005) The formation and degradation of cyanobacterium Aphanizomenon flos-aquae blooms: the importance of pH, water temperature, and day length. Limnology 6:1–6

    Article  CAS  Google Scholar 

Download references

Acknowledgments

Authors wish to acknowledge Cantabrian Basin Authority (Ministry of Environment, Rural and Marine Affairs of Spain) for the dataset used in this research.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Paulino José García Nieto.

Additional information

Responsible editor: Vitor Manuel Oliveira Vasconcelos

Supplementary materials: Appendix A

Supplementary materials: Appendix A

The dataset used in this paper can be downloaded in https://www.dropbox.com/s/5339rpzuvt0fcdd/Trasona_reservoir_dataset_ei.xls?dl=0.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

García Nieto, P.J., García-Gonzalo, E., Sánchez Lasheras, F. et al. Cyanotoxin level prediction in a reservoir using gradient boosted regression trees: a case study. Environ Sci Pollut Res 25, 22658–22671 (2018). https://doi.org/10.1007/s11356-018-2219-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11356-018-2219-4

Keywords

  • Statistical machine learning techniques
  • Regression trees
  • Gradient boosting
  • Cyanotoxins
  • Cyanobacteria
  • Harmful algal blooms (HABs)