Skip to main content

Improving flood forecasting through feature selection by a genetic algorithm – experiments based on real data from an Amazon rainforest river


This paper addresses the problem of feature selection aiming to improve a flood forecasting model. The proposed model is carried out through a case study that uses 18 different time series of thirty-five years of hydrological data, forecasting the level of the Xingu River, in the Amazon rainforest in Brazil. We employ a Genetic Algorithm for the task of feature selection and exploit several different genetic parameters seeking to improve the accuracy of the prediction. The features selected by the Genetic Algorithm are used as input of a Linear Regression model that performs the forecasting. A statistical analysis verifies that the final model can predict the river level with high accuracy, which obtains a coefficient of determination equal to 0.988. Hence, the proposed Genetic Algorithm showed to be successful in selecting the most relevant features.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14


  1. 1.

    Equation for the Coefficient of Determination:

    1. 1.

      \(R^{2} = 1 - \frac {{\sum }_{i=1}^{n}(y_{true}- y_{pred})^{2}}{{\sum }_{i=1}^{n}(y_{true} - \bar {y})^{2}}\), where ytrue is the data set, ypred is the prediction, \(\bar {y}\) is the average of y, and n is number of the observations.

  2. 2.

    Equation for the Root Mean Square Error:

    1. 2.

      \(RMSE = \sqrt { \frac {1}{n} {\sum }_{i=i}^{n} (y_{true} - y_{pred})^{2}}\)

  3. 3.

    Equation for the Mean Absolute Error:

    1. 3.

      \(MAE = \frac {1}{n} {\sum }_{i=i}^{n} |y_{true} - y_{pred}|\)


  1. Bhandari D, Murthy CA (1996) Genetic algorithm with elitist model and its convergence. IJPRAI 10(6):731–747

    Google Scholar 

  2. Chen ST, Yu PS (2007) Pruning of support vector networks on flood forecasting. J Hydrol 347(1):67–78

    Article  Google Scholar 

  3. de Lucena DV, de Lima TW, Soares AS, Coelho CJ (2012) Multi-objective evolutionary algorithm nsga-ii for variables selection in multivariate calibration problems. Int J Natural Comput Res 3:43–58

    Article  Google Scholar 

  4. de Oliveira LL, Freitas AA, Tinós R. (2018) Multi-objective genetic algorithms in the study of the genetic code’s adaptability. Inf Sci 425:48–61

    Article  Google Scholar 

  5. de Paula TI (2015) Avaliação da influência de parêmetros do algoritmo genético na otimização de um problema multiobjetivo utilizando-se arranjo de misturas. Master’s thesis, PPGEP, Univesidade Federal de Itajubá

  6. Dornelles F, Goldenfum JA, Pedrollo OC (2013) Artificial neural network methods applied to forecasting river levels. Revista Brasileira de Recursos Hídricos 18:45–54

    Article  Google Scholar 

  7. Eiben AE, Schippers CA (1998) On evolutionary exploration and exploitation. Fundamenta Informaticae 35(1-4):35–50

    Article  Google Scholar 

  8. EM-DAT (2016) The international disaster database. Emdat Advanced Search. Available at

  9. Francescomarino CD, Dumas M, Federici M, Ghidini C, Maggi FM, Rizzi W, Simonetto L (2018) Genetic algorithms for hyperparameter optimization in predictive business process monitoring. Inf Syst 74(Part):67–83

    Article  Google Scholar 

  10. Franco VS (2014) Previsao hidrológica de cheia sazonal do rio xingu em altamira-pa. Master’s thesis, PPGCA, Universidade Federal do Pará

  11. Furquim G, Pessin G, Faiçal BS, Mendiondo EM, Ueyama J (2016) Improving the accuracy of a flood forecasting model by means of machine learning and chaos theory. Neural Comput & Applic 27 (5):1129–1141

    Article  Google Scholar 

  12. Galelli S, Castelletti A (2013) Tree-based iterative input variable selection for hydrological modeling. Water Resour Res 49(7): 4295–4310

    Article  Google Scholar 

  13. Galelli S, Humphrey GB, Maier HR, Castelletti A, Dandy GC, Gibbs MS (2014) An evaluation framework for input variable selection algorithms for environmental data-driven models. Environ Model Softw 62:33–51

    Article  Google Scholar 

  14. Gavriilidis A, Velten J, Tilgner S, Kummert A (2018) Machine learning for people detection in guidance functionality of enabling health applications by means of cascaded SVM classifiers. J Franklin Institute 355(4):2009–2021

    Article  Google Scholar 

  15. Gonçalves VP, Giancristofaro GT, Geraldo Filho P, Johnson T, Carvalho V, Pessin G, de Almeida Neris VP, Ueyama J (2016) Assessing users emotion at interaction time: a multimodal approach with multiple sensors. Soft Comput 21(18): 5309–5323

    Article  Google Scholar 

  16. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn 3:1157–1182

    Google Scholar 

  17. Haddad K, Rahman A (2020) Regional flood frequency analysis: evaluation of regions in cluster space using support vector regression. Nat Hazards 102:489–517

    Article  Google Scholar 

  18. Hall MA (1999) Correlation-based feature selection for machine learning

  19. Holland JH (1975) Adaptation in natural and artificial systems. The University of Michigan Press

  20. IPCC (2013) Climate change 2013: the physical science basis. contribution of working group I to the fifth assessment report of the intergovernmental panel on climate change. Cambridge University Press, Cambridge

    Google Scholar 

  21. Jing M, Jie Y, Shou-yi L, Lu W (2018) Application of fuzzy analytic hierarchy process in the risk assessment of dangerous small-sized reservoirs. Int J Mach Learn Cybern 9(1):113–123

    Article  Google Scholar 

  22. Khaji E, Mohammadi AS (2014) A heuristic method to generate better initial population for evolutionary methods. CoRR arXiv:1406.4518

  23. Linden R (2012) Algoritmo genetico editora ciencia mordena

  24. Mokadem D, Amine A, Elberrichi Z, Helbert D (2018) Detection of urban areas using genetic algorithms and kohonen maps on multispectral images. IJOCI 8(1):46–62

    Google Scholar 

  25. Montgomery DC (2013) Design and analysis of experiments, 8th edn. Wiley, New York

    Google Scholar 

  26. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  27. Pfafstetter O (1989) Classificação de bacias hidrográficas - Metodologia de classificação Departamento Nacional de Obras de Saneamento (RJ)

  28. Rahnamayan S, Tizhoosh HR, Salama MMA (2007) A novel population initialization method for accelerating evolutionary algorithms. Comput Math Applic 53(10):1605–1614

    Article  Google Scholar 

  29. Rocha EJP, Rolim PAM, Santos DM (2007) Modelo estatístico hidroclimático para previsão de níveis em Altamira-PA. In: XVII Simpósio brasileiro de recursos hídricos

  30. Rodrigues MM, Costa MGF, Filho CFFC (2015) Proposta de um método para previsão de cheias sazonais utilizando redes neurais artificiais: Uma aplicação no rio amazonas. In: Workshop de computação aplicada a gestão do meio ambiente e recursos naturais (WCAMA)

  31. Shapiro SS, Wilk MB (1965) An analysis of variance test for normality (complete samples). Biometrika 52:591–611

    Article  Google Scholar 

  32. Silva B, Netto MAS, Cunha RLF (2018) Jobpruner: a machine learning assistant for exploring parameter spaces in HPC applications. Future Gen Comp Sys 83:144–157

    Article  Google Scholar 

  33. Souza F, Araújo R (2011) Variable and time-lag selection using empirical data. In: IEEE 16th conference on emerging technologies & factory automation, ETFA 2011, pp 1–8

  34. Sumbana MIM, Silva AJC, Gonçalves MA, Almeida JM, Pappa GL (2012) Seleção de atributos utilizando algoritmos genéticos para detecção do vandalismo na wikipedia. In: XXVII Simpósio brasileiro de banco de dados - short papers, São Paulo, São Paulo, Brasil, October 15-18, 2012, pp 209–216

  35. Thomas JM (2017) Complex network embedding in the hyperbolic space using non-linear unsupervised machine learning techniques. Ph.D. thesis, Dresden University of Technology, Germany

  36. Tran H, Muttil N, Perera B (2015) Selection of significant input variables for time series forecasting. Environmental Modelling & Software 64(C):156–163

    Article  Google Scholar 

  37. Ueyama J, Faiçal BS, Mano LY, Bayer G, Pessin G, Gomes PH (2017) Enhancing reliability in wireless sensor networks for adaptive river monitoring systems: reflections on their long-term deployment in Brazil. Computers, Environment and Urban Systems 65:41–52

    Article  Google Scholar 

  38. UFSC (2013) Atlas Brasileiro de Desastres Naturais: 1991 a 2012. Centro Universitario de Estudos e Pesquisa sobre Desastres. Universidade Federal de Santa Catarina

  39. Wu J, Liu H, Wei G, Song T, Zhang C, Zhou H (2019) Flash flood forecasting using support vector regression model in a small mountainous catchment. Water 11:1327

    Article  Google Scholar 

Download references


The authors would like to thank the following colleagues due to help revising the manuscript and providing ideas to its best organization: Bruno S. Faiçal, Leandro Y. Mano, Vinícius Gonçalves and Pedro H. Gomes. The authors would like also to thank Márcio Nirlando Gomes Lopes due to his help in the development of Figure 3. Dr. J. Ueyama would like to acknowledge FAPESP, process 2018/17335-9.

Author information



Corresponding author

Correspondence to Alen Costa Vieira.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by: H. Babaie

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Vieira, A.C., Garcia, G., Pabón, R.E.C. et al. Improving flood forecasting through feature selection by a genetic algorithm – experiments based on real data from an Amazon rainforest river. Earth Sci Inform 14, 37–50 (2021).

Download citation


  • River level forecasting
  • Genetic algorithm
  • Feature selection
  • Linear regression