Abstract
Exact estimation of evaporation rates is very important in a proper planning and efficient operation of water resources projects and agricultural activities. Evaporation is affected by many driving forces characterized by nonlinearity, non-stationary, and stochasticity. Such factors clearly hinder setting up rigorous predictive models. This study evaluates the predictability of coupling the additive regression model (AR) with four ensemble machine-learning algorithms—random Subspace (RSS), M5 pruned (M5P), reduced error pruning tree (REPTree), and bagging for estimating pan evaporation rates. Meteorological data encompass maximum temperature, minimum temperature, mean temperature, relative humidity, and wind speed from three different agroclimatic stations in Iraq (i.e., Baghdad, Mosul, and Basrah) were utilized as predictor parameters. The regression model in addition to the sensitivity analysis was employed to identify the best-input combinations for the evaluated methods. It was demonstrated that the AR-M5P estimated the evaporation with higher accuracy than others when combining wind speed, relative humidity, and the minimum and mean temperatures as input parameters. The AR-M5P model provided the best performance indicators, i.e., MAE = 33.82, RMSE = 45.05, RAE = 24.75, RRSE = 28.50, and r = 0.972 for Baghdad; MAE = 25.82, RMSE = 35.95, RAE = 23.75, RRSE = 29.64, and r = 0.956 for Mosul station, respectively. The outcomes of this study proved the superior performance of the hybridized methods in addressing such intricate hydrological relationships and hence could be employed for other environmental problems.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Evaporation is defined as a physical process where the water molecules escaped from the surface as enough energy is absorbed that overcomes the vapor pressure (Malik et al. 2021). Evaporation hence constitutes the majority of losses that might be occurred in any water system which could be thence exacerbated water scarcity. This is exceptionally true in arid to semi-arid areas where the molecules have enough heat energy to escape. Therefore, an accurate estimation of the evaporation losses play a pivotal role in better water resources management, crop water demands and irrigation scheduling (Fan et al. 2016; Gong et al. 2021; Kushwaha et al. 2021).
In arid and semi-arid regions, high rates of evaporation are typically be witnessed in the summer season. Thence, water losses into the atmosphere from reservoirs, river basins, and natural lakes might be exacerbated leading to deterioration of water levels (Boers et al. 1986; Sayl et al. 2016; Khan et al. 2019). As such, an accurate quantification of losses from water bodies is crucially important to be considered for proper planning and managing of any water resources project (Abd-Elaty et al. 2022; Moazenzadeh et al. 2018; Vishwakarma et al. 2022). Recently, the impact of climate changes has exacerbated the influence of evaporation on surface water balance (Sartori 2000), where the global warming has negative influence on the relationship between evaporation and water management (Eames et al. 1997; Kushwaha et al. 2016, b).
There are two ways to estimate evaporation; either directly such as the pan evaporation (PE) method or indirectly like mass transfer, water and energy balance (Lundberg, 1993), and Penman methods (Zhao et al. 2013). However, the Class A pan method is globally used for estimating evaporation as it is well adapted to relatively estimate the evaporation levels in different climatic characteristics regions (Masoner et al. 2008). Accordingly, the evaporation amount of 60- or 20-cm-diameter pans can be converted into that amount of Class A pan. Nonetheless, the costly nature of the class A method represents the essential obstacle of its application in many developing countries (Ashrafzadeh et al. 2019; Wu et al. 2020). In order to be reliably predictive, the evaporation losses models should be accounting for the driving meteorological variables such as relative humidity (RH), sunshine hours (Sh), wind speed (WS), rainfall (RF), minimum (Tmin), maximum (Tmax), and mean (Tmean) temperatures. As such, many of empirical models have been configured for predicting evaporation rates from metrological variables like Thornthwaite equations, Priestley–Taylor and Penman–Monteith. However, the stochasticity features in addition to the nonlinearity and non-stationary of the meteorological variables employed in building a predictive model necessitate developing rigorous and reliable intelligent models that could be capable to eliminate the stochasticity inherited in the evaporation-meteorological variables relationship (Elbeltagi et al. 2022; Kisi et al. 2017a; Kushwaha et al. 2022a; Salih et al. 2020; Khan et al. 2018; Naganna et al. 2019).
Owing to its capacity in tackling the complexity accompanied by highly stochastic features of many environmental problems (Chia et al. 2020), machine learning (ML) methods have been recently identified as a paramount method to address various aspects of the association between predictors and predictands. In the literature, many successful applications of machine learning methods were reported in various topics of hydrology and climatology. For instance, rainfall (Salih et al. 2020; Adnan et al. 2021), streamflow (Parisouj et al. 2020; Feng and Tian 2020), drought (Malik et al. 2020a; Parisouj et al. 2020), surface water quality (Rezaie-Balf et al. 2020; Chen et al. 2020), groundwater (Mosavi et al. 2021; Rahman et al. 2020), evapotranspiration (Granata 2019; Ferreira and da Cunha 2020; Granata et al. 2020; Granata and Di Nunno 2021), and many others (Al-Mukhtar 2019; Ghaemi et al. 2019; Keshtegar et al. 2019; Majhi et al. 2020; Yang et al. 2020). As a case in point, the applicability of deep neural network architecture with long short-term memory (Deep-LSTM) cells to estimate daily pan evaporation with minimum input features was investigated in a study by Majhi et al. (2020). They proposed to that end number of input combinations to predict the daily evaporation rates in three areas of Chhattisgarh state in India. The investigation suggested that the proposed Deep-LSTM model was capable to successfully model the daily evaporation losses with improved accuracy as compared to multilayer artificial neural network and empirical methods (Hargreaves and Blaney–Criddle). In another study by Zhu et al. (2020), the performance of hybridized extreme learning machine (ELM) model with particle swarm optimization (PSO) was explored to estimate the daily ETo in the arid region of Northwest China. The comparison of the obtained results with those counterparts from the original ELM, artificial neural networks (ANN) and random forests (RF) models along with six empirical models indicated a superior performance of ELM-PSO models for estimating ETo more accurately than others. A comparative study on the performance of ElasticNet linear regression, extreme gradient boosting, long short‑term memory in addition to two empirical techniques, i.e., Stephens‑Stewart and Thornthwaite, was carried out by Abed et al. (2021) in two weather stations in Malaysia. They found that the ML models outperformed the empirical models with the same input configurations. The bat algorithm (Bat) was coupled with gradient boosting in a study by Dong et al. (2021). Its performance was compared with random forest and original CatBoost (CB) for forecasting daily pan evaporation in arid and semi-arid regions of northwest China. They pointed out that the hybrid model outperformed the other models and presented comprehensive performance results (seasonally and spatially) compared to CatBoost and random forest. Emadi et al. (2021) evaluated the applications of wavelet-hybrids artificial neural networks (WANN), adaptive neuro-fuzzy inference system (WANFIS), and gene expression programming (WGEP) to estimate monthly evaporation in a study area in the Northwest and central part of Iran. They compared their results with those standalone models. It was revealed that the WGEP method has superiority in terms of performance and accuracy in comparison with the others and single models.
From the above-mentioned literature, it has been demonstrated that the ML performances were superior in comparison with other methods and depending on the input factors under various climatic conditions, the performance varies. However, the hybrid technique, where two or more models are combined and coupled (Chia et al. 2020), has recently drew more attention in climate and hydrology studies because of its capacity to capture the various patterns in data series by combining multi-technique features in one algorithm (Ghaemi et al. 2019). Yet, the generalization of the capability of these models over various climatic zones is arguable (Al-Mukhtar 2019) owing to the fact that each climatic region is associated with certain characteristics of stochasticity and non-stationarity. Therefore, it is essential to investigate newly developed models and explore their applicability for specific climatic features. As such, the main aim of this study was to explore the predictability of new hybrid methods, i.e., bagging and random subspace-based reduced-error pruning tree (REPTree) algorithms for modeling pan evaporation rates. Using these methods in water-related subjects has been rarely reported in the literature. Therefore, this study represents a novel framework to increase the prediction accuracy of applying machine learning in solving such complex physical relationships.
Materials and methods
Study area and data
The study area encompasses three different areas in Iraq which are Baghdad, Basrah, and Mosul. The latitude–longitude–altitude of the above stations are 33° 20′ 26″ N- 44° 24′ 03″ E- 41 m a.s.l, 30° 31′ 58″ N- 47° 47′ 50″ E- 6 m a.s.l, and 36° 20′ 06″ N- 43° 07′ 08″ E- 228 m a.s.l, respectively. These areas are situated in the middle, south, and north of Iraq, respectively (Fig. 1). Accordingly, the meteorological parameters which include T max., T min., T mean, WS, and RH to predict the monthly pan evaporation were collected for the study areas. The collected data span over the period of 1990–2013 for Baghdad and Mosul stations and 1990–2012 for Basrah, on a monthly time scale. The collected data were subdivided into two datasets, i.e., 75% for training and 25% for testing. The detailed statistical characteristics of the climatic parameters used in modeling configurations are listed in Table 1. The climatic condition at the selected stations is represented by climate variables (Table 1) such as Tmin (0.7–32.9 °C), Tmax (12.9–47.70 °C), RH (20–82%), WS (1.4–2.5), and PE (48.70–624.80 mm) at Baghdad; Tmin (4.7–34.60 °C), Tmax (14.60–48.90 °C), RH (17–80%), WS (1.7–7.7 m/s), and PE (41.40–645.90 mm) at Basrah; and Tmin (− 2.2–27.40 °C), Tmax (8.30–46.40 °C), RH (19–88%), WS (0.2–3.7 m/s), and PE (21.50–464.10 mm) at Mosul station, respectively.
Methods
Five evolutionary ML methods were evaluated in this study for forecasting PE, which are additive regression (AR), additive regression-random subspace (AR-RSS), AR-Bagging, additive regression-reduced error pruning tree (AR-REPTree), and AR-M5 pruned models. The evaluated methods were applied on data from three different climatic stations in Iraq. The proposed methodology applied in this study was explained as given in Fig. 2. A brief description on the individual methods is given in the following.
Additive regression AR
AR is a nonparametric regression method suggested by Friedman and Stuetzle (1981). In contrast to the ordinary regression, AR uses one smoother function to describe the relationship between predictors and predictands. As such, it can overcome the issue of curse dimensionality inherited in other p-dimensional smoother. The additive model takes the following form:
where \(\sum\nolimits_{j = 1}^{p} {f_{i} \left( {x_{ij} } \right)}\) are the smooth functions fitted from the data and \(\beta_{o}\) is the regression coefficient.
Random subspace (RSS)
RSS is one of the machine learning ensemble methods which is used for decision trees. In contrast to other trees-based decision methods, the method uses a systematic construction of decision trees in which each tree is being independently generated based on parallel learning algorithm (Ho 1998). Additionally, its structure as an ensemble learning method works on reducing the correlation between learners by randomly sampling subset of the training points and features instead of the entire both dataset. Thence, the learners produce different models that can be reliably averaged. The generated trees are clustered randomly into subspaces through a majority voting method. So that, the majority of voting is employed in the subspace ensemble system for the sample of X (Eq. (2)):
where \(\delta\) is the Kronecker symbol and y ∈ {− 1, 1} is a decision (class label) of the classifier, \(C^{b} \left( x \right)\) is the classifier of each subspace b (Skurichina and Duin 2002).
One of the attractive features of RSS is being sensibly adapted for high-dimensional problems where the number of features is much larger than the number of training points (Arabameri et al. 2021).
M5 pruned (M5P)
M5 model tree method (Quinlan 1992) is a tree-based model used in solving predictor-predict and regression problems of numerical values (Kisi et al. 2017b). In this method, two main steps are adopted in building the M5 model, i.e., data splitting into subsets at each node; and constructing a multivariate regression for each node (Malik et al. 2020a). In the first step, the data are split by computing the standard deviation reduction (SDR) at each leaf (Eq. (3)). So that the standard deviation of the classes values reaching a node is treated as an error index at that node. Subsequently, the expected reduction of that error is calculated by testing each attribute at that node. Thence, the subsets data are selected based on maximizing the expected error reduction (Al-Mukhtar 2021a).
where T is a set of cases in the data that reach a node, \(sd\) is the standard deviation, and \(T_{i}\) is the subset of cases that have the ith outcomes of the potential test.
In the second step, a multivariate linear regression model is constructed for each node at the model subtree. The M5 model compares the accuracy of the constructed linear model with the subtree at this node to ensure the same level of information. The decision on selection the optimal model is concluded based on the lower estimated error from either the obtained linear model or the model subtree. To minimize the standard error, the linear models are simplified by eliminating parameters using a greedy search method to remove variables which have little contribution to the model. Finally, the pruning and smoothing processes are set up to improve the prediction accuracy (Arabameri et al. 2021). The non-leaf nodes starting from the tree bottom are examined so that the subtree at a node is pruned depending on the least estimated error. If the linear model is opted, the subtree at this node is pruned as a leaf. The prediction accuracy is ultimately enhanced by the smoothing process where the predicted value at the leaf is adjusted by the equation below:
where \(PV\left( {S_{i} } \right)\) is the predicted value at \(S_{i}\), \(n_{i}\) number of training cases, \(S_{i}\) is the case follow a branch of subtree \(S\), \(k\) is the smoothing factor, and \(M\left( S \right)\) is the value given by the model at \(S\).
Reduced error pruning tree (REPTree)
REPTree is one of the pruning algorithms used in machine learning that characterize as simple and computationally speed (Quinlan 1987). It has been commonly used as a baseline method for the purpose of comparison with other pruning methods (Mohamed et al. 2012; Ganatra and Bhensdadia 2012). The algorithm is based on the principle of information gain and reducing the variance error (Chen et al. 2019a, b). The data are split based on the information gain reach at each node, and then, the subtrees are pruned by the reduced error. The algorithm starts from every non-leaf subtree in the test data where the change in misclassification is examined which arises when the subtree is replaced by the most common classified class at the node (Quinlan 1987). The subtree would be replaced by a leaf when the new induced tree has equal or less error than before. The process is repeated over until any further replacements increase the variance error of the test data.
Bagging
Bagging or bootstrap aggregating (Breiman 1996) is the most common ensemble learning method which is used to improve the accuracy performance of forecasting models (Li et al. 2020) by reducing the variation of a noisy data. In bagging, multiple learning algorithms are combined to obtain better predictive performance so that a stronger learners are constructed from weak ones (Saha et al. 2016). In contrast to other ensemble learning, a bootstrap replication of the original dataset is used to generate training sets and trains the base learners on these sub datasets (Al-Mukhtar 2021b). Then, the bagging averages out the resulting models in regression problems (Tyralis et al. 2019).
Statistical performance indicators
Five statistical performance indicators, i.e., mean absolute error, root mean square error, relative absolute error, root relative standard error, the correlation coefficient were applied to assess the five predictor AI models. The mathematical expression of each indicator is explained as below (Moriasi et al. 2007):
Mean absolute error (MAE)
Root mean square error (RMSE)
Relative absolute error (RAE)
Root relative square error (RRSE)
The correlation coefficient (r)
where N is the total number of data points, \(PE_{O}^{i}\) is the observed evaporation value, \(PE_{p}^{i}\) is the predicted evaporation value, and \(\overline{{PE_{O} }} \) and \(\overline{{PE_{p} }}\) are the mean values of the observed and predicted evaporation values, respectively.
Results
The forecast results for the best data-driven models for each station (Baghdad and Mosul) are shown in the sections below. The projections offered are based on validation data sets for time series of evaporation, which is often used to characterize agricultural meteorological events such as droughts and for irrigation system design. The applied models were set up using data from three different climatic characteristics stations in Iraq. The parameters of the machine learning algorithm used for modeling PE in these three regions were listed as shown in Table 2.
Input selection using best subset model
The selection of the optimal input parameters is a critical step in modeling for the best performance of the chosen models. For the optimal input combination, many combinations of meteorological parameters were used. Tables 3, 4 provide the statistical analysis of the five combinations that were examined in this research. At two stations in Baghdad and Mosul, the optimal input combination was chosen using the seven statistical criteria of MSE, R2, adjusted R2, Mallows' Cp, Akaike's AIC, Schwarz's SBC, Amemiya's PC, with the results presented in Tables 3, 4. The smallest and nearest to zero values of MSE, Mallows' Cp, Akaike's AIC, Schwarz's SBC, Amemiya's PC and the highest and near to 1 values of R2, adjusted R2 considered as best input combination in subset linear regression analysis.
The bold blue row in Table 3 is observed as the optimum input combination because it contains the lowest values of MSE of 1733.377, Mallows' Cp of 4.229, Akaike's AIC of 2151.826, Schwarz's SBC of 2166.478 and Amemiya's PC of 0.103, and the highest values of R2 (0.934) and Adj-R2 (0.934) of all input combinations at the Baghdad station. A similar result is seen in Table 4, where the bold blue row was observed as the optimum input combination, with the lowest values of MSE of 979.668, Mallows' Cp of 5.00, Akaike's AIC of 1988.474, Schwarz's SBC of 2006.789 and Amemiya's PC of 0.061, and the highest values of R2 (0.940) and Adj- R2 (0.939) among all the input combinations at the Mosul station.
Sensitivity analysis
The inputs combinations predominantly influence the performance of the models. Some may provide a positive contribution to the accuracy of the chosen model, while others might make a negative contribution. The selection of the most important relevant factors was carried out using sensitivity analysis to capture the optimal performance of daily PE models at two stations in Iraq. Tables 5, 6, as well as Figs. 3, 4 show the findings of the regression analysis that was conducted. The results of the regression analysis on all input parameters revealed that Tmax, Tmin, Tmean, RH, and WS, with absolute standard error coefficients of 0.000, 0.000, 0.038, 0.039, and 0.018, were identified as the most influential input parameters for estimation of evaporation at the Baghdad, whereas, absolute standard error coefficients were 0.000, 0.172, 0.213, 0.061, and 0.015 at Mosul station, respectively. The standardized coefficients of the input variable for sensitivity analysis are shown in Figs. 3, 4 for Baghdad and Mosul station, respectively.
Modeling of pan evaporation
In the present study, five evolutionary machine learning, additive regression, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models, were applied for forecasting monthly evaporation, and results were compared with classic AR to see the accuracy improvement of the new methods. For this purpose, the MAE, RMSE, RAE, RRSE, and r measures were considered.
Training and testing the selected models at Baghdad station
The monthly PE was estimated using AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models based on MAE, RMSE, RAE, RRSE, and r for both training and testing stages at Baghdad station. The values of MAE, RMSE, RAE, RRSE, and r criteria during the training and testing periods for AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models are presented in Table 7. As evaluated for Baghdad station from Table 7, the AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models provided MAE = 30.34, 31.12, 31.33, 33.313, and 29.65; RMSE = 41.19, 41.50, 40.14, 44.35, and 39.58; RAE = 21.18, 21.73, 21.87, 23.25, and 20.70; RRSE = 25.27, 25.47, 24.63, 27.21, and 24.29; r = 0.968, 0.967, 0.970, 0.962, and 0.970 during training period. In addition, the AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models provided MAE = 35.51, 45.72, 37.74, 37.81, and 33.82; RMSE = 48.68, 59.18, 47.35, 50.13, and 45.05; RAE = 25.98, 33.46, 27.62, 27.67, and 24.75; RRSE = 30.80, 37.44, 29.96, 31.72, and 28.50; r = 0.966, 0.949, 0.962, 0.958, and 0.972 during testing period, respectively. Table 7 showing the AR-M5P model attained the most accurate simulation during the testing stage. Therefore, AR-M5P model was the best performed model according to the statistical criteria (i.e., minimum MAE, RMSE, RAE, and RRSE values, and maximum r values) in testing stage followed by AR-Bagging model closely.
The temporal variation along with the scatter plots (right side) of the simulated versus observed monthly evaporation data for the AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models is plotted in Fig. 5a–e during the testing stage. In scatter plots, the regression line provided the coefficient of determination (R2) as 0.934 for the additive regression (AR) model, 0.902 for the AR-RSS model, 0.925 for AR-Bagging model, 0.918 for AR-REPTree model, and 0.944 for AR-M5P model during the testing stage, respectively. The regression line (RL) and the line of 1:1 were close to each other for all models. The RL was above the best fit (1:1) for AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models. This means that at Baghdad station, the five models slightly overpredict the monthly PE values. Radar charts display MAE and RSME of AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models during testing at Baghdad station plotted in Fig. 6. As can be raised from Fig. 6, the applied models were very close to each other. In other words, AR-REPTree was seen as the furthest from the observed point which introduces the AR-REPTree model as the worst model. On the opposite side, AR-M5P was the closest model to the observed point based on the standard deviation, correlation, and RMSE (Fig. 7). This demonstrates the superiority of the AR-M5P model in comparison with the others.
Training and testing the selected models at Mosul station
The monthly PE was estimated using AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models based on MAE, RMSE, RAE, RRSE, and r for both training and testing stages at Mosul station. The values of MAE, RMSE, RAE, RRSE, and r criteria during the training and testing periods for AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models are given in Table 8. As evaluated for Mosul station from Table 8, the AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models provided MAE = 23.62, 21.97, 22.46, 26.66, and 22.75; RMSE = 33.96, 29.38, 29.97, 38.09, and 29.21; RAE = 20.42, 19.00, 19.42, 23.05, and 19.67; RRSE = 26.24, 22.71, 23.16, 29.43, and 22.57; r = 0.965, 0.974, 0.972, 0.957, and 0.974 during training period. In addition, the AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models provided MAE = 29.68, 26.93, 27.37, 27.01, and 25.82; RMSE = 42.34, 37.31, 37.94, 38.62, and 35.95; RAE = 27.30, 24.77, 25.17, 24.84, and 23.75; RRSE = 34.92, 30.77, 31.29, 31.85, and 29.64; r = 0.945, 0.959, 0.959, 0.962, and 0.956 during testing period, respectively. Table 8 proves that the AR-M5P model outperformed the other models during the testing period according to the statistical criteria (i.e., minimum MAE, RMSE, RAE, and RRSE values, and maximum r values) in testing stage followed by AR-RSS model.
The temporal variation along with the scatter plots (right side) of the simulated versus observed monthly evaporation data for the AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models is plotted in Fig. 8a–e during the testing stage. The coefficient of determination (R2) was 0.894 for the AR model, 0.921 for the AR-RSS model, 0.921 for AR-Bagging model, 0.926 for AR-REPTree model, and 0.915 for AR-M5P model during the testing stage, respectively. It can be revealed that the RL and the line of fit (1:1) were close to each other for all applied models. The RL was above the best fit (1:1) for AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models which implies that at Mosul station, the five models slightly overpredict the monthly PE values. Radar charts display MAE and RSME of AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models during testing at Mosul station plotted in Fig. 9. It is noticed that all models were very close to each other; however, AR was the furthest from the observed point, while AR-M5P was the closest. Suggesting that AR-REPTree was the worst model, and AR-M5P was the best based on the standard deviation, correlation, and RMSE (Fig. 10).
Validation best candidate model at Basrah station
The best selected model was utilized for validation to predict of monthly evaporation at Basrah station. AR-M5P model was found best algorithm for both stations, i.e., Baghdad and Mosul; therefore, AR-M5P model was used for validation of best candidate model at Basrah station. The values of MAE, RMSE, RAE, RRSE, and r criteria during the validation period AR-M5P models are presented in Table 9. As evaluated for Basrah station from Table 9, the AR-M5P models provided MAE, RMSE, RAE, RRSE, and r = 47.23, 67.23, 31.19, 39.30, and 0.942, respectively.
The temporal variation along with the scatter plots (right side) of the simulated versus observed monthly evaporation data for the AR, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models is plotted in Fig. 11 during the testing stage. In scatter figure, the coefficient of determination (R2) was 0.887. The fitted RL and the perfect line of fit (1:1) were close to each other. The RL was above the best fit (1:1) for AR-M5P models. This means that at Basrah station, the model slightly overpredicts the monthly PE values.
Discussion
According to the results of the subset regression analysis, the best input combination for the Baghdad station was selected as T, RH, and W, and the best input combination for the Mosul station was selected as Tmin, T, RH, and W, indicating that all of these variables have an effect on pan evaporation. According to the relevant literature, all of these factors have a physical impact on pan evaporation. This demonstrates that the subset regression analysis was performed correctly. The heuristic models AR-M5P outperform the other algorithm models at both stations when compared to the other algorithms. As a result, the AR-M5P model was employed for the validation of the best candidate model at the Basrah station. The use of all of these models in various contexts may only be conceivable after they have been calibrated with fresh data. It was discovered that all of the heuristic models significantly overestimated the pan evaporation values, particularly for the Baghdad, Mosul, and Basrah stations, with the exception of one. One possible explanation for this might be the disparity between the training, test, and validation data ranges at this station. As a result, extrapolating the results of the applicable models becomes challenging.
The results of this study were validated with other recent works (Chen et al. 2019a, b; Kumar et al. 2021; Kushwaha et al. 2021; Lin et al. 2013; Malik et al. 2020b; Vishwakarma et al. 2022) conducted in different continents of the world. Lin et al. (2013) investigated the performance of two different ML techniques (i.e., SVM and backpropagation network) for estimating daily evaporation values. They demonstrated the superiority of the applied support vector machine to estimate the daily PE values and revealed that it can be used as promising alternative for evaporation prediction. The predictability of five ML methods [i.e., multi-model artificial neural network (MM-ANN), MARS, SVM, multi-gene genetic programming (MGGP), and M5Tree] to predict the monthly PE in India was investigated by Malik et al. (2020a), who made a similar commitment. They reported that the MM-ANN and MGGP algorithms were superior in prediction performance when compared to the MARS and SVM algorithms, as well as the M5Tree method as indicated by lowest RMSE. Kushwaha et al. (2021) evaluated four ML algorithms (i.e., SVM, RT, REPTree, and RSS) under diverse climate conditions in Northern India. They concluded that SVM outperformed over other applied algorithms as it has a high value of correlation coefficient and Willmott index and low value of MAE and RMSE. Similarly, Chen et al. (2019a, b) evaluated the prediction of monthly PE from SVM at 6 different stations, located in the Yangtze River in China. They proved that SVM was better than the traditional methods for estimating PE. In parallel to the above literature, the findings of this study confirmed that the AR-M5P hybrid algorithm was more accurate than other applied algorithms in terms of predicting the pan evaporation rates at the selected stations.
Overall, our findings indicate that hybrid models have a stronger predictive value in real-world situations and maybe employed more effectively in watersheds with little data. In addition to predicting pan evaporation, these types of models may be used to forecast a wide range of hydrological and water resources phenomena, including ETo, suspended and bed sediment loads, rainfall, and groundwater contamination. Especially in developing countries where technical skills and understanding of the processes occurring in the watershed are lacking, these algorithms could be used in data-poor watersheds or for measuring some phenomena that are time-consuming or expensive, such as suspended or bed load in rivers, or nitrate and other heavy metals measurement in groundwater. Lastly, it is noteworthy to mention that despite the superior performances of hybrid meta-heuristics algorithms proved through the present study; several drawbacks and limitations might be diagnosed which hinder a generalized conclusion. Such as limitations are represented by the uncertainties inherited in inputs datasets, the established scenarios, modeling methods with large search space and model parameters, etc. where all the above mentioned conditions are ultimately influence the applicability of the proposed methods. Hence, further validations should be carried out to validate the methods at different areas under different agroclimatic conditions with various scenarios.
Conclusions and outlook
In this study, five machines learning evolutionary were applied for forecasting monthly evaporation and results were compared with classic AR to see the accuracy improvement of the new methods. The developed models encompass additive regression, AR-RSS, AR-Bagging, AR-REPTree, and AR-M5P models. Data from three different climatic characteristics regions in Iraq were employed for the sake of models evaluations using several statistical metrics (MAE, RMSE, RAE, RRSE, and r). The best input combination was determined based on the regression subset. As such, the optimal input combination for the Baghdad station was Tmean, RH, and WS, and the best input combination for the Mosul station was Tmin, RH, and WS, indicating that all of these variables affect pan evaporation. It was concluded that the hybrid models have a stronger predictive capability in real-world situations and maybe employed more effectively in watersheds with little data. However, the AR-M5P was found to be the best performance among the other evaluated methods as it shows the least error indices values. The statistical indicators, i.e., the MAE, RMSE, RAE, RRSE, and r, obtained from AR-M5P in Baghdad were 33.82, 45.05, 24.75, 28.50, and 0.972, respectively, while those indicators in Mosul were 25.82, 35.95, 23.75, 29.64, and 0.956, respectively. The superior performance from AR-M5P highlighted the effectiveness of using AI methods in tackling complicated relationships which could be used for further data analysis in water resources and hydrology.
Abbreviations
- T max :
-
Maximum temperature
- T min :
-
Minimum temperature
- T mean :
-
Average temperature
- RH :
-
Relative humidity
- WS :
-
Wind speed
- RF :
-
Rainfall
- PET :
-
Potential evapotranspiration
- PE :
-
Evaporation
- AR :
-
Additive regression
- RSS :
-
Random subspace
- M5P :
-
M5 pruning tree
- RF :
-
Random forest
- RT :
-
Random tree
- REPTree :
-
Reduced error pruning tree
- MAE :
-
Mean absolute error
- RMSE :
-
Root mean square error
- RAE :
-
Relative absolute error
- RRSE :
-
Root relative squared error
- R :
-
Correlation coefficient
- mm :
-
Millimeter
- SDR :
-
Standard deviation reduction factor
- SE :
-
Standard error
- ML :
-
Machine learning
- Deep-LSTM :
-
Long short-term memory
- PSO :
-
Particle swarm optimization
- CB :
-
CatBoost method
- WANN :
-
Wavelet-hybrids artificial neural networks
- WANFIS :
-
Adaptive neuro-fuzzy inference system
- WGEP :
-
Gene expression programming
- ELM :
-
Extreme learning machine
- OSELM :
-
Online sequential-ELM
- ANNs :
-
Artificial neural networks
- SVR :
-
Support vector regression
- MLP :
-
Multilayer perceptron
References
Abd-Elaty I, Kushwaha NL, Grismer ME, Elbeltagi A, Kuriqi A (2022) Cost-effective management measures for coastal aquifers affected by saltwater intrusion and climate change. Sci Total Environ 836:155656. https://doi.org/10.1016/j.scitotenv.2022.155656
Abed M, Imteaz MA, Ahmed AN, Huang YF (2021) Application of long short-term memory neural network technique for predicting monthly pan evaporation. Sci Rep 11:1–19. https://doi.org/10.1038/s41598-021-99999-y
Adnan RM, Petroselli A, Heddam S, Santos CAG, Kisi O (2021) Comparison of different methodologies for rainfall–runoff modeling: machine learning vs conceptual approach. Nat Hazards 105:2987–3011. https://doi.org/10.1007/s11069-020-04438-2
Al-Mukhtar M (2019) Random forest, support vector machine, and neural networks to modelling suspended sediment in Tigris. Environ Monit Assess 191:673. https://doi.org/10.1007/s10661-019-7821-5
Al-Mukhtar M (2021a) Modeling of pan evaporation based on the development of machine learning methods. Theor Appl Climatol 146(3):961–979
Al-Mukhtar M (2021b) Modeling the monthly pan evaporation rates using artificial intelligence methods: a case study in Iraq. Environ Earth Sci. https://doi.org/10.1007/s12665-020-09337-0
Arabameri A, Pal SC, Rezaie F, Nalivan OA, Chowdhuri I, Saha A, Lee S, Moayedi H (2021) Modeling groundwater potential using novel GIS-based machine-learning ensemble techniques. J Hydrol Reg Stud 36:100848. https://doi.org/10.1016/j.ejrh.2021.100848
Ashrafzadeh A, Ghorbani MA, Biazar SM, Yaseen ZM (2019) Evaporation process modelling over northern Iran: application of an integrative data-intelligence model with the krill herd optimization algorithm. Hydrol Sci J 64(15):1843–1856
Boers TM, De Graaf M, Feddes RA, Ben-Asher J (1986) A linear regression model combined with a soil water balance model to design micro-catchments for water harvesting in arid zones. Agric Water Manag. https://doi.org/10.1016/0378-3774(86)90038-7
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140. https://doi.org/10.1007/BF00058655
Chen J-L, Yang H, Lv M-Q, Xiao Z-L, Wu SJ (2019a) Estimation of monthly pan evaporation using support vector machine in three gorges reservoir area, China. Theor Appl Climatol 138:1095–1107. https://doi.org/10.1007/s00704-019-02871-3
Chen W, Hong H, Li S, Shahabi H, Wang Y, Wang X, Ahmad BB (2019b) Flood susceptibility modelling using novel hybrid approach of reduced-error pruning trees with bagging and random subspace ensembles. J Hydrol 575:864–873. https://doi.org/10.1016/j.jhydrol.2019.05.089
Chen K, Chen H, Zhou C, Huang Y, Qi X, Shen R, Liu F, Zuo M, Zou X, Wang J, Zhang Y, Chen D, Chen X, Deng Y, Ren H (2020) Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data. Water Res 171:115454. https://doi.org/10.1016/j.watres.2019.115454
Chia MY, Huang YF, Koo CH (2020) Support vector machine enhanced empirical reference evapotranspiration estimation with limited meteorological parameters. Comput Electron Agric. https://doi.org/10.1016/j.compag.2020.105577
Dong L, Zeng W, Wu L, Lei G, Chen H, Kumar Srivastava A, Gaiser T (2021) Estimating the pan evaporation in northwest china by coupling catboost with bat algorithm. Water 13:1–17. https://doi.org/10.3390/w13030256
Eames IW, Marr NJ, Sabir H (1997) The evaporation coefficient of water: a review. Int J Heat Mass Transf. https://doi.org/10.1016/S0017-9310(96)00339-0
Elbeltagi A, Kushwaha NL, Srivastava A, Zoof AT (2022) Artificial intelligent-based water and soil management. In: Poonia RC, Singh V, Nayak SR (ed), Deep learning for sustainable agriculture, cognitive data science in sustainable computing. academic press, pp. 129–142. https://doi.org/10.1016/B978-0-323-85214-2.00008-2
Emadi A, Zamanzad-Ghavidel S, Fazeli S, Zarei S, Rashid-Niaghi A (2021) Multivariate modeling of pan evaporation in monthly temporal resolution using a hybrid evolutionary data-driven method (case study: Urmia lake and Gavkhouni basins). Environ Monit Assess. https://doi.org/10.1007/s10661-021-09060-8
Fan J, Wu L, Zhang F, Xiang Y, Zheng J (2016) Climate change effects on reference crop evapotranspiration across different climatic zones of China during 1956–2015. J Hydrol 542:923–937. https://doi.org/10.1016/j.jhydrol.2016.09.060
Feng K, Tian J (2020) Forecasting reference evapotranspiration using data mining and limited climatic data. Eur J Remote Sens 00:1–9. https://doi.org/10.1080/22797254.2020.1801355
Ferreira LB, da Cunha FF (2020) Multi-step ahead forecasting of daily reference evapotranspiration using deep learning. Comput Electron Agric 178(May):105728. https://doi.org/10.1016/j.compag.2020.105728
Friedman JH, Stuetzle W (1981) Projection pursuit regression. J Am Stat Assoc 76:817–823
Ganatra A, Bhensdadia CK (2012) Improved decision tree induction algorithm with feature selection, cross validation, model complexity and reduced error pruning data center netwokring view project big data view project. J Compt Sci Inf Technol 3:3427–3431
Ghaemi A, Rezaie-Balf M, Adamowski J, Kisi O, Quilty J (2019) On the applicability of maximum overlap discrete wavelet transform integrated with MARS and M5 model tree for monthly pan evaporation prediction. Agric for Meteorol 278:107647. https://doi.org/10.1016/j.agrformet.2019.107647
Gong D, Hao W, Gao L, Feng Y, Cui N (2021) Extreme learning machine for reference crop evapotranspiration estimation: Model optimization and spatiotemporal assessment across different climates in China. Comput Electron Agric 187:106294. https://doi.org/10.1016/j.compag.2021.106294
Granata F (2019) Evapotranspiration evaluation models based on machine learning algorithms—a comparative study. Agric Water Manag 217(March):303–315. https://doi.org/10.1016/j.agwat.2019.03.015
Granata F, Di Nunno F (2021) Forecasting evapotranspiration in different climates using ensembles of recurrent neural networks. Agric Water Manag 255:107040. https://doi.org/10.1016/j.agwat.2021.107040
Granata F, Gargano R, de Marinis G (2020) Artificial intelligence based approaches to evaluate actual evapotranspiration in wetlands. Sci Total Environ 703:135653. https://doi.org/10.1016/j.scitotenv.2019.135653
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20:832–844
Keshtegar B, Heddam S, Sebbar A, Zhu SP, Trung NT (2019) SVR-RSM: a hybrid heuristic method for modeling monthly pan evaporation. Environ Sci Pollut Res 26:35807–35826. https://doi.org/10.1007/s11356-019-06596-8
Khan N, Shahid S, Ismail T, bin, Wang, X.-J., (2018) Spatial distribution of unidirectional trends in temperature and temperature extremes in Pakistan. Theoret Appl Climatol 136:899–913. https://doi.org/10.1007/s00704-018-2520-7
Khan N, Shahid S, Juneng L, Ahmed K, Ismail T, Nawaz N (2019) Prediction of heat waves in Pakistan using quantile regression forests. Atmos Res 221:1–11. https://doi.org/10.1016/j.atmosres.2019.01.024
Kisi O, Mansouri I, Hu JW (2017a) A new method for evaporation modeling: dynamic evolving neural-fuzzy inference system. Adv Meteorol. https://doi.org/10.1155/2017/5356324
Kisi O, Shiri J, Demir V (2017b) Hydrological time series forecasting using three different heuristic regression techniques, 1st edn. Elsevier Inc., Handbook of neural computation. https://doi.org/10.1016/B978-0-12-811318-9.00003-X
Kumar M, Kumari A, Kumar D, Al-Ansari N, Ali R, Kumar R, Kumar A, Elbeltagi A, Kuriqi A (2021) The superiority of data-driven techniques for estimation of daily pan evaporation. Atmosphere 12:701. https://doi.org/10.3390/atmos12060701
Kushwaha NL, Bhardwaj A, Verma VK (2016) Hydrologic response of Takarla-Ballowal watershed in Shivalik foot-hills based on morphometric analysis using remote sensing and GIS. J Indian Water Resour Soc 36:17–25
Kushwaha NL, Rajput J, Elbeltagi A, Elnaggar AY, Sena DR, Vishwakarma DK, Mani I, Hussein EE (2021) Data intelligence model and meta-heuristic algorithms-based pan evaporation modelling in two different agro-climatic zones: a case study from Northern India. Atmosphere 12:1654. https://doi.org/10.3390/atmos12121654
Kushwaha NL, Rajput J, Sena DR, Elbeltagi A, Singh DK, Mani I (2022a) Evaluation of data-driven hybrid machine learning algorithms for modelling daily reference evapotranspiration. Atmos Ocean 62:1–22. https://doi.org/10.1080/07055900.2022.2087589
Kushwaha NL, Rajput J, Shirsath PB, Sena DR, Mani I (2022b) Seasonal climate forecasts (SCFs) based risk management strategies: a case study of rainfed rice cultivation in India. J Agrometeorol 24:10–17. https://doi.org/10.54386/jam.v24i1.775
Li Z, Chen T, Wu Q, Xia G, Chi D (2020) Application of penalized linear regression and ensemble methods for drought forecasting in Northeast China. Meteorol Atmos Phys 132:113–130. https://doi.org/10.1007/s00703-019-00675-8
Lin G-F, Lin H-Y, Wu M-C (2013) Development of a support-vector-machine-based model for daily pan evaporation estimation. Hydrol Process 27:3115–3127. https://doi.org/10.1002/hyp.9428
Lundberg A (1993) Evaporation of intercepted snow - Review of existing and new measurement methods. J Hydrol. https://doi.org/10.1016/0022-1694(93)90239-6
Majhi B, Naidu D, Mishra AP, Satapathy SC (2020) Improved prediction of daily pan evaporation using Deep-LSTM model. Neural Comput Appl 32:7823–7838. https://doi.org/10.1007/s00521-019-04127-7
Malik A, Kumar A, Kim S, Kashani MH, Karimi V, Sharafati A, Ghorbani MA, Al-Ansari N, Salih SQ, Yaseen ZM, Chau KW (2020a) Modeling monthly pan evaporation process over the Indian central Himalayas: application of multiple learning artificial intelligence model. Eng Appl Compt Fluid Mech 14:323–338. https://doi.org/10.1080/19942060.2020.1715845
Malik A, Tikhamarine Y, Al-Ansari N, Shahid S, Sekhon HS, Pal RK, Rai P, Pandey K, Singh P, Elbeltagi A, Sammen SS (2021) Daily pan-evaporation estimation in different agro-climatic zones using novel hybrid support vector regression optimized by Salp swarm algorithm in conjunction with gamma test. Eng Appl Compt Fluid Mech 15:1075–1094. https://doi.org/10.1080/19942060.2021.1942990
Masoner JR, Stannard DI, Christenson SC (2008) Differences in evaporation between a floating pan and class a pan on land. J Am Water Resour Assoc. https://doi.org/10.1111/j.1752-1688.2008.00181.x
Moazenzadeh R, Mohammadi B, Shamshirband S, Chau K (2018) Coupling a firefly algorithm with support vector regression to predict evaporation in northern Iran. Eng Appl Compt Fluid Mech 12:584–597. https://doi.org/10.1080/19942060.2018.1482476
Mohamed WNHW, Salleh MNM, Omar AH (2012) A comparative study of reduced error pruning method in decision tree algorithms. Proceedings - 2012 IEEE international conference on control system, computing and engineering, ICCSCE pp. 392–397. https://doi.org/10.1109/ICCSCE.2012.6487177
Moriasi DN, Arnold JG, Van Liew MV, Binger RL, Harmel RD, Veith TL (2007) Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Am Soc Agric Biol Eng 50(3):885–900
Mosavi A, Sajedi Hosseini F, Choubin B, Taromideh F, Ghodsi M, Nazari B, Dineva AA (2021) Susceptibility mapping of groundwater salinity using machine learning models. Environ Sci Pollut Res 28:10804–10817. https://doi.org/10.1007/s11356-020-11319-5
Naganna S, Deka P, Ghorbani M, Biazar S, Al-Ansari N, Yaseen Z (2019) Dew point temperature estimation: application of artificial intelligence model integrated with nature-inspired optimization algorithms. Water. https://doi.org/10.3390/w11040742
Parisouj P, Mohebzadeh H, Lee T (2020) Employing machine learning algorithms for streamflow prediction: a case study of four river basins with different climatic zones in the United States. Water Resour Manage 34:4113–4131. https://doi.org/10.1007/s11269-020-02659-5
Quinlan JR (1987) Simplifying decision trees. Int J Man Mach Stud 27:221–234
Quinlan JR (1992) Learning with continuous classes. Aust Jt Conf Artif Intell 92:343–348
Rahman ATMS, Hosono T, Quilty JM, Das J, Basak A (2020) Multiscale groundwater level forecasting: coupling new machine learning approaches with wavelet transforms. Adv Water Resour. https://doi.org/10.1016/j.advwatres.2020.103595
Rezaie-Balf M, Attar NF, Mohammadzadeh A, Murti MA, Ahmed AN, Fai CM, Nabipour N, Alaghmand S, El-Shafie A (2020) Physicochemical parameters data assimilation for efficient improvement of water quality index prediction: comparative assessment of a noise suppression hybridization approach. J Clean Prod 271:122576
Saha M, Mitra P, Nanjundiah RS (2016) Autoencoder-based identification of predictors of Indian monsoon. Meteorol Atmos Phys 128:613–628. https://doi.org/10.1007/s00703-016-0431-7
Salih SQ, Sharafati A, Ebtehaj I, Sanikhani H, Siddique R, Deo RC, Bonakdari H, Shahid S, Yaseen ZM (2020) Integrative stochastic model standardization with genetic algorithm for rainfall pattern forecasting in tropical and semi-arid environments. Hydrol Sci J 65(7):1145–1157
Sartori E (2000) A critical review on equations employed for the calculation of the evaporation rate from free water surfaces. Sol Energy. https://doi.org/10.1016/S0038-092X(99)00054-7
Sayl KN, Muhammad NS, Yaseen ZM, El-Shafie A (2016) Estimation the physical variables of rainwater harvesting system using integrated GIS-based remote sensing approach. Water Resour Manage 30:3299–3313. https://doi.org/10.1007/s11269-016-1350-6
Skurichina M, Duin R (2002) Bagging, boosting and the random subspace method for linear classifier. Pattern Anal Appl 5:121–135. https://doi.org/10.4028/www.scientific.net/msf.440-441.77
Tyralis H, Papacharalampous G, Langousis A (2019) A brief review of random forests for water scientists and practitioners and their recent history inwater resources. Water. https://doi.org/10.3390/w11050910
Vishwakarma DK, Pandey K, Kaur A, Kushwaha NL, Kumar R, Ali R, Elbeltagi A, Kuriqi A (2022) Methods to estimate evapotranspiration in humid and subtropical climate conditions. Agric Water Manag 261:107378. https://doi.org/10.1016/j.agwat.2021.107378
Wu L, Huang G, Fan J, Ma X, Zhou H, Zeng W (2020) Hybrid extreme learning machine with meta-heuristic algorithms for monthly pan evaporation prediction. Comput Electron Agric 168:105115. https://doi.org/10.1016/j.compag.2019.105115
Yang X, Zhou J, Fang W, Wang Y (2020) An ensemble flow forecast method based on autoregressive model and hydrological uncertainty processer. Water 12:1–15. https://doi.org/10.3390/w12113138
Zhao L, Xia J, Xu C, yu, Wang, Z., Sobkowiak, L., Long, C., (2013) Evapotranspiration estimation methods in hydrological models. J Geog Sci 23:359–369. https://doi.org/10.1007/s11442-013-1015-9
Zhu B, Feng Y, Gong D, Jiang S, Zhao L, Cui N (2020) Hybrid particle swarm optimization with extreme learning machine for daily reference evapotranspiration prediction from limited climatic data. Comput Electron Agric 173:105430. https://doi.org/10.1016/j.compag.2020.105430
Acknowledgements
The author would like to express their gratitude to the anonymous reviewer of this manuscript for their valuable comments.
Funding
Open access funding provided by Lulea University of Technology. This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Elbeltagi, A., Al-Mukhtar, M., Kushwaha, N.L. et al. Forecasting monthly pan evaporation using hybrid additive regression and data-driven models in a semi-arid environment. Appl Water Sci 13, 42 (2023). https://doi.org/10.1007/s13201-022-01846-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13201-022-01846-6