Abstract
Reference evapotranspiration (ET0) estimates are commonly used in hydrologic planning for water resources and agricultural applications. Last 2 decades, machine learning (ML) techniques have enabled scientists to develop powerful tools to study ET0 patterns in the ecosystem. This study investigated the feasibility and effectiveness of three ML techniques, including the k-nearest neighbor algorithm, multigene genetic programming, and support vector regression (SVR), to estimate daily ET0 in Türkiye. In addition, different interpolation techniques, including ordinary kriging (OK), co-kriging, inverse distance weighted, and radial basis function, were compared to develop the most appropriate ET0 maps for Türkiye. All developed models were evaluated according to the performance indices such as coefficient of determination (R2), root mean square error (RMSE), and mean absolute error (MAE). Taylor, violin, and scatter plots were also generated. Among the applied ML models, the SVR model provided the best results in determining ET0 with the performance indices of R2 = 0.961, RMSE = 0.327 mm, and MAE = 0.232 mm. The SVR model’s input variables were selected as solar radiation, temperature, and relative humidity. Similarly, the maps of the spatial distribution of ET0 were produced with the OK interpolation method, which provided the best estimates.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Accurate estimation of crop water consumption (evapotranspiration, ET) and reference evapotranspiration (ET0) is critical for water resource management, especially in arid and semiarid regions (Banda et al. 2018; Tunca et al. 2022). ET0 is a complicated hydrologic cycle component crucial to agricultural water management. Knowledge of ET0 would make it possible to reduce water consumption, increase irrigation efficiency, and allow properly scheduling of the irrigation application (Sattari et al. 2021). ET0 can be determined by conducting laboratory experiments, or it can be estimated using mathematical models. Lysimeters, which provide accurate measurements of ET0, are typically used to develop and test other indirect ET0 measurement methods (López-Urrea et al. 2006). However, it is not always possible to measure ET0 with lysimeters because this method is costly, labor-intensive, and requires careful measurements (Allen et al. 1998). Therefore, using mathematical models by collecting meteorological data from the weather stations is preferable to estimate ET0 (Ferreira et al. 2019).
In indirect methods, ET0 is calculated using equations calibrated with direct methods; ET0 is then converted to standard plant ET using the crop coefficient for different growing seasons. Generally, the equations used to calculate ET0 values are complex, nonlinear, contain random factors, and rely on several assumptions. According to the literature, there are about 20 methods for estimating ET0 based on meteorological variables. Of the methods, Penman (Penman 1948), Thornthwaite (Thornthwaite 1948), Blaney and Criddle (Blaney and Criddle 1950), Priestly and Taylor (Priestley and Taylor 1972), Hargreaves and Samani (Hargreaves and Samani 1985), and FAO Penman–Monteith (FAO56PM) (Allen et al. 1998) have been widely and successfully used mathematical models to estimate ET0. For these mathematical methods to solve the equations, rigorous optimization procedures, accurate spatiotemporal data, and knowledge of initial conditions are required (Prasad et al. 2017).
On the other hand, with recent developments in computer technology, researchers worldwide have gained attention from machine learning (ML) techniques. ML represents input–output relations without understanding the physical process, making them effective tools for modeling nonlinear systems. They have provided an alternative to estimating ET0 based on a minimum number of weather input parameters (Krishnashetty et al. 2021). Recently, researchers have investigated the potential of various ML techniques, such as artificial neural networks (ANN), support vector machine (SVM), and genetic programming (GP), to determine ET0 (Abrishami et al. 2019; Mohammadrezapour et al. 2019; Chia et al. 2020; Ayaz et al. 2021; Jang et al. 2021; Achite et al. 2022; Dimitriadou and Nikolakopoulos 2022; Makwana et al. 2022; Tejada et al. 2022). It is worth noting that the inputs to these models lend themselves well to sensitivity analyses and obtaining optimal structures and that they do not rely on mathematical relationships even for complex phenomena. Although these models have been trained and tested in the literature for many meteorological stations and climatic conditions, it is impractical to train them for every station in a large region. Therefore, regional models need to be created (Citakoglu et al. 2014).
Accurate field-level assessments of ET0 can be critical for understanding the spatiotemporal patterns of water demand and improving water resource management and agricultural practices, such as decisions about the timing and quantity of irrigation (Kustas et al. 2018). The spatial variation of ET0 is important for climate analyses and assessment of regional climate change scenarios such as droughts and floods (Güler 2014). In recent years, the geographic information system (GIS) has been extensively used to evaluate the spatial and temporal variation of ET0 (Mardikis et al. 2005; Vicente‐Serrano et al. 2007; Sabziparvar and Tabari 2010). Using GIS tools, spatial interpolation techniques can be applied to create ET0 maps by expanding ET0 values from points to areal/regional estimates. Interpolation techniques generally fall into two main categories: deterministic and geostatistical (stochastic). Deterministic interpolation generates surfaces using mathematical functions from sample points based on their similarity as inverse distance weighting (IDW) or smoothing as radial basis functions (RBF). In contrast, stochastic methods (kriging) use mathematical and statistical functions to evaluate the uncertainty of estimates (Webster and Oliver 2007). These interpolation methods are easily computerized and can be combined with contour mapping of regional variables. Over the past 2 decades, a number of spatial interpolation methods for ET0 mapping have been tested with varying degrees of complexity. For instance, da Silva Júnior et al. (2019) evaluated methods for spatial interpolation of ET0 data in terms of precision and performance. Gentilucci et al. (2021) used geostatistical methods to calculate ET0 and calibrate the Hargreaves equation over the last 10 years in Italy. Dalezios et al. (2002) studied the optimal interpolation of ET0 over large regions of complex terrain in Greece. Hodam et al. (2017) investigated the potential of IDW and kriging to develop an appropriate method for spatial interpolation of point ET0.
The studies above have contributed significantly to the knowledge base on using ML and interpolation techniques in estimating ET0. However, the combined application of ML and different interpolation techniques for ET0 estimation problems is currently limited, and the knowledge on this topic is still incomplete and fragmented. To fill this gap, this current study aims to investigate the performance of ML and interpolation techniques and to determine the best model for accurately predicting daily ET0 values. The main contributions of this study are as follows:
The potential estimation of daily ET0 with k-nearest neighbor algorithm (KNN), support vector regression (SVR), and multigene genetic programming (MGGP) are investigated. The accuracy, applicability, and reliability of the models are examined in comparison to the ASCE Penman–Monteith (ASCE-PM) method. Various interpolation techniques, including ordinary kriging (OK), co-kriging (COK), inverse distance weighted (IDW), and radial basis function (RBF), are compared to find the most suitable spatial maps of ET0.
Materials and methods
Research site and data sets
Türkiye has located in the northern hemisphere and is bounded by latitudes of 36–42° N and longitudes of 26–45° E. The country is closer to the equator than the poles and is in a temperate climate zone. It is approximately 1,600 km long and 800 km wide, with a total geographic area of 783,562 km2. The total water area is 9820 km2. The lowest altitude is 0 m at the Mediterranean Sea, and the highest altitude is 5137 m at Mount Ararat. Daily meteorological measurement data for a 52-year period (between 1962 and 2014) were collected from 213 meteorological stations (Fig. 1). The ASCE Penman–Monteith method was used to calculate daily ET0 using meteorological data such as temperature (T), relative humidity (RH), wind speed (WS), and solar radiation (RS) from these meteorological stations. Information on the daily ET0 values and the locations of the stations representing different geographic regions and climatic conditions are presented in Table 1.
Method of ET0 Estimation
To compute ET0 on a daily scale, the ASCE-PM equation (Allen et al. 2006) (Eq. 1) was employed.
where ET0 is the reference evapotranspiration (mm d−1), Rn is net solar radiation reaching the soil surface (MJ m−2d−1), G is soil heat flux density (MJ m−2d−1), T is mean daily air temperature at the height of 2 m (°C), u2 is the wind speed at the height of 2 m (m s−1), es is saturated vapor pressure (kPa), ea is actual vapor pressure (kPa), Δ is the slope of vapor pressure curve (kPa °C−1), and γ is psychometric constant (kPa °C−1).
k-nearest neighbor algorithm (KNN)
In many cases, the KNN is an effective nonparametric ML approach for classification and regression (Guo et al. 2003; Cheng et al. 2022). Since the KNN is a distance-based model, the basic logic of the KNN algorithm is to determine the correct class for the test data by calculating the distance between the test data and all training points. In the method, the letter “k” stands for the number of neighbors, and the success of the classification depends heavily on this value. A simple way to determine the “k” value is to run the algorithm repeatedly with different values, select the best performance on the training data, and then classify the unknowns (Guo et al. 2003). The majority of its neighbors classify data, and its class is determined by the frequency of its “k” nearest neighbors, measured by a distance function. Although there are many distance functions, such as Euclidean distance, Mahalanobis distance, and Minkowsky distance and their variants, the Euclidean distance function has been used most frequently in the literature, and therefore this function was preferred in the present study (Mittal et al. 2019; Prasad et al. 2019; Xu et al. 2020; Bayram and Çıtakoğlu 2023). The Euclidean distance function can be expressed as shown in Eq. (2).
where xi and xj are the ith and jth variables, and n is the data number.
KNN has a major disadvantage that a comprehensive record search is required for each sample to identify the most relevant samples (Sahoo and Ghose 2022). On the other hand, the KNN performs classification in a simple, fast, easy-to-understand, and competitive way (Imandoust and Bolandraftar 2013).
Support vector regression (SVR)
SVR is widely used in linear and nonlinear regression for prediction and curve fitting. SVR is based on the elements of the SVM, where support vectors are closer points in the direction of the generated hyperplane in an n-dimensional feature space that separates the data points around the hyperplane. SVR uses a part of the given data set to create a function estimator (Eq. 3).
where w ∈ Rn is a weighted feature vector, and x is the input vector in Rn. ϕ represents the mapping, and b is the intercept.
This study uses ε-SVR because the ET0 prediction model does not limit the number of support vectors (Ceperic et al. 2013). The optimization problem for the ε-insensitive problem is formulated using Eq. (4).
in which ½║ω║2 indicates the parameter for regularization, C is the empirical error penalty factor, ξi and ξi* are the slack variables, and ε represents the loss function.
To solve this optimization problem, a Lagrangian dual function can be constructed. The performance of SVR depends heavily on the kernel functions since SVR is a kernel-based algorithm. Although there are several kernel functions, the RBF is one of the most recommended kernels in the literature (Hosseinzadeh et al. 2021). Its formulation is as defined in Eq. (5).
where ║x − y║2 is identified as the squared Euclidean distance between the two feature vectors, and σ is a free parameter.
Despite its high accuracy and excellent generalizability, the main shortcoming of the SVR is that the kernel function is used only once to map the sample data to the high-dimensional features (Zhong et al. 2019).
Multigene genetic programming (MGGP)
GP is a supervised ML technique that searches a program space rather than a data space. Symbolic regression is performed over GP to develop trees. GP solves complex problems by constructing and modifying function trees (Gandomi et al. 2010). This technique has the advantage that one can create prediction equations without assuming existing relations. MGGP is a robust variant of the GP model designed to overcome shortcomings. MGGP produces mathematical “multigene” models of predictor response data, i.e., linear combinations of low-order nonlinear input variable transformations. Unlike MGGP, the traditional GP evaluates a single tree expression.
In the MGGP model, the input–output relation is determined by a random selection of functions and variables (Niazkar and Niazkar 2020). This model uses one or more gene trees and calibrates the coefficients using least-squares regression analysis (LSRA), a statistical method. Based on the MGGP, evolutionary processes are referred to as high-level transitions and mutations, as opposed to the processes described in the standard GP, which are referred to as low-level processes (Bayram and Çıtakoğlu 2023). In MGGP, linearly combining nonlinear terms without functional structure generates a set of equations. It also provides a mathematical representation of the problem in the form of a tree structure or equations that are formal and explicit (Zhang et al. 2021). This study used the open-source code of MGGP, which was obtained from the literature (Searson et al. 2010). For the MGGP model, the basic mathematical functions, such as “ + , − , × , ÷ , √, x2, exp, ln.” are used to obtain the optimum models. In the MGGP method, two important control parameters (multigene and tree-building parameters) are set by the user. The accuracy of the model can be increased by increasing the values of these two parameters. Detailed explanations of the MGGP parameters and operators can be found in the literature (Searson et al. 2010; Gandomi and Alavi 2012).
Spatial interpolation methods
Interpolation techniques can be divided into two main categories: deterministic and geostatistical (stochastic). Deterministic interpolation techniques use mathematical functions to generate surfaces from sample points based on the degree of similarity (inverse distance weighting, IDW) or smoothing (radial basis functions, RBF). On the other hand, stochastic techniques (kriging) use mathematical and statistical features to estimate uncertainty. Geostatistical methods are known as kriging. Semi-variograms are used in interpolation by kriging. OK is a common method used in geostatistical studies to predict the value of a variable at a single point or block. It is used when there are no trends in the data. The most common functions used to model semi-variograms are exponential, spherical, and Gaussian functions (Hodam et al. 2017; Küçüktopcu and Cemek 2022). COK is advantageous compared to kriging when working with variables that are expensive to sample (Orejuela et al. 2021).
The deterministic methods, on the other hand, use mathematical functions for interpolation. IDW is used to estimate the value at unsampled points from the values of sampled points using linear combinations of values weighted by the inverse function of the distances between the points of interest and the surrounding known values. Some basic parameters are considered when applying the interpolation methods. The accuracy of the method is affected by exponential parameters (power 1, power 2, etc.) (Burrough and McDonnell 1998). The RBF methods are a series of exact interpolation methods in which the surface must pass through each measured sample value. The RBF method has five basic functions: thin plate spline (TSP), spline with tension (ST), completely regularized spline (CRS), multiquadric function (MQ), and inverse multiquadric function (IMQ) (Xie et al. 2011). The flowchart for selected ML methods and interpolation techniques of this study is presented in Fig. 2.
Model performance evaluation
Statistical evaluation
Various statistical performance indicators, such as the mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination (R2), were considered (Cemek et al. 2022) and formulated in Eqs. (6–8).
in which Zi and Zp are the values calculated and predicted at the jth observation, Zi,avg is the mean value of the calculated variable, and n is the data number.
Graphical evaluation
For this study, in addition to the statistical analyses described above, we employed three different visualization methods, including scatter, violin, and Taylor plots to evaluate the effectiveness of the models examined.
Results and discussion
Machine learning results
In this study, a Pearson correlation analysis was performed between meteorological input variables including T, RH, RS, and WS, and the output variable (ET0). According to the correlation analysis, RS has the strongest effect on ET0, while WS has the weakest one (Fig. 3). Previous studies also support this argument (Gong et al. 2021; Ge et al. 2022). The highest linear relationships were found between RS, T, RH, and ET0; therefore, RS, T, and RH were used as the input parameters. Analyses were performed with the following seven combinations: (i) RS, (ii) T, (iii) RH, (iv) RS and T, (v) RS and RH, (vi) T and RH, and (vii) RS, T, and RH.
To eliminate unit problems, the data were normalized between 0 and 1. Then, the data were randomly divided into two groups: training (70%) and testing (30%). After the hyperparameters of the models were determined using the training data, the estimation performances of the models were obtained using the test data. For KNN, “number of neighbors (k),” “leaf size (ls),” and “power parameter (p)” are the main hyperparameters for model generation. A grid search procedure was used for each model to find the best values for these parameters. For the parameters k, ls, and p, the ranges 1–50, 1–20, and 1–10 were considered, respectively. In all models examined, the parameters ls and p did not significantly improve model performance, so default values of 30 and 2, respectively, were chosen. On the other hand, the parameter k significantly improved the accuracy of the prediction results. For example, in the model with RS, T, and RH as inputs, k equal to 9 gave the best estimation result (Appendix 1a). All KNN models' prediction performances are also presented in Table 2. In the testing stage, the lowest RMSE and MAE values (0.424–0.309 mm) were obtained for the KNN7 model where the inputs were RS, T, and RH; however, the highest values (1.284–1.015 mm) were obtained for the KNN3 model in which the input was RH.
The three parameters of the SVR, including the error term penalty parameter (C), radius (ε), and kernel coefficient (γ), are called the primary hyperparameters. According to Chang and Lin (2011), the parameter γ is equal to 1/K, where K is the number of inputs. The tuning parameters in the SVR model were determined using a grid search approach with tenfold cross-validation in this paper (Küçüktopcu 2023). For example, the best accuracy was achieved when C and ε in the SVR7 model were 500 and 0.0001, respectively (Appendix 1b). The predictive performances of all SVR models are also shown in Table 3. The SVR7 model achieved the best results (RMSE = 0.327 mm, MAE = 0.232 mm), while the worst estimate was obtained by the SVR3 model (RMSE = 1.237 mm, MAE = 1.029 mm) in the testing phase.
With the MGGP, input and output data can be expressed nonlinearly. MGGP equations and performance criteria for different input models are presented in Table 4. The best results (RMSE = 0.391 mm, MAE = 0.298 mm) were obtained with the MGGP7 model, whereas the worst estimates were obtained with the MGGP3 model (RMSE = 1.797 mm, MAE = 1.535 mm) in the testing phase.
The comparisons of the three models of the best combination of input data (RS, T, RH) for the testing data were presented on the scatter, Taylor, and violin plots (Figs. 4 and 5). In scatter plots (Fig. 4a, b, and c), most data were close to the 1:1 line. As shown in Fig. 4, all three models proved successful in estimating ET0 values. Among the models examined, the SVR7 model performed best with respect to the R2, RMSE, and MAE criteria. In contrast, the KNN7 model lagged behind the other models (SVR7 and MGGP7) in estimation performance.
Figure 5a shows the violin plots for each model. The SVR7, KNN7, and MGGP7 models produced similar plots to the calculated model. As shown in Fig. 5a, the SVR7 model provided similar estimates of ET0 as the calculated values. The Taylor diagram, which includes statistical analysis, was also used in comparing the three ML techniques. The Taylor diagram evaluates the agreement of the estimated data with the reference data. By using the Taylor diagram, further comparisons of the models were obtained. As shown in Fig. 5b, the SVR7 model gave a closer estimate of the calculated value than the MGGP7 and KNN7 models.
Interpolation method results
Descriptive statistics for calculated ET0 values with the ASCE Penman–Monteith method are shown in Table 5. The highest ET0 value (5.37 mm day−1) was calculated for July and the lowest (0.85 mm day−1) for January. The coefficient of variation (CV) measures the variation of an attribute. A CV ≤ 15% indicates a low variation, 16–35% a moderate variation, and a CV ≥ 36% high variation (Güler et al. 2014). Our CV values revealed that reference evapotranspiration values exhibited a low variation in May (16.10 mm) and moderate variation in January (33.85 mm). The highest coefficient of skewness for ET0 occurred in November (1.28), January (1.01), and February (1.12), while the distribution was normal in the remaining months. Geostatistical methods are more reliable when the data exhibit a normal distribution. Therefore, data normality was checked before the initiation of geostatistical methods. In most of the cases, data showed a lognormal distribution.
Semi-variogram models and model parameters of nugget (Co), Sill (Co + C), and range for daily ET0 values are shown in Appendix 2. Besides, R2 and RSS (residual sum of squares) indicated the model’s availability, and nugget ratios indicating spatial dependence levels were also determined. Semi-variograms for ET0 revealed that the exponential model was suitable for January, February, April, November, and December, and the spherical model was suitable for the other months. The highest r (0.98) and the lowest RSS (1.87 × 10–5) values of the spherical model were observed in March. The highest range value (31.41 km) occurred in December and the lowest (2.87 km) in April. Accurate estimation of variables depends on the density of observed data points, their spatial variation, and proper modeling of semi-variograms (Fotheringham et al. 2000).
The ratio of nugget semi-variance to total semi-variance is used to classify spatial dependence. The spatial dependence is classified as strongly spatially dependent when the ratio is ≤ 25%, moderately spatially dependent when 25–75%, and weakly spatially dependent when ≥ 75% (Cambardella et al. 1994). Experimental and theoretical semi-variograms for daily ET0 values are presented in Appendix 3. Spatial variation in short distances is generally related to the sampling scheme not capturing the short-distance spatial variation and/or experimental errors. Rehman and Ghori (2000) modeled spatial dependency of global solar radiation by geostatistical techniques where they fitted a spherical model to experimental monthly semi-variograms. Irmak et al. (2010) used predicted minimum and maximum temperatures using their monthly semi-variograms of maximum and minimum temperature in OK.
In the OK method, the Gaussian model outperformed others based on RMSE and MAE. Elevation was used as an auxiliary variable in COK, and the lowest RMSE and MAE values were with the exponential model. In IDW, the lowest RMSE and MAE values were found with linear interpolation (power = 1), and in the RBF method, where four different functions were evaluated, the lowest RMSE and MAE values were observed in the ST model (Appendix 4). Xu et al. (2006) interpolated seasonal and annual ET0 calculated with the Penman–Monteith method and pan evaporation and concluded that kriging and IDW were the best methods.
Nugget variance is generally high, suggesting that there were fine-scale discontinuities in the ET0, which would be resulted from several factors, including the combined effect of differences in climate and topography, small-scale differences that the current sampling scheme could not capture, errors resulted from locations of climate stations, and errors resulted from calculations of ET0 from climate data.
A greater nugget effect occurred for winter months, indicating a greater short-range variation of ET0 in the winter. The range values for winter months were far greater than the remaining nine months, which showed that ET0 was more spatially dependent in the winter than in other seasons. Sill values representing overall variation in ET0 were greater for winter months than the rest; this revealed that ET0 is more variable in winter than in the other seasons. All these showed that the spatial variation of ET0 in winter was considerably different from that of the rest of the year.
Spatial distribution maps for ET0 predicted by OK, COK, IDW, and RBF were presented in Fig. 6 for January (the coldest month in winter) and July (the hottest month in summer). In each case, the range of ET0 was divided into ten classes. The spatial pattern of ET0 predicted four techniques differed across the nation. Although the overall spatial patterns were similar, some significant differences occurred among predictions at some locations. Geostatistical techniques yielded similar spatial patterns, and similarly, IDW and RBF predicted similar spatial patterns of ET0 across Türkiye in January and July. Figure 6 further shows that the acreage of ET0 classes differs considerably among prediction methods in both cases (January and July). The validation results showed that OK outperformed the other three prediction methods, justifying its use for predicting ET0 in Türkiye. Kriging techniques have advantages, including error maps and avoiding data clustering, and the basis provides for stochastic simulation for the possible realization of estimate variables (Karamouz et al. 2012; Geleta et al. 2019). It is not always likely to obtain data from every meteorological station. Therefore, unknown values can be estimated from data taken at missing data locations that can be sampled. Interpolation methods are important in bridging such gaps through spatial mapping of regionalized variables for ET0.
The spatial pattern of ET0 was similar in December and January; March, April, and May; July and August; and September and October (Fig. 7) by the OK–Gaussian method. ET0 values predicted by the OK agreed well with those calculated by the ASCE Penman–Monteith method. The OK under-predicted ET0 consistently in Izmir (Aegean cost), Antalya (Southern Anatoli), and Diyarbakır (southeastern Anatolia) and overpredicted in Sivas (Central Anatolia). In the remaining provinces, there was no consistent over- or under-prediction. In general, over-/under-prediction was more drastic in the summer months.
Due to the lack of knowledge of internal variables, researchers have developed various ML algorithms to predict evaporation because they provide straightforward solutions to nonlinear multivariate functions. For instance, Keskin et al. (2004) successfully applied fuzzy logic models as an alternative to classical evaporation estimation formulas for Lake Egirdir in the western region of Türkiye. Lu et al. (2018) proposed the gradient-boosting decision tree model because it was the most accurate and stable model for estimating evaporation in the Poyang Lake watershed. Kumar et al. (2021) recommended the radial function-based SVM model for estimating evaporation under the same climatic conditions and with the same meteorological parameters. Ahi et al. (2022) found that the ANN model had high performance in evaporation estimates with few input parameters. Furthermore, numerous recent studies have been published in the literature to estimate ET0 using independent variables from specific periods for one or more meteorological stations (Mehdizadeh et al. 2021; Mosre and Suárez 2021; Rashid Niaghi et al. 2021; Kadkhodazadeh et al. 2022; Kim et al. 2022; Zouzou and Citakoglu 2022). In this study, the daily data, including T, RH, WS, and RS, from 213 stations in Türkiye were used to estimate ET0. In the Pearson correlation analysis between the input parameters and the output parameter ET0, RS had the most significant influence on ET0, while WS had the least effect. The argument presented here is in line with previous research (Ge et al. 2022; Bayram and Çıtakoğlu 2023). This study used KNN, SVR, and MGGP models to estimate ET0. Based on the results, the SVR model with inputs RS, T, and RH had the best accuracy, while the KNN model had the least accurate results. Bayram and Çıtakoğlu (2023) found that the MGGP model is more robust than the M5Tree and KNN models in modeling monthly ET0. However, some authors have found that the performance of the KNN model is better (Yamaç and Todorovic 2020; Al-Mukhtar 2021).
In the second stage of this study, we also investigated the different interpolation techniques, including OK, COK, IDW, and RBF, to predict the daily ET0 values of Türkiye. The findings of our study showed that OK is the best interpolation method to predict ET0 in Türkiye. Different studies examining ET0 have been conducted for spatial distribution using different methods. Dalezios et al. (2002) used geostatistical techniques to study ET0 and identified spatial variations by kriging estimates over very complex and large Greek fields. Jadhav et al. (2017) used simple kriging (SK), OK, and IDW to determine the spatial distribution of monthly ET0 values for ten districts in western Maharashtra. The SK interpolation technique was found to be suitable to map ET0 for all months except February. Geleta et al. (2019) used the Penman–Monteith method to calculate ET0 from 1991 to 2015 in the Horro Guduru Wollega zone from meteorological data and utilized the OK interpolation technique to estimate temporal and spatial variability. Okechukwu (2020) employed IDW and kriging interpolation techniques for monthly, annual, and seasonal ET0 and obtained a positive correlation of prediction results by IDW (R: 0.83) and kriging (R: 0.50) for ET0.
Conclusions
This study consists of two stages. In the first stage, a daily ET0 forecast was conducted with different meteorological inputs. In this scenario, the capabilities and accuracy of three ML techniques, namely KNN, SVR, and MGGP, were compared. In the second stage of the study, different interpolation techniques, such as OK, COK, IDW, and RBF, were used to predict the daily ET0 values of Türkiye. Then, spatial maps of ET0 were generated using the interpolation technique that provided the best estimation results. The conclusions reported below were drawn from the results. RS, T, and RH proved to be the most influential input combination for ET0 estimation. Based on statistical and graphical criteria, SVR was the most successful model, followed by MGGP and KNN. The OK interpolation technique outperformed the other methods (COK, IDW, and RBF) in predicting ET0 in Türkiye.
The study also has some limitations: (1) Data from 213 meteorological stations covering a 52-year period (between 1962 and 2014) in Türkiye were used. (2) Four input parameters were considered for the prediction of ET0. (3) Three ML methods (KNN, SVR, and MGGP) and four interpolation techniques (OK, COK, IDW, and RBF) were investigated. In future studies, Deep Learning and hybrid techniques need to be explored to improve the accuracy of ET0 prediction further.
References
Abrishami N, Sepaskhah AR, Shahrokhnia MH (2019) Estimating wheat and maize daily evapotranspiration using artificial neural network. Theor App Climatol 135:945–958
Achite M, Jehanzaib M, Sattari MT et al (2022) Modern techniques to modeling reference evapotranspiration in a semi-arid area based on ANN and GEP models. Water 14:1210
Ahi Y, Coşkun Dilcan Ç, Köksal DD, Gültaş HT (2022) Reservoir evaporation forecasting based on climate change scenarios using artificial neural network model. Water Resour Manag. https://doi.org/10.1007/s11269-022-03365-0
Allen RG, Pereira LS, Raes D, Smith M (1998) Crop evapotranspiration: guidelines for computing crop water requirements. In FAO irrigation and drainage paper 56. Food and agriculture organization of the United Nations, Rome
Allen RG, Pruitt WO, Wright JL et al (2006) A recommendation on standardized surface resistance for hourly calculation of reference ETo by the FAO56 Penman-Monteith method. Agric Water Manag 81:1–22
Al-Mukhtar M (2021) Modeling of pan evaporation based on the development of machine learning methods. Theor Appl Climatol 146:961–979
Ayaz A, Chandra S, Mandlecha P, Shaik R (2021) Modelling of reference evapotranspiration for semi-arid climates using artificial neural network. In: Majumder M, Kale GD (eds) Water and energy management in India. Springer, Cham, pp 141–160
Banda P, Cemek B, Küçüktopcu E (2018) Estimation of daily reference evapotranspiration by neuro computing techniques using limited data in a semi-arid environment. Arch Agron Soil Sci 64:916–929
Bayram S, Çıtakoğlu H (2023) Modeling monthly reference evapotranspiration process in Turkey: application of machine learning methods. Environ Monit Assess 195:1–23
Blaney HF, Criddle WD (1950) Determining water requirements in irrigated areas from climatological and irrigation data. In USDA Soil Conservation Service. SCS-TP-96
Burrough PA, McDonnell RA (1998) Creating continuous surfaces from point data. In: Burrough PA, McDonnell RA (eds) Principles of geographic information systems. Oxford University Press, Oxford, UK
Cambardella CA, Moorman TB, Novak JM et al (1994) Field-scale variability of soil properties in central Iowa soils. Soil Sci Soc Am J 58:1501–1511
Cemek B, Arslan H, Küçüktopcu E, Simsek H (2022) Comparative analysis of machine learning techniques for estimating groundwater deuterium and oxygen-18 isotopes. Stoch Environ Res Risk Assess 36:4271–4285
Ceperic E, Ceperic V, Baric A (2013) A strategy for short-term load forecasting by support vector regression machines. IEEE T Power Syst 28:4356–4364
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:1–27
Cheng S, Jin Y, Harrison SP et al (2022) Parameter flexible wildfire prediction using machine learning techniques: forward and inverse modelling. Remote Sens 14:3228
Chia MY, Huang YF, Koo CH (2020) Support vector machine enhanced empirical reference evapotranspiration estimation with limited meteorological parameters. Comput Electron Agric 175:105577
Citakoglu H, Cobaner M, Haktanir T, Kisi O (2014) Estimation of monthly mean reference evapotranspiration in Turkey. Water Resour Manage 28:99–113
da Silva Júnior JC, Medeiros V, Garrozi C et al (2019) Random forest techniques for spatial interpolation of evapotranspiration data from Brazilian’s Northeast. Comput Electron Agric 166:105017
Dalezios NR, Loukas A, Bampzelis D (2002) Spatial variability of reference evapotranspiration in Greece. Phys Chem Earth 27:1031–1038
Dimitriadou S, Nikolakopoulos KG (2022) Artificial neural networks for the prediction of the reference evapotranspiration of the Peloponnese Peninsula. Greece Water 14:2027
Ferreira LB, da Cunha FF, de Oliveira RA, Fernandes Filho EI (2019) Estimation of reference evapotranspiration in Brazil with limited meteorological data using ANN and SVM–A new approach. J Hydrol 572:556–570
Fotheringham AS, Brunsdon C, Charlton M (2000) Quantitative geography: perspectives on spatial data analysis. Publication Sage, London
Gandomi AH, Alavi AH (2012) A new multi-gene genetic programming approach to nonlinear system modeling. Part I: materials and structural engineering problems. Neural Comput Appl 21:171–187
Gandomi AH, Alavi AH, Arjmandi P et al (2010) Genetic programming and orthogonal least squares: a hybrid approach to modeling the compressive strength of CFRP-confined concrete cylinders. J Mech Mater Struct 5:735–753
Ge J, Zhao L, Yu Z et al (2022) Prediction of greenhouse tomato crop evapotranspiration using XGBoost machine learning model. Plants 11:1923
Geleta CD, Bulto GO, Gemechu MG (2019) Spatiotemporal variation in reference evapotranspiration Over Horro Guduru Wollega zone using kriging method. World Appl Sci J 37:250–258
Gentilucci M, Bufalini M, Materazzi M et al (2021) Calculation of potential evapotranspiration and calibration of the Hargreaves Equation using geostatistical methods over the Last 10 Years in Central Italy. Geosciences 11:348
Gong X, Qiu R, Zhang B et al (2021) Energy budget for tomato plants grown in a greenhouse in northern China. Agric Water Manag 255:107039
Güler M (2014) A comparison of different interpolation methods using the geographical information system for the production of reference evapotranspiration maps in Turkey. J Meteorol Soc Jpn 92:227–240
Güler M, Arslan H, Cemek B, Erşahin S (2014) Long-term changes in spatial variation of soil electrical conductivity and exchangeable sodium percentage in irrigated mesic ustifluvents. Agric Water Manag 135:1–8
Guo G, Wang H, Bell D, et al (2003) KNN model-based approach in classification. In Meersman R, Tari Z, Schmidt DC (eds.), On the move to meaningful internet systems 2003: CoopIS, DOA, and ODBASE. OTM 2003. Lecture Notes in Computer Science, vol 2888. Springer, Berlin
Hargreaves GH, Samani ZA (1985) Reference crop evapotranspiration from temperature. Appl Eng Agric 1:96–99
Hodam S, Sarkar S, Marak AGR et al (2017) Spatial interpolation of reference evapotranspiration in India: comparison of IDW and Kriging Methods. J Inst Eng (india): A 98:511–524
Hosseinzadeh A, Moeinaddini A, Ghasemzadeh A (2021) Investigating factors affecting severity of large truck-involved crashes: Comparison of the SVM and random parameter logit model. J Safety Res 77:151–160
Imandoust SB, Bolandraftar M (2013) Application of k-nearest neighbor (knn) approach for predicting economic events: theoretical background. Int J Eng Res Appl 3:605–610
Irmak A, Ranade PK, Marx D et al (2010) Spatial interpolation of climate variables in Nebraska. Trans ASABE 53:1759–1771
Jadhav PB, Kadam SA, Gorantiwar SD (2017) Mapping of reference evapotranspiration using geostatistical analysis techniques. Agric Res J 54:197–201
Jang J-C, Sohn E-H, Park K-H, Lee S (2021) Estimation of daily potential evapotranspiration in real-time from GK2A/AMI data using artificial neural network for the Korean Peninsula. Hydrology 8:129
Kadkhodazadeh M, Valikhan Anaraki M, Morshed-Bozorgdel A, Farzin S (2022) A new methodology for reference evapotranspiration prediction and uncertainty analysis under climate change conditions based on machine learning, multi criteria decision making and Monte Carlo methods. Sustainability 14:2601
Karamouz M, Nazif S, Falahi M (2012) Hydrology and hydroclimatology: principles and applications. CRC Press, New York
Keskin ME, Terzi Ö, Taylan D (2004) Fuzzy logic model approaches to daily pan evaporation estimation in western Turkey. Hydrol Sci J 49:1001–1010
Kim S-J, Bae S-J, Jang M-W (2022) Linear regression machine learning algorithms for estimating reference evapotranspiration using limited climate data. Sustainability 14:11674
Krishnashetty PH, Balasangameshwara J, Sreeman S et al (2021) Cognitive computing models for estimation of reference evapotranspiration: a review. Cogn Syst Res 70:109–116
Küçüktopcu E (2023) Comparative analysis of data-driven techniques to predict heating and cooling energy requirements of poultry buildings. Buildings 13:142
Küçüktopcu E, Cemek B (2022) A comparison of deterministic and stochastic models for predicting air and litter properties in a broiler building. Int J Environ Sci Technol 19:12369–12384
Kumar M, Kumari A, Kumar D et al (2021) The superiority of data-driven techniques for estimation of daily pan evaporation. Atmosphere 12:701
Kustas WP, Anderson MC, Alfieri JG et al (2018) The grape remote sensing atmospheric profile and evapotranspiration experiment. Bull Am Meteorol Soc 99:1791–1812
López-Urrea R, de Santa Olalla FM, Fabeiro C, Moratalla A (2006) Testing evapotranspiration equations using lysimeter observations in a semi-arid climate. Agric Water Manag 85:15–26
Lu X, Ju Y, Wu L et al (2018) Daily pan evaporation modeling from local and cross-station data using three tree-based machine learning models. J Hydrol 566:668–684
Makwana JJ, Deora BS, Parmar BS et al (2022) Modelling of reference evapotranspiration using artificial neural network in semi-arid region of north Gujarat. J Agric Eng 59:193–200
Mardikis MG, Kalivas DP, Kollias VJ (2005) Comparison of interpolation methods for the prediction of reference evapotranspiration: an application in Greece. Water Resour Manag 19:251–278
Mehdizadeh S, Mohammadi B, Pham QB, Duan Z (2021) Development of boosted machine learning models for estimating daily reference evapotranspiration and comparison with empirical approaches. Water 13:3489
Mittal K, Aggarwal G, Mahajan P (2019) Performance study of K-nearest neighbor classifier and K-means clustering for predicting the diagnostic accuracy. Int J Inf Technol 11:535–540
Mohammadrezapour O, Piri J, Kisi O (2019) Comparison of SVM, ANFIS and GEP in modeling monthly potential evapotranspiration in an arid region (Case study: Sistan and Baluchestan Province, Iran). Water Supply 19:392–403
Mosre J, Suárez F (2021) Actual evapotranspiration estimates in arid cold regions using machine learning algorithms with in situ and remote sensing data. Water 13:870
Niazkar M, Niazkar HR (2020) Covid-19 outbreak: application of multi-gene genetic programming to country-based prediction models. Electron J Gen Med 17:em247
Okechukwu ME (2020) Spatial distribution of rainfall and reference evapotranspiration in southeast Nigeria. Agric Eng Int: CIGR J 22:1–8
Orejuela IP, González CL, Guerra XB et al (2021) Geoid undulation modeling through the Cokriging method: a case study of Guayaquil, Ecuador. Geod Geodyn 12:356–367
Penman HL (1948) Natural evaporation from open water, bare soil and grass. Proc R Soc Lond A 193:120–145
Prasad R, Deo RC, Li Y, Maraseni T (2017) Input selection and performance optimization of ANN-based streamflow forecasts in the drought-prone Murray Darling Basin region using IIS and MODWT algorithm. Atmos Res 197:42–63
Prasad D, Goyal SK, Sharma A et al (2019) System model for prediction analytics using k-nearest neighbors algorithm. J Comput Theor Nanosci 16:4425–4430
Priestley CHB, Taylor RJ (1972) On the assessment of surface heat flux and evaporation using large-scale parameters. Mon Weather Rev 100:81–92
Rashid Niaghi A, Hassanijalilian O, Shiri J (2021) Estimation of reference evapotranspiration using spatial and temporal machine learning approaches. Hydrology 8:25
Rehman S, Ghori SG (2000) Spatial estimation of global solar radiation using geostatistics. Renew Energy 21:583–605
Sabziparvar A-A, Tabari H (2010) Regional estimation of reference evapotranspiration in arid and semi-arid regions. J Irrig Drain Eng 136:724–731
Sahoo A, Ghose DK (2022) Imputation of missing precipitation data using KNN, SOM, RF, and FNN. Soft Comput 26:5919–5936
Sattari MT, Apaydin H, Band SS et al (2021) Comparative analysis of kernel-based versus ANN and deep learning methods in monthly reference evapotranspiration estimation. Hydrol Earth Syst Sci 25:603–618
Searson DP, Leahy DE, Willis MJ (2010) GPTIPS: an open source genetic programming toolbox for multigene symbolic regression. In Proceedings of the International MultiConference of Engineers and Computer Scientists 2010 (IMECS 2010), Hong Kong, 17–19 March.
Tejada AT Jr, Ella VB, Lampayan RM, Reaño CE (2022) Modeling reference crop evapotranspiration using support vector machine (SVM) and extreme learning machine (ELM) in region IV-A. Philipp Water 14:754
Thornthwaite CW (1948) An approach toward a rational classification of climate. Geogr Rev 38:55–94
Tunca E, Köksal ES, Torres-Rua A et al (2022) Estimation of bell pepper evapotranspiration using two-source energy balance model based on high-resolution thermal and visible imagery from unmanned aerial vehicles. J Appl Remote Sens 16:022204
Vicente-Serrano SM, Lanjeri S, López-Moreno JI (2007) Comparison of different procedures to map reference evapotranspiration using geographical information systems and regression-based techniques. Int J Climatol 27:1103–1118
Webster R, Oliver MA (2007) Geostatistics for environmental scientists statistics in practice. Wiley, Chichester
Xie Y, Chen T-B, Lei M et al (2011) Spatial distribution of soil heavy metal pollution estimated by different interpolation methods: accuracy and uncertainty analysis. Chemosphere 82:468–476
Xu C-Y, Gong L, Jiang T et al (2006) Analysis of spatial distribution and temporal trend of reference evapotranspiration and pan evaporation in Changjiang (Yangtze River) catchment. J Hydrol 327:81–93
Xu D, Wang Y, Peng P et al (2020) Real-time road traffic state prediction based on kernel-KNN. Transp a: Transp Sci 16:104–118
Yamaç SS, Todorovic M (2020) Estimation of daily potato crop evapotranspiration using three different machine learning algorithms and four scenarios of available meteorological data. Agric Water Manag 228:105875
Zhang Q, Barri K, Jiao P et al (2021) Genetic programming in civil engineering: advent, applications and future trends. Artif Intell Rev 54:1863–1885
Zhong H, Wang J, Jia H et al (2019) Vector field-based support vector regression for building energy consumption prediction. Appl Energy 242:403–414
Zouzou Y, Citakoglu H (2022) General and regional cross-station assessment of machine learning models for estimating reference evapotranspiration. Acta Geophys 71(2):927–947
Funding
This research received no external funding.
Author information
Authors and Affiliations
Contributions
DY, EK, BC, and HS were responsible for conceptualization and methodology; EK and BC contributed to software; EK and HS carried out the formal analysis; DY and BC curated the data; EK and DY took part in writing—original draft preparation and visualization; and BC and HS participated in writing—review and editing and supervision. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no competing interests.
Ethical approval
Not applicable.
Informed consent
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yildirim, D., Küçüktopcu, E., Cemek, B. et al. Comparison of machine learning techniques and spatial distribution of daily reference evapotranspiration in Türkiye. Appl Water Sci 13, 107 (2023). https://doi.org/10.1007/s13201-023-01912-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13201-023-01912-7