Evaluation of multivariate linear regression for reference evapotranspiration modeling in different climates of Iran

The study aimed to evaluate the accuracy of empirical equations (Hargreaves-Samani; HS, Irmak; IR and Dalton; DT) and multivariate linear regression models (MLR1–6) for estimating reference evapotranspiration (ETRef) in different climates of Iran based on the Köppen method including arid desert (Bw), semiarid (Bs), humid with mild winters (C), and humid with severe winters (D). For this purpose, climatic data of 33 meteorological stations during 30 statistical years 1990–2019 were used with a monthly time step. Based on various meteorological data (minimum and maximum temperature, relative humidity, wind speed, solar radiation, extraterrestrial radiation, and vapor pressure deficit), in addition to 6 multivariate linear regression models and three empirical equations were used as MLR1, MLR2, and HS (temperature-based), MLR3 and IR (radiation-based), MLR4, MLR5 and DT (mass transfer-based), and MLR6 (combination-based) were also used to estimate the reference evapotranspiration. The results of these models were compared using the root mean square error (RMSE), mean absolute error (MAE), scatter index (SI), determination coefficient (R2), and Nash-Sutcliffe efficiency (NSE) statistical criteria with the evapotranspiration results of the FAO56 Penman-Monteith reference as target data. All MLR models gave better results than empirical equations. The results showed that the simplest regression model (MLR1) based on the minimum and maximum temperature data was more accurate than the empirical equations. The lowest and highest accuracy related to the MLR6 model and HS empirical equation with RMSE was 10.8–15.1 mm month−1 and 22–28.3 mm month−1, respectively. Also, among all the evaluated equations, radiation-based models such as IR in Bw and Bs climates with MAE = 8.01–11.2 mm month−1 had higher accuracy than C and D climates with MAE = 13.44–14.48 mm month−1. In general, the results showed that the ability of regression models was excellent in all climates from Bw to D based on SI < 0.2.


Introduction
Accurate estimation of reference evapotranspiration (ET Ref ) is one of the priorities for estimating the water requirement of agricultural products. The complexity of the evapotranspiration process and its dependence on meteorological variables, lack of access to all meteorological data, and the lack of generalizability of a model for different climates have made it difficult to accurately estimate this variable. Meanwhile, the development of simple models based on multivariate linear regression due to the simplicity and the ability to select different models according to the available meteorological variables can help in estimating the appropriate ET Ref in different climates. In general, the estimation of ET Ref can be done through direct and indirect methods and using mathematical models (Jing et al. 2019). Indirect methods for estimating ET Ref began with the development of empirical equations such as Penman-Monteith (PM) (Monteith 1965). Based on the PM method, FAO 56 published a standard method for estimating ET Ref . The PM method presented in FAO 56 has been used by many researchers as the base method in calculating ET Ref (Güçlü et al. 2017;Saggi and Jain 2019;Shiri et al. 2019).
In recent years, however, artificial intelligence-based methods such as the neural networks Gavili et al. 2018), the support vector machine (Tabari et al. 2013), the extreme learning machine (Abdullah et al. 2015), decision tree (Huang et al. 2019;Raza et al. 2020), and hybrid methods (Ehteram et al. 2019;Shiri et al. 2020;Tikhamarine et al. 2019;Zhu et al. 2020;Kim et al. 2014;Sanikhani et al. 2019;Mehdizadeh et al. 2017) have had many applications in estimating ET Ref , but among them, multivariate linear regression method has been compared with other empirical equations and soft computing, validated by many researchers (Reis et al. 2019;Kisi and Heddam 2019;Mattar and Alazba 2019;Tabari et al. 2012).
The appropriate method for estimating ET Ref in each region depends on climatic conditions, required data, and related costs (Sharafi et al. 2016). Accordingly, Unes et al. (2020) predicted daily ET Ref based on climatic conditions using empirical equations Ritchie,Turc,and Penman FAO 56), multilinear regression (MLR), and different data mining techniques (M5T, ANFIS, SVM). According to their results, Turc empirical formula (radiation-based) is found better than other empirical equations and the highest correlation coefficient is calculated for ANFIS, and the minimum errors are calculated for radial basis function SVM. Also, Chen et al. (2020) estimate daily ET Ref based on limited meteorological data using three deep learning methods, two classical machine learning methods, and seven empirical equations. Their results show that, when temperature-based features were available, the deep learning models performed markedly better than temperature-based empirical models, and when radiation-based or humiditybased features were available, all of the proposed deep and classical learning machine models outperformed radiationbased or humidity-based empirical equations beyond the study areas.
Yirga (2019) Karbasi (2018) investigated the Gaussian process regression (GPR) and Wavelet-GPR models to forecast multi-step ahead daily ET Ref at the synoptic station of Zanjan (Iran). The results of the Wavelet-GPR model showed that the performance of the model during the warmer season is better than its performance throughout the year.
dos Santos Farias et al. (2020) evaluated the performance of machine learning techniques and stepwise multiple linear regression method to estimate daily ET Ref with limited weather data in a Brazilian agricultural frontier. Their results showed that machine learning methods are robust in estimating ET Ref , even in the absence of some variables. On the other hand, the use of artificial intelligence models in estimating ET Ref with high accuracy has become prevalent in recent years, but the complexity of these models makes their application difficult for different regions. For this purpose, the present paper uses multivariate linear regression models with different data input to develop a simple comprehensive model with the minimum data required to estimate the ET Ref in different climates of Iran. Although in most of the previous researches, the accuracy of the results has been investigated based on meteorological stations (points), in this research, the accuracy of multivariate linear regression models has been investigated based on different climates in Iran. This innovation extends the results of this paper to similar climates around the world, and also reduces the uncertainty of results as a result of fluctuations in climate variables on a point-by-point basis.

Study area
Iran is located in the northern hemisphere between two longitudes of eastern 44°and 64°and two latitudes of northern 40°and 25°. Meteorological data of different synoptic stations, including different climates, are collected and analyzed. Some records of data input were incomplete, or not available for some stations; therefore, only stations with long statistical period length remained. Accordingly, in the present study, meteorological data of 33 synoptic stations in Iran over a statistical period of 30 years  are used. Figure 1 shows the location of the studied stations along with climatic classification based on the Köppen method (Köppen 1931) and according to the values of air temperature and precipitation. According to the study of Sharafi and Karim (2020) (the Köppen climate classification), 7 stations (Bandar Abbas, Abadan, Ahwaz, Bushehr, Bam, and Zabol), 7 stations (Zahedan, Semnan, Sabzevar, Kerman, Isfahan, Shahroud and Torbat-e Heydarieh), 13 stations (Shiraz, Gorgan, Tehran, Khorramabad, Bandar Anzali, Birjand, Rasht, Kermanshah, Mashhad, Arak, Qazvin, Sanandaj, and Tabriz), and 6 stations (Shahrekord, Khoy, Saqez, Urmia, Hamedan, and Zanjan) are located in the arid desert (Bw), semiarid (Bs), humid with very mild winters (C), and humid with severe winters (D) zones, respectively ( Fig. 1).
To calculate ET Ref values using empirical equations and regression models, the monthly data, parameters of minimum temperature (T min ), maximum temperature (T max ), relative humidity (RH), wind speed (u), solar radiation (R s ), extraterrestrial radiation (R a ), and vapor pressure deficit (VPD) were used in different stations. The statistical characteristics of these parameters are presented for different climates separately in Table 1.
According to the results in Table 1, the average monthly ET Ref values decreased from 123.3 to 90.5 mm month −1 , respectively, from arid desert (Bw) to humid climate with severe winters (D). Also, the highest and lowest coefficient of variation (CV%) of ET Ref changes in Bw and D climates was reported to be equal to 39.2% and 53.8%, respectively. The coefficients of variation of minimum and maximum temperatures similar to ET Ref increased from Bw to D climates (Table 1).
R, rainfall; RH, relative humidity; T min , minimum temperature; T max , maximum temperature; u, wind speed; R s , solar radiation; R a , extraterrestrial radiation; ET Ref , reference evapotranspiration by Penman-Monteith (Allen et al. 1998); Min, minimum; Max, maximum; Ave, average; SD, standard deviation; CV (%), coefficient of variation; Kur, kurtosis; Sk, skewness In Fig. 2, the Pearson correlation coefficient values between ET Ref and other meteorological variables (average, maximum and minimum temperature, vapor pressure deficient, wind speed, relative humidity, solar radiation, and extraterrestrial radiation) are shown for 33 study stations based on Köppen method. Pearson correlation coefficient values of + 1 and − 1 show the highest correlation with a direct and inverse relationship between the dependent variable (ET Ref ) and independent variable (meteorological variables), respectively. Figure 2 shows that ET Ref were directly related to all meteorological variables. The only correlation coefficient between ET Ref and relative humidity was negative, indicating an inverse relationship between

Empirical equations
In the present paper, the FAO 56 Penman-Monteith (ET Ref -PMF56) method was used as the target ET Ref values to compare with the results of empirical equations and MLR models, which can be calculated as: For more information on this method, you can refer to FAO 56 (Allen et al. 1998). The developed empirical equations can be categorized into the temperature-based, radiationbased, and mass transfer-based according to the data used (Liu et al. 2017;Zhang et al. 2018). In this research, the empirical equations of Hargreaves-Samani (temperature-based), (2) (Hargreaves-Samani), (3) (Irmak), and (4) (Dalton), respectively, to estimate ET Ref (Hargreaves and Samani 1985;Irmak et al. 2003;Dalton 1802).
In these equations, ET Ref , reference evapotranspiration (mm month −1 ); Δ, the slope of the saturation vapor pressure function (kPa°C); γ, psychometric constant (kPa°C); R n , net radiation (MJ m −2 day −1 ); G, soil heat flux density (MJ m −2 day −1 ); u, average wind speed at 2 m height (m s −1 ); e s , saturation vapor pressure (kPa); e a , actual vapor pressure; VPD, vapor pressure deficit; α, 1.26; λ, latent heat of the evaporation (MJ kg −1 ); R a , extraterrestrial radiation ( m m m o n t h − 1 ) ; R s , m o n t h l y s o l a r r a d i a t i o n (MJ m −2 month −1 ); RH, relative humidity (%); T a , average air temperature (°C); T max , maximum air temperature (°C); T min , minimum air temperature (°C); the values of a, b, and c are empirical coefficients.    Monthly ETo (mm month

Multivariate linear regression models
Multivariate linear regression is a method to model the relationship between several independent variables (average, maximum and minimum temperature, vapor pressure deficient, wind speed, relative humidity, solar radiation, and extraterrestrial radiation) with a dependent variable (ET Ref ). This method, based on the minimum of mean square error, performs the empirical coefficients of the linear relationship between the dependent variable and the independent variables in such a way that the linear model data has the best fit with the target data. In general, the form of multivariate linear regression equations is the following expression 5.
In this expression, b Y is a dependent variable, b 0 to b m are empirical coefficients and X 1 to X m are independent variables. Also, the accuracy of the results of multivariate linear regression models extremely depends on the number and type of input variables to the model. In the Fig. 5 SI criteria values using different empirical equations and MLR models in the studied stations present study, 6 multivariate linear regression models presented in Table 2 have been developed. Meteorological variables are referred separately for each empirical equations and multivariate linear regression models. On the other hand, MLR1, MLR2, and HS models are defined as temperature-based, MLR3, and IR as radiation-based, MLR4, MLR5, and DT as mass transfer-based, and MLR6 as combination models, respectively (Table 2).

Evaluation criteria
In this study, 5 statistical criteria include the following: RMSE, SI, MAE, NSE, and R 2 were used to compare the results of empirical equations and MLR with ET Ref -PMF56 based on Eqs. (6) to (10).

Root Mean Square Error RMSE
Mean Absolute Error MAE ð Þ MAE Nash−Sutcliffe Efficiency NSE ð Þ NSE   (Table 3). Kiafar et al. (2017) conducted the comparison of gene expression programming (GEP) models and empirical models led to the development of new equations for each climatic region. Their research showed that the models developed for very dry (Bw) and humid (C and D) climates had more accurate results. Also, based on their results, models that used the more climatic variables had a more accurate estimate of ET Ref .
Other researchers have reported similar results in the same climates (Traore et al. 2010;Ozkan et al. 2011;Huo et al. 2012).
To evaluate the relationship between ET Ref and meteorological variables, a t test with a 95% confidence level (α = 0.05) was used. The significance of the multivariate regression linear relationship between ET Ref and each independent variable will be significant at the 95% level if the t stat value is greater than 1.96 or less than − 1.96 (Mattar and Alazba 2019). Figure 3 shows the t stat parameter values for all stations studied by climates classified according to the Köppen method (Fig. 3).
According to the results of Fig. 3, in Bw climate, the two variables R a and R s had no significant linear relationship with ET Ref in none of the stations of this climate at a 95% confidence level. Also in this climate, changes in the T min variable had a significant relationship with ET Ref only in three stations of Bandar Abbas, Bushehr, and Yazd, and the relationship of this variable with ET Ref was not significant in four stations of Abadan, Ahvaz, Bam, and Zabol at 95% confidence level. Also, in Bs climate (Fig. 3b), there was only a significant linear relationship between the three variables T min , u, and VPD at 95% confidence level with ET Ref in most stations of this climate. For other variables including intercept width, T max , R a , R s , and RH, there was a significant linear relationship with ET Ref at a 95% level in two out of seven stations in this climate (Fig. 3b). Figure 3c shows that there is a significant linear relationship between ET Ref and most meteorological variables in station C climates. In other words, there is a significant linear relationship between ET Ref and most meteorological parameters in this climate at a 95% level (Fig. 3c). There is no significant linear relationship between ET Ref values and R a , R s , and RH variables in the most stations with D climate (Fig. 3d). Based on the results of Fig. 3, there is a significant linear relationship between ET Ref and VPD, u, and T min variables in 32, 28, and 25 stations of the 33 studied stations (α = 0.05%), respectively. Also, the least significant linear relationship between ET Ref and R a and R s variables was observed in 8 and 10 stations of all studied stations, respectively (Fig. 3). According to the results of Shiri et al. (2013) and Khanmohammadi et al. (2018), models in which the RH variable was used showed better results in C and D climates, because the effect of RH on the ET Ref is greater in these climates. Also, based on the results of Yassin et al. (2016), the Irmak model (radiation-based) was more accurate in Bw and Bs climates. RMSE values for the Irmak model in arid climates were 25.78% lower than humid climates. Table 4 shows the average values of RMSE and R 2 criteria in stations based on climates classified according to the Köppen method for empirical equations and MLR models. Also, the highest and lowest accuracy of the models in all climates was related to the MLR6 model and the empirical Hargreaves-Samani (HS) equation, respectively. The Irmak method had the best results in all climates in comparison with the other empirical equations. After the MLR6 model, which was estimated based on the linear relationship between ET Ref and all climatic variables, the MLR4 model (based on the minimum and maximum temperature, relative humidity, and wind speed) was more accurate in Bw and Bs climates. In C and D climates, the MLR3 model (based on minimum, maximum temperatures, and solar and extraterrestrial radiations) showed the best estimates of ET Ref between radiation-based models. This can be justified due to the higher correlation of ET Ref with radiation variables in C and D climates (Fig. 2 c and d).
Based on the results of this study, a significant linear relationship between ET Ref and VPD, u, and T min variables were reported in 32, 28, and 25 stations, but this linear relationship was only significant in three variables T min , u, and VPD with ET Ref in most semiarid climate stations (Bs). However, the linear relationship between ET Ref and most meteorological variables in C climate was significant.
Furthermore, according to the results of Table 4, results were more accurate of the simplest regression model (i.e., MLR1) in estimating the ET Ref compared to all empirical equations. Also, in terms of RMSE criteria, the lowest error values were reported for C climate and with changes in the range of 13.3-22.0 mm month −1 . On the other hand, the highest values of R 2 for this climate were around 0.88-0.96 (Table 4).
The results of Shiri et al. (2012) showed that the HS method has less accurate in estimating daily reference evapotranspiration in the north of Spain in comparison with the Priestley-Taylor empirical method, intelligent gene expression programming (GEP), and adaptive neuro-fuzzy inference system (ANFIS) methods. Their results corroborate the results of the present study due to the lower accuracy of the HS method in all four climates of Iran. Figure 4 shows the mean of the estimated ET Ref values for the best empirical method (Irmak) and the best regression model (MLR6) in different climates with two criteria NSE and MAE (Fig. 4) Also, Fig. 4 shows that in terms of NSE and MAE criteria, the Irmak empirical method is more accurate in arid and semiarid climates (Bw and Bs) than in humid climates (C and D).
According to the value of the SI criteria, the performance of a model can be divided into four levels: excellent to poor. Figure 5 shows the SI values at different stations for all empirical equations and the MLR models. Also, the MLR6 model had excellent estimates in 10 stations including Bushehr and Yazd stations (Bw); Zahedan, Kerman, and Isfahan (Bs); Shiraz, Gorgan, Birjand, and Qazvin (C); and Zanjan (D) (SI < 0.1) which in this regard had the highest accuracy in different climates. Then, MLR3 model in two stations in Bs climate (Zahedan and Kerman), in three stations in C climate (Gorgan, Birjand, and Qazvin), and one station in D climate (Zanjan) had the best estimate of ET Ref . In other words, the HS equation in four stations of Ahwaz, Isfahan, Birjand, and Rasht had the lowest SI values and ET Ref estimation using this method in these stations was poor (SI > 0.3). In general, Fig. 5 shows that except for the HS equation, the accuracy of estimation of other methods in most of the studied stations concerning SI values was good (0.1 < SI < 0.2). Figure 5 shows that the MLR6 model has the best estimation of ET Ref in almost of the studied station located on different climates. However, in a few stations such as Arak, a model other than MLR6 has had better results in ET Ref estimation based on the scatter index. According to Fig. 5, the Irmak method has been more accurate to estimate ET Ref among the three studied empirical equations in most stations located in different climates. According to Unes et al. (2020) results, radiation-based empirical equations (Turc and Irmak) were found better than other empirical equations, the same with our results about Irmak empirical equation (radiationbased).

Conclusion
Based on the results of the study, the MLR6 model was developed as the best multivariate linear regression model with a minimum error value between ET Ref -PMF56, and ET Ref . The results of the model showed a significant linear relationship between ET Ref as a dependent variable and other meteorological variables as independent variables, because of the MLR6 model. Also, the model had excellent estimates in 10 stations including Bushehr and Yazd stations (Bw), Zahedan, Kerman, and Isfahan (Bs); Shiraz, Gorgan, Birjand, and Qazvin (C); and Zanjan (D) (SI < 0.1), which in this regard had the highest accuracy of estimation in different climates. Therefore, the development of multivariate linear regression models provided the necessary preconditions for evaluating the function of the empirical equations of ET Ref in the study.
Generally, the MLR6 shows efficient results under various climatic conditions. However, intensive data is required for this method and in developing countries; numerous climatic data are not available readily for all the weather stations. Also, the data always lack reliable quality. Certainly, there is a need to develop some approaches that can estimate ET Ref precisely with available limited climatic data. The study demonstrated that all MLR models (even MLR1 with the T min and T max as input data) gave reliable results to estimation of ET Ref in different climates.  Italic values represent significant difference at p < 0.05 * Bw, arid desert; Bs, semiarid; C, humid with mild winters; D, humid with severe winters

Appendix
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.