Introduction

The widespread climatic changes in the twenty-first century and the negative impacts that follow on the available water resources have become one of the most important issues that cast their shadows on the focus of contemporary environmental events and issues (Smith et al. 2012; UNEP, 1990). The irrigated agriculture sector represents the largest consumer of water in Egypt, representing 85% of the total water share available to Egypt, which represents 55.5 billion cubic meters. Water is a scarce resource, and the problem is likely to persist into the future. Drought is defined as a water shortage caused by a disparity between supply and demand (Xu et al. 2020). This is due to the tremendous changes in the climate causing variations in air temperature, relative humidity, and solar radiation (Haskett et al. 2000). These climatic factors cause a change in evapotranspiration which disturbs the hydrological cycle on a global scale. The prediction of the reference evapotranspiration (ETo) helps to guide and evaluate the impact of climate change on agriculture and thus on food security (de Oliveira e Lucas et al. 2020). On the other hand, climate change is the most important issue in water resources studies (Misra 2014). In climatological study, temperature, relative humidity, and precipitation are the most important factors for forecasting, making decisions, managing risks, and optimizing uses of water resources (Meshram et al. 2015).

An accurate estimate of ETo is essential for maintaining the hydrological cycle, crop yield simulation, water management, and irrigation system design, as well as irrigation scheduling. Due to the difficulty of direct measurement of reference evaporation, it is estimated from meteorological data such as wind speed, solar radiation, humidity, and temperature (Pereira et al. 2015). Indirect methods have been used to estimate ETo such as the FAO-56 Penman-Monteith (FAO56-PM) equation (Allen et al. 1998). These methods depend on meteorological variables that are sometimes not available at or near the site, especially those related to solving the aerodynamic term, wind speed, and water vapor pressure deficit in the air. Therefore, ETo estimation methods as a function of climatic elements, such as air temperature and extraterrestrial radiation, can be obtained simply and more feasibly (Hargreaves and Samani 1985; Samani 2000), and some have been tested and verified in many studies (Ahooghalandari et al. 2016; Almorox et al. 2016; Valiantzas 2018; Zanetti et al. 2019).

The main challenge in water scarcity research is to develop suitable methods or techniques to predict the factors that are affected by climate change and the consequences on water availability and ETo estimation. ETo is characterized by high nonlinearity and non-stationarity (Hernández et al. 2011), making its daily forecast difficult. However, timeseries analysis is a specific way of analyzing a sequence of datasets collected over a time interval that allows the development of a mathematical model to explain systematic patterns embedded in the data. The most apparent patterns appearing in timeseries data are trends and seasonality (Box et al. 2015). Moreover, the forecast of timeseries depends on the previous data, which are used to create relationships between the data that have continuous observations (Box et al. 2015). However, the abstraction of autocorrelation components from the timeseries data remains a challenge in timeseries analysis techniques (Box 2013). Recently, more effort has been dedicated to using stochastic models in hydrology and climatology (Mossad and Alazba 2016).

Timeseries analysis is a powerful statistical prediction tool that relies on the collection and analysis of past observations of a variable to create a model for future trends. The timeseries models do not presume knowledge of any structural relationships between variables involved in the studied process, such as evapotranspiration rates and climatic variables. These models are stochastic because an observed timeseries is an actual realization of a stochastic process (Arca et al. 2004). The autoregressive integrated moving average (ARIMA) models (Box et al. 1995) are the most popular timeseries tools. For the ARIMA models, the forecast of a variable is described as a linear (additive) combination of the previous gates of the variable (pure autoregressive component) and the previous forecast errors (pure moving average component). ARIMA models are commonly used to forecast ETo (Alireza and Hossein 2015; Arca et al., 2004; Gautam and Sinha 2016).

Recently used computer software models have shown high accuracy in estimating and forecasting ETo, e.g., support vector machines (SVMs) (Fan et al. 2019). In recent years, the use of machine learning programs for ETo estimation has spread by making relationships between the inputs and outputs used in ETo estimation, which are mainly meteorological data, which gives higher accuracy and power to apply machine learning programs in ETo modeling (Ferreira and Cunha 2020a, b; Kumar et al. 2011). Machine learning methods have been utilized successfully in recent years to estimate ETo with fewer meteorological data. These models can capture complicated interactions between the input and the output data, making them effective ETo modeling tools. Several models have been evaluated such as artificial neural network (ANN) (Afzaal et al. 2020a, b; Alves et al. 2017a, b; Farooque et al. 2021; Traore et al. 2016), support vector machine (SVM) (Farooque et al. 2021; Ferreira et al. 2019a; Mehdizadeh et al. 2017; Traore et al. 2016), multivariate adaptive regression splines (MARS) (Ferreira et al. 2019b; Mehdizadeh 2018; Wu and Fan 2019), and random forest (RF) (Feng et al. 2017) used meteorological parameters to estimate daily ETo amounts through random forest (RF) and generalized regression neural network (GRNN) methods for southwestern China. Although both methods were found to be acceptable, the RF method was found to be superior. Wang et al. (Wang et al. 2019) used meteorological data from the Karst region in southwest China and the RF and GEP models to estimate ETo. The results showed that RF-based models performed better, but GEP-based models were recommended because they provide understandable equations and simpler to use. Moreover, long short-term memory (LSTM) was applied to assess the potential of machine learning models in forecasting irrigation water requirements (IWR) of snap beans by evolving multi-scenarios of inputs parameters to figure out the impact of meteorological, crop, and soil parameters on IWR in Egypt (Mokhtar et al. 2023).

Traditional machine learning models, such as ANN and SVM, have been employed for ETo forecasting as indicated above. In recent years, the deep learning models have received a lot of attention, and they have been applied in a variety of fields, outperforming traditional machine learning models (Alibabaei et al. 2021; Chen et al. 2020; Ferreira and da Cunha 2020a, 2020b; Nagappan et al. 2020; Saggi and Jain 2019; Sattari et al. 2020; Tikhamarine et al. 2019). LSTM in timeseries forecasting (Afzaal et al. 2020c; Alibabaei et al. 2021; Farooque et al. 2021; Son and Kim, 2020; Tian et al. 2018; Zhou et al. 2019) can be employed for timeseries forecasting. Marndi et al. (2021) applied long short-term memory (LSTM) for predicting rice yield using different input scenarios. The best LSTM model was achieved using rainfall as an input variable for rice yield forecasting. Sultana and Khanam (2020) compared the performance of autoregressive integrated moving average (ARIMA) and artificial neural network (ANN) on univariate time series data of yearly rice production from 1972 to 2013. According to this study, the ARIMA model outperforms the ANN model since the estimated error of ANN was significantly higher than ARIMA errors.

Some studies predict ETo using expected meteorological data, such as public weather forecasts (Cai et al. 2007; Perera et al. 2014; Traore et al. 2017; Yang et al. 2019). However, ETo is predicted in this study using past meteorological data. By this way, there is no need for external data but to rely solely on the collected data from a local weather station. Given the importance of ETo forecasts, the objective of this study is to assess ETo under arid climate conditions in Egypt at the regional and local scales using ARIMA, RF, and LSTM models. In addition, we identify the appropriate model and dataset input for ETo estimation given a limited number of meteorological data. This research is critical in determining the best approach (optimal model and input variables) that could be used as a simple, rapid, and inexpensive approach for timely and reliable ETo prediction at local and regional scales across Egypt. Furthermore, this paper aims to participate in saving water tasks by forecasting ETo across Egypt. To the best of our knowledge, the applied approaches are still poorly investigated for ETo, especially those based on different climate inputs at both regional and local levels. So, the novel contribution of this work is to develop and compare the results from ARIMA and machine learning models. Thus, this investigation presents a pioneering modeling strategy that would lead to improvement of efforts to address the deficiencies in ETo prediction which could make irrigation scheduling improved and so give better solutions for decision-makers.

Material and methods

Study area and data collection

This study was conducted in four regions in Egypt (Fig. 1). The latitude and longitude of the weather stations within the regions are as follows: Station 1 is Cairo (30.1162 N, 31.4094 E), Station 2 is Ismalia (30.5567 N, 32.2652 E), Station 3 is Benisouif (28.9082 N, 30.9505 E), and Station 4 is Sohag (26.634 N, 31.6526 E). These regions were chosen to reflect the different climatic conditions of Egypt. Data for the daily meteorological variables of minimum and maximum temperature (Tmin and Tmax, oC), relative humidity (RH, %), and wind speed (WS, m/s) were obtained from 1982 to 2020. Wind speed measured at 10 m height was converted to 2 m values as presented by Allen et al. (1998). The meteorological data were obtained from two sources: (i) the Egyptian Meterological Authority and (ii) NASA (2015) gridded daily data over the study regions. The daily reference evapotranspiration (ETo) was determined using FAO version 3.2 calculator software based on the standard FAO-Penman method described in (Allen et al. 1998).

Fig. 1
figure 1

The location of the weather stations

With the LSTM and RF models, four combinations of input sets of (i) Tmin, Tmax, WS, and RH, (ii) Tmin and Tmax, (iii) Tmin, Tmax, and WS, and (iv) Tmin, Tmax, and RH were implemented as presented in Table 1. With the ARIMA model, the input was a relation between the date and the ETo calculated from the four collected weather parameters (Tmin, Tmax, WS, and RH) using the ETo Calculator. Moreover, the collected data were divided into three distinct timeframes for calibration/training (1982–2007), cross-validation (2008–2011), and validation (2012–2020) for forecasting ETo, which was compared to the FAO 56 PM observations at the weather station. A flowchart of the machine learning models and ARIMA model used in the study is shown in Fig. 2. The LSTM and RF models implemented in Python 3.8 and the ARIMA model implemented in MATLAB (2021a) were used.

Table 1 Combinations of the meteorological data used for the machine learning models
Fig. 2
figure 2

Computational flowchart adopted for the ETo forecast: a machine learning models and b ARIMA model

Machine learning models

Long short-term memory

Long short-term memory (LSTM) models were introduced by (Hochreiter and Schmidhuber 1997), and they are a type of recurrent neural network that can learn data dependencies over time. This is possible because the recurring module of the models is made up of four layers that interact with one another. To address short-term memory problems, input (Xt), output (ot), forget (ft), hidden gate, and cell gate were added to the LSTM blocks. The forget gate determines which information should be removed from the cell gate, resulting in an (ft) value. In addition, the forget gate can discard irrelevant information based on relevance. The input gate determines which information from the cell gate should be updated, resulting in its value. The output gate is in charge of producing (ot), which is used to compute the hidden gate (ht) using a filtered version of the cell gate. Following the ft function, the tanh and sigmoid functions were used to scale the values for further processing. The combined gate (Ct) is calculated as the dot product of the tanh and sigmoid outputs. Figure 3 depicts a more detailed overview of the LSTM information flow memory block.

Fig. 3
figure 3

Long short-term memory (LSTM) neural network memory block

Random forest (RF)

RF is a supervised algorithm, and it is one of the most widely used algorithms for both regression and classification. It is an improvement on the decision tree algorithm in that decision trees have a massive limitation of overfitting. In this algorithm, decision trees are fitted in different subsets of the training data. The number of trees in the algorithm has a direct relationship with the results it can achieve and has to be optimized. Although each individual tree is a weak learner (Fig. 4), RF combines the predictions of all trees (ensemble), resulting in a powerful model (Huang et al. 2019). This model has the advantage of requiring less hyperparameter adjustment in general. More information on RF can be found in Tyralis et al. (2019).

Fig. 4
figure 4

The RF algorithm

A tree is created by selecting a random collection of variables that will be utilized to decide the outcome of the forecast. Two crucial parameters in the RF training process are the number of trees (ntree) and the number of variables available for selection in each split (mtry) (Houborg and McCabe 2018). The RF approach is made up of groups of the classification tree or the regression tree, depending on the situation. Repeated runs of the RF algorithm achieve the best model setup (García-Peñalvo et al. 2018).

The ARIMA model

The ARIMA model is based on the Box–Jenkins methodology, which is fitted to a given timeseries, and it is parsimonious in terms of the number of model parameters. It depends on a three-step iterative process of model identification, estimation, and diagnostic checking to define the best parsimonious model. This three-step process is repeated several times until a satisfactory model is finally selected. Finally, this model is used to forecast future values of the timeseries (Box et al. 2015).

Model identification

The AR (autoregressive) part of the ARIMA model shows that the variable of interest is regressed on its own prior values. The MA (moving average) part of the ARIMA model shows that the regression error is a linear combination of error values occurring at various time intervals in the past. The I (integrated) part shows the number of times differencing has been performed. The entire objective of banding adequate AR, I, and MR terms is to produce the best parsimonious model to fit the time series data (Box 2013). The model assumes the data to be a non-seasonal timeseries, and therefore, the data need to be de-seasonalize before modeling. A non-seasonal ARIMA model is generally denoted as ARIMA (p, d, q), where p is the order of AR, d is the order of differencing, and q is the order of MA. The ARIMA methodology has its own limitations by relying on past values. However, it works best for long and stable timeseries (Box 2013; Marco et al. 2012). The non-seasonal part of the ARIMA model can be expressed as:

$$ \phi (B)\nabla^{d} Z_{t} = \theta (B)a_{t} $$
(1)

where \(\phi (B)\) and θ(B) are polynomials for p and q order, respectively.

Model estimation

After identifying the appropriate model as the first step, the model parameters have to be estimated. The AR and MA parts parameters were calculated using the procedure suggested by (Box 2013). The AR and MA parameters should be tested for statistical significance.

Diagnostic checking

After the estimation of the model parameters, diagnostic checking has to be performed to verify the adequacy of the model. Several diagnostic statistics and plots of residuals were investigated to ascertain whether the residuals are correlated or white noise. The residuals analysis involves checking the autocorrelation function (ACF), the partial autocorrelation function (PACF), and the histograms of the residuals and the residual distribution around the mean (Box 2013; Mossad and Alazba 2016).

Performance evaluation of the applied models

The models were assessed for each meteorological station using the performance statistics of the mean absolute error (MAE), the Nash–Sutcliffe efficiency coefficient (NSE), the coefficient of determination (R2), and the root-mean-square error (RMSE), expressed as:

$$ {\text{MAE}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {O_{i} - P_{i} } \right|} $$
(2)
$$ {\text{RMSE}} = \sqrt {\frac{1}{n}\sum {\left( {P_{i} - O_{i} } \right)}^{2} } $$
(3)
$$ {\text{NSE}} = 1 - \frac{{\sum {\left( {P_{i} - O_{i} } \right)} }}{{\sum {(\overline{O} - O_{i} )^{2} } }}^{2} $$
(4)
$$ R^{2} = \left[ {\frac{{\sum\limits_{i = 1}^{n} {(O_{i} - \overline{O})(P_{i} - \overline{P})} }}{{\sqrt {\left( {\sum\limits_{i = 1}^{n} {(O_{i} - \overline{O}_{{}} )}^{2} } \right)\left( {\sum\limits_{i = 1}^{n} {(P_{i} - \overline{P})^{2} } } \right)} }}} \right]^{2} $$
(5)

where Oi and Pi, i are the observed and the predicted values, respectively, at the time i, and n is the number of observations. \(\overline{O}\) and \(\overline{P}\) represent the average values of the observed and the predicted values, respectively.

Results and discussion

Basic statistics

Table 2 shows the mean and the standard deviation values of the climatic variables from 1982 to 2020 in four weather stations. The mean of maximum air temperature was in the range of 28.4–30.9 °C with standard deviation values being between 7.1 and 7.7 °C for all stations, and those of the minimum air temperature were in the range of 14.8–15.7 °C for the mean and 5.3–6.7 °C for the standard deviation. For the wind speed, the mean range was 4.9–8.7 m/s and the standard deviation was 1.5–2.6 m/s for all stations, and the mean of the relative humidity was in the range of 30.1–54.1% and the standard deviation of 10.7–15.5%, and the mean ETo computed from the FAO-56 method had a range of 4.33–4.82 mm/day and the standard deviation a range of 1.6–1.8 mm/day for all the stations.

Table 2 The mean and the standard deviation values of the daily climate variables during the period from 1982 to 2020 for the four weather stations

The machine learning models

RF model

The results obtained with the RF model using the four input combinations datasets of (1) Tmin, Tmax, RH, WS; (2) Tmin, Tmax; (3) Tmin, Tmax, WS; and (4) Tmin, Tmax, RH, are presented in Fig. 5 for the regional scale. Using all input variables (1) yielded the best result having R2 = 0.85, RMSE = 0.69 mm/day, MAE = 0.51 mm/day, and NSE = 0.85, while the second set of input combinations was the worst performer with R2 = 0.80, RMSE = 0.80 mm/day, MAE = 0.62 mm/day, and NSE = 0.80 but its results are still considered as good.

Fig. 5
figure 5

Performance statistics of LSTM and RF models based on the four input datasets

At the local scale, we trained and tested the RF model for each weather station to investigate how the climate variables impact evapotranspiration in the different climatic regions. Table 3 shows the heatmap of R2, RMSE, NSE, and MAE indices over the four weather stations. Of the four weather stations, the R2 of the input dataset 1 (4 inputs) ranged between 0.92 and 0.95, and the range for the RMSE was 0.38–0.46 mm/day. For the MAE, the range was 0.28–0.33 mm/day and for the NSE it was between 0.92 and 0.95. As was the case for the regional scale, the second input combination dataset recorded the lowest values of R2 (0.86–0.93) and NSE (0.86–0.93) and the highest values for RMSE (0.50–0.58 mm/day) and MAE (0.37–0.44 mm/day). Thus, for both the local and regional scale, the first input combination dataset (Tmin, Tmax, RH, WS) yielded the best results, stemming from the fact that it used the maximum number of input variables which increases the accuracy of evaporation prediction.

Table 3 Heatmap for the performance of the models (LSTM and RF) based on the four input combinations datasets for the four weather stations at the local scale

LSTM model

Figure 5 is also shown the results of the LSTM model using the different input data combinations for the regional scale. For the regional scale, the best result was obtained with the first input combination dataset (four input variables), yielding R2 = 0.86, RMSE = 0.68 mm/day, MAE = 0.52 mm/day, and NSE = 0.86. With this model too, the second input dataset exhibited the worst performance having R2 = 0.84, RMSE = 0.73 mm/day, MAE = 0.5 mm/day, and NSE = 0.84. In terms of the four climate stations (local scale) executed with the input combination dataset 1, the R2 values range was 0.95–0.92, the RMSE range was 0.37–0.43 mm/day, and those of MAE and NSE were 0.28–0.33 mm/day and 0.95–0.92, respectively. Here too, the second input combination dataset for all stations yielded the lowest values of R2 (0.87–0.93) RMSE (0.47–0.56) and highest values of MAE (0.38–0.470 mm/day) and NSE (0.87–0.93).

Overall evaluation

To compare LSTM and RF models, a boxplot was made based on the residuals (Fig. 6). For the regional scale, the two best input combination datasets with the LSTM and the RF models were selected to determine the best input combination dataset and model pair (Fig. 6a). As shown in Fig. 6a, the inter-quartile range (IQR) values of RF1, RF4, LSTM1, LSTM4 were 0.756, 0.801, 0.870, and 0.885, respectively. Furthermore, the RF model with input combination dataset 1 appears to be the best model having the lowest error in comparison with the other pairs. For the local scale (Fig. 6b), the IQR values of RF1-St1, RF1-St2, RF1-St3, and RF1-St4 were 0.442, 0.4, 0.447, and 0.409, respectively, while the corresponding values of RF4-St1, RF4-St2, RF4-St3, and RF4-St4 were 0.445, 0.520, 0.413, and 0.433, respectively. The LSTM1 was the best for all stations. The Q1 value of LSTM1-St1 was −0.149, while LSTM4-St3 was −0.205 (Fig. 6c). Moreover, a smaller IQR by LSTM1-St1 of 0.452 clearly shows that the distribution error of LSTM1-St1 is much better than the others because of the higher concentration around the mean.

Fig. 6
figure 6

Boxplots showing the distribution of the estimation errors at the testing stage for the four input datasets for a regional scale, b RF for each station (local scale), and c LSTM for each station. Q1: lower quartile of errors, Q3: upper quartile of errors, IQR: interquartile range for each model

The RF model performed well at both the local and the regional scales. (Jeong et al. 2016) reported that the RF model may suffer overfitting to data because its algorithm consists of an ensemble of a large number of decision trees that may not be fully described mechanistically. Also, RF may cause a loss of accuracy when extreme ends are expected or responses are outside the limits of the training dataset (Jeong et al. 2016). Our results for the RF model for ETo forecasting were better than the results obtained by Son and Kim, (2020b) based on RMSE. Although the ETo was forecast based on ANN and SVR models in Iran (Maroufpoor et al. 2020), our results based on RF are better than theirs. The main reason is that the input datasets are more related to the ETo in our study area. Using the extreme learning machine model to forecast ETo in eight provinces in China (Wu et al. 2019), they achieved better results (R2 = 0.99 and RMSE = 0.15 mm/day) than those obtained in this study which was conducted on the local scale and a shorter timeseries data which decrease the variations between the data that finally resulted in high model performance. Ferreira et al. (2019a) reported performance improvements when relative humidity was added to temperature in machine learning models developed for Brazil. Our study gave better results when we considered wind speed in addition to relative humidity and temperature. Furthermore, Afzaal et al. (2020b) reported performance improvements when relative humidity was added to temperature in the LSTM model developed for Canada. Moreover, Barzegar et al. (2020); Ferreira and da Cunha (2020c); Kim and Cho (2019); and Landeras et al. (2009) also reported better performance of deep learning over traditional machine learning models Marndi et al. (2021) have proven that LSTM was a good model for rice production in India. Mokhtar et al. (2023) used the LSTM and RF models to predict the irrigation water requirements of the green bean crop in Egypt through an actual field experiment, and they gave satisfactory results.

In the present study, however, the deep learning model provided marginally better results. Nevertheless, deep learning models generally have more hyperparameters to be adjusted, requiring more time for training. By contrast (Landeras et al. 2009), forecasting weekly ETo with ARIMA and ANN reduced RMSE with respect to weekly historical means by only 6–8%. ETo forecasting is a complex task since it is affected by several meteorological variables, which can vary widely daily.

ARIMA model

Identification of AR(p) and MR(q) components

The generated timeseries for daily ETo is illustrated in Fig. 7 for the four selected weather stations from 1982 to 2011. Figure 7 shows a nonlinear and a seasonal component in the original timeseries data, exhibiting an annual cycle. Therefore, a non-Gaussian is often used to evaluate the effectiveness of nonlinear models. Figure 7 shows that there are no abnormal flocculating trends in the timeseries. The same thing is observed by inspection of the autocorrelation (ACF) and the partial autocorrelation functions (PACF) for the original data displayed in Fig. 8. Because the data show non-stationarity, the differencing approach was applied to make the timeseries stationary (Hyndman and Athanasopoulos, 2018) and the ACF and PACF of the transformed data are plotted in Fig. 9. Correlations within the blue lines indicate that they are significantly different from zero. For the regional scale, the average daily ETo for the four stations was used for estimating the ACF and PACF in Figs. 8 and 9.

Fig. 7
figure 7

Timeseries of daily ETo of the four selected study weather stations

Fig. 8
figure 8

Autocorrelation function (ACF) and partial autocorrelation function (PACF) for the candidate ARIMA model with 5% significance limits at the local (four weather stations) and regional scales

Fig. 9
figure 9

Autocorrelation function (ACF) and partial autocorrelation function (PACF) for the candidate ARIMA model after differencing with 5% significance confidence limits of the local and regional scale

Estimation of the appropriate p, d, q values

Model selection can be made by the maximum likelihood used in the parameter estimation as explained by Box (2013). The most appropriate model structure was selected through two information criteria: the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Generally, the model with the lowest AIC and BIC values is considered the best as explained by Ampaw et al. (2013; Hyndman and Athanasopoulos, (2018); and Marco et al. (2012).

According to the results shown in Table 4, the ARIMA (2,1,4) for station 1, ARIMA (2,1,3) for station 2, ARIMA (2,1,4) for station 3, and ARIMA (2,1,3) for station 4 models were identified as the best-fitted models among the 16 ARIMA models tested at the local scale. For the regional scale, only ARIMA (2,1,4) and ARIMA (2,1,3) were tested as they were the best-fitted models for the local scale yielding the minimum AIC and BIC values. The model with the smallest AIC value yields residuals that resemble white noise (Mossad and Alazba 2016), a confirmation of the appropriateness of the selected ARIMA model (Ord et al. 2017).

Table 4 The goodness of fit for the ARIMA (p,d,q) models

Estimation of the best ARIMA model

The selected ARIMA models needed to be validated to check their appropriateness, an essential and last step before using the selected model for forecasting. The diagnostic checking step assures the reliability of the chosen models (Dimri et al. 2020) (Mossad and Alazba 2016). One of the convenient methods used to validate the model is the graphical technique. Hence, many validation plots were investigated to check whether the residuals were white noise. Figure 10 shows the estimated ACF and PACF of the residuals for the candidate model at various numbers of lags with 95% probability at local and regional scales. Most of the ACF and PACF values were not significantly different from zero as they lie within the confidence limits. Therefore, there is no significant correlation between the residuals.

Fig. 10
figure 10

Autocorrelation function (ACF) and partial autocorrelation function (PACF) of the residuals for the best selected ARIMA models with 5% significance confidence limits at the local and regional scales

Model forecasting

Forecasting helps to predict future uncertainty based on the behavior of the past and the current observations. It was done using the best-fitted ARIMA models. Data from 2012 to 2020 (2987) were used in the forecast to ascertain the validity of the developed models. Some performance statistics (R2, RMSE, MAE, NSE) were used for evaluating the agreement between the predicted and the observed timeseries at the local and regional scale (Table 5). For the four weather stations (the local scale), the R2 values range was 0.86–0.92 and that for RMSE was (0.61–0.90 mm/day). A range of 0.45–0.76 mm/day was recorded for MAE and 0.80–0.90 for the NSE. For the regional scale, the performance statistics were R2 = 0.90, RMSE = 0.58 mm/day, MAE = 0.42 mm/day, and NSE = 0.93. Sultana and Khanam (2020) forecasted the production of rice in Bangladesh using ARIMA and ANN. The results indicated that the ARIMA model outperformed the ANN model for predicting rice production based on the RMSE, MAE, and MAPE values. The ARIMA model can be a valuable tool for forecasting daily reference evapotranspiration. Accurate ET0 forecasts are crucial for irrigation scheduling, drought monitoring, and water allocation. Several studies have successfully utilized ARIMA models for ETo and climate variables forecasting in many countries, such as Saudi Arabia (Mossad and Alazba 2016); India (Dimri et al. 2020); Colombia (Martínez-Acosta et al. 2020); China (Chen et al. 2018); and Poland (Murat et al. 2018). They indicate an important note that the performance of the ARIMA model may vary depending on the specific dataset, geographical location, and climatic conditions. Therefore, the ARIMA model changes with each climate zone. However, these different studies state that ARIMA models could be promising tools for ETo and climate variables forecasting.

Table 5 Performance statistics values at the local and regional scales

Conclusion

This study focused on modeling and forecasting the daily reference evapotranspiration (ETo) using two machine learning models [random forest (RF) and long short-term memory (LSTM)] and autoregressive integrated moving average (ARIMA) models. Four input climatic data combinations assisted the machine learning models. The results of this study could assist policy and decision-makers to develop water resources strategies in Egypt, and any arid region for that matter, which are very important nowadays due to water shortages resulting from climate change. The first part of the study focused on the possibility of forecasting ETo using the RF and LSTM models. The results attested that RF1 and LSTM1, i.e., using all the climatic input data, have the lowest error at both the regional and local scales. Further, the RF1 model having the lowest error shows that its distribution of error is much better than the others. The second part of the study focused on the possibility of forecasting ETo using the ARIMA modeling approach. Therefore, different ARIMA model structures have been proposed based on correlation methods (ACF and PACF) for the four stations in Egypt. The best ARIMA model structure was selected according to the lowest AIC and BIC values, and the selected models were ARIMA (2,1,4) for station 1, ARIMA (2,1,3) for station 2, ARIMA (2,1,4) for station 3, and ARIMA (2,1,3) for station 4. In addition, a high correlation was noticed between the four weather stations, and the ARIMA model (2,1,4) appeared to be a reasonable prediction of the ETo for the combined four stations' data to constitute the regional data. These results are promising, and the proposed ARIMA model structure, RF, and LSTM could be considered for forecasting daily ETo within the study regions.