A novel application of transformer neural network (TNN) for estimating pan evaporation rate

For decision-making in farming, the operation of dams and irrigation systems, as well as other fields of water resource management and hydrology, evaporation, as a key activity throughout the universal hydrological processes, entails efficient techniques for measuring its variation. The main challenge in creating accurate and dependable predictive models is the evaporation procedure's non-stationarity, nonlinearity, and stochastic characteristics. This work examines, for the first time, a transformer-based deep learning architecture for evaporation prediction in four different Malaysian regions. The effectiveness of the proposed deep learning (DL) model, signified as TNN, is evaluated against two competitive reference DL models, namely Convolutional Neural Network and Long Short-Term Memory, and with regards to various statistical indices using the monthly-scale dataset collected from four Malaysian meteorological stations in the 2000–2019 period. Using a variety of input variable combinations, the impact of every meteorological data on the Ep forecast is also examined. The performance assessment metrics demonstrate that compared to the other benchmark frameworks examined in this work, the developed TNN technique was more precise in modelling monthly water loss owing to evaporation. In terms of predictive effectiveness, the proposed TNN model, enhanced with the self-attention mechanism, outperforms the benchmark models, demonstrating its potential use in the forecasting of evaporation. Relating to application, the predictive model created for Ep projection offers a precise estimate of water loss due to evaporation and can thus be used in irrigation management, agriculture planning based on irrigation, and the decrease in fiscal and economic losses in farming and related industries where consistent supervision and estimation of water are considered necessary for viable living and economy.


Introduction
Background A crucial step in the hydrological cycle is evaporation, which converts liquid water from the surface of earth into steam. Greater evaporation rates are a key marker of global warming (Yizhong Chen et al. 2018). Evaporation also results in considerable water loss, which has an impact on lake and reservoir water levels as well as the water budget. Consequently, accurate measurement and estimation of water loss because of evaporation are crucial for effective management of water resources (Abtew & Melesse 2012). Both indirect and direct methods are used to estimate evaporation, including evaporation pan, water balance, Penman method, energy balance, and mass transfer (L. Wu et al. 2020a). The evaporation pan technique is the most widely used since it is comparatively easier and less costly . The current work attempts to estimate evaporation pan (E p ) with an accuracy comparable to real evaporation considering (Kahler & Brutsaert 2006) demonstration that the pan evaporation technique provides a precise rate of the real alterations in evaporation. For E p estimation, techniques based on meteorological datasets associated to the experimental evaporation equation, Energy Budget, and Water Budget have been used (L. Wang et al. 2016). The intricate stochastic characteristics of the evaporative procedure, which is not sufficiently represented by the linear modelling method, might cause the predicted errors in these techniques to be rather substantial (M. Abed et al. 2021). Furthermore, empirical models must have their model coefficients calibrated before being applied to various agroclimatic zones because they behave differently under various conditions.

Literature review
Scientists have concentrated their initiatives on machine learning approaches to estimate losses caused by evaporation due to the low performance levels and difficulties with conceptual and practical gauging techniques. These Artificial Intelligence (AI) systems are easier to use, more reliable, and capable of accurately simulating intricate nonlinear procedures (M. M. Abed et al. 2010;Kişi, 2009;Sudheer et al. 2002). Numerous studies have been conducted on utilising AI to estimate various hydrological factors (Ashrafzadeh et al. 2019). Researchers suggest that ANN frameworks offer more accurate projections compared to traditional approaches (Ditthakit et al. 2022;Pham et al. 2022). As a result, AI-based modelling techniques have been effectively applied in a variety of engineering research fields. When comparing the Box & Jenkins approach with ANN for instance, Castellano-Méndez et al. 2004 found that the latter offers higher runoff simulation performance with regards to accuracy. For estimating pan evaporation, numerous research studies have also been carried out by utilising ML techniques with multiple optimisation works (Ashrafzadeh et al. 2020;Malik et al. 2020). Goyal et al. 2014) tested LSSVR, Fuzzy Logic (FL), ANN, and ANFIS strategies for projecting daily E p , and the results were compared to those of the Stephens-Stewart (SS) and Hargreaves-Samani (HGS) empirical methods. Results of this research have demonstrated that LSSVR and FL approaches are more effective than conventional methods for estimating daily evaporation. In order to calculate pan evaporation over monthly timeframes, Kişi 2013 developed evolutionary neural networks. The findings showed that the models were more accurate compared to empirical techniques. In their study on monthly water loss from evaporation, Deo et al. 2016  Meteorological factors were used as predictor variables, and RVM was discovered being the most successful strategy. According to Sudheer et al. 2002, ANN approaches could be used to predict evaporation using weather data. They developed an ANN technique for modelling daily evaporation. Falamarzi et al. 2014 examined the application of wavelet ANN and ANN for day-to-day evaporation forecasting. They used measurements of the wind speed and temperature as model predictors. The outcomes showed that the two techniques provided accurate evaporation estimates. These shallow learning techniques have proven successful at forecasting E p for a variety of climatic situations. However, Deep Learning (DL) algorithm-based modelling has become increasingly popular in many engineering research disciplines to produce predictions that are more precise and trustworthy (Yunzhi Chen et al. 2022).
Since deep learning (DL) approaches, which use improved multi-layered neural networks, are attractive for time series applications, they may open new possibilities for E p estimations. This is because they are currently gaining popularity among artificial intelligence techniques that are utilised in both commercial and scientific fields owing to their increased precision (Hu et al. 2018). Recurrent neural networks (RNN), which form the foundation of DL approaches and are better candidates for estimating and projecting time series data because to their capacity to maintain and use memory from past network states, are known for their ability to do so (Chang et al. 2016;Daliakopoulos et al. 2005). Although the typical RNN model structures are capable of capturing the patterns of the time series data, they struggle to maintain the variables' longer-term dependence and have problems with vanishing and exploding gradients (Bengio et al. 1994). Because of these two fundamental flaws in the typical RNN, network training might result in unrealistic network weights that are either zero or too large. Practically speaking, remembering vital information, and avoiding redundant or unneeded information among different network states are the two key factors that guarantee improved network training. Long Short-Term Memory (LSTM), an enhanced class of conventional RNN architectures, has been developed as a potent algorithm capable of outclassing the training shortcomings of RNNs (vanishing and exploding gradient issues) by retaining important information for model establishment while preventing needless information from being conveyed to the following states in the model development procedure.
LSTM has been effectively used in research on natural language processing (NLP), financial time series forecasting, travelling period predictions, traffic congestion, and many other areas. Despite its wide applicability in a variety of research domains, LSTM approaches have lately been employed in hydrologic time series forecasting (Hu et al. 2018). Zhang et al. 2018 employed an LSTM method for predicting water tables in rural areas. Moreover, the authors compared the outcome scheme using LSTM techniques with that of a standard ANN and noted that the former approach outperforms the ANN. Research was conducted by Majhi et al. 2020 to forecast evaporation with use of LSTM-based models. In this research, the LSTM-based approach was contrasted against Multilayer-ANN as well as empirical techniques such as Blaney-Criddle and Hargreaves, to demonstrate its superiority in predicting daily evaporative losses over selected benchmark schemes. Convolutional Neural Network (CNN), an alternative and powerful deep learning technique, has recently drawn widespread attention as a result of its varied application in a range of fields, including object recognition processing (Krizhevsky et al. 2017), time series categorisation (Z. Wang et al. 2017b), robotic haptic and visual data classification (Gao et al. 2016), weather forecasting (Liu et al. 2014), and audio signal classification (Lee et al. 2009). For instance, Ferreira and da Cunha (Ferreira & da Cunha 2020) examined one-dimensional Convolutional Neural Network (1D-CNN) plus a combination of LSTM-CNN, LSTM, as well as ML strategies (ANN and RF), for application in predicting multi step-ahead daily E p with use of data from weather stations located in Brazil. They established that the developed DL approaches achieved relatively better results than ML strategies. It is notable that numerous researchers have used CNN in several time series forecasting fields such as electrical load estimations, solar energy forecasting, and other modelling scenarios. CNN has largely shown performance superior to that of conventional machine learning models across many studies and achieves state-ofthe-art performance in most cases.
Recently, attention-based models have been employed in time series forecasting with effectiveness. Transformer architecture is derived purely from self-attention (intra-attention) mechanisms and the approach has recently become more popular. Transformers were first used in machine translation applications by (Vaswani et al. 2017) and demonstrated an exceptional capability for generalising other key tasks, including sequence modelling and computer vision. In contrast to recurrent networks, a transformer has no vanishing gradient problem and can access all points in the past irrespective of distance. This feature enables the transformer to find long-running dependencies. Unlike with recurrent networks, a transformer forgoes sequential computation and so can run completely in parallel and at higher speeds. In sum, transformer mechanisms do not analyse inputs in a sequential manner for the architecture relies on a self-attention mechanism that overcomes certain issues in recurrent and convolutional sequence-to-sequence modelling. The transformer has been successfully employed in various tasks related to time series forecasting and it outperforms numerous forecasting methods. In this context, certain work has been carried out with the goal of improving recurrent DL models through use of self-attention mechanisms. As an example, a deep transformer model used in influenza-like illness forecasting was introduced in (N. Wu et al. 2020b) that outperforms sequence-to-sequence and LSTM architectures. The self-attention mechanism of transformer-based models performs better at forecasting than the linear-attention mechanism used in sequence-to-sequence models. Transformer methods therefore have great potential to simulate the complex dynamics found in time series data that are difficult for sequence schemes to handle. The approach may largely resolve the vanishing gradient problem that impedes Recurrent Neural Networks (RNNs) modelling of long-term predictions.
The review of literature confirmed that use of ANN via appropriate learning techniques can properly simulate evaporation in different locations, with results superior to that of relatively complex traditional approaches ). Nevertheless, identifying and developing efficient, reliable, and properly generalised estimation methods remains challenging due to the nonlinear complex nature of evaporation processes. Of the various ANN techniques employed recently, the innovative DL model has great potential for resolving prediction problems and is known to outperform more complex techniques. In particular, the literature shows that of the various DL methods, CNN and LSTM offer the strongest performance potential and therefore will be considered as modelling benchmarks. In recent research, attention-based models have similarly been used in time series forecasting with much success in overcoming the problems found in standard RNN and convolutional sequence-to-sequence modelling. This research presents a novel approach in the field of evaporative losses, for it is the first to attempt use of a transformer model that relies on a self-attention mechanism for E p predictions. Successful development would be of high significance particularly for water resources management towards sustainability of farming.

Objectives
The current study aims to evaluate the applicability, predictability, and accuracy of Transformer Neural Network (TNN) schemes in predicting monthly Ep levels in four regions across Malaysia, using climatological data sets for the duration from 2000 to 2019. The TNN model performance is contrasted with two well-known deep learning approaches, Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN). Both methods are effective and compete well in DL modelling. The predictive accuracy of the models is investigated across a range of input combination scenarios to attain the highest possible levels of precision. Efficiency values for the model are analysed and evaluated using conventional statistical performance metrics that establish their suitability for forecasting evaporation rates. Moreover, adequate analysis was performed in this research to prove TNN modelling reliability with the objective of developing a reliable approach to forecasting evaporation, a task which is particularly essential to agricultural planning and water resource management. This study presents a novel approach in the field of evaporative water loss studies, for it is the first attempt in which transformer-based architecture is used for E p predictions.

Study area
Malaysia is in the tropics and thus receives a substantial amount of rain. Nonetheless, development has led to growing water demand in recent years. In addition, climate change appears to have extended the dry season all while increasing evaporation rates in reservoirs. Many researchers believe that drought is a very complex and inadequately understood natural disaster, one which affects populations far more than other threats (Shaaban & Low 2003). Precise evaporation forecasting is therefore key to developmental efforts. This study's intent is to devise accurate schemes for forecasting E p that would be particularly beneficial in agricultural planning and water resource management. Monthly climate data is recorded across four meteorological sites located as follows: Alor Setar station (longitude 100° 24′ E, latitude 6° 12′ N, elevation 3.4 m), Kota Bharu station (longitude 102° 18′ E, latitude 6° 10′ N, elevation 4.4 m), KLIA Sepang station (longitude 101° 42′ E, latitude 2° 44′ N, elevation 16.1 m), and Kuantan station (longitude 103° 13′ E, latitude 3° 46′ N, elevation 15.2 m). These locations have been selected as a case study due to the existence of goodquality daily evaporation data as well as the importance of these cities in the region. All sites are run by the Malaysian Meteorological Department (MMD) and their outputs are relied on to both calibrate and validate recommended prediction models. Climatic variables were gathered across a variety of regions across Malaysia, as shown in Fig. 1. As well, Google Maps was utilised to map and describe the areas under study.

Data description
All proposed prediction models were constructed using seven meteorological variables, which include minimum, maximum, and mean air temperatures (T min , T max , T a ), relative humidity (RH), wind speed (S w ), solar radiation (R s ), and open pan evaporation (E p ). Data sets comprised some 19 years of daily statistics from 2000 until 2019. Various meteorological parameters logged every month that pertain to the quantified weather data, as collected by the four previously mentioned stations, are displayed in Table 1. Additionally, Fig. 2 displays average monthly variations for all meteorological parameters during the period from 2000 until 2019.
In the table provided, X min , X max , X mean , C x , C v , and S x correspond to the minimum, maximum, mean, skewness, coefficient of variation, and standard deviation values of the modelled weather indicators. From the table data, the minimum value of E p was recorded at Kuantan station, while the greatest value was logged at KLIA Sepang station. This trend might relate to site variation in the value of relative humidity, which is conversely proportional to evaporation. Kuantan station recorded the highest rate of relative humidity, whereas KLIA Sepang station recorded the lowest rate. On the other hand, the maximum skewness of E p was logged in KLIA Sepang station, whereas the minimum skewness was logged at Kuantan. Positive values of skewness imply that the attached information is not proportional and does not follow the standard dispersion.

Input combination scenarios
Selection of suitable predictors is a key step in the development of robust predictive models (Tofiq et al. 2022); various input sets of weather parameters were considered to devise successful input-output schemes and enhance the predictive capability of the ML model. This approach allows for more pragmatic understanding of how all input parameters influence evaporation estimates for a region (M. Abed et al. 2021). Input variables (predictors) were selected based on Pearson correlation coefficient considerations (Freedman et al. 2007). Pearson correlation produces bound test that quantify the statistically significant correlation, or correlation, between two continuous factors. It is recognised as a magnificent means of measuring correlations among parameters under study, for it derives from the autocorrelation method (Hauke & Kossowski 2011). The technique provides data on the association (the magnitude of correlation) and the path of the trend relationship. Both variables can be positively or negatively associated, with no relationship between the two variables determinable if the correlation coefficient equals 0. Regarding the relevant features of the meteorological parameters used in estimating monthly Ep, ranges and interpretations of the Pearson correlation coefficient results were discussed in the previous research of Mustafa et al. (M. Abed et al. 2022). Pearson correlation coefficient was used to determine the meteorological parameters that most affect evaporation estimates, as listed in Table 2. The results displayed in Table 2 indicate that T max , T min , RH, R s, and S w were each related to some degree with Ep, and therefore may have a key role in forecasting evaporation parameters for data from all stations. In particular, the RH and Tmax parameters for every site show the strongest association with Ep. RH and Tmax therefore will be applied in every input combination to strengthen E p estimation accuracy. Previous studies have also implied that T max , T min , RH, R s, and S w are among the key predictors of evaporation (Dalkiliç et al. 2014;L. Wang et al. 2016).
The current study also investigated the effect of the input parameter E p on enhancing the model's performance in evaporation prediction. In this context, data were selected based on correlations with previous records and their relation with the forecasted outcome. As depicted in Fig. 3, for every station, the autocorrelation assessment concerning the evaluated monthly time series for E p levels indicated that the relationship reduced notably when it crossed the second lagperiod record. It indicates that the second prior evaporation characteristics impacted evaporation at a given time. Hence, previous pan evaporation data can be used along with correlation evaluation. Consequently, using historical evaporation pan rate data with the benefit of correlation examination, the model was built using the most significant time lag concerning the past two records.
Therefore, the present study evaluated six distinct inputs for the proposed TNN framework (Table 3). Every climatic dataset was split as 80% and 20%, representing the training and testing (calibration and validation) sets. Initial years from the split dataset were used for training the model, while the remaining were used for testing.

Data pre-processing
Considering the time series aspect of this problem, data concerning specific predictors were normalised in the (0,1) range to eliminate variance before the framework was devised and trained. Since this process comprises regression and forecasting, the max-min scaling technique is employed as per the following equation.
where X i and x i denote the normalised and observed numbers, x min andx max are the minimum and maximum observed values. The normalised predictor and predicted variable values were divided into training and validation sets. As specified previously, the training set comprised 80% of the observations, while the remaining 20% were used for testing.

Machine learning algorithms
The proposed TNN model and the other contrasting benchmark approaches (i.e. LSTM and CNN) were developed using an Intel Core i7-1195G7 CPU @ 2.90 GHz and 16 GB of RAM computer, built using TensorFlow in Python 3.9.0.

Transformer neural network (TNN) model for E p prediction
The transformer framework comprises a self-attention (intra-attention) mechanism that attempts to address several concerns encountered in recurrent and convolutional sequence-to-sequence approaches. The transformer approach employs the self-attention technique to retain only the critical data from the preceding token by selecting vital details concerning the present token's encoding. Put differently, the attention approach is modified to calculate the latent space equivalent for the encoder and decoder. Nevertheless, the loss of recurrence requires positional encoding to be integrated with the inputs and outputs. Likewise, considering the recurrent time-step, positional data offers the transformer system sequences of the inputs and outputs. The encoding layer comprises two components: multi-head self-attention (MSA) and the feed-forward layer. The attention system creates a one-to-one correlation concerning the time-specific moments. Aspects of human attention have stimulated the attention layers; however, essentially, it comprises a weighted mean reduction. Three inputs are fed to the attention layer: values, query, and keys. Every sub-layer comprises residual associations; subsequently, the layers are normalised. The objective behind several heads is typically contrasted against using several CNN filters, where every filter extracts latent features from the input. Likewise, several latent features concerning the input are extracted by several heads in the multiheaded attention approach. The outputs from every head are combined using the concatenation function. In contrast to recurrent networks, the transformer approach is free from the vanishing gradient issue and can reference any previous point, notwithstanding its distance. This aspect permits the transformer system to identify long-term dependencies. Moreover, in contrast to recurrent systems, the transformer does not require sequential computing, allowing faster speed using parallel processing. Put differently, transformer inputs are not assessed sequentially. Hence, the vanishing gradient issue is inherently eliminated. On the other hand, Recurrent Neural Networks (RNNs) suffer from this issue for long-term forecasts. Figure 4 presents the fundamental difference concerning how information is handled in an RNN vs the self-attention system. Comparatively, transformers preserve direct associations to every previous timestamp so that data can be moved over extended sequences. Nevertheless, there is a new concern: the framework directly correlates with massive data inputs. The self-attention mechanism is used in the transformer framework to segregate non-essential information.
TNN model customization Several studies have attempted to establish that DL frameworks are superior to other machine learning techniques when measured using forecasting accuracy. Nevertheless, there is no mention of self-attention frameworks in the literature for evaporation forecast modelling. Hence, this review aims to assess the efficacy of the transformer approach for evaporation estimation when measured on the efficiency and accuracy metrics. Nevertheless, the transformer model suitable for machine transliteration cannot be directly employed for estimating time series. The following section specifies the modifications applied to the transformer model to allow its use for predicting time series. Hence, the embedding layers concerning the framework input associated with NLP are disregarded, and the time series magnitude at a specific time is provided as the input to the system. The soft-max classification layer at the output is also disregarded. Moreover, the output layer is modified to provide a linear activation function. The regression-based mean square error (MSE) expression is employed as the loss function. The original transformer framework's encoder is employed for the training scheme. Every encoding later comprises two sub-layer: the self-attention and a fully connected feed-forward. This study comprises a one-dimensional convolutional system as a substitute for the fully connected layer to identify high-  Such an arrangement provides convolutional layers additional versatility in learning data attributes. Briefly, the encoding system comprises the transformer input comprising a specific time series, as depicted in Fig. 5. The data is then used as input for the self-attention layer comprising the encoding system; subsequently, layer normalisation is implemented. Further, a feed-forward layer is used with another layer normalisation process. This study uses a TNN framework where four similar encoding blocks are associated in a feed-forward manner. Figure 5 presents the TNN framework setup. This work used an exhaustive search process for the system design and training hyperparameters to build optimal structures for the studied TNN model. Consequently, many differently configured models have been assessed to determine the optimal architecture. The optimal hyperparameters of this work are listed in Table 4. Based on the outcomes presented in the table, the optimal TNN framework is devised using four similar transformer encoders whose output is processed using a one-dimensional Global Average Pooling layer. Global average pooling is beneficial because it is fundamentally close to the convolutional architecture by implementation of communication between feature maps and classifications. Hence, it is simple to understand feature maps as category confidence indicators. Further, global average pooling lacks parameter optimisation need, thereby avoiding overfitting (Lin et al. 2013). Global average pooling is followed by adding a 128-neuron dense layer comprising the ELU activation function. Also, a dropout layer is introduced to control overfitting (Ferreira & da Cunha 2020). Lastly, one more single-output fully connected linear layer (i.e. forecast E p numbers) is used. Model training was based on several iterations comprising 16 batches and 200 epochs that followed the system configurations discussed above. Network weights were regulated using the Adam algorithm (Kingma & Ba 2014) for loss function reduction. Moreover, network performance was evaluated at a 1e-2 learning speed. After devising framework architectures, the system was trained by validating it based on the data in the training set. The subsequent step evaluated the model's forecasting ability based on fresh data. The forecast numbers were denormalised to facilitate visual representation, followed by a comparison with actual values. Figure 6 depicts the development steps concerning the TNN prediction framework.

Baseline models used for performances assessment of the proposed TNN model
This study benchmarked model performance based on Long Short-Term Memory Neural Networks (LSTM) and Convolutional Neural Networks (CNN) to contrast the proposed model's performance. These two benchmark frameworks have different architectures and belong to distinct families in the DL framework. The optimal hyperparameters concerning the CNN and LSTM models are listed in Table 5. The hyperparameters are values that control the learning process and determine the values of model parameters that a learning algorithm learns.
The LSTM network is an adapted and enhanced form of RNN that can learn extended correlations between several time-steps comprising a data sequence. LSTMs are appropriate for forecasting sequence information since they control the exploding and vanishing gradient challenges faced by conventional RNNs. These problems are addressed by implementing gating expressions and state information. Using a specifically devised structure, the LSTM approach exhibited enhanced modelling ability for distinct time series problems. The LSTM system comprises several memory blocks connected using layers that comprise multiple recurrently associated cells. The primary blocks of a simple LSTM system comprise an input layer for feeding sequences (i.e. time series data); the model layer is employed for training the system for long-term use of the sequence (time series) data. To address a fundamental regression issue, four layers are used in the LSTM system: the network originates using the 'sequence input layer' and the 'LSTM layer'. The terminal side comprises the 'fully connected layer' and the 'regression output layer'. Theoretical notions of the LSTM are detailed in (Hochreiter & Schmidhuber 1997).
The CNN approach belongs to the deep learning paradigm. This neural network is primarily unlike a traditional ANN (i.e. MLP) since it comprises convolutional layers (of filters). Automated feature identification is implemented in such layers, where critical input data features are mapped to the required input-output association. Hence, CNN can process raw data, eliminating pre-processing or manual feature identification. CNN is typically employed for image processing. Hence, two-dimensional (2D CNN) convolutional filters have been employed (representation matches an image). Nevertheless, sequential or temporal data evaluation uses single-dimension (1D CNN) convolutional filters (Li et al. 2017). Such filters crawl the inputs to record probable temporal patterns for time series sequences. Hence, this paper uses 1D CNN. The conceptual framework of the CNN is explained by LeCun et al. (LeCun et al. 1998).
The primary objective for choosing these benchmark frameworks is to enhance precision and validity for performance assessment. Hence, the models' distinct architectures and commendable performance were the primary selection reasons; several recent papers have employed these for forecasting evaporation (M. Abed et al. 2021Abed et al. , 2022. The target TNN model's performance can be assessed from a wider perspective since these benchmark frameworks have distinct architectures and belong to a broad spectrum of deep learning approaches. This study does not employ empirical techniques for benchmarks. It is better to choose sophisticated machine learning benchmark frameworks that perform better than empirical techniques specified in the literature. Moreover, employing models with relatively poor performance than superior non-conventional machine learning approaches might highlight significant performance gaps, causing TNN framework overestimation.

Performance evaluation
It is crucial to choose suitable performance indicators since each indicator has its own set of attributes. Furthermore, the way a model performs can be better understood by knowing the properties of every statistical indicator. Thus, in this research, the predictive performance of the model was assessed by using several statistical indicators, which are described below: (1) Coefficient of determination (R 2 ): The coefficient of determination represents the relationship between the estimated and real outputs; its value has a span of 0-1 (including both limits). A value of zero signifies a stochastic framework, while a value of one signifies perfect fit. R 2 is very widespread and makes model comparison more consistent and easier. It aims to assess how well a prediction model fits a dataset, providing researchers with instant feedback on the model's performance. (2) (2) Root mean square error (RMSE): RMSE represents the square root of the error squares average with respect to the estimated and real values. In regression model performance assessment, RMSE is more widely used compared to MSE. Moreover, RMSE is simple and easy to determine. Additionally, RMSE penalises huge errors, and thus become more acceptable.
(3) Mean absolute error (MAE): The MAE is the absolute difference between the estimated and actual output. MAE does not penalise high errors caused due to outliers.
(4) Nash-Sutcliffe efficiency (NSE): NSE is a normalised metric that determines the intensity of residual variance (noise) compared to that of the computed variability (information). It is still extensively used in hydrologic modelling, partly because it normalises the precision to a more understandable level.
where n represents sample count, y represents the real output, ŷ represents the predicted values, and y represents the real output average.

Results
To indicate that the TNN model for evaporation prediction is robust, this portion provides complete analysis of the empirical outcomes derived from experimenting using this model and comparative performance assessment of various models. In this study, in all, three models were used including the model presented by the author, i.e. TNN model and standard models; CNN and LSTM were used in monthly E p forecasting task at four sites, namely Kuantan-103° 13′ E, 3° 46′ N; KLIA Sepang-101° 42′ E, 2° 44′ N; Kota Bharu-102° 18′ E, 6° 10′ N; and Alor Setar-100° 24′ E, 6° 12′ N in Malaysia. The performance of the TNN forecasting model was first studied under different input combinations to obtain the highest accuracy of the forecast. Then, comparison of the best model was carried out against the other two standard models. All models, including the proposed model, were assessed using the outcomes of the following performance The value of R 2 is utilised to evaluate the effectiveness of all models investigated in this research in respect of degree of correlation between observed (E p ) and forecasted (E p ) values. For each model, the best statistical metrics have been displayed in bold. As can be noticed in Table 6, there is indeed a significant difference in the monthly E p prediction accuracy determined by the input combinations. It had been possible to recognise the best prediction accuracy through the model by using the entire meteorological dataset (RH, E P , T max , T min , R s , and S w ) with respect to all sites, when compared with combination of inputs concerning other inadequate data input. In general, it further indicated that the accuracy of prediction models enhanced with extended input variables, which was consistent with the findings of previous research (Fan et al. 2016;L. Wang et al. 2017a). Four input combinations that did not include R s or S w were sufficient to achieve reasonable accuracy with respect to monthly E p prediction. When only T max and RH data were available, the TNN model's prediction accuracy was found to be inadequate for all stations. This indicated that utilising advanced capacities, such as that of AI, may not enhance the predictive performance of the ML model, especially when there are limited number of meteorological inputs. In addition, there was a slight improvement in the prediction accuracy when E p was used as an input.
In comparison with the TNN and the benchmark models, the values of R 2 for TNN tested at the four studied stations are recorded to be 0.977 for Alor Setar, 0.989 for Kota Bharu, 0.972 for KLIA Sepang, and 0.974 for Kuantan. For all the stations, the R 2 values produced by the standard models considered for this study are lower than that of the TNN model (see Table 7). Thus, it can be said that the TNN model shows the greatest degree of collinearity between estimated (E p ) and observed (E p ) values. Moreover, R 2 values produced by the TNN model for all stations are rather near to 1. It deserves to be remarked that the value of the coefficient of determination (R 2 ) achieves 1 in case of ideal performance of the model. Therefore, the TNN model apparently shows a better performance in comparison with the other benchmark models with respect to the value of R 2 determined for the estimated and observed values.
The    Fig. 7 displayed the radar plots in terms of the RMSE for the TNN model and the benchmark models pertaining to all study sites. Moreover, the Nash-Sutcliffe efficiency (NSE) value, which can be regarded as another metric employed for evaluating the put forward deep learning model's efficacy, seems to be near to a value of oneness for all study regions versus the benchmark models. To emphasise this metric with regards to the TNN model, for instance, NSE ≥ 0.968 has been noted for all the study regions. More importantly, these values were found to be higher when compared with the comparative models with regards to all the study stations, as presented in Table 7. Overall, the current investigation offers convincing proof supporting that a significant potential is associated with the TNN model to predict monthly Ep, and this performance was found to be higher versus comparative models pertaining to all the study sites in Malaysia.
It is worthy to note that the TNN model showed lower predictive error for all the study stations versus the benchmark models employed in the testing phase. Figures 8 and  9 display the scatter plots as well as time series that have been observed compared with the predicted monthly Ep with regards to the TNN model as well as the benchmark models with regards to the testing phase. These scatter plots give the coefficient of determination (R 2 ), which can inform a reader how effectively the variability in observed E p compares to that of modelled E p , where this value is between 0 and 1. If the model fits well and there is good resonation in the comparative correlations between the observed and forecasted E p , the R 2 can be deemed to be nearer to oneness. As per Figs. 8 and 9, all the models showed high R 2 values closer to unity. However, the TNN model gave the highest R 2 value versus the other models with regards to all study sites. This deduction demonstrated that the TNN model could yield better accuracy with regards to E p versus Time series of measured E p versus predicted E p of TNN and benchmark models at all stations the other models examined in this study and proved that the putative DL model could be regarded as an important tool for predicting E p . As per the results, the put forward TNN model was found to be better when compared with all benchmark models in all sites, demonstrating the positive impact cast by self-attention mechanism on improving the accuracy of prediction.

Discussion
For deep learning, transformers are regarded to be a stateof-the-art approach. The adoption of attention has been revolutionised by the transformer model via dispensing with convolution and recurrence and, alternatively, depending exclusively on a self-attention mechanism. When it comes to time series forecasting, transformers are not able to analyse their input in a sequential manner. Thus, this has helped deal with the vanishing gradient problem that hampers the recurrent neural networks (RNNs) with regards to long-term prediction. The present study applies a novel application pertaining to self-attention algorithms with regards to the evaporative prediction domain. The put forward TNN model has demonstrated good performances with regards to E p prediction for all the four selected stations, whereas other models investigated in this study have been placed in different positions by considering the E p prediction performances pertaining to various sites. The consistency demonstrated by the put forward TNN model with regards to the E p prediction performances by being the best E p prediction model for all study sites validates the impact of the self-attention mechanism. Thus, the architecture of the transformer, which is solely based on self-attention mechanisms (intra-attention), has demonstrated potential to enhance the DL predictive models' performance.
With regards to the study limitations, data were gathered and modelled based on just four study regions in Malaysia (as a case study) to accomplish the goals. While this pioneering research has yielded a new modelling structure with regards to the E p prediction, despite its restricted context, further research could choose a broader range of regions elsewhere, signifying different weather conditions. Nevertheless, the deep learning seems to have considerable implications with regards to managing irrigations as well as other water resource systems by monitoring the changes pertaining to monthly E p .
One of the research's practical implications is that the E p modelling approach, which can give a quite close estimate of the real water loss because of evaporation as well as its relation to managing water resources, could be employed as a science-based strategic approach that can be applied to irrigation and other agricultural tasks. When E p values is multiplied by the surface area pertaining to the irrigation water resources, the amount of water shortages resulting from evaporation (a primary component of water loss common for the existing water asset volume) can be evaluated. Thus, it becomes easy to estimate the total amount of existing water that can be used for irrigation, and planning and implementation of a toolset of intelligent irrigation schedules. These schedules can also help dodge unnecessary water losses since irrigation practices become more relaxed. Thus, the current TNN model employed for predicting E p is also believed to provide considerable economic advantages to the agriculturists, especially in areas where farming is influenced by water resource issues, droughts, and other types of hydrological disparities. In addition, this study offers optimal advice for hydrologists on how to effectively analyse non-stationary and nonlinear behaviours pertaining to hydrological cycles employing soft computing.

Conclusion
This study concentrated on developing a transformer-based architecture (TNN) model that can be employed practically for prediction of monthly E p losses and provide a detailed comparison with other benchmark DL models, including CNN and LSTM approaches. To evaluate the capabilities pertaining to the designed DL models, monthly data from four meteorological stations in Malaysia were considered for predicting the E p rates. Time series data with regards to the monthly E p , like T min , T max , RH, R s , S w , and E p , in the years 2000-2019 were employed for testing (validation) and training (calibration) for the designed models. The input parameters (predictors) were chosen based on the Pearson's correlation coefficient values to detect the most effective input combinations pertaining to the TNN model. Based on standard statistical measures, the performance of each model and its efficacy with regard to evaporation forecasting were evaluated.
The investigation provided the following results: • A high level of prediction accuracy for the monthly E p was observed with the three developed DL models. • The TNN model delivered enhanced performance when it comes to predicting monthly E p versus the benchmark models with regards to all study sites. • Models that considered complete meteorological datasets (T min , T max , S w , R s , E p , and RH) were found to achieve the best prediction accuracy at all stations, versus other combinations employing limited data input. • As evident in the results, the performance of the developed TNN model was significantly better versus other benchmark models at all study sites. This supported the fact that the TNN model can be employed in an efficient manner for predicting monthly E p data series.
• In terms of application, the TNN model provides a precise estimate of water loss due to evaporation and can thus be used in irrigation management, agriculture planning based on irrigation, and the reduction in fiscal and economic losses in farming and related industries where consistent supervision and estimation of water are considered necessary for viable living and economy. • In the future, the applicability of the proposed technique can be tested for different areas in Malaysia or elsewhere using different data sets to develop a reliable, generalizable model that can predict evaporation.