Optimizing long-short-term memory models via metaheuristics for decomposition aided wind energy generation forecasting

Pavlov-Kagadejev, Marijana; Jovanovic, Luka; Bacanin, Nebojsa; Deveci, Muhammet; Zivkovic, Miodrag; Tuba, Milan; Strumberger, Ivana; Pedrycz, Witold

doi:10.1007/s10462-023-10678-y

Optimizing long-short-term memory models via metaheuristics for decomposition aided wind energy generation forecasting

Open access
Published: 12 February 2024

Volume 57, article number 45, (2024)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

Optimizing long-short-term memory models via metaheuristics for decomposition aided wind energy generation forecasting

Download PDF

Marijana Pavlov-Kagadejev¹,
Luka Jovanovic²^na1,
Nebojsa Bacanin³^na1,
Muhammet Deveci^4,5,6^na1,
Miodrag Zivkovic³^na1,
Milan Tuba³^na1,
Ivana Strumberger³^na1 &
…
Witold Pedrycz^7,8,9^na1

1373 Accesses
4 Citations
Explore all metrics

Abstract

Power supply from renewable energy is an important part of modern power grids. Robust methods for predicting production are required to balance production and demand to avoid losses. This study proposed an approach that incorporates signal decomposition techniques with Long Short-Term Memory (LSTM) neural networks tuned via a modified metaheuristic algorithm used for wind power generation forecasting. LSTM networks perform notably well when addressing time-series prediction, and further hyperparameter tuning by a modified version of the reptile search algorithm (RSA) can help improve performance. The modified RSA was first evaluated against standard CEC2019 benchmark instances before being applied to the practical challenge. The proposed tuned LSTM model has been tested against two wind production datasets with hourly resolutions. The predictions were executed without and with decomposition for one, two, and three steps ahead. Simulation outcomes have been compared to LSTM networks tuned by other cutting-edge metaheuristics. It was observed that the introduced methodology notably exceed other contenders, as was later confirmed by the statistical analysis. Finally, this study also provides interpretations of the best-performing models on both observed datasets, accompanied by the analysis of the importance and impact each feature has on the predictions.

Decomposition Aided Bidirectional Long-Short-Term Memory Optimized by Hybrid Metaheuristic Applied for Wind Power Forecasting

The Long Short-Term Memory Tuning for Multi-step Ahead Wind Energy Forecasting Using Enhanced Sine Cosine Algorithm and Variation Mode Decomposition

A novel model based on CEEMDAN, IWOA, and LSTM for ultra-short-term wind power forecasting

Article 13 September 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Accurate power generation using renewable sources is crucial for several reasons. Renewable energy generation using wind power as well as photovoltaic sources is inherently variable (Li et al. 2023), which means that their output can fluctuate based on weather conditions and other factors. Accurate power generation data enables grid operators and energy companies to better anticipate these fluctuations and manage the overall power supply to ensure a stable and reliable grid (Coppitters and Contino 2023). Accurate power generation data is essential for billing purposes, as it ensures that energy providers are properly compensated for the power they generate. Finally, accurate power generation data is important for policy-making and research (Awerbuch and Berger 2003), as it helps to inform decisions about energy infrastructure investment, environmental impact, and renewable energy technology development. In short, accurate power generation data is critical for ensuring the efficiency, reliability, and sustainability of our energy systems.

Historical power generation data could prove a useful indicator for future forecasting (Shi et al. 2012). However, several interconnected parameters affect power generation, all of which are prone to volatile changes. This makes prediction very difficult due to signal complexity. Decomposition techniques such as Variational Mode Decomposition (VMD) (Rehman and Aftab 2019) and Empirical Mode Decomposition (EMD) (Rehman and Mandic 2010) have the potential to deal with signal complexity by breaking down a complex signal into simpler, more easily analyzed components. These techniques are particularly useful for signals that exhibit non-stationary and nonlinear behavior. Both decompose a signal creating a set of intrinsic mode functions (IMFs) or variational modes, respectively, that capture the underlying frequency components of the signal. Each IMF or variational mode represents a distinct frequency component of the signal, with the highest frequency modes capturing the most rapid and transient changes, and the lowest frequency modes capturing the slower, more persistent changes. By analyzing the IMFs or variational modes separately, researchers can gain insights into the underlying patterns and dynamics of the signal that may be obscured by its complexity. Overall, decomposition techniques such as VMD and EMD offer a powerful approach to dealing with signal complexity, providing researchers with a deeper understanding of the underlying dynamics and behavior of complex signals.

Emerging artificial intelligence (AI) techniques have the potential for accurately forecasting production from wind sources in several ways. Models can be trained on large datasets of historical weather and wind data to better predict wind speeds and direction, which are critical factors in wind energy production. This can be leveraged to refine the accuracy of production forecasting and help energy providers to better anticipate fluctuations in wind power output. Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) neural networks are a type of recurrent neural network that are designed to model sequences and time series data. LSTMs are suitable for time series data because they can capture and remember long-term dependencies and patterns in the data over time, while also avoiding the vanishing gradient problem that can occur in traditional recurrent neural networks. LSTMs use a system of “gates” to moderate the transmission of data and control the memory of the network, allowing them to selectively forget or remember information from previous time steps. This makes LSTMs highly effective for modeling time series data with complex temporal dynamics. These types of networks offer a powerful tool for tackling time series data, allowing for accurate predictions and insights into the subtle patterns and connections in the data. However, like many algorithms, they present a set of hyperparameters that require adequate adjustment to attain desirable outcomes.

Hyperparameters are an essential aspect of forecasting AI models, as they can significantly influence model performance Probst et al. (2019). These parameters are not learned by the model during training but are set by the user before training. Choosing the right hyperparameters for a model is critical, as selecting the wrong values can lead to poor performance, slow convergence, or overfitting. Hyperparameter tuning involves selecting optimal values for these hyperparameters so as to attain the best possible outcomes from the model. This is typically done through a combination of trial and error and automated techniques. The process involves training the model with different hyperparameter values and evaluating the performance on a validation set. The hyperparameter values that produce an optimized model are then selected.

Metaheuristic algorithms present a powerful class of optimization algorithms that may be applied to hyperparameter tuning in forecasting AI models Bacanin et al. (2023b). Iterative optimization is designed to traverse and analyze the search space of possible solutions, without making assumptions about the objective function or the structure of the problem. They are particularly well-suited to hyperparameter tuning, as they can handle high-dimensional and non-convex exploration spaces with many local optima. Metaheuristic algorithms offer a promising approach to hyperparameter tuning in forecasting AI models and can help to overcome some of the limitations of traditional optimization methods. By leveraging the power of iterative search and exploration, metaheuristic algorithms can help to identify optimal hyperparameter values and improve the accuracy and performance of forecasting AI models. Swarm intelligence metaheuristic algorithms are a group of optimization algorithms inspired by the collective behavior of social insects and animals. The concept of decentralized self-organization is at the core of these algorithms, in which a group of individuals mutually interact to reach a shared goal. By following simple sets of rules, complex behaviors emerge on a global scale. These algorithms have the ability to address non-deterministic polynomial-time hard (NP-hard) optimization problems spending time and resources reasonably, something often constituting a problem with traditional methods. This being said, in accordance with the no free lunch theorem (NFL) (Wolpert and Macready 1997), no single methodology is best for all problems, instead an individual approach is preferred. Therefore extensive investigation is needed to further improve techniques and methods.

A notably well-performing metaheuristic algorithm used in this research is the reptile search algorithm (RSA) (Abualigah et al. 2022). It is a meta-learning approach that adapts a model to new tasks quickly. It works by training the model on a set of tasks for a fixed number of iterations and then fine-tuning it on new tasks using only a few gradient updates. Reptile search algorithm aims to learn a good initialization of model parameters that can be quickly adapted to new tasks with minimal updates.

A motivation for the conducted research was to further explore and expand the understanding of the novel RSA and its potential for hyperparameter tuning. Additionally, a key motivator was to determine if this already admirably performing metaheuristic can be further improved through hybridization with other well-known powerful optimizers. Finally, this research hopes to introduce a robust AI-based method improved by the introduced metaheuristic in order to better address the pressing real-world issue of wind power generation forecasting.

With this in mind, this work proposes a novel method for forecasting power generated by wind farms based on a time series of meteorological and historical factors. To account for the complexities caused by the volatility associated with these predictors two signal decomposition techniques are utilized, VMD and EMD. The processed data is formulated as a time series and six input lags are utilized in order to train LSTM neural networks to create forecasts three steps ahead. With the goal of optimizing the performance of the models, metaheuristic algorithms are applied for hyperparameter selection. Several metaheuristics are evaluated and a new modified version of the RSA is introduced. New and modified algorithm performance is usually initially evaluated using a set of standardized benchmarking functions. Accordingly, the modified metaheuristic was evaluated using a wide range of standard bound-constrained CEC2019 benchmarking functions before being applied to a real-world challenge. Following these evaluations, each metaheuristic optimized decomposition-aided LSMT approach is evaluated on two real-world data sets covering two wind power plants in different parts of the world to determine their performance. The best-performing models have been interpreted using SHapley Additive exPlanations (SHAP) (Lundberg and Lee 2017) methods to determine the factors that have the highest influence on model predictions.

The central contributions of the presented research work are summarised as:

A proposal for a modified variation of the recently introduced RSA that betters the commendable performance of the original
An introduction of a decomposition-aided metaheuristic optimized methodology for wind power generation prognosis
An interpretation of the best-performing models using SHapley SHAP analysis to better understand the factors that contribute the most to wind power generation

The remainder of the word follows the structure hereby presented: preceding research that lay out a foundation for this research is presented in the related works Sect. 2. The utilized methods and newly introduced metaheuristics are presented in Sect. 3. The capabilities of the introduced algorithm on bound-constrained functions are shown and discussed in Sect. 4. The experimental setup followed by the attained results on the two real-world datasets along with the achieved results’ discussion are presented in Sects. 5 and 6, respectively. Finally, a conclusion of the work and proposals for future research are shown in Sect. 7.

2 Overview of research background and literature review

In research, interest in LSTM neural networks has been renewed for forecasting wind power generation. Several studies have demonstrated the effectiveness of LSTM-based models in accurately predicting wind power output over short-term and long-term horizons. One such study (Shahid et al. 2021) proposed an LSTM-based model for short-term wind power forecasting that incorporated both meteorological and power data. The authors demonstrated that the given model outperformed traditional time series models and other ML (ML) algorithms, achieving high accuracy and robustness across different locations and weather conditions.

Another study (Liu et al. 2019) focused on long-term wind power forecasting using LSTM networks. The authors introduced a hybrid model that combined LSTM and wavelet transform and principal component analysis to capture both the temporal and spatial variations in wind power data. The results showed that the proposed model did better than traditional statistical models and other ML algorithms, with improved accuracy and robustness over longer forecasting horizons.

In addition to LSTM neural networks, other techniques have been explored to help augment wind power forecasting precision, including the use of decomposition techniques such as VMD and EMD. Researchers (Zhang et al. 2016) proposed a VMD-based approach to upgrade the accuracy of wind power forecasting by decomposing the time series into different frequency components and applying separate models to each component. The authors demonstrated that the VMD-based approach outperformed traditional time series models and other ML algorithms, achieving high accuracy and robustness across different locations and weather conditions. While other approaches for time-series forecasting exist their potential has been sufficiently explored in literature.

Hyperparameter tuning is a paramount aspect of ML and has been tested in the context of wind power forecasting using metaheuristic algorithms. Previous works (Shao et al. 2021) proposed a firework algorithm-based approach to optimize hyperparameters of LSTM neural networks for wind power forecasting. The authors demonstrated that the proposed approach achieved higher forecasting accuracy compared to other optimization techniques, indicating the importance of hyperparameter tuning for accurate wind power forecasting. Moreover, metaheuristics have been applied to optimization across several fields and have shown admirable results including crude oil price forecasting (Jovanovic et al. 2022a), and environmental sciences (Jovanovic et al. 2023a).

Another approach to augment the interoperability of wind electricity production forecasting models is the use of SHAP (Lundberg and Lee 2017) values. These values provide a way to interpret the reasoning behind ML model decisions, by determining contributions made by available features towards the final outcome. The use of SHAP values could be used to understand the factors that influence wind power forecasting accuracy using LSTM-based models. SHAP values provided a more comprehensive and intuitive understanding of model performance, compared to traditional evaluation metrics.

2.1 Motivation

Renewable energy plays a crucial role in addressing our planet’s pressing environmental challenges (Akella et al. 2009; Dinçer et al., 2023; Yüksel et al., 2024). By harnessing sources like solar and wind we can significantly reduce greenhouse gas emissions and combat climate change (Razmjoo et al. 2021). Embracing renewable energy not only safeguards the environment but also promotes energy security, job creation, and a healthier future for generations to come.

However, methods for generating energy in a renewable way are still developing. To facilitate large-scale adoption, many challenges need to be overcome (Züttel et al. 2022). One major challenge comes in the form of increased reliability. Being able to forecast the available power systems can improve reliability in the long run, increasing the viability of renewable systems.

The potential of metaheuristic algorithms for hyperparameter optimization is a well-established approach (Tayebi and El Kafhali 2022), however, it has not yet been explored when applied to wind power generation using LSTM networks. With novel and more powerful techniques constantly being developed (Mattos Neto et al. 2021; Belotti et al. 2020), evaluation and innovation are required in order to improve the body of work available to racehorses tackling optimizations. The potential of the recently introduced RSA (Abualigah et al. 2022) has not yet been explored and implemented in energy forecasting. Furthermore, as a relatively recent approach, this algorithm has great potential for improvement through hybridization.

Decomposition techniques offer yet another technique that can improve model training, with new methods being developed experimentation is required to determine their potential for helping address this increasingly pressing problem. Model interpretation techniques (Dwivedi et al. 2023) are increasingly important to build more reliable models and robust systems. This work aims to address the observed research gap and improve the body of available techniques for renewable energy forecasting.

2.2 Variational mode decomposition (VMD)

Variational mode decomposition (VMD) (Rehman and Aftab 2019) is a methodology for decomposing signals into base components in a non-reducible way. The base of it is Wiener filtering and the Hilbert transform (Zhang et al. 2021). This adaptive signal decomposition method can decompose a given signal f(t) into several components signals $u_k(t)$ within bandwidth constraints around a center frequency $\omega _k$ according to Eq. (1) (Wang and Li 2023).

$$\begin{aligned} \min \limits _{\{u_k\},\{\omega _k\}}\left\{ \sum _{k=1}^{K}\Big \Vert \partial _t\Bigl [\Bigl ( \delta (t)+\frac{j}{\pi t}\Bigr )*u_k(t)\Bigr ] e^{-j\omega _kt}\Big \Vert _2^2\right\} , s.t.\sum _{k=1}^{K}u_k=f(t) \end{aligned}$$

(1)

where K presents the count of decomposed modes, $\{u_k\} = \{u_1, u_2, \dots , u_k\}$ are modal components with a center frequencies $\{\omega _k\} = \{\omega _1, \omega _2, \dots , \omega _k\}$, $\partial _t$ is partial derivative. $\delta (t)$ represents the Dirac distribution, f(t) depicts the original input signal, $u_k(t)$ represents k-th sub-sequence of f(t) and * marks a convolution operator.

Lagrange multiplication operator $\lambda$ and the quadratic penalty factor $\alpha$ are incorporated to upgrade the optimal solution of constrained variation as per the following:

$$\begin{aligned} L(\{u_k\},\{\omega _k\},\lambda )&=\alpha \sum _{k=1}^{K}\Big \Vert \partial _t\Bigl [\Bigl ( \delta (t) +\frac{j}{\pi t}\Bigr )*u_k(t)\Bigr ]e^{-j\omega _kt}\Big \Vert _2^2 \\&\quad +\left\| f(t)-\sum _{k=1}^{K}u_k(t)\right\| _2^2+ \left\langle \lambda (t),f(t)-\sum _{k=1}^{K}u_k(t)\right\rangle \end{aligned}$$

The Lagrange function is transformed from the time domain to the frequency domain. The alternate direction method of multipliers (ADMM) is utilized to minimize the optimization problem.The modes $u_k$ and their center frequency $\omega _k$ are calculated using the following equations respectively: Eqs. (2) and (3).

$$\begin{aligned} {\hat{u}}_k^{n+1}(\omega )= & {} \frac{{\hat{f}}(\omega )-\sum _{i\ne k}{\hat{u}}_i(\omega )+\frac{{\hat{\lambda }}(\omega )}{2}}{1+2\alpha (\omega -\omega _k)^2} \end{aligned}$$

(2)

$$\begin{aligned} \omega _k^{n+1}= & {} \frac{\displaystyle \int _0^\infty \omega \big |{\hat{u}}_k(\omega )\big |^2\,d\omega }{\displaystyle \int _0^\infty \big |{\hat{u}}_k(\omega )\big |^2\,d\omega } \end{aligned}$$

(3)

in which n is a number of iteration, $\lambda$ is Lagrange operator given by Eq. (4).

$$\begin{aligned} {\hat{\lambda }}^{n+1}(\omega )={\hat{\lambda }}^n(\omega )+\tau \left[ {\hat{f}}(\omega )-\sum _{k=1}^{K}{\hat{u}}_k^{n+1}(\omega )\right] \end{aligned}$$

(4)

The iterative process will be executed until the condition given by Eq. (5) is met.

$$\begin{aligned} \frac{\displaystyle \sum _{k=1}^{K}\Big \Vert {\hat{u}}_k^{n+1}(\omega )-{\hat{u}}_k^{n}(\omega )\Big \Vert _2^2}{\Big \Vert {\hat{u}}_k^{n}(\omega )\Big \Vert _2^2}<\epsilon \end{aligned}$$

(5)

2.3 Empirical mode decomposition (EMD)

Empirical mode decomposition (EMD) (Rehman and Mandic 2010) is a signal decomposition approach for reducing the amount of noise in non-stationary time data series (Devi et al. 2020). Fourier and Wavelet analysis are used. This technique gives good results for analyzing wind speed data series. By using the EMD algorithm, complex time series data can be decomposed into a limited number of Intrinsic Mode Functions (IMFs) (Wang et al. 2022b).

EMD process has the following procedure:

(1)
Determine the local maximum and minimum value for any processed signal x(t). Record $h_1(t)$ as difference between x(t) and mean value of upper and lower envelope $m_1(t)$, according to Eq. (6)
$$\begin{aligned} h_1(t)=x(t)-m_1(t) \end{aligned}$$
(6)
(2)
$h_1(t)$ filtered out of the original signal tends to contain the signal’s highest frequency component. Difference signal, $r_1(t)$, is gained by separating $h_1(t)$ from x(t). That way, the high-frequency component is removed. Filtering steps are repeated, with $r_1 (t$) as a new signal, with the goal of the residual signal in the n-th stage becoming a monotonic function. This is shown by Eq. (7).
$$\begin{aligned} {\left\{ \begin{array}{ll} r_1(t)=x(t)-h_1(t)\\ r_2(t)=r_1(t)-h_2(t)\\ \quad \vdots \quad \quad \quad \vdots \\ r_n(t)=r_{n-1}(t)-h_n(t) \end{array}\right. } \end{aligned}$$
(7)
in which x(t) can be represented as a sum of n IMFs and a single residual according to Eq. (8).
$$\begin{aligned} x(t) = \sum _{j=1}^{n} h_j(t)+r_n(t) \end{aligned}$$
(8)
where $r_n(t)$ is the residual that denotes the signal average trend, $h_j(t)$ is the j-th IMF, $j = 1, 2,\ldots , n$, represents the various signal components in the direction from high to low frequencies.

Standard deviation (SD)is given by Eq. (9). It is mainly set from 0.2 to 0.3.

$$\begin{aligned} SD = \sum _{j=1}^{n}\frac{\big | h_j(t)-h_{j-1}(t)\big |^2}{\big |h_{j-1}(t)\big |^2} \end{aligned}$$

(9)

2.4 Long short-term memory (LSTM)

Recurrent neural networks (RNN) (Amalou et al. 2022), represent network architecture specialized for processing sequential data. However, RNNs have a problem of gradient disappearance or gradient explosion. To overcome this, researchers proposed the Long Short-Term Memory Neural Network (LSTM) (Hochreiter and Schmidhuber 1997). This type of RNN can learn long-term dependent information (Liu et al. 2020). It can remember the relationship of the current information with the long-term information in the time sequence. The hidden level of traditional RNN has been replaced with memory cell (Wang et al. 2022a). It is comprised of a forget gate, input, and output gate, as shown in Fig. 1 (Liu et al. 2020; Wang et al. 2022a).

The basic structure in an LSTM is a memory block. This unit contains a cell used to store data as a system consisting of three control gates: forget, input, and output. All gate units have the same structure and consist of the same sequence: sigmoid activation function and multiplication in a range [0, 1], and determines the amount of information passing through. The output of activation function tanh is in the range [− 1, 1] (Wang et al. 2022b).

LSTM input is made up of previous sequences $h_{t-1}$ as well as the ongoing input $x_t$. The forgot gate determines values in the cell state $C_{t-1}$ to be discarded, as it is defined by Eq. (10) (Fu et al. 2019).

$$\begin{aligned} f_t = \sigma \left( W_f x_t+U_f h_{t-1}+b_f\right) \end{aligned}$$

(10)

where $f_t$ is an output vector with values in the range [0,1], $\sigma$ is the sigmoid function, $W_f$ and $U_f$ are the weight matrices and $b_f$ is the bias vector. The input gate updates the information using the result of the sigmoid layer $i_t$ according to Eq. (11).

$$\begin{aligned} i_t = \sigma \left( W_i x_t + U_i h_{t-1} +b_i\right) \end{aligned}$$

(11)

where $W_i$, $U_i$ are the weight matrices and $b_i$ is the bias vector.

The new potential values of cell state vector ${\tilde{C}}_t$ are generated by tanh function, determined by Eq. (12).

$$\begin{aligned} \tilde{C_t} =tanh\left( W_c x_t+U_c h_{t-1}+b_c\right) \end{aligned}$$

(12)

$C_t$ is obtained by multiplying the old cell state $C_{t-1}$ with $f_t$ (to forget the unwanted information) and adding new potential information $i_t \otimes \tilde{C_t}$, as it is given by Eq. (13)

$$\begin{aligned} C_t =f_t \otimes C_{t-1}+i_t \otimes \tilde{C_t} \end{aligned}$$

(13)

with the $\otimes$ indicating an element-wise multiplication (Wang et al. 2022a; Fan et al. 2020). The output gates value $o_t$ is obtained according to Eq. (14)

$$\begin{aligned} o_t = \sigma \left( W_o x_t+U_o h_{t-1}+b_o\right) \end{aligned}$$

(14)

where $W_o$,$U_o$ are the weight matrices and $b_o$ is the bias vector. The value of output $h_t$ is calculated using Eq. (15).

$$\begin{aligned} h_t =o_t \otimes tanh(C_t) \end{aligned}$$

(15)

With the tanh function, cell state $C_t$ is scaled in the range [− 1,1], and multiplying it with the output of output gate $o_t$, a new output value $h_t$ has been calculated.

2.5 Metaheuristics optimization

Stochastic algorithms, including metaheuristics, are often necessary for computer science to tackle NP-hard challenges because deterministic algorithms are not practical. Metaheuristics algorithms may be classified into categories emulating natural processes to lead the search method. For instance, some methods are inspired by evolution, natural selection, or birds and insects’ collective behavior (Stegherr et al. 2020; Emmerich et al. 2018; Fausto et al. 2020). The most prominent groups of metaheuristics approaches include nature-inspired algorithms, consisting of genetic algorithms and swarm intelligence, as well as algorithms based on some physical phenomena (e.g., storm, gravitational and electromagnetic fields). Other approaches include those that mimic facets of human behavior such as teaching and learning, brainstorming, or social media activity and those that were derived from fundamental mathematical laws that pilot the search, e.g. through the use of trigonometric function oscillations.

Swarm intelligence methods have been based on groups’ cooperative actions composed of relatively simple individuals, such as swarms of insects or flocks of birds, which can exhibit astonishingly coordinated and sophisticated behavior patterns while performing fundamental survival tasks such as hunting, foraging, mating, or migrating (Beni 2020; Abraham et al. 2006). These methods have demonstrated significant potential when tackling real-world NP-hard problems. However, they have been sometimes known to fail. Several popular swarm intelligence algorithms include particle swarm optimization (PSO) (Kennedy and Eberhart 1995), ant colony optimization (ACO) (Dorigo et al. 2006), firefly algorithm (FA) (Yang 2009), and bat algorithm (BA) (Yang 2010; Yang and Gandomi 2012). In the past few years, a highly effective group of metaheuristics has been developed based on mathematical functions and their behavior patterns to guide the search process, where the most notable samples are the sine-cosine algorithm (SCA) (Mirjalili 2016) and the arithmetic optimization algorithm (AOA) (Abualigah et al. 2021).

The NFL is the main cause of such a variety of optimization methodologies existing. The NFL states that there is no single algorithm that can be universally superior for every optimization task. Therefore, while one algorithm may perform well on a particular problem, it may fall short entirely on another, highlighting the necessity for diverse metaheuristic methods and the need to select the most appropriate method for each individual optimization task.

Population-based algorithms have lately become a usual choice for addressing different real-world problems. These algorithms are useful for many fields, such as prediction of COVID-19 cases (Zivkovic et al. 2021a, b), organizing on demand computational services (Bacanin et al. 2019; Bezdan et al. 2020a, b; Zivkovic et al. 2021c), optimizing wireless sensors and IoT (Zivkovic et al. 2020, 2021d), feature selection (Bezdan et al. 2021; Bacanin et al. 2023a), processing and classifying medical images (Bezdan et al. 2020c; Zivkovic et al. 2022), addressing global optimization problems (Strumberger et al. 2019; Preuss et al. 2011), identifying credit card fraud (Jovanovic et al. 2022b; Petrovic et al. 2022), monitoring and forecasting air pollution (Bacanin et al. 2022a; Jovanovic et al. 2023a), detecting network and computer system intrusions (Bacanin et al. 2022b; Stankovic et al. 2022), predicting power generation and energy load (Bacanin et al. 2023b; Stoean et al. 2023), and optimizing different ML models (Salb et al. 2022; Milosevic et al. 2021; Gajic et al. 2021; Bacanin et al. 2022c, d; Jovanovic et al. 2022a, 2023b; Bukumira et al. 2022).

2.6 SHapley Additive exPlanations (SHAP)

SHapley Additive exPlanations (SHAP) (Lundberg and Lee 2017) is a method for interpreting the output of ML models, particularly those that are a black-box or difficult to interpret. The methodology of SHAP is the computation of the Shapley value concept rooted in cooperative game theory to explain the role of the available feature and its impact on decisions. In SHAP, the Shapley value for a feature represents the average influence of that feature on the model’s output across all possible subsets of features.

To calculate the Shapley value for a feature, we first define a reference value for that feature. This reference value could be the average value of that feature in the dataset, or it could be a user-defined value. We then create a set of all possible feature subsets that include the feature we are interested in. For example, if we are interested in the Shapley value of feature A, we might create subsets that include just A, or A and B, or A and C, and so on.

For each subset, we calculate the contribution of the feature to the model’s output relative to the reference value. This contribution might be positive or negative, causing the feature’s value in the subset to increase or decrease the model’s output. We then calculate the average contribution of the feature across all possible subsets, which gives us the Shapley value.

Mathematically, the Shapley value for a feature j is determined by:

$$\begin{aligned} \phi _j = \sum _{S\subseteq M \backslash {j}} \frac{|S|!(|M|-|S|-1)!}{|M|!}(f(S\cup {j})-f(S)) \end{aligned}$$

(16)

Here, M is the set of all features, and f(S) is the model’s output for a given subset of features S. The term inside the summation calculates the marginal contribution of feature j to the subset S, and the summation calculates the average contribution across all possible subsets.

In practice, we can estimate the Shapley values for a model using a technique called Kernel SHAP. This involves generating a set of “background” instances that are representative of the dataset and then using these instances to estimate the expected model output for each subset of features. The Shapley values can then be calculated using these expected outputs.

Once we have the Shapley values for all features, we can use them to create a SHAP summary plot, which shows the contribution of each feature to the model’s output. This plot can help us identify which features are most important for the model’s predictions, and how each feature contributes to those predictions.

3 Methods

3.1 The original reptile search algorithm (RSA)

The RSA (Abualigah et al. 2022) is an optimization algorithm that mathematically simulates the predatory techniques of Crocodiles. This algorithm simulates two main techniques: encircling and hunting. The details of both are outlined in the original work (Abualigah et al. 2022).

3.2 Initialization phase

Optimization initialized with a stochastically generated candidates (X) shown in Eq. (17)

$$\begin{aligned} X= & {} \begin{bmatrix} x_{1,1} &{} \ldots &{} x_{1,j} &{} x_{1,n-1} &{} x_{1,n} \\ x_{2,1} &{} \ldots &{} x_{2,j} &{} \ldots &{} x_{2,n} \\ \ldots &{} \ldots &{} x_{i,j} &{} \ldots &{}\ldots \\ \vdots &{} &{} \vdots &{} \vdots &{} \vdots \\ x_{N-1,1} &{} \ldots &{} x_{N-1,j}&{} \ldots &{}x_{N-1,n}\\ x_{N,1} &{} \ldots &{} x_{N,j}&{} x_{N,n-1} &{}x_{N,n} \end{bmatrix} \end{aligned}$$

(17)

$$\begin{aligned} x_{i,j}= & {} rand\times (UB-LB)+LB, j=1,2,...,n \end{aligned}$$

(18)

in which $x_{i,j}$ represents the $j_{th}$ location of the $i_{th}$ agent, N is the count of potential agent, and n represents the dimensions for the given challenge. In Eq. (18), rand is a random value from a uniform distribution, LB is lower and UB is the upper bound of the given challenge. During independent testing, multiple distributions were considered, and it has been deduced that using a uniform distribution yielded the best results.

3.2.1 Encircling phase (exploration)

During the encircling phase, Crocodiles have two kinds of movements: high walking and belly walking. The RSA can alternate two search phases: encircling (exploration) and hunting (exploitation). The change is made according to four criteria and is determined by the current iteration. The position updating for the exploration phase is presented by Eq. (19).

$$\begin{aligned} x_{(i,j)} (t+1) = {\left\{ \begin{array}{ll} Best_j (t)\times -\eta _{(i,j)} (t)\times \beta -R_{(i,j)} (t)\times rand, &{} t \le \frac{T}{4} \\ Best_j (t)\times x_{(r_1,j)} \times ES(t)\times rand, &{} t \le 2\frac{T}{4}\hspace{5.0pt}and\hspace{5.0pt}t>\frac{T}{4} \end{array}\right. } \end{aligned}$$

(19)

in which $Best_j (t)$ represents the $j_{th}$ location in the current optimal agent, rand signifies an arbitrary number in range [0, 1], t represents the ongoing iterative count, and T maximum iterations. The value of $\eta _{(i,j)}$ id calculated by Eq. (20), and describes hunting operator for the $j_{th}$ position in the $i_{th}$ solution. The value of $\beta$ is used to control sensitivity. $R_{(i,j)}$ is a reduced function (a value used to reduce the search area) presented by Eq. (21). The value of $r_1$ is arbitrarily selecting form [1, N], variable $x_{(r_1,j)}$ denotes the $i_{th}$ solutions arbitrary location. Evolutionary Sense, denoted as ES(t), is a probability ratio with randomly decreasing values between [2,− 2], which is calculated using Eq. (22).

$$\begin{aligned} \eta _{(i,j)}\,=\, & {} Best_j (t)\times P_{(i,j)} \end{aligned}$$

(20)

$$\begin{aligned} R_{(i,j)}\,= \,& {} \frac{ Best_j (t)- x_{(r_2,j)}}{Best_j (t)+\epsilon } \end{aligned}$$

(21)

$$\begin{aligned} ES(t)\,=\, & {} 2\times r_3\times \left( 1-\frac{1}{T}\right) \end{aligned}$$

(22)

in which $P_{(i,j)}$ defines the dissimilarity in percent of the $j_{th}$ and best attained $j_{th}$ solution location determined via Eq. (23). Variable $\epsilon$ stores a minimal value used to avoid division with zero. Value $r_2$ and $r_3$ are arbitrary values between [1, N] and $[-1,1]$ respectively.

$$\begin{aligned} P_{(i,j)} = \alpha + \frac{ x_{(i,j)}- M(x_i)}{Best_j (t)\times (UB_{(j)}-LB_{(j)})+\epsilon } \end{aligned}$$

(23)

in which $M(x_i)$ represents the mean location of the $i_{th}$ agent, determined per Eq. (24). LB(j) and UB(j) are the lower and upper constraints for the $j_{th}$ location, variable $\alpha$ is used to regulated sensitivity.

$$\begin{aligned} M(x_i) = \frac{1}{n} \sum _{j=1}^{n} x_{(i,j)} \end{aligned}$$

(24)

3.2.2 Hunting phase (exploitation)

During the hunting phase, the exploitation mechanisms of RSA are presented. During the hunting process, crocodiles perform either hunting coordination or cooperation. According to this, the position updating for the exploitation phase is presented by Eq. (25).

$$\begin{aligned} x_{(i,j)} (t+1) = {\left\{ \begin{array}{ll} Best_j (t)\times - P_{(i,j)} (t)\times rand, &{} t \le 3\frac{T}{4}\hspace{5.0pt}and\hspace{5.0pt}t>2\frac{T}{4} \\ Best_j (t)-\eta _{(i,j)} (t)\times \epsilon -R_{(i,j)} (t)\times rand, &{} t \le T\hspace{5.0pt}and\hspace{5.0pt}t>3\frac{T}{4} \end{array}\right. } \end{aligned}$$

(25)

in which $Best_j (t)$ represents the $j_{th}$ location for the best-obtained agent thus far, $\epsilon$ a small value, $P_{(i,j)}$, $\eta _{(i,j)}$ and $R_{(i,j)}$ are given by Eqs. (23), (20) and (21), respectively.

The Pseudocode for the described RSA is shown in 1.

3.3 Proposed modified RSA approach

The prioritization of exploitation over exploration in RSA leads to a lack of variety among the population and early convergence. This implies that the starting positions of the solutions have a significant influence on the final outcomes. The objective of this research is to enhance the RSA algorithm by tackling the problem of limited exploration by ensuring adequate population diversity during initialization and during execution. To accomplish this, two adjustments are implemented in the elementary RSA metaheuristics: a new approach to initialization and a mechanism for preserving diverse solutions during the execution of the algorithm.

3.3.1 New initialization scheme

The method introduced in this study utilizes a traditional initialization equation to produce the set of individuals in the initial population:

$$\begin{aligned} X_{i,j} = lb_{j} + \psi \cdot (ub_{j}-lb_{j}), \end{aligned}$$

(26)

in which $X_{i,j}$ denotes the j-th item of the i-th solution, $lb_{j}$ and $ub_{j}$ represent the lower and upper constraints of the component j, and $\psi$ is an arbitrary value drawn from the normal distribution within limits [0, 1].

Still, the study by Rahnamayan et al. (2007) has shown that incorporation of the quasi-reflection-based learning (QRL) (Rahnamayan et al. 2007) approach to the population produced by Eq. (26) could allow the exploration of the wider search area. Consequently, for every component j belonging to the solution ($X_{j}$), a quasi-reflective-opposite component ($X_{j}^{qr}$) is produced in the following way:

$$\begin{aligned} X_{j}^{qr}=\text {rnd}\bigg (\dfrac{lb_{j}+ub_{j}}{2},x_{j}\bigg ), \end{aligned}$$

(27)

where rnd allows to select an arbitrary number within $\bigg [\dfrac{lb_{j}+ub_{j}}{2},x_{j}\bigg ]$ limits.

According to the QRL procedure, the proposed initialization approach doesn’t affect the complexity of the algorithm with respect to FFEs as it produces only half of the entire population (NP/2). The initialization procedure used in this research is outlined in Algorithm 2.

The intensive experiments have demonstrated that this initialization scheme exhibits two important advantages. First, it enhances the diversification of the starting population, which can improve the outcomes of the algorithm at the beginning of the run. Second, it enables the algorithm to cover a wider search area with the same size of the population, allowing the initial boost to the search procedure as well.

3.3.2 Procedure to keep the diversity of population

To evaluate whether the algorithm’s search mechanism is converging or diverging, one method is to assess the diversity of the population, which is explained in Cheng and Shi (2011). The study employs a new definition of measuring population diversity, specifically using the L1 norm. This norm considers diversities resulting from two factors: the solutions generated by the population and the problem’s dimensionality.

Furthermore, Cheng and Shi (2011) emphasizes the importance of data obtained from the dimension-wise element of the L1 norm, which can be utilized to evaluate the search mechanism of the algorithm being studied.

Suppose m represents the number of solutions in the population, and n denotes the problem’s dimensionality. The L1 norm can be calculated as presented in Eqs. (28) to (30):

$$\begin{aligned} \overline{x_{j}}= & {} \frac{1}{m}\sum _{i=1}^{m}x_{ij} \end{aligned}$$

(28)

$$\begin{aligned} \Theta ^p_{j}= & {} \frac{1}{m}\sum _{i=1}^{m}\Bigg \vert x_{ij} - {\overline{x}}_j \Bigg \vert \end{aligned}$$

(29)

$$\begin{aligned} \Theta ^p= & {} \frac{1}{n}\sum _{i=1}^{n} \Theta ^p_j, \end{aligned}$$

(30)

In this context, ${\overline{x}}$ is referring to the array containing the average positions of the solutions across each dimension, while $\Theta ^p_{j}$ represents the array of position diversities of the individuals, calculated using the L1 norm. $\Theta ^p$ denotes the overall diversity value of the population as a scalar.

During the initial rounds of the algorithm’s execution, the population’s diversity should be high since the solutions are generated using the standard initialization equation (26). Nevertheless, while the method is converging to the optimal or sub-optimal solution in later rounds, the diversity should decrease dynamically. In order to tackle this, the enhanced RSA algorithm proposed in this study makes use of the L1 norm for regulating the population’s diversity during the entire run. This is achieved through a dynamic diversity threshold control parameter, represented by $\Theta _t$.

A technique has been suggested to preserve variety within a population by introducing an extra control factor, referred to as nrs, that specifies the number of individuals to be substituted. The approach functions in the following way: at the outset of the algorithm, the dynamic threshold for diversity, labeled as $\Theta _{t0}$, is established. During each round of execution, the latest population diversity, represented as $\Theta ^P$, is assessed and compared to the dynamic diversity threshold, $\Theta _t$. If the condition $\Theta ^P<\Theta _t$ is met, suggesting that the population’s diversity is insufficient, the worst nrs individuals are replaced with random solutions created utilizing a method comparable to the one used to initialize the population.

After conducting empirical simulations and theoretical analysis, the formula for computing $\Theta _{t0}$ can be expressed in the following manner.

$$\begin{aligned} \Theta _{t0} = \sum _{j=1}^{NP}\frac{(ub_{j}-lb_{j})}{2 \cdot NP} \end{aligned}$$

(31)

As the algorithm progresses, it is anticipated that the population will gradually approach the optimal search area. Thus, the dynamic diversity threshold, $\Theta _{t}$, must be lowered from its starting value, $\Theta _{t0}$, which is calculated by applying the Eq. (31). To accomplish this reduction in $\Theta _{t}$, a linear decreasing function can be employed, as shown in Eq. (32), with T denoting the maximum number of rounds and $\Theta _{t0}$ representing the initial diversity threshold.

$$\begin{aligned} \Theta _{t+1} = \Theta _{t} - \Theta _{t}\cdot \frac{t}{T}, \end{aligned}$$

(32)

Here, t and $t+1$ represent the current as well as the next rounds. Additionally, T represents the iterative maximum. As the algorithm continues, the dynamic reduction of $\Theta _t$ occurs, and eventually, the mechanism described will no longer be utilized, disregarding $\Theta ^P$.

3.3.3 Inner workings of the suggested algorithm

Since the introduced modified RSA algorithm improved on the admirable performance of the basic RSA, it was therefore named the enhanced RSA - ERSA, and its internal structure is provided in Algorithm 3. While looking at the proposed pseudo-code, it is possible to note that the suggested modifications have been integrated into the basic variant of the RSA algorithm described in Algorithm 1 (Abualigah et al. 2022).

3.4 Evaluation metrics

The observed models’ simulation outcomes have been validated by applying the collection of traditional ML measurements, namely mean squared error (MSE) calculated by Eq. (33), root mean squared error (RMSE) that can be obtained by Eq. (34), mean absolute error (MAE) specified by Eq. (36), and finally the coefficient of determination (R2) that can be determined with Eq. (36).

$$\begin{aligned} MSE= & {} \frac{1}{N}\sum _{i=1}^{N}\left( \hat{p_i} - p_i\right) ^{2} \end{aligned}$$

(33)

$$\begin{aligned} RMSE= & {} \sqrt{\frac{1}{N}\sum _{i=1}^{N}\left( \hat{p_i} - p_i\right) ^{2}} \end{aligned}$$

(34)

$$\begin{aligned} MAE= & {} \frac{1}{N}\sum _{i=1}^{N}\left| \hat{p_i}-p_i\right| \end{aligned}$$

(35)

$$\begin{aligned} R^2,= & {} 1- \, \frac{\sum _{i\, = 1}^{n} \, \left( p_{i} - \hat{p_{i}} \right) ^2}{\sum _{i\, = 1}^{n}\,\left( p_{i} - {\bar{p}} \right) ^2}, \end{aligned}$$

(36)

where $p_{i}$ and $\hat{p_i}$ mark arrays that consist of the observed and predicted values, both containing N entries. This research employs MSE as the objective function with the goal to minimize it.

4 Experiments with standard bound-constrained functions

Before evaluating the performance of ERSA metaheuristics on practical RNN tuning for wind energy time-series forecasting, in compliance with well-established practices from the modern literature, the proposed method was first tested against standard bound-constrained (unconstrained) benchmarks.

The CEC2019 test suite was chosen for this purpose due to dual reasons: this set of functions is more challenging and complex than other benchmarking suites (e.g. standard test instances and CEC2017); the basic RSA was also evaluated on them when this approach was introduced for the first time (Abualigah et al. 2022). This package consists of ten functions and its details (name, dimension, search space constraints, global optimum) have been demonstrated in Table 1.

The CEC2019 simulations were conducted with both, the introduced ERSA and the original RSA algorithms. Additionally, to make a comparative analysis more widespread, other cutting-edge metaheuristics were also considered in comparative analysis: particle swarm optimization (PSO) (Kennedy and Eberhart 1995), artificial bee colony (ABC) (Karaboga and Basturk 2008), firefly algorithm (FA) (Yang 2009), harris hawks optimization (HHO) (Heidari et al. 2019), whale optimization algorithm (WOA) (Mirjalili and Lewis 2016) and chimp swarm optimization (ChOA) (Khishe and Mosavi 2020). This particular set of algorithms was chosen to make a balance between traditional methods like PSO and more recent ones, e.g. ChOA.

Similar experimental conditions as in Abualigah et al. (2022), where maximum iterations (T) and the agents in population count (N) were adjusted to 500 and 30, respectively, were set for the purpose of this research as well. Additionally, due to the methods’ stochastic behavior, experiments are repeated 30 independent times (runtime = 30), with the best, worst, mean, average, and standard deviation metrics captured.

All evaluated methods were implemented specifically for this research work and results for the approaches tested also in Abualigah et al. (2022) were not taken. However, it should be pointed out that very similar results for the basic RSA, WOA, and PSO as those reported in Abualigah et al. (2022) were obtained in simulations conducted for the purpose of this work.

Table 1 Review of CEC2019 benchmark function problems details

Optimizing long-short-term memory models via metaheuristics for decomposition aided wind energy generation forecasting

Abstract

Similar content being viewed by others

Decomposition Aided Bidirectional Long-Short-Term Memory Optimized by Hybrid Metaheuristic Applied for Wind Power Forecasting

The Long Short-Term Memory Tuning for Multi-step Ahead Wind Energy Forecasting Using Enhanced Sine Cosine Algorithm and Variation Mode Decomposition

A novel model based on CEEMDAN, IWOA, and LSTM for ultra-short-term wind power forecasting

1 Introduction

2 Overview of research background and literature review

2.1 Motivation

2.2 Variational mode decomposition (VMD)

2.3 Empirical mode decomposition (EMD)

2.4 Long short-term memory (LSTM)

2.5 Metaheuristics optimization

2.6 SHapley Additive exPlanations (SHAP)

3 Methods

3.1 The original reptile search algorithm (RSA)

3.2 Initialization phase

3.2.1 Encircling phase (exploration)

3.2.2 Hunting phase (exploitation)

3.3 Proposed modified RSA approach

3.3.1 New initialization scheme

3.3.2 Procedure to keep the diversity of population

3.3.3 Inner workings of the suggested algorithm

3.4 Evaluation metrics

4 Experiments with standard bound-constrained functions

5 Utilized datasets and basic experimental setup

5.1 Overview of wind generation datasets

5.1.1 Wind dataset

5.1.2 Wind farm dataset

5.2 Decomposition

5.3 Experimental setup

6 Achieved results, comparative analysis, and discussion

6.1 Wind dataset experimental results

6.1.1 Wind dataset with VMD

6.1.2 Wind dataset with EMD

6.1.3 Comparison with other models on the Wind dataset

6.2 Wind farm dataset experimental results

6.2.1 Wind farm dataset with VMD

6.2.2 Wind farm dataset with EMD

6.2.3 Comparison with other models on the Wind farm dataset

6.3 Validation: statistical tests

6.4 Best model interpretation and SHAP analysis

6.4.1 Wind dataset SHAP analysis

6.4.2 Wind farm dataset SHAP analysis

7 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation