1 Introduction

Accurate power generation using renewable sources is crucial for several reasons. Renewable energy generation using wind power as well as photovoltaic sources is inherently variable (Li et al. 2023), which means that their output can fluctuate based on weather conditions and other factors. Accurate power generation data enables grid operators and energy companies to better anticipate these fluctuations and manage the overall power supply to ensure a stable and reliable grid (Coppitters and Contino 2023). Accurate power generation data is essential for billing purposes, as it ensures that energy providers are properly compensated for the power they generate. Finally, accurate power generation data is important for policy-making and research (Awerbuch and Berger 2003), as it helps to inform decisions about energy infrastructure investment, environmental impact, and renewable energy technology development. In short, accurate power generation data is critical for ensuring the efficiency, reliability, and sustainability of our energy systems.

Historical power generation data could prove a useful indicator for future forecasting (Shi et al. 2012). However, several interconnected parameters affect power generation, all of which are prone to volatile changes. This makes prediction very difficult due to signal complexity. Decomposition techniques such as Variational Mode Decomposition (VMD) (Rehman and Aftab 2019) and Empirical Mode Decomposition (EMD) (Rehman and Mandic 2010) have the potential to deal with signal complexity by breaking down a complex signal into simpler, more easily analyzed components. These techniques are particularly useful for signals that exhibit non-stationary and nonlinear behavior. Both decompose a signal creating a set of intrinsic mode functions (IMFs) or variational modes, respectively, that capture the underlying frequency components of the signal. Each IMF or variational mode represents a distinct frequency component of the signal, with the highest frequency modes capturing the most rapid and transient changes, and the lowest frequency modes capturing the slower, more persistent changes. By analyzing the IMFs or variational modes separately, researchers can gain insights into the underlying patterns and dynamics of the signal that may be obscured by its complexity. Overall, decomposition techniques such as VMD and EMD offer a powerful approach to dealing with signal complexity, providing researchers with a deeper understanding of the underlying dynamics and behavior of complex signals.

Emerging artificial intelligence (AI) techniques have the potential for accurately forecasting production from wind sources in several ways. Models can be trained on large datasets of historical weather and wind data to better predict wind speeds and direction, which are critical factors in wind energy production. This can be leveraged to refine the accuracy of production forecasting and help energy providers to better anticipate fluctuations in wind power output. Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) neural networks are a type of recurrent neural network that are designed to model sequences and time series data. LSTMs are suitable for time series data because they can capture and remember long-term dependencies and patterns in the data over time, while also avoiding the vanishing gradient problem that can occur in traditional recurrent neural networks. LSTMs use a system of “gates” to moderate the transmission of data and control the memory of the network, allowing them to selectively forget or remember information from previous time steps. This makes LSTMs highly effective for modeling time series data with complex temporal dynamics. These types of networks offer a powerful tool for tackling time series data, allowing for accurate predictions and insights into the subtle patterns and connections in the data. However, like many algorithms, they present a set of hyperparameters that require adequate adjustment to attain desirable outcomes.

Hyperparameters are an essential aspect of forecasting AI models, as they can significantly influence model performance Probst et al. (2019). These parameters are not learned by the model during training but are set by the user before training. Choosing the right hyperparameters for a model is critical, as selecting the wrong values can lead to poor performance, slow convergence, or overfitting. Hyperparameter tuning involves selecting optimal values for these hyperparameters so as to attain the best possible outcomes from the model. This is typically done through a combination of trial and error and automated techniques. The process involves training the model with different hyperparameter values and evaluating the performance on a validation set. The hyperparameter values that produce an optimized model are then selected.

Metaheuristic algorithms present a powerful class of optimization algorithms that may be applied to hyperparameter tuning in forecasting AI models Bacanin et al. (2023b). Iterative optimization is designed to traverse and analyze the search space of possible solutions, without making assumptions about the objective function or the structure of the problem. They are particularly well-suited to hyperparameter tuning, as they can handle high-dimensional and non-convex exploration spaces with many local optima. Metaheuristic algorithms offer a promising approach to hyperparameter tuning in forecasting AI models and can help to overcome some of the limitations of traditional optimization methods. By leveraging the power of iterative search and exploration, metaheuristic algorithms can help to identify optimal hyperparameter values and improve the accuracy and performance of forecasting AI models. Swarm intelligence metaheuristic algorithms are a group of optimization algorithms inspired by the collective behavior of social insects and animals. The concept of decentralized self-organization is at the core of these algorithms, in which a group of individuals mutually interact to reach a shared goal. By following simple sets of rules, complex behaviors emerge on a global scale. These algorithms have the ability to address non-deterministic polynomial-time hard (NP-hard) optimization problems spending time and resources reasonably, something often constituting a problem with traditional methods. This being said, in accordance with the no free lunch theorem (NFL) (Wolpert and Macready 1997), no single methodology is best for all problems, instead an individual approach is preferred. Therefore extensive investigation is needed to further improve techniques and methods.

A notably well-performing metaheuristic algorithm used in this research is the reptile search algorithm (RSA) (Abualigah et al. 2022). It is a meta-learning approach that adapts a model to new tasks quickly. It works by training the model on a set of tasks for a fixed number of iterations and then fine-tuning it on new tasks using only a few gradient updates. Reptile search algorithm aims to learn a good initialization of model parameters that can be quickly adapted to new tasks with minimal updates.

A motivation for the conducted research was to further explore and expand the understanding of the novel RSA and its potential for hyperparameter tuning. Additionally, a key motivator was to determine if this already admirably performing metaheuristic can be further improved through hybridization with other well-known powerful optimizers. Finally, this research hopes to introduce a robust AI-based method improved by the introduced metaheuristic in order to better address the pressing real-world issue of wind power generation forecasting.

With this in mind, this work proposes a novel method for forecasting power generated by wind farms based on a time series of meteorological and historical factors. To account for the complexities caused by the volatility associated with these predictors two signal decomposition techniques are utilized, VMD and EMD. The processed data is formulated as a time series and six input lags are utilized in order to train LSTM neural networks to create forecasts three steps ahead. With the goal of optimizing the performance of the models, metaheuristic algorithms are applied for hyperparameter selection. Several metaheuristics are evaluated and a new modified version of the RSA is introduced. New and modified algorithm performance is usually initially evaluated using a set of standardized benchmarking functions. Accordingly, the modified metaheuristic was evaluated using a wide range of standard bound-constrained CEC2019 benchmarking functions before being applied to a real-world challenge. Following these evaluations, each metaheuristic optimized decomposition-aided LSMT approach is evaluated on two real-world data sets covering two wind power plants in different parts of the world to determine their performance. The best-performing models have been interpreted using SHapley Additive exPlanations (SHAP) (Lundberg and Lee 2017) methods to determine the factors that have the highest influence on model predictions.

The central contributions of the presented research work are summarised as:

  • A proposal for a modified variation of the recently introduced RSA that betters the commendable performance of the original

  • An introduction of a decomposition-aided metaheuristic optimized methodology for wind power generation prognosis

  • An interpretation of the best-performing models using SHapley SHAP analysis to better understand the factors that contribute the most to wind power generation

The remainder of the word follows the structure hereby presented: preceding research that lay out a foundation for this research is presented in the related works Sect. 2. The utilized methods and newly introduced metaheuristics are presented in Sect. 3. The capabilities of the introduced algorithm on bound-constrained functions are shown and discussed in Sect. 4. The experimental setup followed by the attained results on the two real-world datasets along with the achieved results’ discussion are presented in Sects. 5 and 6, respectively. Finally, a conclusion of the work and proposals for future research are shown in Sect.  7.

2 Overview of research background and literature review

In research, interest in LSTM neural networks has been renewed for forecasting wind power generation. Several studies have demonstrated the effectiveness of LSTM-based models in accurately predicting wind power output over short-term and long-term horizons. One such study (Shahid et al. 2021) proposed an LSTM-based model for short-term wind power forecasting that incorporated both meteorological and power data. The authors demonstrated that the given model outperformed traditional time series models and other ML (ML) algorithms, achieving high accuracy and robustness across different locations and weather conditions.

Another study (Liu et al. 2019) focused on long-term wind power forecasting using LSTM networks. The authors introduced a hybrid model that combined LSTM and wavelet transform and principal component analysis to capture both the temporal and spatial variations in wind power data. The results showed that the proposed model did better than traditional statistical models and other ML algorithms, with improved accuracy and robustness over longer forecasting horizons.

In addition to LSTM neural networks, other techniques have been explored to help augment wind power forecasting precision, including the use of decomposition techniques such as VMD and EMD. Researchers (Zhang et al. 2016) proposed a VMD-based approach to upgrade the accuracy of wind power forecasting by decomposing the time series into different frequency components and applying separate models to each component. The authors demonstrated that the VMD-based approach outperformed traditional time series models and other ML algorithms, achieving high accuracy and robustness across different locations and weather conditions. While other approaches for time-series forecasting exist their potential has been sufficiently explored in literature.

Hyperparameter tuning is a paramount aspect of ML and has been tested in the context of wind power forecasting using metaheuristic algorithms. Previous works (Shao et al. 2021) proposed a firework algorithm-based approach to optimize hyperparameters of LSTM neural networks for wind power forecasting. The authors demonstrated that the proposed approach achieved higher forecasting accuracy compared to other optimization techniques, indicating the importance of hyperparameter tuning for accurate wind power forecasting. Moreover, metaheuristics have been applied to optimization across several fields and have shown admirable results including crude oil price forecasting (Jovanovic et al. 2022a), and environmental sciences (Jovanovic et al. 2023a).

Another approach to augment the interoperability of wind electricity production forecasting models is the use of SHAP (Lundberg and Lee 2017) values. These values provide a way to interpret the reasoning behind ML model decisions, by determining contributions made by available features towards the final outcome. The use of SHAP values could be used to understand the factors that influence wind power forecasting accuracy using LSTM-based models. SHAP values provided a more comprehensive and intuitive understanding of model performance, compared to traditional evaluation metrics.

2.1 Motivation

Renewable energy plays a crucial role in addressing our planet’s pressing environmental challenges (Akella et al. 2009; Dinçer et al., 2023; Yüksel et al., 2024). By harnessing sources like solar and wind we can significantly reduce greenhouse gas emissions and combat climate change (Razmjoo et al. 2021). Embracing renewable energy not only safeguards the environment but also promotes energy security, job creation, and a healthier future for generations to come.

However, methods for generating energy in a renewable way are still developing. To facilitate large-scale adoption, many challenges need to be overcome (Züttel et al. 2022). One major challenge comes in the form of increased reliability. Being able to forecast the available power systems can improve reliability in the long run, increasing the viability of renewable systems.

The potential of metaheuristic algorithms for hyperparameter optimization is a well-established approach (Tayebi and El Kafhali 2022), however, it has not yet been explored when applied to wind power generation using LSTM networks. With novel and more powerful techniques constantly being developed (Mattos Neto et al. 2021; Belotti et al. 2020), evaluation and innovation are required in order to improve the body of work available to racehorses tackling optimizations. The potential of the recently introduced RSA (Abualigah et al. 2022) has not yet been explored and implemented in energy forecasting. Furthermore, as a relatively recent approach, this algorithm has great potential for improvement through hybridization.

Decomposition techniques offer yet another technique that can improve model training, with new methods being developed experimentation is required to determine their potential for helping address this increasingly pressing problem. Model interpretation techniques (Dwivedi et al. 2023) are increasingly important to build more reliable models and robust systems. This work aims to address the observed research gap and improve the body of available techniques for renewable energy forecasting.

2.2 Variational mode decomposition (VMD)

Variational mode decomposition (VMD) (Rehman and Aftab 2019) is a methodology for decomposing signals into base components in a non-reducible way. The base of it is Wiener filtering and the Hilbert transform (Zhang et al. 2021). This adaptive signal decomposition method can decompose a given signal f(t) into several components signals \(u_k(t)\) within bandwidth constraints around a center frequency \(\omega _k\) according to Eq. (1) (Wang and Li 2023).

$$\begin{aligned} \min \limits _{\{u_k\},\{\omega _k\}}\left\{ \sum _{k=1}^{K}\Big \Vert \partial _t\Bigl [\Bigl ( \delta (t)+\frac{j}{\pi t}\Bigr )*u_k(t)\Bigr ] e^{-j\omega _kt}\Big \Vert _2^2\right\} , s.t.\sum _{k=1}^{K}u_k=f(t) \end{aligned}$$
(1)

where K presents the count of decomposed modes, \(\{u_k\} = \{u_1, u_2, \dots , u_k\}\) are modal components with a center frequencies \(\{\omega _k\} = \{\omega _1, \omega _2, \dots , \omega _k\}\), \(\partial _t\) is partial derivative. \(\delta (t)\) represents the Dirac distribution, f(t) depicts the original input signal, \(u_k(t)\) represents k-th sub-sequence of f(t) and * marks a convolution operator.

Lagrange multiplication operator \(\lambda\) and the quadratic penalty factor \(\alpha\) are incorporated to upgrade the optimal solution of constrained variation as per the following:

$$\begin{aligned} L(\{u_k\},\{\omega _k\},\lambda )&=\alpha \sum _{k=1}^{K}\Big \Vert \partial _t\Bigl [\Bigl ( \delta (t) +\frac{j}{\pi t}\Bigr )*u_k(t)\Bigr ]e^{-j\omega _kt}\Big \Vert _2^2 \\&\quad +\left\| f(t)-\sum _{k=1}^{K}u_k(t)\right\| _2^2+ \left\langle \lambda (t),f(t)-\sum _{k=1}^{K}u_k(t)\right\rangle \end{aligned}$$

The Lagrange function is transformed from the time domain to the frequency domain. The alternate direction method of multipliers (ADMM) is utilized to minimize the optimization problem.The modes \(u_k\) and their center frequency \(\omega _k\) are calculated using the following equations respectively: Eqs. (2) and (3).

$$\begin{aligned} {\hat{u}}_k^{n+1}(\omega )= & {} \frac{{\hat{f}}(\omega )-\sum _{i\ne k}{\hat{u}}_i(\omega )+\frac{{\hat{\lambda }}(\omega )}{2}}{1+2\alpha (\omega -\omega _k)^2} \end{aligned}$$
(2)
$$\begin{aligned} \omega _k^{n+1}= & {} \frac{\displaystyle \int _0^\infty \omega \big |{\hat{u}}_k(\omega )\big |^2\,d\omega }{\displaystyle \int _0^\infty \big |{\hat{u}}_k(\omega )\big |^2\,d\omega } \end{aligned}$$
(3)

in which n is a number of iteration, \(\lambda\) is Lagrange operator given by Eq. (4).

$$\begin{aligned} {\hat{\lambda }}^{n+1}(\omega )={\hat{\lambda }}^n(\omega )+\tau \left[ {\hat{f}}(\omega )-\sum _{k=1}^{K}{\hat{u}}_k^{n+1}(\omega )\right] \end{aligned}$$
(4)

The iterative process will be executed until the condition given by Eq. (5) is met.

$$\begin{aligned} \frac{\displaystyle \sum _{k=1}^{K}\Big \Vert {\hat{u}}_k^{n+1}(\omega )-{\hat{u}}_k^{n}(\omega )\Big \Vert _2^2}{\Big \Vert {\hat{u}}_k^{n}(\omega )\Big \Vert _2^2}<\epsilon \end{aligned}$$
(5)

2.3 Empirical mode decomposition (EMD)

Empirical mode decomposition (EMD) (Rehman and Mandic 2010) is a signal decomposition approach for reducing the amount of noise in non-stationary time data series (Devi et al. 2020). Fourier and Wavelet analysis are used. This technique gives good results for analyzing wind speed data series. By using the EMD algorithm, complex time series data can be decomposed into a limited number of Intrinsic Mode Functions (IMFs) (Wang et al. 2022b).

EMD process has the following procedure:

  1. (1)

    Determine the local maximum and minimum value for any processed signal x(t). Record \(h_1(t)\) as difference between x(t) and mean value of upper and lower envelope \(m_1(t)\), according to Eq. (6)

    $$\begin{aligned} h_1(t)=x(t)-m_1(t) \end{aligned}$$
    (6)
  2. (2)

    \(h_1(t)\) filtered out of the original signal tends to contain the signal’s highest frequency component. Difference signal, \(r_1(t)\), is gained by separating \(h_1(t)\) from x(t). That way, the high-frequency component is removed. Filtering steps are repeated, with \(r_1 (t\)) as a new signal, with the goal of the residual signal in the n-th stage becoming a monotonic function. This is shown by Eq. (7).

    $$\begin{aligned} {\left\{ \begin{array}{ll} r_1(t)=x(t)-h_1(t)\\ r_2(t)=r_1(t)-h_2(t)\\ \quad \vdots \quad \quad \quad \vdots \\ r_n(t)=r_{n-1}(t)-h_n(t) \end{array}\right. } \end{aligned}$$
    (7)

    in which x(t) can be represented as a sum of n IMFs and a single residual according to Eq. (8).

    $$\begin{aligned} x(t) = \sum _{j=1}^{n} h_j(t)+r_n(t) \end{aligned}$$
    (8)

    where \(r_n(t)\) is the residual that denotes the signal average trend, \(h_j(t)\) is the j-th IMF, \(j = 1, 2,\ldots , n\), represents the various signal components in the direction from high to low frequencies.

Standard deviation (SD)is given by Eq. (9). It is mainly set from 0.2 to 0.3.

$$\begin{aligned} SD = \sum _{j=1}^{n}\frac{\big | h_j(t)-h_{j-1}(t)\big |^2}{\big |h_{j-1}(t)\big |^2} \end{aligned}$$
(9)

2.4 Long short-term memory (LSTM)

Recurrent neural networks (RNN) (Amalou et al. 2022), represent network architecture specialized for processing sequential data. However, RNNs have a problem of gradient disappearance or gradient explosion. To overcome this, researchers proposed the Long Short-Term Memory Neural Network (LSTM) (Hochreiter and Schmidhuber 1997). This type of RNN can learn long-term dependent information (Liu et al. 2020). It can remember the relationship of the current information with the long-term information in the time sequence. The hidden level of traditional RNN has been replaced with memory cell (Wang et al. 2022a). It is comprised of a forget gate, input, and output gate, as shown in Fig. 1 (Liu et al. 2020; Wang et al. 2022a).

Fig. 1
figure 1

The architecture of LSTM cell

The basic structure in an LSTM is a memory block. This unit contains a cell used to store data as a system consisting of three control gates: forget, input, and output. All gate units have the same structure and consist of the same sequence: sigmoid activation function and multiplication in a range [0, 1], and determines the amount of information passing through. The output of activation function tanh is in the range [− 1, 1] (Wang et al. 2022b).

LSTM input is made up of previous sequences \(h_{t-1}\) as well as the ongoing input \(x_t\). The forgot gate determines values in the cell state \(C_{t-1}\) to be discarded, as it is defined by Eq. (10) (Fu et al. 2019).

$$\begin{aligned} f_t = \sigma \left( W_f x_t+U_f h_{t-1}+b_f\right) \end{aligned}$$
(10)

where \(f_t\) is an output vector with values in the range [0,1], \(\sigma\) is the sigmoid function, \(W_f\) and \(U_f\) are the weight matrices and \(b_f\) is the bias vector. The input gate updates the information using the result of the sigmoid layer \(i_t\) according to Eq. (11).

$$\begin{aligned} i_t = \sigma \left( W_i x_t + U_i h_{t-1} +b_i\right) \end{aligned}$$
(11)

where \(W_i\), \(U_i\) are the weight matrices and \(b_i\) is the bias vector.

The new potential values of cell state vector \({\tilde{C}}_t\) are generated by tanh function, determined by Eq. (12).

$$\begin{aligned} \tilde{C_t} =tanh\left( W_c x_t+U_c h_{t-1}+b_c\right) \end{aligned}$$
(12)

\(C_t\) is obtained by multiplying the old cell state \(C_{t-1}\) with \(f_t\) (to forget the unwanted information) and adding new potential information \(i_t \otimes \tilde{C_t}\), as it is given by Eq. (13)

$$\begin{aligned} C_t =f_t \otimes C_{t-1}+i_t \otimes \tilde{C_t} \end{aligned}$$
(13)

with the \(\otimes\) indicating an element-wise multiplication (Wang et al. 2022a; Fan et al. 2020). The output gates value \(o_t\) is obtained according to Eq. (14)

$$\begin{aligned} o_t = \sigma \left( W_o x_t+U_o h_{t-1}+b_o\right) \end{aligned}$$
(14)

where \(W_o\),\(U_o\) are the weight matrices and \(b_o\) is the bias vector. The value of output \(h_t\) is calculated using Eq. (15).

$$\begin{aligned} h_t =o_t \otimes tanh(C_t) \end{aligned}$$
(15)

With the tanh function, cell state \(C_t\) is scaled in the range [− 1,1], and multiplying it with the output of output gate \(o_t\), a new output value \(h_t\) has been calculated.

2.5 Metaheuristics optimization

Stochastic algorithms, including metaheuristics, are often necessary for computer science to tackle NP-hard challenges because deterministic algorithms are not practical. Metaheuristics algorithms may be classified into categories emulating natural processes to lead the search method. For instance, some methods are inspired by evolution, natural selection, or birds and insects’ collective behavior (Stegherr et al. 2020; Emmerich et al. 2018; Fausto et al. 2020). The most prominent groups of metaheuristics approaches include nature-inspired algorithms, consisting of genetic algorithms and swarm intelligence, as well as algorithms based on some physical phenomena (e.g., storm, gravitational and electromagnetic fields). Other approaches include those that mimic facets of human behavior such as teaching and learning, brainstorming, or social media activity and those that were derived from fundamental mathematical laws that pilot the search, e.g. through the use of trigonometric function oscillations.

Swarm intelligence methods have been based on groups’ cooperative actions composed of relatively simple individuals, such as swarms of insects or flocks of birds, which can exhibit astonishingly coordinated and sophisticated behavior patterns while performing fundamental survival tasks such as hunting, foraging, mating, or migrating (Beni 2020; Abraham et al. 2006). These methods have demonstrated significant potential when tackling real-world NP-hard problems. However, they have been sometimes known to fail. Several popular swarm intelligence algorithms include particle swarm optimization (PSO) (Kennedy and Eberhart 1995), ant colony optimization (ACO) (Dorigo et al. 2006), firefly algorithm (FA) (Yang 2009), and bat algorithm (BA) (Yang 2010; Yang and Gandomi 2012). In the past few years, a highly effective group of metaheuristics has been developed based on mathematical functions and their behavior patterns to guide the search process, where the most notable samples are the sine-cosine algorithm (SCA) (Mirjalili 2016) and the arithmetic optimization algorithm (AOA) (Abualigah et al. 2021).

The NFL is the main cause of such a variety of optimization methodologies existing. The NFL states that there is no single algorithm that can be universally superior for every optimization task. Therefore, while one algorithm may perform well on a particular problem, it may fall short entirely on another, highlighting the necessity for diverse metaheuristic methods and the need to select the most appropriate method for each individual optimization task.

Population-based algorithms have lately become a usual choice for addressing different real-world problems. These algorithms are useful for many fields, such as prediction of COVID-19 cases (Zivkovic et al. 2021a, b), organizing on demand computational services (Bacanin et al. 2019; Bezdan et al. 2020a, b; Zivkovic et al. 2021c), optimizing wireless sensors and IoT (Zivkovic et al. 2020, 2021d), feature selection (Bezdan et al. 2021; Bacanin et al. 2023a), processing and classifying medical images (Bezdan et al. 2020c; Zivkovic et al. 2022), addressing global optimization problems (Strumberger et al. 2019; Preuss et al. 2011), identifying credit card fraud (Jovanovic et al. 2022b; Petrovic et al. 2022), monitoring and forecasting air pollution (Bacanin et al. 2022a; Jovanovic et al. 2023a), detecting network and computer system intrusions (Bacanin et al. 2022b; Stankovic et al. 2022), predicting power generation and energy load (Bacanin et al. 2023b; Stoean et al. 2023), and optimizing different ML models (Salb et al. 2022; Milosevic et al. 2021; Gajic et al. 2021; Bacanin et al. 2022c, d; Jovanovic et al. 2022a, 2023b; Bukumira et al. 2022).

2.6 SHapley Additive exPlanations (SHAP)

SHapley Additive exPlanations (SHAP) (Lundberg and Lee 2017) is a method for interpreting the output of ML models, particularly those that are a black-box or difficult to interpret. The methodology of SHAP is the computation of the Shapley value concept rooted in cooperative game theory to explain the role of the available feature and its impact on decisions. In SHAP, the Shapley value for a feature represents the average influence of that feature on the model’s output across all possible subsets of features.

To calculate the Shapley value for a feature, we first define a reference value for that feature. This reference value could be the average value of that feature in the dataset, or it could be a user-defined value. We then create a set of all possible feature subsets that include the feature we are interested in. For example, if we are interested in the Shapley value of feature A, we might create subsets that include just A, or A and B, or A and C, and so on.

For each subset, we calculate the contribution of the feature to the model’s output relative to the reference value. This contribution might be positive or negative, causing the feature’s value in the subset to increase or decrease the model’s output. We then calculate the average contribution of the feature across all possible subsets, which gives us the Shapley value.

Mathematically, the Shapley value for a feature j is determined by:

$$\begin{aligned} \phi _j = \sum _{S\subseteq M \backslash {j}} \frac{|S|!(|M|-|S|-1)!}{|M|!}(f(S\cup {j})-f(S)) \end{aligned}$$
(16)

Here, M is the set of all features, and f(S) is the model’s output for a given subset of features S. The term inside the summation calculates the marginal contribution of feature j to the subset S, and the summation calculates the average contribution across all possible subsets.

In practice, we can estimate the Shapley values for a model using a technique called Kernel SHAP. This involves generating a set of “background” instances that are representative of the dataset and then using these instances to estimate the expected model output for each subset of features. The Shapley values can then be calculated using these expected outputs.

Once we have the Shapley values for all features, we can use them to create a SHAP summary plot, which shows the contribution of each feature to the model’s output. This plot can help us identify which features are most important for the model’s predictions, and how each feature contributes to those predictions.

3 Methods

3.1 The original reptile search algorithm (RSA)

The RSA (Abualigah et al. 2022) is an optimization algorithm that mathematically simulates the predatory techniques of Crocodiles. This algorithm simulates two main techniques: encircling and hunting. The details of both are outlined in the original work (Abualigah et al. 2022).

3.2 Initialization phase

Optimization initialized with a stochastically generated candidates (X) shown in Eq. (17)

$$\begin{aligned} X= & {} \begin{bmatrix} x_{1,1} &{} \ldots &{} x_{1,j} &{} x_{1,n-1} &{} x_{1,n} \\ x_{2,1} &{} \ldots &{} x_{2,j} &{} \ldots &{} x_{2,n} \\ \ldots &{} \ldots &{} x_{i,j} &{} \ldots &{}\ldots \\ \vdots &{} &{} \vdots &{} \vdots &{} \vdots \\ x_{N-1,1} &{} \ldots &{} x_{N-1,j}&{} \ldots &{}x_{N-1,n}\\ x_{N,1} &{} \ldots &{} x_{N,j}&{} x_{N,n-1} &{}x_{N,n} \end{bmatrix} \end{aligned}$$
(17)
$$\begin{aligned} x_{i,j}= & {} rand\times (UB-LB)+LB, j=1,2,...,n \end{aligned}$$
(18)

in which \(x_{i,j}\) represents the \(j_{th}\) location of the \(i_{th}\) agent, N is the count of potential agent, and n represents the dimensions for the given challenge. In Eq. (18), rand is a random value from a uniform distribution, LB is lower and UB is the upper bound of the given challenge. During independent testing, multiple distributions were considered, and it has been deduced that using a uniform distribution yielded the best results.

3.2.1 Encircling phase (exploration)

During the encircling phase, Crocodiles have two kinds of movements: high walking and belly walking. The RSA can alternate two search phases: encircling (exploration) and hunting (exploitation). The change is made according to four criteria and is determined by the current iteration. The position updating for the exploration phase is presented by Eq. (19).

$$\begin{aligned} x_{(i,j)} (t+1) = {\left\{ \begin{array}{ll} Best_j (t)\times -\eta _{(i,j)} (t)\times \beta -R_{(i,j)} (t)\times rand, &{} t \le \frac{T}{4} \\ Best_j (t)\times x_{(r_1,j)} \times ES(t)\times rand, &{} t \le 2\frac{T}{4}\hspace{5.0pt}and\hspace{5.0pt}t>\frac{T}{4} \end{array}\right. } \end{aligned}$$
(19)

in which \(Best_j (t)\) represents the \(j_{th}\) location in the current optimal agent, rand signifies an arbitrary number in range [0, 1], t represents the ongoing iterative count, and T maximum iterations. The value of \(\eta _{(i,j)}\) id calculated by Eq. (20), and describes hunting operator for the \(j_{th}\) position in the \(i_{th}\) solution. The value of \(\beta\) is used to control sensitivity. \(R_{(i,j)}\) is a reduced function (a value used to reduce the search area) presented by Eq. (21). The value of \(r_1\) is arbitrarily selecting form [1, N], variable \(x_{(r_1,j)}\) denotes the \(i_{th}\) solutions arbitrary location. Evolutionary Sense, denoted as ES(t), is a probability ratio with randomly decreasing values between [2,− 2], which is calculated using Eq. (22).

$$\begin{aligned} \eta _{(i,j)}\,=\, & {} Best_j (t)\times P_{(i,j)} \end{aligned}$$
(20)
$$\begin{aligned} R_{(i,j)}\,= \,& {} \frac{ Best_j (t)- x_{(r_2,j)}}{Best_j (t)+\epsilon } \end{aligned}$$
(21)
$$\begin{aligned} ES(t)\,=\, & {} 2\times r_3\times \left( 1-\frac{1}{T}\right) \end{aligned}$$
(22)

in which \(P_{(i,j)}\) defines the dissimilarity in percent of the \(j_{th}\) and best attained \(j_{th}\) solution location determined via Eq. (23). Variable \(\epsilon\) stores a minimal value used to avoid division with zero. Value \(r_2\) and \(r_3\) are arbitrary values between [1, N] and \([-1,1]\) respectively.

$$\begin{aligned} P_{(i,j)} = \alpha + \frac{ x_{(i,j)}- M(x_i)}{Best_j (t)\times (UB_{(j)}-LB_{(j)})+\epsilon } \end{aligned}$$
(23)

in which \(M(x_i)\) represents the mean location of the \(i_{th}\) agent, determined per Eq. (24). LB(j) and UB(j) are the lower and upper constraints for the \(j_{th}\) location, variable \(\alpha\) is used to regulated sensitivity.

$$\begin{aligned} M(x_i) = \frac{1}{n} \sum _{j=1}^{n} x_{(i,j)} \end{aligned}$$
(24)

3.2.2 Hunting phase (exploitation)

During the hunting phase, the exploitation mechanisms of RSA are presented. During the hunting process, crocodiles perform either hunting coordination or cooperation. According to this, the position updating for the exploitation phase is presented by Eq. (25).

$$\begin{aligned} x_{(i,j)} (t+1) = {\left\{ \begin{array}{ll} Best_j (t)\times - P_{(i,j)} (t)\times rand, &{} t \le 3\frac{T}{4}\hspace{5.0pt}and\hspace{5.0pt}t>2\frac{T}{4} \\ Best_j (t)-\eta _{(i,j)} (t)\times \epsilon -R_{(i,j)} (t)\times rand, &{} t \le T\hspace{5.0pt}and\hspace{5.0pt}t>3\frac{T}{4} \end{array}\right. } \end{aligned}$$
(25)

in which \(Best_j (t)\) represents the \(j_{th}\) location for the best-obtained agent thus far, \(\epsilon\) a small value, \(P_{(i,j)}\), \(\eta _{(i,j)}\) and \(R_{(i,j)}\) are given by Eqs. (23), (20) and (21), respectively.

The Pseudocode for the described RSA is shown in 1.

Algorithm 1
figure a

Pseudo-code of the RSA

3.3 Proposed modified RSA approach

The prioritization of exploitation over exploration in RSA leads to a lack of variety among the population and early convergence. This implies that the starting positions of the solutions have a significant influence on the final outcomes. The objective of this research is to enhance the RSA algorithm by tackling the problem of limited exploration by ensuring adequate population diversity during initialization and during execution. To accomplish this, two adjustments are implemented in the elementary RSA metaheuristics: a new approach to initialization and a mechanism for preserving diverse solutions during the execution of the algorithm.

3.3.1 New initialization scheme

The method introduced in this study utilizes a traditional initialization equation to produce the set of individuals in the initial population:

$$\begin{aligned} X_{i,j} = lb_{j} + \psi \cdot (ub_{j}-lb_{j}), \end{aligned}$$
(26)

in which \(X_{i,j}\) denotes the j-th item of the i-th solution, \(lb_{j}\) and \(ub_{j}\) represent the lower and upper constraints of the component j, and \(\psi\) is an arbitrary value drawn from the normal distribution within limits [0, 1].

Still, the study by Rahnamayan et al. (2007) has shown that incorporation of the quasi-reflection-based learning (QRL) (Rahnamayan et al. 2007) approach to the population produced by Eq. (26) could allow the exploration of the wider search area. Consequently, for every component j belonging to the solution (\(X_{j}\)), a quasi-reflective-opposite component (\(X_{j}^{qr}\)) is produced in the following way:

$$\begin{aligned} X_{j}^{qr}=\text {rnd}\bigg (\dfrac{lb_{j}+ub_{j}}{2},x_{j}\bigg ), \end{aligned}$$
(27)

where rnd allows to select an arbitrary number within \(\bigg [\dfrac{lb_{j}+ub_{j}}{2},x_{j}\bigg ]\) limits.

According to the QRL procedure, the proposed initialization approach doesn’t affect the complexity of the algorithm with respect to FFEs as it produces only half of the entire population (NP/2). The initialization procedure used in this research is outlined in Algorithm 2.

Algorithm 2
figure b

Proposed initialization procedure that incorporates the QRL method

The intensive experiments have demonstrated that this initialization scheme exhibits two important advantages. First, it enhances the diversification of the starting population, which can improve the outcomes of the algorithm at the beginning of the run. Second, it enables the algorithm to cover a wider search area with the same size of the population, allowing the initial boost to the search procedure as well.

3.3.2 Procedure to keep the diversity of population

To evaluate whether the algorithm’s search mechanism is converging or diverging, one method is to assess the diversity of the population, which is explained in Cheng and Shi (2011). The study employs a new definition of measuring population diversity, specifically using the L1 norm. This norm considers diversities resulting from two factors: the solutions generated by the population and the problem’s dimensionality.

Furthermore, Cheng and Shi (2011) emphasizes the importance of data obtained from the dimension-wise element of the L1 norm, which can be utilized to evaluate the search mechanism of the algorithm being studied.

Suppose m represents the number of solutions in the population, and n denotes the problem’s dimensionality. The L1 norm can be calculated as presented in Eqs. (28) to (30):

$$\begin{aligned} \overline{x_{j}}= & {} \frac{1}{m}\sum _{i=1}^{m}x_{ij} \end{aligned}$$
(28)
$$\begin{aligned} \Theta ^p_{j}= & {} \frac{1}{m}\sum _{i=1}^{m}\Bigg \vert x_{ij} - {\overline{x}}_j \Bigg \vert \end{aligned}$$
(29)
$$\begin{aligned} \Theta ^p= & {} \frac{1}{n}\sum _{i=1}^{n} \Theta ^p_j, \end{aligned}$$
(30)

In this context, \({\overline{x}}\) is referring to the array containing the average positions of the solutions across each dimension, while \(\Theta ^p_{j}\) represents the array of position diversities of the individuals, calculated using the L1 norm. \(\Theta ^p\) denotes the overall diversity value of the population as a scalar.

During the initial rounds of the algorithm’s execution, the population’s diversity should be high since the solutions are generated using the standard initialization equation (26). Nevertheless, while the method is converging to the optimal or sub-optimal solution in later rounds, the diversity should decrease dynamically. In order to tackle this, the enhanced RSA algorithm proposed in this study makes use of the L1 norm for regulating the population’s diversity during the entire run. This is achieved through a dynamic diversity threshold control parameter, represented by \(\Theta _t\).

A technique has been suggested to preserve variety within a population by introducing an extra control factor, referred to as nrs, that specifies the number of individuals to be substituted. The approach functions in the following way: at the outset of the algorithm, the dynamic threshold for diversity, labeled as \(\Theta _{t0}\), is established. During each round of execution, the latest population diversity, represented as \(\Theta ^P\), is assessed and compared to the dynamic diversity threshold, \(\Theta _t\). If the condition \(\Theta ^P<\Theta _t\) is met, suggesting that the population’s diversity is insufficient, the worst nrs individuals are replaced with random solutions created utilizing a method comparable to the one used to initialize the population.

After conducting empirical simulations and theoretical analysis, the formula for computing \(\Theta _{t0}\) can be expressed in the following manner.

$$\begin{aligned} \Theta _{t0} = \sum _{j=1}^{NP}\frac{(ub_{j}-lb_{j})}{2 \cdot NP} \end{aligned}$$
(31)

As the algorithm progresses, it is anticipated that the population will gradually approach the optimal search area. Thus, the dynamic diversity threshold, \(\Theta _{t}\), must be lowered from its starting value, \(\Theta _{t0}\), which is calculated by applying the Eq. (31). To accomplish this reduction in \(\Theta _{t}\), a linear decreasing function can be employed, as shown in Eq. (32), with T denoting the maximum number of rounds and \(\Theta _{t0}\) representing the initial diversity threshold.

$$\begin{aligned} \Theta _{t+1} = \Theta _{t} - \Theta _{t}\cdot \frac{t}{T}, \end{aligned}$$
(32)

Here, t and \(t+1\) represent the current as well as the next rounds. Additionally, T represents the iterative maximum. As the algorithm continues, the dynamic reduction of \(\Theta _t\) occurs, and eventually, the mechanism described will no longer be utilized, disregarding \(\Theta ^P\).

3.3.3 Inner workings of the suggested algorithm

Since the introduced modified RSA algorithm improved on the admirable performance of the basic RSA, it was therefore named the enhanced RSA - ERSA, and its internal structure is provided in Algorithm 3. While looking at the proposed pseudo-code, it is possible to note that the suggested modifications have been integrated into the basic variant of the RSA algorithm described in Algorithm 1 (Abualigah et al. 2022).

Algorithm 3
figure c

ERSA pseudocode

3.4 Evaluation metrics

The observed models’ simulation outcomes have been validated by applying the collection of traditional ML measurements, namely mean squared error (MSE) calculated by Eq. (33), root mean squared error (RMSE) that can be obtained by Eq. (34), mean absolute error (MAE) specified by Eq. (36), and finally the coefficient of determination (R2) that can be determined with Eq. (36).

$$\begin{aligned} MSE= & {} \frac{1}{N}\sum _{i=1}^{N}\left( \hat{p_i} - p_i\right) ^{2} \end{aligned}$$
(33)
$$\begin{aligned} RMSE= & {} \sqrt{\frac{1}{N}\sum _{i=1}^{N}\left( \hat{p_i} - p_i\right) ^{2}} \end{aligned}$$
(34)
$$\begin{aligned} MAE= & {} \frac{1}{N}\sum _{i=1}^{N}\left| \hat{p_i}-p_i\right| \end{aligned}$$
(35)
$$\begin{aligned} R^2,= & {} 1- \, \frac{\sum _{i\, = 1}^{n} \, \left( p_{i} - \hat{p_{i}} \right) ^2}{\sum _{i\, = 1}^{n}\,\left( p_{i} - {\bar{p}} \right) ^2}, \end{aligned}$$
(36)

where \(p_{i}\) and \(\hat{p_i}\) mark arrays that consist of the observed and predicted values, both containing N entries. This research employs MSE as the objective function with the goal to minimize it.

4 Experiments with standard bound-constrained functions

Before evaluating the performance of ERSA metaheuristics on practical RNN tuning for wind energy time-series forecasting, in compliance with well-established practices from the modern literature, the proposed method was first tested against standard bound-constrained (unconstrained) benchmarks.

The CEC2019 test suite was chosen for this purpose due to dual reasons: this set of functions is more challenging and complex than other benchmarking suites (e.g. standard test instances and CEC2017); the basic RSA was also evaluated on them when this approach was introduced for the first time  (Abualigah et al. 2022). This package consists of ten functions and its details (name, dimension, search space constraints, global optimum) have been demonstrated in Table 1.

The CEC2019 simulations were conducted with both, the introduced ERSA and the original RSA algorithms. Additionally, to make a comparative analysis more widespread, other cutting-edge metaheuristics were also considered in comparative analysis: particle swarm optimization (PSO) (Kennedy and Eberhart 1995), artificial bee colony (ABC) (Karaboga and Basturk 2008), firefly algorithm (FA) (Yang 2009), harris hawks optimization (HHO) (Heidari et al. 2019), whale optimization algorithm (WOA) (Mirjalili and Lewis 2016) and chimp swarm optimization (ChOA) (Khishe and Mosavi 2020). This particular set of algorithms was chosen to make a balance between traditional methods like PSO and more recent ones, e.g. ChOA.

Similar experimental conditions as in Abualigah et al. (2022), where maximum iterations (T) and the agents in population count (N) were adjusted to 500 and 30, respectively, were set for the purpose of this research as well. Additionally, due to the methods’ stochastic behavior, experiments are repeated 30 independent times (runtime = 30), with the best, worst, mean, average, and standard deviation metrics captured.

All evaluated methods were implemented specifically for this research work and results for the approaches tested also in Abualigah et al. (2022) were not taken. However, it should be pointed out that very similar results for the basic RSA, WOA, and PSO as those reported in Abualigah et al. (2022) were obtained in simulations conducted for the purpose of this work.

Table 1 Review of CEC2019 benchmark function problems details

The performance metrics comparison between the competitive algorithms is given in Table 2, with the best-attained outcomes highlighted in bold text.

Table 2 Results of the RSA using the CEC2019 test functions

The outcomes of the CEC2019 function testing indicate that on average the proposed ERSA demonstrates the most admirable performance. It is also apparent that this is not the case across all test functions. Nevertheless, this is to be expected, as per the NFL theorem of optimization no single algorithm is equally effective across all applications. Therefore, experimentation is required to determine algorithms best suited to a given problem.

Further analysis and observations help us understand the strengths and weaknesses of the introduced algorithms in comparison to contemporary methods. Interestingly, in certain functions such as F9, while the algorithm did not display the best performance, it nevertheless, attained the best STD score indicating that, while the introduced algorithm isn’t the optimal solution by a small margin, it is the most robust and reliable option. This observation is somewhat mirrored for function F4, where despite attaining a good score for both the best and average run, the algorithm does not demonstrate the highest reliability, further emphasizing the importance of extensive experimentation and the NFL theorem.

A more direct improvement can be observed in F2, where despite attaining identical scores for best, average, and worst runs between the RSA and introduced ERSA, the introduced algorithm attained a significant improvement in robustness as demonstrated by a decrease in the STD. In F10, both algorithms performed slightly less favorably than the ABC algorithm, with the original RSA performing slightly better in the works run, while the introduced ERSA attained better performance in the best run, evening out average performance. For test function F3, most metaheuristics performed similarly well. In most other cases the introduced algorithms achieved a performance increase compared to the original as well as also outperformed competing metaheuristics in these cases.

Finally, metaheuristic ranking has been done according to average performance scores, with the best-performing algorithms receiving the lower ranks, while less optimal algorithms received progressively higher rankings. It can be observed that the proposed algorithm attained the best ranking in the majority of cases, closely followed by the original RSA algorithm. Average rankings across all functions are also shown, where the introduced ERSA algorithm attained an average rank of 1.5, followed closely by the original RSA which attained an average rank of 2.4. Based on these rankings it can be determined that the introduced ERSA performed admirably well in comparison with contemporary algorithms, while also improving on the admirable performance maintained by the RSA. It is also worth noting that all evaluations and algorithms have been independently implemented and tested. Furthermore, the attained results are in line with previous works (Abualigah et al. 2022) that have similarly evaluated the original RSA algorithm against several state-of-the-art algorithms.

Finally, to visualize the performance of evaluated methods, convergence speed graphs for arbitrarily chosen instances depicted for the mean run are shown in Fig. 2. In cases of F2, F4, and F8 benchmarks, the ERSA manages to converge relatively fast, however, for the F10 test, the proposed method exhibits relatively stable, but not the best performance.

Fig. 2
figure 2

Convergence speed graphs for some arbitrarily CEC2019 functions

5 Utilized datasets and basic experimental setup

5.1 Overview of wind generation datasets

As already noted in Introduction, to evaluate the performance of LSTM models for multivariate time-series forecasting, two datasets were employed. In this part of the manuscript, a brief description of both datasets is provided. For easier following, the first dataset is labeled as “wind dataset”, while the second is titled as “wind farm dataset”.

5.1.1 Wind dataset

The hourly energy demand generation and weather dataset, available onlineFootnote 1 has been compiled from two primary sources. The first source covered electrical generation and consumption data for Spain. This data has been provided by the ENTSOE public portal for Transmission Service Operator (TSO) data. The provided data covers several sources of power such as biomass, geothermal, wind, solar as well as fossil power. The second source includes relevant meteorological data for Valencia Spain. The information for this segment of the dataset has been provided by the Weather API available online.Footnote 2 It covers hourly resolution data for temperature, pressure, humidity, wind speed, wind direction as well as rainfall.

The compiled dataset covers a total of 4 years worth of weather, load, and generation data for Spain with an hourly resolution, for the years 2015–2018. The available information makes this compiled data an excellent contender for forecasting power generation based on available meteorological data. For this research, the onshore power generation data has been used as the target variable while the available weather data has been used as inputs.

However, since this is a relatively large dataset, only the period from January 1, 2018 to December 31, 2018, was taken for simulations and it consists of 8759 observations. The dataset training-validation-testing split demonstrated on the target feature can be seen in Fig. 3

Fig. 3
figure 3

Wind dataset split shown on target feature

5.1.2 Wind farm dataset

Formerly a competition dataset the GEFCom2012 challenge (Hong et al. 2014) for a time was available on Kaggle,Footnote 3 the wind dataset has been repurposed for use in wind energy generation forecasting. It was originally introduced with the aim of improving forecasting practices and their utility across industries and serves to bridge the connection between academic research and industry practice. The dataset continues its legacy in this work where it is utilized to assess the proposed models’ performance when tackling the complex task of windpower generation forecasting.

The datasets contained encompass weather and generation data for 7 anonymized wind farms in mainland China. The data concerning the power generation of the farms have been normalized to ensure anonymity. Windfarm data is accompanied by 24 h forecasts of relevant wind meteorological data created every 12 h. Wind speed and wind direction are provided alongside zonal and meridional wind components for each wind farm (Fig. 4).

Fig. 4
figure 4

Wind farm dataset split shown on target feature

For experimental purposes the relevant meteorological data has been trimmed into 12 h predictions in every forecast, to create an hourly resolution. This was then combined with the available normalized real-world wind generation data for each respective wind farm on an hourly basis. The original dataset covered four and a half years’ worth of hourly resolution data, with the final half years’ worth of data reserved for testing. However, the final half year’s worth of data has not been made available.

Due to the huge amount of observations available in this dataset, as well as missing values in the later parts of the data, during experimentation a reduced portion of the available data needed to be used. The huge amount of data makes training and evaluating models very resource intensive. Therefore, the experimental dataset used in simulations covers 2 years’ worth of data (from January 1, 2009 to December 31, 2010) for a single anonymized wind farm. The final portion of the utilized dataset contains a total of 13176 instances for wind farm 2. The dataset training-validation-testing split demonstrated on the target feature can be seen in Fig. 3.

5.2 Decomposition

The initial stage of experimentation involves applying decomposition techniques to the available data input features. This is done so as to divide the complex input feature signals into a series of simpler sub-signals, that could later be used for forecasting. This process does, however, increase the number of features a network needs to handle.

The two techniques tested in this work include VMD (Dragomiretskiy and Zosso 2013) and EMD (Huang et al. 1998). The VMD has been carried out with the k value of three, resulting in a total of four signals. Three signals represent attained modes, while the fourth is the residual values that were not encompassed by a mode. All VMD modes and residuals for the wind and wind farm datasets are shown in Figs. 5 and 6 respectively.

Fig. 5
figure 5

Wind dataset input feature VMD modes and residuals

Fig. 6
figure 6

Wind farm dataset input feature VMD modes and residuals

Similarly, the EMD technique number of IMF’s was limited to a maximum of four. It is important to note that, once a new IMF cannot be extracted via decomposition EMD terminates the process early, resulting in fewer respective components. In addition to the IMF components, an additional residual component is added that consists of signals that could not be assigned to an IMF. This results in a maximum of 5 sub-signals per input feature. All EMD IMF’s and residuals for the wind and wind farm datasets are shown in Figs. 7 and 8 respectively.

Fig. 7
figure 7

Wind dataset input feature EMD IMF’s and residuals

Fig. 8
figure 8

Wind farm dataset input feature EMD IMF’s and residuals

5.3 Experimental setup

The experimental process involves two stages. Initially, the available data for both datasets were subjected to decomposition. Following this process, the signal components and residual signals were fed into LSTM models tasked with forecasting. Each model was given six points of input data and challenged with making forecasts three steps ahead. A flowchart of the described process is shown in Fig. 9.

Fig. 9
figure 9

Flowchart of the forecasting process

Several state-of-the-art metaheuristics were challenged to optimize the parameters of prediction models in order to improve performance. The evaluated models include the introduced ERSA as well as the original RSA. Additionally, several well-known optimization algorithms were included in the comparative analysis including PSO (Kennedy and Eberhart 1995), ABC (Karaboga and Basturk 2008), FA (Yang 2009), HHO (Heidari et al. 2019), WOA (Mirjalili and Lewis 2016), ChOA (Khishe and Mosavi 2020). The evaluated optimization algorithms were assigned a population size of five agents and allowed eight iterations to improve solutions. Additionally, to account for intrinsic randomness associated with metaheuristic algorithms results were evaluated through 30 independent executions to help attain objective evaluations.

All tested algorithms were tasked with selecting LSTM hyperparameters. The parameter subset selected for optimization was selected due to the high impact on model performance. Optimized parameters and their possible ranges are as follows: the neuron count in the layers determined from range [100, 300], the learning rate range [0.0001, 0, 01], training epochs [300, 600], dropout rate [0.05, 0.2], the total number of network layers between [1, 2] and the number of neurons in the second network layer [100, 300]. The parameters and their respective constraints are highlighted in Table 3 Finally, early stopping has been implemented in order to help prevent model over-fitting. The threshold for early stopping has been empirically determined as \(\frac{epochs}{3}\). Meaning that if a model should not improve for \(\frac{epochs}{3}\) model training is terminated early. An added benefit of this approach is the reduction in wasted computation resources. The utilized ranges where determined empirically to give the best outcomes considering the computational costs of optimization and performance outcomes.

Table 3 The LSTM hyperparamaters included in the optimization and their respective constraints

Experiments where carried out using the Python programming languages. Additionally, standard ML and AI libraries where utilized including Keras, TensorFlow, Scikit-learn. The vizuals where generated using Seaborn and Matplotlib libraries. To facilitate experimentation a machine with an Intel i9 11900K CPU, 128BB of ram memory and a RTX4070 GPU was employed.

6 Achieved results, comparative analysis, and discussion

In this segment the experimental outcomes on two observed datasets are delivered, namely wind and wind farms. First, the results without decomposition are shown, followed by the results with VMD and EMD employed. Last but not least, this section also provides a SHAP analysis of the best-performing model on each of the two observed datasets. In all tables that contain experimental results, the best outcomes in every regarded category are marked in bold text.

6.1 Wind dataset experimental results

The following demonstrated the outcomes on the wind dataset, without decomposition. Table 4 shows the overall metrics in terms of the best, worst, mean, and median values, accompanied by the standard deviance and variance values over 30 separate runs of each regarded algorithm. LSTM-ERSA model accomplished the supreme results in terms of the best and median values. The second-best result was scored by the LSTM-ABC method, while LSTM-WOA attained the third-best score. Otherwise, LSTM-ChOA achieved the best results for the worst and median metrics. Finally, LSTM-HHO established the best standard deviation and variance scores, suggesting that it provided the steadiest results across the runs.

Table 4 Wind dataset overall metrics for best, worst, mean, and median run without decomposition

Table 5 brings forward the detailed metrics of every prediction step regarding the best run of each algorithm. The prefix L is used to denote that LSMT is used. It can be noted that the suggested LSTM-ERSA attained supreme results for two-step, three-samples forward, and overall results, in terms of the objective—MSE, but also for other important indicators, namely \(R^2\), MAE (except for two-samples forward) and RMSE. The best scores for one sample forward were achieved by the LSTM-ABC approach. Looking at the overall results, the second-best algorithm was LSTM-ABC, in front of the LSTM-WOA and LSTM-ChOA methods. The observed LSTM-ERSA attained the best overall MSE value of 0.006592, in front of LSTM-ABC with an MSE value of 0.006605.

Table 5 Wind dataset detailed metrics for each prediction step of the best run without decomposition

The best set of LSTM parameters produced by the top-performing run of each metaheuristic is shown in Table 6. The proposed LSTM-ERSA established the LSTM structure as follows: 140 neurons in the first layer, a learning rate of 0.000894, 517 epochs, a dropout value of 0.108587, and 170 neurons in the second layer.

Table 6 Parameters selected by metaheuristics for best-performing wind prediction models without decomposition

Aiming to provide better insight into the results, visualizations are provided in Figs. 10, 11, and 12. Figure 10 shows the violin plots for the objective function (MSE), accompanied by the box plot of the \(R^2\) indicator, for all 30 runs. After that, the convergence diagrams of the objective function and \(R^2\) for the best run of each algorithm are given in Fig. 11. It can be noted that at the beginning, RSA and WOA metaheuristics are converging faster, but are overrun by the proposed ERSA in the final rounds of execution. Lastly, the Kernel Density Estimation (KDE) diagram and swarm plot are shown in Fig. 12. KDE diagram is used to show the probability density function and can indicate whether or not the results are coming from the normal distribution. The swarm plot shows the diversity of the solutions during the final round of the best-performing run of each algorithm.

Fig. 10
figure 10

Wind dataset objective function and \(R^2\) distribution plots for each metaheuristic without decomposition

Fig. 11
figure 11

Wind dataset objective function and \(R^2\) convergence plots for each metaheuristic without decomposition

Fig. 12
figure 12

Wind dataset objective swarm and KDE plots for each metaheuristic without decomposition

6.1.1 Wind dataset with VMD

This section presents the experimental outcomes on the wind dataset when VMD has been applied. Table 7 shows the overall metrics in terms of the best, worst, mean, and median values, accompanied by the standard deviance and variance values over 30 separate runs of each regarded algorithm. VMD-LSTM-ERSA model accomplished the supreme results in terms of the best, mean, and median values. The second-best result was scored by the VMD-LSTM-ChOA method, while VMD-LSTM-RSA attained the third-best score. Otherwise, VMD-LSTM-HHO achieved the best result for the worst metric. Finally, VMD-LSTM-HHO also established the best standard deviation and variance scores, suggesting that it provided the steadiest results over 30 independent runs.

Table 7 Wind dataset overall metrics for best, worst, mean, and median run using VMD

Table 8 brings forward the detailed metrics of every prediction step regarding the best run of each algorithm. The prefix VL is used to denote that VMD-LSMT is used. It can be noted that the suggested VMD-LSTM-ERSA attained superior results for one-step and overall results, in terms of the objective—MSE, but also for other important indicators, namely \(R^2\), MAE, and RMSE. The best scores for the two-samples forward were achieved by the VMD-LSTM-WOA approach, while VMD-LSTM-HHO attained the best outcomes for three-samples forward predictions. Looking at the overall results, the second-best algorithm was VMD-LSTM-ChOA, in front of the VMD-LSTM-RSA method. The observed VMD-LSTM-ERSA attained the best overall MSE value of 0.001704, in front of VMD-LSTM-ChOA with an MSE value of 0.001726.

Table 8 Wind dataset detailed metrics for each prediction step of the best run using VMD

The best set of LSTM parameters produced by the top-performing run of each metaheuristic is shown in Table 9. The proposed VMD-LSTM-ERSA established the LSTM structure as follows: 100 neurons in the first layer, a learning rate of 0.010000, 563 epochs, a dropout value of 0.200000, and 100 neurons in the second layer.

Table 9 Parameters selected by metaheuristics for best-performing wind prediction models using VMD

Aiming to provide better insight into the results, visualizations are provided in Figs. 13, 14, and 15. Figure 13 depicts the violin plots for the objective function (MSE), accompanied by the box plot of the \(R^2\) indicator, for all 30 runs. After that, the convergence diagrams of the objective function and \(R^2\) for the best run of each algorithm are given in Fig. 14. It can be noted that in this case, the proposed ERSA exhibits the fastest convergence from the beginning to the end. Lastly, the KDE diagram and swarm plot for the objective function is shown in Fig. 15. The swarm plot shows the diversity of the solutions during the final round of the best-performing run of each algorithm, and it can be noted that all the solutions of the proposed ERSA are in close proximity to the optimum in this case.

Fig. 13
figure 13

Wind dataset objective function and \(R^2\) distribution plots for each metaheuristic with VMD

Fig. 14
figure 14

Wind dataset objective function and \(R^2\) convergence plots for each metaheuristic with VMD

Fig. 15
figure 15

Wind dataset objective swarm and KDE plots for each metaheuristic with VMD

6.1.2 Wind dataset with EMD

This section presents the experimental outcomes on the wind dataset when EMD has been applied. Table 10 shows the overall metrics in terms of the best, worst, mean, and median values, accompanied by the standard deviance and variance values over 30 separate runs of each regarded algorithm. The EMD-LSTM-ERSA model accomplished superior results in terms of the best, mean, and median scores. The second-best result was attained by the EMD-LSTM-PSO method, while EMD-LSTM-FA obtained the third-best score. Otherwise, EMD-LSTM-PSO achieved the best result for the worst metric. Finally, EMD-LSTM-RSA established the best standard deviation and variance scores, suggesting that it provided the steadiest results over 30 independent runs in this scenario.

Table 10 Wind dataset overall metrics for best, worst, mean, and median run using EMD

Table 11 brings forward the detailed metrics of every prediction step regarding the best run of each algorithm. The prefix EL is used to denote that EMD-LSMT is used. In this scenario, it can be noted that the suggested EMD-LSTM-ERSA attained supreme outcomes for all metrics: one-step, two-step, three-samples forward and overall results, in terms of the objective—MSE, and all other important indicators, namely \(R^2\), MAE and RMSE. Looking at the overall results, the second-best algorithm was EMD-LSTM-PSO, in front of the EMD-LSTM-FA method. The proposed EMD-LSTM-ERSA attained the best overall MSE value of 0.004831, in front of EMD-LSTM-PSO which achieved an MSE value of 0.004994.

Table 11 Wind dataset detailed metrics for each prediction step of the best run using EMD

The best set of LSTM parameters produced by the top-performing run of each metaheuristic for the scenario with EMD employed is shown in Table 12. The proposed EMD-LSTM-ERSA established the LSTM structure as follows: 108 neurons in the first layer, a learning rate of 0.007837, 600 epochs, a dropout value of 0.050000, and 151 neurons in the second layer.

Table 12 Parameters selected by metaheuristics for the respective best-performing wind prediction models using EMD

Aiming to provide better insight into the results, visualizations are provided in Figs. 16, 17, and 18. Figure 16 depicts the violin plots for the objective function (MSE), accompanied by the box plot of the \(R^2\) indicator, for all 30 runs. After that, the convergence diagrams of the objective function and \(R^2\) for the best run of each algorithm are given in Fig. 17. It can be noted that in this case, the proposed ERSA exhibits the fastest speed of convergence from the beginning to the end. Lastly, the KDE diagram and swarm plot for the objective function is shown in Fig. 18. The swarm plot shows the diversity of the solutions during the final round of the best-performing run of each algorithm.

Fig. 16
figure 16

Wind dataset objective function and \(R^2\) distribution plots for each metaheuristic with EMD

Fig. 17
figure 17

Wind dataset objective function and \(R^2\) convergence plots for each metaheuristic with EMD

Fig. 18
figure 18

Wind dataset objective swarm and KDE plots for each metaheuristic with EMD

6.1.3 Comparison with other models on the Wind dataset

To demonstrate the comparative performance improvements of the coupling optimizers with decomposition’s techniques in the introduced methodology the best outcomes of each approach have been compared to several contemporary prediction models. The outcomes of the objective (MSE) and indicator R\(^2\) factions is shown in Table 13.

Table 13 Comparison of the best performing methods with other contemporary prediction models applying to the wind dataset

As demonstrated in Table 13, outcome attained by applying VMD followed by LSTM networks optimized via introduced metaheuristics demonstrate notable improvements compared to their approaches applied to the same task.

6.2 Wind farm dataset experimental results

This section presents the results of the wind farm dataset, without decomposition. Table 14 shows the overall metrics in terms of the best, worst, mean, and median values, accompanied by the standard deviance and variance values over 30 separate runs of each regarded algorithm. LSTM-ERSA model accomplished the supreme results in terms of the best, worst, and mean values. The second-best result was scored by the LSTM-HHO method, while LSTM-ABC attained the third-best score. Otherwise, LSTM-ABC achieved the best results for the median metric. Last but not least, LSTM-RSA established the best standard deviation and variance scores, suggesting that it provided the steadiest results across the runs.

Table 14 Wind farm dataset overall metrics for best, worst, mean, and median run without decomposition

Table 15 brings forward the detailed metrics of every prediction step regarding the best run of each algorithm. The prefix L is used to denote that LSMT is used. It can be noted that the suggested LSTM-ERSA attained supreme results for two samples forward and overall results, in terms of the objective—MSE, but also for other important indicators, namely \(R^2\) and RMSE (except MAE). The best scores for one-samples forward were achieved by the LSTM-HHO approach, while the best outcomes for three-samples forward were attained by LSTM-ChOA. Looking at the overall results, the second-best algorithm was LSTM-HHO, in front of LSTM-ABC and LSTM-ChOA methods. The observed LSTM-ERSA attained the best overall MSE value of 0.020566, in front of LSTM-HHO with an MSE value of 0.020608.

Table 15 Wind farm dataset detailed metrics for each prediction step of the best run without decomposition

The best set of LSTM parameters produced by the top-performing run of each metaheuristic for this scenario is shown in Table 16. The proposed LSTM-ERSA established the LSTM structure as follows: 180 neurons in the first layer, a learning rate of 0.005919, 439 epochs, a dropout value of 0.165101, and 200 neurons in the second layer for this particular scenario.

Table 16 Parameters selected by metaheuristics for best-performing wind farm prediction models without decomposition

Aiming to provide better insight into the results, visualizations are provided in Figs. 19, 20, and 21. Figure 19 shows the violin plots for the objective function (MSE), accompanied by the box plot of the \(R^2\) indicator, for all 30 runs. After that, the convergence diagrams of the objective function and \(R^2\) for the best run of each algorithm are given in Fig. 20. It can be noted that at the beginning, HHO exhibited a bit faster convergence at one point, however, it was overrun by the proposed ERSA in the final rounds of execution. Finally, the KDE diagram and swarm plot are shown in Fig. 21. KDE diagram is used to show the probability density function and can indicate whether or not the outcomes originate from the normal distribution. The swarm plot shows the diversity of the solutions during the final round of the best-performing run of each algorithm.

Fig. 19
figure 19

Wind farm dataset objective function and \(R^2\) distribution plots for each metaheuristic without decomposition

Fig. 20
figure 20

Wind farm dataset objective function and \(R^2\) convergence plots for each metaheuristic without decomposition

Fig. 21
figure 21

Wind farm dataset objective swarm and KDE plots for each metaheuristic without decomposition

6.2.1 Wind farm dataset with VMD

This section presents the results on the wind farm dataset, with employed VMD. Table 17 shows the overall metrics in terms of the best, worst, mean, and median values, accompanied by the standard deviance and variance values over 30 separate runs of each regarded algorithm. VMD-LSTM-ERSA model accomplished the supreme results in terms of the best, worst, mean, and median values. The second-best result was scored by the VMD-LSTM-FA method, while VMD-LSTM-PSO attained the third-best score. VMD-LSTM-ERSA also achieved the best results for the standard deviation. Last but not least, LSTM-WOA established the best variance score.

Table 17 Wind farm dataset overall metrics for best, worst, mean, and median run using VMD

Table 18 brings forward the detailed metrics of every prediction step regarding the best run of each algorithm. The prefix VL is used to denote that VMD-LSMT is used. It can be noted that the suggested VMD-LSTM-ERSA attained the best scores for overall results, in terms of the objective—MSE, but also for other important indicators, namely \(R^2\) and RMSE (except MAE). The best scores for one sample forward were achieved by the VMD-LSTM-HHO approach, the best results for two-samples forward were obtained by the VMD-LSTM-RSA, while the best outcomes for three-samples forward were attained by VMD-LSTM-FA. Looking at the overall results, the second-best algorithm was VMD-LSTM-FA, in front of VMD-LSTM-PSO. The proposed VMD-LSTM-ERSA attained the best overall MSE value of 0.006702, in front of VMD-LSTM-FA with an MSE value of 0.006747.

Table 18 Wind farm dataset detailed metrics for each prediction step of the best run using VMD

The best set of LSTM parameters produced by the top-performing run of each metaheuristic for this scenario is shown in Table 19. The proposed LSTM-ERSA established the LSTM structure as follows: 142 neurons in the first layer, a learning rate of 0.009502, 600 epochs, a dropout value of 0.151008, and 127 neurons in the second layer for this particular scenario.

Table 19 Parameters selected by metaheuristics for best-performing wind farm prediction models using VMD

Aiming to provide better insight into the results, visualizations are provided in Figs. 22, 23, and 24. Figure 22 shows the violin plots for the objective function (MSE), accompanied by the box plot of the \(R^2\) indicator, for all 30 runs. After that, the convergence diagrams of the objective function and \(R^2\) for the best run of each algorithm are given in Fig. 23. It can be noted that at the beginning, FA exhibited slightly faster convergence, however, it was overrun by the proposed ERSA in the final rounds of execution. Finally, the KDE diagram and swarm plot are shown in Fig. 24. KDE diagram is used to show the probability density function and can indicate whether or not the results are coming from the normal distribution. The swarm plot shows the diversity of the solutions during the last iteration of the best-performing run of each algorithm.

Fig. 22
figure 22

Wind farm dataset objective function and \(R^2\) distribution plots for each metaheuristic with VMD

Fig. 23
figure 23

Wind farm dataset objective function and \(R^2\) convergence plots for each metaheuristic with VMD

Fig. 24
figure 24

Wind farm dataset objective swarm and KDE plots for each metaheuristic with VMD

6.2.2 Wind farm dataset with EMD

This section presents the results on the wind farm dataset, with employed EMD. Table 20 shows the overall metrics in terms of the best, worst, mean, and median values, accompanied by the standard deviance and variance values over 30 separate runs of each regarded algorithm. EMD-LSTM-ERSA model accomplished the supreme results in terms of the best and mean values. The second-best result was scored by the EMD-LSTM-RSA method, while EMD-LSTM-PSO attained the third-best score. Otherwise, EMD-LSTM-FA obtained the best result for the worst metric, and also for standard deviation and variance. Finally, the best median value was achieved by the EMD-LSTM-RSA method.

Table 20 Wind farm dataset overall metrics for best, worst, mean, and median run using EMD

Table 21 brings forward the detailed metrics of every prediction step regarding the best run of each algorithm. The prefix EL is used to denote that EMD-LSTM-LSMT is used. It can be noted that the suggested EMD-LSTM-ERSA attained the best scores for overall results, in terms of the objective—MSE, but also for other important indicators, namely \(R^2\) and RMSE (except MAE). Also, EMD-LSTM-ERSA attained the best results for the two-sample forward predictions, for all observed indicators. The best scores for one sample forward were achieved by the EMD-LSTM-FA approach, while the best outcomes for the three samples forward were attained by EMD-LSTM-ChOA. Looking at the overall results, the second-best algorithm was EMD-LSTM-RSA, in front of EMD-LSTM-PSO. The proposed EMD-LSTM-ERSA attained the best overall MSE value of 0.020351, in front of EMD-LSTM-RSA with an MSE value of 0.020556.

Table 21 Wind farm dataset detailed metrics for each prediction step of the best run using EMD

The best set of LSTM parameters produced by the top-performing run of each metaheuristic for this scenario is shown in Table 22. The proposed LSTM-ERSA established the LSTM structure as follows: 100 neurons in the initial layer, a learning rate of 0.010000, 600 epochs, a dropout value of 0.200000, three layers, and 100 neurons in the second layer for this particular scenario with employed EMD.

Aiming to provide better insight into the results, visualizations are provided in Figs. 25, 26, and 27. Figure 25 shows the violin plots for the objective function (MSE), accompanied by the box plot of the \(R^2\) indicator, for all 30 runs. After that, the convergence diagrams of the objective function and \(R^2\) for the best run of each algorithm are given in Fig. 26. It can be noted that at the beginning, ChOA, PSO, and RSA exhibited slightly faster convergence, however, all of them were overrun by the proposed ERSA in the final rounds of execution. Finally, the KDE diagram and swarm plot are shown in Fig. 27. KDE diagram is used to show the probability density function and can indicate whether or not the outcomes originate from a normal distribution. The swarm plot shows the diversity of the solutions during the last iteration of the best-performing run of each algorithm. In this scenario, it can be noted that all outcomes of the ERSA were grouped near the best solution at the end of the run.

Fig. 25
figure 25

Wind farm dataset objective function and \(R^2\) distribution plots for each metaheuristic with EMD

Fig. 26
figure 26

Wind farm dataset objective function and \(R^2\) convergence plots for each metaheuristic with EMD

Fig. 27
figure 27

Wind farm dataset objective swarm and KDE plots for each metaheuristic with EMD

6.2.3 Comparison with other models on the Wind farm dataset

To demonstrate the comparative performance improvements of the coupling optimizers with decomposition’s techniques in the introduced methodology the best outcomes of each approach have been compared to several contemporary prediction models (Table 23). The outcomes of the objective (MSE) and indicator R\(^2\) factions is shown in Table 23.

Table 22 Parameters selected by metaheuristics for best-performing wind farm prediction models using EMD
Table 23 Comparison of the best performing methods with other contemporary prediction models applied to the wind farm dataset

As it can be observed in Table 23 the VMD demonstrated the best performance when applied alongside metaheuristics optimized LSTM models.

6.3 Validation: statistical tests

Modern computer research necessitates scientists to determine if the introduced improvements are statistically significant since experimental outcomes alone are usually inadequate to declare that one algorithm outperforms its competitors. This research manuscript tested eight methods, including the proposed ERSA metaheuristics, for tuning LSTM networks on two wind power generation datasets. The comparison was conducted among eight methods and 6 problem instances utilized for multi-problem analysis as per Eftimov et al. (2016).

Literature recommendations by Eftimov et al. (2016), Derrac et al. (2011) suggest that statistical tests in such scenarios should involve creating a representative collection of outcomes for each method involves creating a sample of outcomes by determining average objective values over several independent executions for each problem. Nevertheless, this methodology may not be ideal when dealing with outliers that originally form non-normal distribution, thus it may result in misleading conclusions. An open question remains regarding whether taking the mean objective function value to use in statistical tests is appropriate for comparison of stochastic methods, as per a literature survey cited by Eftimov et al. (2016). Nonetheless, despite these potential drawbacks, the classification error rate objective function was averaged over 30 independent runs in order to contrast 10 methods across 6 problem instances in this study.

The decision was made after performing the Shapiro–Wilk test (Shapiro and Francia 1972) for single-problem analysis using the described procedure: a data sample was constructed for each algorithm and every problem by gathering the results of each run, and the corresponding p-values were computed for all method-problem combinations. The resulting p-values are presented in Table 24.

Table 24 Shapiro–Wilk test scores for the single-problem analysis

Since the p-values in Table 24 are all below \(\alpha =0.05\) and therefore not caused by chance, the null hypothesis can be rejected. Consequently, the data samples for all method-problem combinations are not originating from the Gaussian distribution, meaning it is not acceptable to utilize the average objective value in further statistical tests. As a result, this study utilized the best values for further statistical analysis.

Next, the multi-problems multiple methods statistical analysis has been employed, using data samples made using the best objective function value higher than 30 individual runs for all individual algorithms on every problem instance. To ensure the valid use of parametric tests, we verified the following conditions: independence, normality, and homoscedasticity of the data, as described by LaTorre et al. (2021). Independence was confirmed because each run began with generating a collection of random solutions. To assess normality, the data samples were subjected to the Shapiro–Wilk test, and the subsequent results for each algorithm are presented in Table 25.

Table 25 Shapiro–Wilk test scores for the multiple problem analysis

Despite conducting the Shapiro–Wilk test for all methods, the resulting p-values were significantly less than \(\alpha =0.05\), as indicated in Table 25. This suggests that the assumption of justified use for parametric tests was not met, and non-parametric tests were employed instead. The suggested ERSA method was set as the control algorithm in all conducted non-parametric tests.

As a result, the Friedman test (1937, 1940) was utilized with a two-way variance analysis with ranking, to determine if the performance level of the proposed ERSA was significantly superior to other contenders. The application of this type of test, accompanied by the Holm post-hoc procedure has been proposed by Derrac et al. (2011). The Friedman test scores are provided in Table 26, accompanied by the scores for the Friedman-aligned test, provided in Table 27.

Table 26 Friedman test scores
Table 27 Friedman aligned test scores

The results presented in Table 26 indicate that the suggested ERSA method attained a superior level of performance compared to the rest of the methods in the comparative analysis, by scoring the average ranking value of 1.00. In these experiments, the second-best result was obtained by the ChOA algorithm with an average ranking of 4.00, in front of FA in third place with an average ranking of 4.33. The basic implementation of the RSA attained an average ranking of 4.83, showing the obvious superiority of the suggested ERSA over the elementary version of the algorithm. Moreover, the Friedman statistics (\(\chi ^2_r = 17.22\)) is greater than \(\chi ^2\) critical value, with seven degrees of freedom (14.07) with the level of significance \(\alpha = 0.05\). As the Friedman p-value is \(1.24\times 10^{-9}\), it infers that the results significantly vary with the observed methods. Finally, it allows the rejection of the null hypothesis (\(H_0\)), and it confirms that the proposed ERSA method attained performance that was significantly statistically different from other contenders. It is possible to draw similar conclusions from the Friedman-aligned test scores, given in Table 27.

Last but not least, as discussed by Sheskin (2020), the Iman and Davenport’s test (1980) may provide more detailed results compared to the \(\chi ^2\), and this particular test was executed as well. The Iman and Davenport’s obtained score is 3.47, which is greater than the F-distribution’s critical value 2.28, allowing the conclusion that this particular test rejects \(H_0\) as well.

The non-parametric post-hoc Holm’s step-down procedure has been employed since both conducted tests reject the null hypothesis, with the findings reported in Table 28. This procedure sorts the regarded methods concerning their p values evaluated to \(\alpha /(k-i)\), where k and i represent the degree of freedom (\(k=7\) in this study) and the method’s number, after sorting in the ascending order (related to rank). This study utilized \(\alpha\) threshold values of 0.05 and 0.1. The results reported in Table 28 again clearly imply that the introduced ERSA method significantly statistically outscored all contenders for both regarded significance levels 0.05 and 0.1.

Table 28 Holm’s step-down procedure

6.4 Best model interpretation and SHAP analysis

This section brings forward the interpretation of the top-performing model on each of the two regarded datasets. In the previously discussed method, SHAP values were used to estimate feature importance. These values compare the model’s predictions with and without each feature, demonstrating the impact of the feature on the observed model’s output. Because the order of features can influence predictions, feature importance is calculated in every possible order to ensure fair feature comparisons (García and Aznarte 2020). This study developed separate models to assess wind energy generation, and the importance and impact of the features were evaluated for each model. Specifically, the analysis focused on how each predictor variable affected the predicted probability of observation.

6.4.1 Wind dataset SHAP analysis

Figure 28 shows the impacts of each feature for the LSTM-ERSA model on the wind dataset, while Fig. 29 presents the impacts of the best-performing model with VMD, namely VMD-LSTM-ERSA. A closer look at the waterfall plots (left part of the figure) indicates that the most important feature in this scenario is temperature, followed by the generation of wind onshore, and pressure.

Fig. 28
figure 28

Wind dataset without decomposition best performing LSTM model feature impact determined thought SHAP analysis

Fig. 29
figure 29

Wind dataset with VMD best performing LSTM model feature impact determined thought SHAP analysis

One interesting observation is that a significant shift in feature influence occurs following the application of VMD. The top-performing model trained without decomposition places the highest value on it. However, following the application of VMD, the first three decomposed modes of onshore wind generation become the most significant features. This is likely due to the noise associated with onshore wind generation data. In the raw form, onshore wind generation is quite a complex and noisy data sequence, while the temperature is a smoother more predictable data sequence less prone to sudden shifts. Following decomposition, the onshore wind generation modes are made more reliable, predictable, and less noisy allowing these features to have a more significant contribution to the model, even higher relative to temperature. It is also interesting to note that the residual values for each mode, data components that could not be decomposed to a specific mode mostly containing noise, have the lowest influence on model predictions.

6.4.2 Wind farm dataset SHAP analysis

In Fig. 30 the impact of each observed feature for the best-performing LSTM-ERSA model can be seen for the wind farm dataset. Following this Fig. 31 likewise demonstrates feature impacts following data decomposition using VMD with the VMD-LSTM-ERSA model. Further details can be observed in the accompanying waterfall diagrams on each figure.

Fig. 30
figure 30

Wind farm dataset without decomposition best performing LSTM model feature impact determined thought SHAP analysis

Fig. 31
figure 31

Wind farm dataset with VMD best performing LSTM model feature impact determined thought SHAP analysis

The findings of the SHAP analysis indicate a strong influence of the wind power component, followed closely by wind speed. These findings are further reinforced by the analysis of models working with decomposed data. Wind power modes, followed by find speed modes show a significant influence. It can also be noted that the influence of residual components, consisting mostly of noise is very low for most features. Nevertheless, the wind power residual does show a decent influence on the prediction.

7 Conclusion

This study employed and tuned the LSTM ML model aimed at optimizing the prediction of the power production that comes from renewable sources. In the beginning, an improved variant of the swarm intelligence RSA metaheuristics was proposed, that surpass the known deficiencies of the initially introduced algorithm. Later, the introduced algorithm, named ERSA, was used to adjust the hyperparameters of the LSTM network for wind energy production problems.

The introduced methodology was assessed on two wind production datasets, and in three scenarios for each dataset - without decomposition, with VMD, and with EMD employed. The predictions were executed up to three samples forward, and the outcomes were contrasted against those attained by competing metaheuristics applied in identical experimental setups. The obtained results clearly indicate the superiority of the proposed LSTM-ERSA model in all regarded scenarios. Statistical analysis was also employed, concluding that the results attained by the proposed method are statistically significant. Additionally, the influence of proper parameter selection is clearly demonstrated. The performance of optimized networks is significantly improved. Finally, SHAP analysis was performed on the best-performing model on each dataset, aiming to assess the impact of the features on each model.

The conducted research has shown a great deal of potential for using hybrid ML models tuned by metaheuristics algorithms for wind power production prognosis. A crucial task such as the estimation of the expected production by the wind farm is important as the power grid has to be capable of balancing power production and consumption at all times. Future research regarding this important topic will explore developing even more accurate models optimized by various metaheuristics algorithms. Emerging decomposition algorithms and their optimization will be explored. Additionally, the introduced modified metaheuristic will be applied to emerging challenges.