Introduction

Improvements in network technology coupled with the wide adoption of the internet have led to a paradigm shift in computing [1]. Demanding computational tasks have shifted from personal machines toward more affordable online service options. While presenting many advantages for users, cloud computing does not come without certain challenges [2]. Managing systems to achieve the best possible utilization is a complex task. Many factors need to be considered and accounted for to make responsible decisions. Overestimating demand can lead to excessive resource allocation, while underestimation can lead to reduced quality of services or even a temporary loss in service.

With this in mind, it is evident that a robust system for forecasting cloud service load is needed. One possible approach is through the application of novel artificial intelligence (AI) techniques that are not only capable of observing many factors and determining nonlinear correlations between data features but also capable of accounting for temporal factors. The problem of cloud load forecasting can be formulated as a time-series forecasting challenge. As mentioned above, this can be resolved by applying novel artificial intelligence (AI) techniques. Originally introduced to allow artificial neural networks (ANN) [3] to account for sequences of data, recurrent neural networks (RNN) [4] make use of recurrent connections to allow previous outputs to affect future predictions. However, when tackling complex data sequences, models may struggle to determine subtle relations. To resolve this, attention mechanism [5] are integrated into networks. Furthermore, decomposition techniques can be applied. This work uses variational mode decomposition (VMD) [6] to preprocess the available data used for analysis.

The role of cloud computing has increased in recent years. Many services rely on cloud infrastructure for their respective business. This has, however, increased the demand for intelligence approaches for predicting demand. The use of machine learning (ML) is one potential approach for tackling this complex nonlinear problem. Resurfaces have explored the potential of various ML for tackling this task [7] and have demonstrated promising outcomes. However, as many ML algorithms rely on proper hyperparameter selections, the use of metaheuristics to optimize performance could potentially yield improvements.

Research into workload scheduling has provided much insight into the demands of cloud providers [8]. Furthermore, the advantages of various works have been further outlined in this study. One promising approach is the application of deep neural networks (DNN) [9]. Recurrent versions of networks have also been explored in preceding works [10]. These implementations demonstrated remarkable outcomes. However, recent advances in recurrent networks have highlighted that attention mechanisms show great potential for improving performance further.

Metaheuristics have been established as powerful tools for scheduling tasks in cloud environments [11]. Furthermore, metaheuristics are a popular choice for hyperparameter optimization [12]. There is great potential for improving forecasting methodologies through the application of optimization metaheuristics. It is also important to note that as per the NFL theorem [13], experimentation is needed to determine the best combination of methods. An additional note needs to be stated on the use of decomposition techniques for time series forecasting, a topic that has not yet been sufficiently explored, as optimization techniques can yield better performing decomposition outcomes, as well as have said outcomes aid forecasting algorithms in making more accurate prognoses.

The field of cloud-computing forecasting faces various challenges, including the dynamic and unpredictable nature of workloads, necessitating accurate predictions for optimal resource allocation. However, the potential of coupling RNNs with attention mechanisms and integrating them with optimization metaheuristic algorithms for hyperparameter tuning remains largely unexplored in existing literature. Exploring the combination of RNNs with attention layers and metaheuristic algorithms for hyperparameter optimization holds great promise for enhancing the accuracy and efficiency of cloud load forecasting models, paving the way for improved resource management in cloud environments.

Further, cloud forecasting applications based on ML have yet to be extensively evaluated on the specifically challenging dataset used in this work. Due to several peak demand spikes occurring in the dataset, it forms a more realistic representation of the demands often faced in a cloud-computing environment. This also makes it more difficult to accurately forecast demand.

This work proposes a RNN [14] with an attention mechanism-based approach for tackling cloud-load forecasting. In addition, a modified version of the well-known particle swarm optimization (PSO) [15] metaheuristic is introduced to tune the hyperparameters of the network. Decomposition techniques are applied to help the models deal with data complexity. The proposed approach has been tested on a real-world cloud workload dataset. Two experiments have been conducted. The first experiment tested the combined potential of VMD with RNN, and the second evaluated VMD combined with RNN with attention. The modified metaheuristic has been compared to the original as well as several other state-of-the-art metaheuristics based on their ability to improve the performance of RNN and RNN with attention through hyperparameter optimization. The introduced approach demonstrates superior performance compared to contemporary ML techniques and shows great potential applied to energy load forecasting when hyperparameter tuning is properly performed. Finally, the best-performing models have been subjected to SHapley Additive exPlanations (SHAP) [16] to determine the impact each feature has on the model decisions.

Based on everything stated so far, the main contributions of this work may be summarized as follows:

  1. 1.

    A suggestion is made for an RNN-attention-based method that leverages decomposition to improve time-series forecasting. This approach aims to enhance the practical applications of time-series-based prognosis. The potential of the attention mechanism has yet to be fully explored applied to cloud-load forecasting in combination with decomposition techniques and this work seeks to fill this research gap.

  2. 2.

    A modified version of the well-known PSO metaheuristic is introduced. This adaptation specifically targets the optimization of hyperparameters in the RNN-attention-based model. The proposed modification aims to improve the practical implementation and effectiveness of the model. The optimized models generated by the introduced modified algorithm show clear statistically significant improvements over the original algorithm as well as several contemporary optimizers.

  3. 3.

    The suggested approach is applied to address real-world cloud-load forecasting challenges. By leveraging the RNN-attention-based model with decomposition, this application seeks to enhance the accuracy and reliability of cloud-load predictions in practical scenarios.

  4. 4.

    The best model is interpreted using SHAP analysis, providing insights into the influence of each feature on cloud load. This analysis technique holds great potential as a valuable tool for cloud providers, offering practical insights into optimizing cloud services based on feature importance.

The remainder of this work is structured according to the following: Sect. 2 presents preceding works that helped inspire, motivate, and provide a basis for this work. In Sect. 3, the introduced methods are described in detail. The experimental setup, utilized dataset, and evaluation metrics are described in Sect. 4, while the attained results are shown and discussed in Sect. 5. Finally, Sect. 6 gives a concluding word on the work covered in the research and proposed potential future work in this field.

Literature review and background

This section presents preceding research that helped inspire and guide this work. The decomposition technique is also described, followed by a detailed description of RNN and the utilized attention mechanism. Finally, a brief overview of metastatic optimization.

Preceding works

The concepts that form the foundation of cloud computing [1] have been around for a while. However, in the early 2000s, computation resources became more available as well as in higher demand. As the delivery of computational services over the Internet became viable, the paradigm behind cloud computing was formed. While the shift toward the cloud generates many advantages, this approach is not without its challenges [2].

One major challenge present in cloud computing is security [17]. Securely exchanging data between providers and users, maintaining data integrity, and doing so quickly is essential for cloud infrastructure to support operations. Researchers have applied several state-of-the-art approaches for tackling and addressing intrusion detection within cloud computing. While traditional models for maintaining security, such as firewalls and block lists, exist, significant improvements have been made with the application of powerful AI algorithms into security systems.

An emerging paradigm that has emphasized the importance of the cloud is the adoption of the Internet of Things (IoT) [18]. With IoT, several smaller less computationally powerful devices can be used as sensors to handle complex tasks such as environmental monitoring. These devices are by design capable of connecting to a network and are primarily used on the Internet. These interactions have led to a modern-day revolution across many industries dubbed industrial revolution 4.0 [19, 20].

A pressing issue to be considered by cloud service providers is maximizing available hardware utilization [21]. One major advantage of using cloud-based computing is that users can access significant computational resources without owning and maintaining the hardware [22]. This hardware is nonetheless owned and maintained by the service provider. In order to cover costs, these resources need to be utilized efficiently.

One further reason why efficient management is important is user satisfaction [23]. Computational resources need to be available conveniently and without long load times in order to attract and maintain a stable consumer base. Further, significant delays in hardware access can end up being costly in the long run.

Preceding research [24] has tackled cloud-load forecasting through the application of deep learning. However, deep learning methods significantly increase computational demands and can, therefore, be expensive when applied in practice. Further, deep networks cannot address temporal sequences present in the data. These factors combined lead to shorter periods being addressed in research, as well as reduced longer term accuracy, with longer term accuracy having around a higher error rate. The application of decomposition has also not been explored in this context, with deep networks alone often struggling to detect existing trends in data. Nevertheless, introduced modifications may to some degree allow these networks to address temporal components and researchers have made significant strides in improving associated methodologies [25, 26]. However, the potential of these is yet to be fully explored when applied to cloud-load forecasting.

Further works [27] explored the use of various ML approaches for efficient resource management in cloud environments and introduced an approach for further improving accuracy. While the outcomes demonstrate decent accuracy and a substantial improvement in CPU utilization, the forecasts are relatively short-term, limiting the applicability of the approach for applications in longer term forecasting. In addition, the advantages of trend detection using decomposition techniques as well as their coupling with hyperparameter optimization techniques have not been explored in the work, leading to a research gap.

Fig. 1
figure 1

Dataset training, validation and testing split plot for target feature

Fig. 2
figure 2

Input features decomposed modes

Table 1 Network parameters considered for optimization with constraints

Other works have explored the potential of optimization coupled with ML for cloud-load forecasting  [28]. Nevertheless, with many promising optimization algorithms, and a clear precedent set by the NFL [13] theorem, further experimentation is needed to establish the viability of various metaheuristics for the given tasks, making it possible to make better informed decisions when selecting an algorithm for the specific application.

There is significant potential for recurrent as opposed to deep methods for forecasting. While deep methods are limited in their ability to add temporal components to data, recurrent models can adjust future outputs based on previous predictions. As forecasting is an important topic in modern research, several techniques have been developed. While recurrent networks are a popular choice, challenges with vanishing and exploding gradients persist in more complex architectures. Some methods have been developed such as long short-term memory (LSTM) networks that can help mitigate some of the issues associated with training [29].

Few works in the literature explore the potential of optimizing decomposition techniques for the optimization of performance. And as far as the authors are aware, there are no works currently in literature that couple the potential of metaheuristics with decomposition techniques applied to cloud-load forecasting. With all this in mind, this work explores the potential of coupling optimization and decomposition algorithms to form a basis of robust systems capable of estimating the computation load on cloud resources. As more advanced techniques become available, it is important to explore their potential to further improve the techniques available for dealing with increasingly pressing topics such as cloud-load forecasting.

Variational mode decomposition (VMD)

A signal decomposition technique such as VMD [6] is used for signal processing. The primary goal of these techniques is to separate a set of modes with varying frequencies from the original signals. VMD in principle locates a set of modes that are orthogonal to each other with localized frequency content. The decomposition is achieved by progressively optimizing according to Eq. (1):

$$\begin{aligned} E(V) = \int \left( \frac{1}{2} \Vert V'(t) \Vert _2^2 + \mu U(V(t)) \right) dt \end{aligned}$$
(1)

in which V(t) defines the modes of the signal, and \(V'(t)\) is a derivative of V(t) with respect to time. The regularization parameter is represented as \(\mu \) and balances mode smoothness and sparsity. Accordingly, function U(V(t)) promotes sparsity.

To determine adequate modes, an algorithm alternates between updating the penalty function and solving modes. Modes are determined through a process of minimizing the energy function with respect to V(t). A Lagrange multiplier \(\alpha (t)\) is also introduced, thus giving Eq. (2):

$$\begin{aligned} E(V) {=} \int \left( \frac{1}{2} \Vert V'(t) \Vert _2^2 {+} \mu U(V(t)) {+} \alpha (t) \sum _{k=1}^K V_k(t)^2 \right) dt \end{aligned}$$
(2)

in which \(V_k(t)\) represents the k-th mode of a given signal. To update the penalty function, the energy function is minimized with respect to \(\alpha (t)\). This process involves setting the derivative of E(V) with respect to \(\alpha (t)\) to zero. The resulting function is shown in Eq. (3):

$$\begin{aligned} \frac{d}{dt} \alpha (t) = \mu \sum _{k=1}^K V_k(t)^2 - \lambda \end{aligned}$$
(3)

with the \(\lambda \) constraint defining the overall mode energy.

Recurrent neural networks with attention mechanisms

A RNN [14] is a variation on a classical neural network, expanded to tackle sequences of data. While many traits of a neural network are retained, such as neurons, and connections, a RNN is capable of repeating a certain operation for sequential inputs through the use of recurrent connections. This effectively lets RNN retain a memory of processed values that can be used alongside future inputs. Given input sequence \(I = {i_1, i_2, i_3,..., i_T}\), for each step t the networks repeats the operation described in Eq. (4):

$$\begin{aligned} \begin{bmatrix} \hat{o}_t \\ h_t \end{bmatrix} = \phi _W (i_t, h_{t-1}) \end{aligned}$$
(4)

where \(\hat{o}_t\) and \(h_t\) represent the output and hidden state at time t, respectively. Further, \(\phi _W\) represents a neural network characterized by a weighted network W. Said networks consider the t-th input \(i_t\) as well as the previous hidden state \(h_{t-1}\) as an input. The structure of a RNN is quite flexible and is, therefore, suitable for addressing many complex problems.

The introduction of attention mechanisms RNN architectures helps the networks retain memory. This improves memory retention over longer sequences and improves performance. While there are several different ways of implementing attention, for this work, an implementation of Luong attention [30] is used.

At each timestep t, the Luong attention mechanism accounts for encoded weights \(w_t\) over an encoded source sequence with respect to Eq. (5):

$$\begin{aligned} \sum _s w_t(s) = 1, \;\;\; \text {and} \;\;\; \forall s w_t(s) \ge 0 \end{aligned}$$
(5)
Fig. 3
figure 3

Utilized experimental framework

Table 2 Pre-experimental simulations—VMD overall metric evaluation outcomes

The output values predicted in said timestep represent a function of the RNN hidden state \(h_t\) and the weights of the encoded hidden states according to Eq. (6):

$$\begin{aligned} \sum _s w_t(s) * \hat{h}_s \end{aligned}$$
(6)

Major differences between attention mechanisms are in the way that \(w_t\) values are determined. The Luong attention model applies the softmax function to compute sequence scores over an entire sequence as described in Eq. (7):

$$\begin{aligned} w_t(s) \leftarrow \frac{exp(\beta _t * score(h_t, \hat{h}_s))}{\sum _s (\beta _t * score(h_t, \hat{h}_{s'})} \end{aligned}$$
(7)

where \(\beta \) represents a scaling control parameter of the attention mechanism. Score values for each sequence can be determined using the dot product of each RNN hidden state \(h_t\) and encoder hidden state \(\hat{h}_s\) transformed via matrix \(W_\alpha \) as shown in Eq. (8):

$$\begin{aligned} score(h_t, \hat{h}_s) \leftarrow h_t^T W_{\alpha } \hat{h}_s \end{aligned}$$
(8)

where T is the maximum number of iterations.

Metaheuristic optimization

Non-deterministic polynomial-time hard (NP-hard) problems are proven to be solvable by metaheuristics-based algorithms. The most successful group of these algorithms has been identified as swarm intelligence. This group focuses on the application of swarm behavior from various natural occurrences toward the algorithmic solution. These behaviors are exhibited in the way animals hunt, forage, or mate.

Some notable representatives from this group of metaheuristic algorithms include the firefly algorithm (FA) [31] and the artificial bee colony (ABC) [32]. Evolutionary algorithms represent another widely applied category of metaheuristic algorithms whose key traits are the exploitation of selection, recombination, and mutation. Examples of such solutions include the genetic algorithm (GA) [33]. Mathematics is a strong inspiration to this field and some of the higher performing algorithms belong to this category, such as the sine cosine algorithm (SCA) [34], and the arithmetic optimization algorithm (AOA) [35].

Metaheuristic algorithms have seen great success when applied to complex real-world problems. Some interesting examples include optimizing time-series forecasting for crude oil and stock forecasting [36]. Other relevant applications relate to forecasting applications in environmental sciences [37, 38].

Methods

This section describes the original PSO algorithm followed by the modified version of the metaheuristic. In addition, the introduced mechanisms used to make the modification are described in detail.

Table 3 Pre-experimental simulations - detailed metrics of the best VMD optimized by each metaheuristic

Original particle swarm optimization algorithm

The PSO [15] algorithm makes use of velocity vectors to update agent positions and has established itself as a powerful optimizer. Locations are updated based on a set of social rules that the population of agents follows and thus agents are directed toward more promising regions of the search space. This procedure has a stochastic nature and relies on particle memory as well as population experience.

The optimization process begins by generating a population of particles. These particles are randomly distributed across the search space based on Eq. (9) and assigned random velocities based on Eq. (10):

$$\begin{aligned} x^i_0= & {} x_{min} + r_1(x_{max} - x_{min}) \end{aligned}$$
(9)
$$\begin{aligned} v^i_0= & {} \frac{x_{min} + r_2 (x_{max} - x_{min})}{\Delta t} \end{aligned}$$
(10)

where \(x^i_0\) and \(v^i_0\) define the initial position and velocity of the i-th agent. Two independent random variables \(r_1\) and \(r_2\) between 0 and 1 are used to introduce randomness in the initialization stage. The lower and upper bounds of the vector are defined by \(x_{min}\) and \(x_{max}\).

The initially defined population is updated through several algorithm iterations in order to explore more promising areas within the search space. Particle positions are updated according to Eq. (11):

$$\begin{aligned} x^i_{k+1} = x^i_k + v^i_{k+1} \Delta t \end{aligned}$$
(11)

in which \(x^i_{k+1}\) defines the position of the i-th agent in the \(k+1\) iteration, \(v^i_{k+1}\) defines the corresponding agent’s velocity vector and \(\Delta t\) defines the timestep. Particle velocities are updated according to Eq. (12):

$$\begin{aligned} v^i_{k+1} = wv^i_k + c_1r_3 \frac{(p^i - x^i_k)}{\Delta t} = c_2r_4 \frac{(p^g_k - x^i_k)}{ \Delta t} \end{aligned}$$
(12)

in which \(r_3\) and \(r_4\) are likewise independent arbitrary values from a uniform distribution from the range [0, 1], and the best so far determined position of the i-th agent is denoted as \(p^i\). The position of the best agent in the population is denoted as \(p^g_k\) in iteration k, while \(\Delta t\) is the current timestep. Three additional control parameters are present in this formula, the inertia parameter w, and two parameters \(c_1\), \(c_2\) that define agent and swarm confidence, respectively. Larger values for parameter w emphasize global exploration.

Modified PSO algorithm

The original PSO is a very good and efficient algorithm for tackling optimization tasks. However, one drawback that has been observed following extensive empirical testing using standard CEC bench-marking functions [39] is that it has a tendency of getting stuck in local minima. This suggests that the PSO has a lack of exploration potential.

Fig. 4
figure 4

Pre-experimental simulations—objective function outcome plots for metaheuristic optimized VMD

Fig. 5
figure 5

Pre-experimental simulations—indicator function outcome plots for metaheuristic optimized VMD

The modification introduced in this work alters the internal workings of the original PSO. The equation that determines particle reallocation and relies on particle velocity is disregarded. Instead, a randomized linear combination of the agents personal best \(x_{pBest}\) and global best \(x_{gBest}\) positions is used as described in Eq. (13):

$$\begin{aligned} x_{ij}(t+1) = c_1 \times r_1 \times x_{pBest, ij}(t) + c_2 \times r_2 \times x_{gBest, j}(t)\nonumber \\ \end{aligned}$$
(13)

in which \(r_1\) and \(r_2\) are a combination weights from a range [0, 1] and \(c_1\) and \(c_2\) attraction parameters. By carefully tuning these parameters, a more suitable balance between exploration and exploitation can be established.

To boost expiration even further, the quasi-reflexive learning (QRL) [40] method is also incorporated in the algorithm. This process involves producing quasi-reflexive-opposite solutions to a given agent. In this way, if the current optimal solution is not within a truly promising region, the chances of the reflexive agent being in a better position are increased.

The quasi-reflexive-opposite individual \(x^{qr}\) of the solution x can be generated using Eq. 14 for each component j of the x solution:

$$\begin{aligned} x^{qr} = rnd\bigg (\frac{LB + UB}{2}, x\bigg ) \end{aligned}$$
(14)

where \(rnd\bigg (\frac{LB + UB}{2}, X\bigg )\) denotes the generation of a random value from a uniform distribution between \(\frac{LB + UB}{2}\) and x, with LB and UB denoting the lower and upper boundaries respectfully.

Due to the described modifications, the introduced algorithm is dubbed the modified PSO (MPSO). The pseudocode for the introduced algorithm is shown in Algorithm 1.

Algorithm 1
figure a

Pseudocode of the introduced MPSO algorithm.

It is important to note that, due to the fact that in the introduced MPSO algorithm, the newly generated solution is not evaluated following inception, the computational complexity remains unchanged. In other words, the computational complexity of the MPSO remains the same as that of the original PSO shown in Big O notation in Eq. (15):

$$\begin{aligned} O(N) = N + T \cdot N \end{aligned}$$
(15)

where N denotes the number of solutions, and T the maximum number of iterations.

Experiments

In this section, the conducted experimentation is described in detail. The utilized dataset is presented, followed by descriptions of the experimental framework and solution encoding schemes.

Utilized dataset

To assess the proposed method for forecasting cloud load, the publicly available real-world GWA-T-12 Bitbrains Footnote 1 dataset is utilized. The original datasets contain data on 1750 virtual machines running on a distributed service provider data center. The center contains data from real-world datacenters servicing major clients, such as banks, credit card operations, and insurance companies.

The data are separated into several files, each containing information on a single virtual machine. The available features for each virtual machine include the timestamp of when the data was sampled, the number of virtual CPU cores provided, CPU capacity, CPU usage in MHZ, and CPU usage in percentage. In addition, memory-related features are available including memory capacity in KB, memory usage in KB, disk read throughput in KB/s, and disk write in terms of KB/s. Finally, two network features are also available, the network received in KB/s and the network transmitted in KB/s. Data for machine 2 were used and the CPU load in percentage features was the prediction target.

Due to the high computational demands of optimization research, only data from one virtual machine have been subjected to analysis. To further reduce computational workload, only a selected subset of the most relevant features is used, including the timestamp, CPU load, disk write throughout, network transmitted throughout, CPU capacity, and memory usage. The data cover a period of 1 month with a 5-minute resolution and consists of 8635 data samples. The available data have been separated with 70% used for training, 10% for validation and the remaining 20% for testing. This spilt can be seen for on the target feature in Fig. 1

Table 4 Parameters selected by each metaheuristic for the respective beast performing VMD
Table 5 Average objective function average values of RNN models optimized by each metaheuristic

Experimental setup, solution encoding and simulation framework

For the purpose of experimentation, two experiments have been carried out. In the first experiment, VMD decomposed outputs where as inputs for a standard RNN (VMD-RNN). In the second experiment, the same decomposed VMD output were used as inputs for a RNN with attention mechanism (VMD-RNN-ATT). These acronyms are used hereinafter to help distinguish between the methods in the results section.

The dataset VMD was implemented with a K value of three, that was empirically determined to give the best results on this dataset. This results in a total of four components per feature, thee VMD modes and one residual component. The value for \(K=3\) has been empirically selected in a process described in the following section. The resulting modes for each input feature are shown in Fig. 2.

The resulting features were then formulated as a time-series. A total of six lags were provided to each network as inputs. These networks were then tasked with forecasting load three steps ahead.

Several state-of-the-art metaheuristics including the introduced MPSO algorithm, the original PSO [15], GA [33], ABC [32], FA [31], RSA [41], ChOA [42] as well as advanced metaheuristics such as COLSHADE [43], and SASS [44] have been evaluated on their ability to select the best possible parameters for these algorithms.

The parameters optimized by these metaheuristics include the learning rate, dropout, number of RNN layers as well as the number of neurons in each respective layer. In addition, the number of training epochs has also been optimized for each type of network. The parameters’ respective constraints are shown in Table 1. Networks utilizing attention were given more potential training epochs to compensate for the larger network structure that often requires additional training. Early stopping is also utilized to help address over training. Should a network not improve for a total of \(\frac{epochs}{3}\), training is terminated early.

Each metaheuristic was assigned a population of five individuals representing sets of potential network parameters and a total of six iterations to improve the potential solution. Each agent possesses a set of attributes representing network parameters. Solutions lengths vary depending on the number of layers in the network. The shortest penitential solution has a length of four (learning rate, dropout, number of layers, epochs) plus the number of layers. In addition, when working with a RNN with attention mechanisms, the length of the encoding is one longer due to the added attention layer.

A slightly lower number of iterations were allowed for each metaheuristic due to extensive computational demands. However, despite fewer iterations being allocated to metaheuristics for optimization, a greater number of runs has been carried out to account to the stochastic nature of the optimization algorithms, To account for randomness in these methods, the results are repeated over 30 independent runs to provide objective evaluations.

Finally, a visualization of the described experimental framework is shown in Fig. 3.

Evaluation metrics

For performance evaluation, several standard [45] metrics have been selected: mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and lastly the coefficient of determination (\(\text {R}^2\)) shown, respectively, in Eqs. (16), (17), (18) and (19). In all equations, \(\hat{y}_i\) represents the load forecast, \(y_i\) is the actual value, and \(\bar{y}\) is the arithmetic mean of actual values. Finally, N represents the total number of data samples:

$$\begin{aligned} MSE= & {} \frac{1}{N} \sum _{i=1}^{N}\left( \hat{y}_{i}-y_{i}\right) ^{2} \end{aligned}$$
(16)
$$\begin{aligned} RMSE= & {} \sqrt{\frac{1}{N} \sum _{i=1}^{N}\left( \hat{y}_{i}-y_{i}\right) ^{2}} \end{aligned}$$
(17)
$$\begin{aligned} MAE= & {} \frac{1}{N} \sum _{i=1}^{N} |\hat{y}_{i} - y_{i} |\end{aligned}$$
(18)
$$\begin{aligned} R^2= & {} 1 - \frac{\sum _{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum _{i=1}^{n} (y_i - \bar{y}_i)^2} \end{aligned}$$
(19)
Table 6 Detailed metrics of the best RNN models optimized by each metaheuristic

To further assess model performance, an additional metric is introduced. The Index of Agreement (IA) is calculated as described in Eq. (20):

$$\begin{aligned} IA = 1 - \frac{\sum _{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum _{i=1}^{n} (|y_p - \bar{y} |+ |y_i = \bar{y} |)^2} \end{aligned}$$
(20)

where \(\hat{y}_i\) represents the forecast value, \(y_i\) is the actual value, and \(\bar{y}\) is the arithmetic mean of actual values.

To guide the optimization procedure, the selected objective function is MSE. As the goal of the optimization is to attain results that are as accurate as possible, this task is formulated as a minimization problem.

Results and discussion

In this section, the outcomes of the two conducted experiments are presented and discussed. First, the process of selecting optimal VMD parameters using metaheuristics is performed and explored. Then, experiments with the bestf resulting VMD outcomes in combination with RNN are presented, followed by the experiments carried out with VMD and RNN with attention. Thereafter, an experiment presenting a comparison between contemporary ML algorithms and the introduced methods is showcased. Following that, the outcomes are subjected to rigorous statistical analysis to determine the significance of the improvements made. Finally, the outcomes of SHAP analyses of the best-performing model are presented.

Pre-experimental simulations—VMD tuning

The initial step in experimentation is determining optimal decomposition parameters. The best possible decomposition was required to facilitate further testing. Metaheuristics were leveraged to determine the optimal control parameters of VMD including the K parameter (dictating the number of output modes) and the \(\alpha \) parameter determining the number of observed samples used for the decomposition. The outcomes attained by each method in terms of MSE are provided in Table 2.

As demonstrated by the outcomes in Tabel 2, optimizations performed by the introduced algorithm yielded the best outcomes. This is further confirmed in the detailed metrics Table 3. where the introduced method attained the best results for one-step ahead, two-step ahead and overall results, being only slightly outperformed by the PSO algorithm for three-step ahead forecasts. Nevertheless, this behavior can be expected, as per the NFL theorem, no single approach is equally suited for all test cases.

Convergence rates, stability comparisons, and KDE plots for the objective and indicator functions are provided in Fig. 4 and Table 5, respectively.

Finally, the parameters selected by each metaheuristic for decomposition are provided in Table 4 to facilitate experimental repeatability.

As a result of the VMD experimental process, the parameters selected for VMD are \(K=3\) and \(\alpha =1062\), as these selections yielded the best outcomes during experimentation.

Fig. 6
figure 6

Objective function outcome plots for metaheuristic optimized VMD-RNN method

Fig. 7
figure 7

R\(^2\) function outcome plots for metaheuristic optimized VMD-RNN method

Experiment I (VMD-RNN)

In Table 5, the outcomes of the normalized objective function over 30 independent runs are shown. This is done to account for randomness intrinsic to metaheuristic algorithms. The results for the best, worst, mean and median are shown alongside outcome standard deviation and variance.

As the results in Table 5 indicate, the proposed approach outperformed competing metaheuristics. However, the FA has shown an impressive stability across runs. Detailed metrics for the best-performing models are shown in Table 6.

As shown in Table 6, the introduced algorithm attained the best performance for two and three steps ahead, as well as an overall best performance. However, it is also important to note that the SASS approach demonstrated admirable performance for one-step ahead forecasts. These findings are interesting but to be expected as they further confirm the NFL [13] theorem. The improvements in convergence rates as well as outcome distributions are shown for the objective function in Fig. 6 as well as the R\(^2\) function in Fig. 7 for easier observation.

Parameter selections for the best-performing models selected by each metaheuristic are shown in Table 7

Table 7 Parameters selected by each metaheuristic for the respective best-performing RNN model

Experiment II (VMD-RNN-ATT)

In Table 8, the outcomes of the normalized objective function over 30 independent runs are shown. This is done to account for randomness intrinsic to metaheuristic algorithms. The results for the best, worst, mean and median are shown alongside outcome standard deviation and variance.

Table 8 Overall objective function average values of RNN models with attention optimized by each metaheuristic

As the results in Table 8 indicate, the proposed approach outperformed competing metaheuristics. Notable exceptions are the worst case exception where the SASS algorithm attained the best results, and the ChOA that demonstrated admirable stability across runs. Detailed metrics for the best-performing models are shown in Table 9.

Table 9 Detailed metrics of the best RNN models with attention optimized by each metaheuristic

As shown in Table 9, the introduced algorithms attained the best results for overall R\(^2\), MSE and RMSE. However, the GA demonstrated the best results for two- and three-step ahead forecasts, while COLSHADE demonstrated admirable performance from one-step ahead forecasts. These findings are interesting but to be expected, as they further confirm the NFL [13] theorem. The improvements in convergence rates as well as outcome distributions are shown for the objective function in Fig. 8 as well as the R\(^2\) function in Fig. 9 for easier observation.

Fig. 8
figure 8

Objective function outcome plots for metaheuristic optimized VMD-RNN-ATT method

Fig. 9
figure 9

R\(^2\) function outcome plots for metaheuristic optimized VMD-RNN-ATT method

The forecasts made by the best-performing model in comparison to real expected values are shown in Fig. 10

Parameter selections for the best-performing models selected by each metaheuristic are shown in Table 7

An interesting observation that can be made is that the RNN model performed slightly better then the competing attention models. This is likely caused by the slower process of learning in models that utilize attention layers. With added attention layers, networks require additional training and additional data. This comes with higher computational demands but with a potential to attain better results.

Comparison with other ML models

To further establish the advantages of the proposed approach, several well-established methods, and different network structures of deep neural networks (DNN) have also been subjected to testing under the same conditions. Each algorithm was tasked with determining upcoming cloud load based on an identical number of historical samples. Further, each algorithm was provided with the best attained VMD dataset. Tested methods include support vector machines (SVM), decision trees (DT), extreme gradient boosting (XGBoost), and kernel extreme learning machine(KELM) alongside several deep learning and recurrent network architectures. The methods were tested with the default suggested parameters. Implementations were performed using well-established python ML and AI libraries Scikit-learn and Tensorflow.

Simulations for DNN and RNN were carried out with 500 training epochs with two and three layers architectures. In addition, 300 neurons were allocated per layer, with early stopping criteria set at one-third of the total number of training epochs. Overall metrics for each approach are shown in Table 11.

As can be deduced from the attained experimental outcomes demonstrated in Table 11, while other methods achieved admirable results, the best performances are demonstrated by the optimized model. These outcomes are closely followed by three-layer attention networks, ADABOOST, and XGBoost methods. Nevertheless, the utilization of VMD with metastatic tuning demonstrates a clear superiority over existing approaches applied under identical conditions to the challenge of cloud-load forecasting.

Statistical validation

In modern computer science research, it is important to assess the statistical significance of made improvements. Simply observing outcomes is inadequate for determining the superiority of one algorithm over others. In this study, nine implemented methods were evaluated along with the proposed MPSO for their effectiveness in optimizing RNN models with and without the attention mechanism for cloud-load prediction against real-world cloud-computing dataset.

Fig. 10
figure 10

Predictions made by the best-performing metaheuristic optimized RNN-ATT model

Table 10 Parameters selected by each metaheuristic for the respective beast performing RNN model with attention

According to previous research [46], statistical evaluations should only be conducted after each method has been adequately sampled. This involves determining objective averages through multiple independent executions for each problem. However, this approach may not always be conclusive, as it can result in misleading conclusions if the samples do not follow a normal distribution. In addition, there is ongoing debate among researchers regarding the appropriateness of taking the average objective function value when comparing stochastic methods [47].

To establish the statistical significance of the results obtained in this research, the best values obtained from 30 separate runs of each algorithm for both simulation scenarios (RNN with and without attention mechanism) were selected for generating results samples for each method and experiment. However, before applying an appropriate statistical test from the parametric or non-parametric family, it is necessary to investigate conditions for the safe use of parametric tests by examining the independence, normality, and homoscedasticity of the data variances, as recommended by [48].

The independence criterion is fulfilled, as each run is executed independently with its own pseudo-random number seed. However, the normality condition is not fulfilled as the obtained samples do not stem from a normal distribution. This is evident from the KDE plots and is further supported by the Shapiro–Wilk test for single-problem analysts [49]. By performing the Shapiro–Wilk test, p values are generated for each method–problem combination, and these outcomes are presented in Table 12.

Table 11 Objective function outcomes attained by several state-of-the-art ML models

Standard threshold values, \(\alpha = 0.05\) and \(\alpha = 0.1\), indicate that the null hypothesis (H0) could be rejected, suggesting that none of the samples (for any of the problem-method pairs) originate from a normal distribution. As a result, the normality requirement for the safe use of parametric tests was not fulfilled, and it was deemed unnecessary to verify the homoscedasticity constraint.

Therefore, since the requirements for safe use of parametric tests were not met, it was decided to proceed with statistical analysis by performing non-parametric tests. In this study, the Wilcoxon signed-rank test, a non-parametric statistical test [50], was executed between the MPSO method and all other techniques for both problems (experiments). The same data samples as in the previous test (Shapiro–Wilk for checking normality condition) were taken for each method. The findings of this analysis are presented in Table 13, with p values above the \(\alpha =0.05\) threshold highlighted in bold.

Table 12 Shapiro–Wilk scores for the single-problem analysis for testing normality condition
Table 13 Wilcoxon signed-rank test findings
Fig. 11
figure 11

Feature impacts in the best-performing load forecasting model

Table 13 displays the p values generated by the Wilcoxon signed-rank test, which demonstrate that the proposed MPSO method significantly outperformed all other techniques in both experiments except the SASS metaheuristics in simulation where the RNN with attention layer was optimized. In this case, the generated p value was slightly above the threshold of 0.05 (bold value in Table 13), indicating that the MPSO exhibited similar performance as SASS in this test. The p values for all other methods were substantially lower than 0.05. Therefore, the MPSO method demonstrated robustness and efficiency as an optimizer in this computationally intensive experiment.

As a final remark based on statistical analysis, it can be concluded that the MPSO method performed significantly better than most of the other metaheuristics analyzed in both experiments.

Feature impact interpretation

SHapley Additive exPlanations (SHAP) [16] is an approach for explaining the outputs of ML models based on game theory. By subjecting well-performing models to SHAP analysis, the impact of real-world factors can be determined. To better understand the impact of the factors that play the highest role in cloud-load forecasting, the best performing model, VMD-RNN-MPSO, has been subjected to SHAP analysis. The outcomes are shown in Fig. 11.

As demonstrated in Fig. 11, memory usage plays an important role in forecasting CPU load. The decomposed modes represents regularly occurring tasks that take place on a scheduled basis, with the lowest modes representing tasks that occur with the lowest frequency. Tasks that occur regularly, but with a low frequency, are likely to have high computational demands. Therefore, the lowest decomposition mode of memory usage has the highest impact on forecasts. Interestingly, this mode is closely followed by the memory usage decomposition residuals. This is probably due to the sporadic nature of the utilization of cloud services. Memory most likely has the highest impact on CPU load utilization, as it is directly related to the amount of data that needs to be relocated into working memory, as well as the amount of data that needs to be processed by the processor in upcoming cycles.

Conclusion

The increasing adoption of the IoT and various online services has driven demand for remote computation. Cloud computing has seen an increased demand in recent years. Therefore, accurately estimating computational load can make a significant difference in the service quality and profitability of cloud services. This work proposes one possible approach to accurately forecasting computational load using time-series forecasting based on RNN with attention layers. The performance of this approach is further improved through hyperparameter optimization with metaheuristic algorithms. In addition, a modified version of the well-known PSO algorithms is introduced specifically for this research. Optimized networks have been evaluated on a real-world dataset and have shown promising outcomes. The best models have been tested via SHAP analysis to determine feature impacts on forecasts. Taken together, the introduction of the RNN-ATT approach aided by VMD, coupled with the assistance of the MPSO algorithm for hyperparameter selection, demonstrated notable improvements over existing machine learning techniques. These enhancements contribute to the growing body of evidence supporting the effectiveness and practical value of the proposed methodology in tackling time-series forecasting challenges.

Extensive computational demands impose certain limits on this research. Smaller metaheuristic populations with fewer optimization iterations have been used. In addition, other forms of RNN such as LSTM, bidirectional LSTM (BiLSTM) and gated recurrent unit (GRU) networks have not been tested in this work. Future work will focus on finding additional applications of the described approach to pressing real-world problems concerning time-series forecasting. Attention mechanisms also play an important role in popular AI subfields such as natural language processing where the potential of optimization techniques is yet to be fully applied and explored. Further, in future works, other forms of RNN, such as LSTM, BiLSTM and GRU, will be explored further and applied to load forecasting, and also optimized using variations of well-known optimizers to further improve performance.