Introduction

Recently, there has been a pronounced tendency towards using individual virtual servers in large-scale cloud data centers with thousands of high-performance servers. For instance, cloud services provide elastic computing advantages to end users based on virtualization technology at a low-cost [16, 68]. Virtual machine (VM) facilities allow cloud end users to scale up/down or relinquish their resource demands (e.g., CPUs/GPUs, memory, storage, \(\cdots\), etc.) and pay accordingly. Such frequent variations in the dynamic environment lead to a tradeoff between the service provider’s profit and the end user’s quality of service (QoS). More specifically, the underutilized server causes resource and power consumption wastage. On the other hand, the overutilized server causes performance degradation. Consequently, service providers need efficient techniques for optimal resource management [33, 68]. Managing and improving the provided services in such distributed systems cause several challenges. One major challenge is observing and monitoring these distributed systems for accurate resource allocation decisions [58]. In particular, observability has become a critical prerequisite to guarantee stable services for end-user applications and maximize the profit for the service provider.

In general, there are two approaches for resource allocation: reactive and proactive [77]. The reactive approach offloads the required resources from overutilized servers to underutilized servers. The offloading decisions, in this case, rely on the current end-user utilization. Nevertheless, this causes unnecessary migration because of the sharp workload peaks. Hence, researchers exert continuous effort to improve the accuracy of proactive resource allocation techniques, where deciding VM migration depends on future workloads [71]. Most researchers focus on predicting CPU utilization for the servers [24, 54], or individual VMs [55]. The motivation for focusing on CPU utilization stems from the fact that the CPU of a server incurs the most power consumption, and the relationship between energy consumption and CPU utilization is linear [15].

Focusing on the proactive resource allocation approaches, we need an accurate forecasting technique. To that end, classical time-series techniques aim to model short-term forecasts. As the CPU utilization data is considered time series data, The ARIMA models have been widely used for CPU utilization time series forecasting [57]. For example, researchers have used ARIMA models as a baseline to compare more sophisticated techniques [41]. The main drawback of the time series forecasting model is that it merely captures linear relationships. In addition, TS models require the input data to be stationary (whether in its raw form or as differenced data). Unfortunately, authors in [55] performed the popular Kwiatkowski-Phillips-Schmidt-Shin (KPSS) stationarity test for each VM [40]. They concluded that almost 70% of tested PlanetLab VMs [60] are not stationary. Consequently, classical TS models cannot accurately predict its future CPU utilization. As a result, they used machine learning (ML) models to predict the CPU utilization using lagged values of each time series as inputs to the model. Hence, in recent years many machine learning models, such as artificial neural networks (ANNs) [55, 66, 67], and support vector machine (SVM) [6, 37, 55], have been proposed for modeling CPU utilization.

Deep learning (DL) methods have stirred remarkable attention during the artificial intelligence revolution in recent years. Deep-learning-based prediction models outperform traditional machine learning models in several applications, especially cloud workloads prediction [48]. Thus, the accuracy of CPU utilization prediction could increase using a recurrent neural network (RNN), which maps target vectors from the history of the previous inputs. Nevertheless, RNN suffers from the gradient vanishing problem with long sequences [57]. The long short-term memory (LSTM), which Hochreiter and Schmidhuber [35] proposed, is an effective solution to overcome the gradient vanishing problem. LSTM achieves a considerable improvement in capturing long-term temporal dependencies. Thus, LSTM can accurately predict high fluctuated time-series data [59, 76]. Recently, the generative adversarial network (GAN), proposed by Goodfellow [30], achieves remarkable improvements in different research areas. In particular, GANs are used for the prediction of highly volatile cloud traces as in [85]. This motivates our interest in investigating the performance of GANs for workload prediction. GANs employ two deep learning networks, namely, the generator and the discriminator. The generator generates artificial data samples that mimic the actual distribution of the actual data distribution. The discriminator, however, tries to differentiate between the actual data samples and the artificially generated samples by the generator. By providing a feedback signal from the discriminator to the generator, the generator enhances its data generation model.

Moreover, many research works concerning forecasting investigated the problem of selecting technical indicators (TIs) as input of machine learning/deep learning models for extracting more features [74]. Many efforts study the determination of the optimal combinations of TIs or their parameters.

The main challenge in cloud prediction is the need for an effective nonlinear model that tracks the cloud workload [45, 79]. Furthermore, the workload value frequently suffers from excessive changes [62]. This motivates our interest in recasting the over-utilized server detection problem into a workload trend prediction rather than the value. In other words, the system will migrate VMs from over-utilized servers if the future workload trend is “up” only. We inspire this idea from stock price prediction, where researchers in this area demonstrated that trend prediction as a classification problem can improve prediction accuracy using machine learning and deep learning models [23, 70].

Therefore, the principal contribution of this paper is proposing a novel nonlinear prediction model, named value trend generative adversarial network (VTGAN), to deal with the high-frequency and volatility of cloud workload. Additionally, this paper presents a novel classification approach to predict the trend of workload data. In our proposed VTGAN prediction model, we used a GAN in which the long short-term memory (LSTM) or the gated recurrent units (GRU) model is a generator, and the convolution neural network (CNN) model is a discriminator. The proposed system presents subsequent research contributions:

  • We use GAN models for building predicting cloud workloads models. Moreover, GANs were not applied before in cloud data centers, whether a simulation or real environment, making our model one of the pioneers in cloud workload prediction.

  • In addition, we compared the results of the proposed models with state-of-the-art time series, ML, and DL models, such as ARIMA, SVR, LSTM, and GRU.

  • We propose a classification approach to predict the trend instead of the value of the cloud workload.

  • We study the effect of using common technical indicators.

  • We also study and test the window input size and multi-step prediction using our model.

The structure of this paper is as follows: Section “Related work” presents the related work. Section “Proposed architecture” introduces the mathematical model. Section “Experimental configuration and evaluation methodology” shows the experimental set-up and the methodology of the evaluation conducted in this work. Section “Results and discussions” analyzes the performance results. Section “Conclusions and future works” summarizes our concluding remarks.

Related work

During the last decade, machine learning and deep learning approaches have revolutionized the scientific and industrial communities. In the sequel, we focus on enumerating research works concerning the time-series prediction area. Figure 1 illustrates a taxonomy of time-series prediction models. Classically, most works deal with workload forecasting as a value prediction problem (a.k.a. regression). We classify the regression models into four main categories: (i) Traditional time series models, (ii) Machine Learning models, (iii) Deep learning models, and (iv) Hybrid Techniques. Nevertheless, in this work, we will introduce a trend prediction approach (a.k.a. classification), where we focus on predicting the sign of workload change.

Fig. 1
figure 1

Taxonomy of cloud workload prediction models

Traditional time series approaches

As cloud workload data is naturally temporal, researchers used different time-series forecasting models for predicting workload traces. Autoregressive moving average (ARMA), as a traditional time-series forecasting model, is used in [17] to predict cloud workload for resource allocation. Authors reported that this approach is unsuitable for most cloud workload traces, particularly for highly-volatile workloads. Also, Vazquez et al. [81] applied several time-series prediction models, such as AR, MA, simple exponential smoothing (SES), double exponential smoothing (DES), error trend seasonal exponential smoothing (ETS), and ARIMA, to forecast cloud workloads. They evaluated the forecasting accuracy for each model for two real cloud workloads, namely, Google cluster data and Intel Netbatch logs. The authors conclude that no model is consistently superior to the others for all datasets.

Vashistha and Verma [80] presented a cloud workload prediction survey based on time series models, where some researchers applied AR [37,38,39, 46], MA [37, 38, 81], and ARIMA [7, 17, 18, 28, 38, 46, 81]. In addition, other researchers proposed extended versions of the ARIMA model for workload prediction, such as autoregressive moving average with exogenous inputs (ARMAX) [88], cumulative moving average (CMA), weighted moving average (WMA) [29], difference model (DM), and median model (MM) [38].

Although such traditional time-series approaches were ubiquitous in the last decade, these models are not appropriate for long-term time-series data [47]. Moreover, these models assume that the input data is stationary, which is not a valid assumption for most cloud workload traces [55]. Therefore, the ML approaches seem like a natural solution for traditional time-series problems and a step toward more accurate cloud workload prediction results.

Machine learning approaches

ML models have been widely used as an alternative solution for traditional time-series forecasting. Thus, researchers proposed several ML prediction models for cloud applications. Farahnakian et al. [25] proposed a linear regression (LR) algorithm to predict the CPU utilization of the servers in the context of proactive overload detection servers. In follow-up work, they used a K-nearest neighbor (KNN) regression model instead of the linear regression model. They demonstrated that this approach is superior in terms of energy consumption and system performance [26].

Patel et al. [63] proposed the support vector regression (SVR) and ARIMA models to predict VM memory during the live migration to calculate the migration time. The SVR model has less capability to improve prediction accuracy because it consists of a single hidden layer. Cortez et al. [21] used gradient boosting tree and random forest models to predict the resource management of a VM allocated in the Azure cloud platform. They used the dynamically linked library (DLL) to collect the result after each estimation process. Then, it decided whether the prediction process was trusted using the DLL score.

Nguyen et al. [34] used a multiple linear regression (MLR) method to predict overutilized and underutilized servers. They integrated their prediction technique with traditional consolidation frameworks to reduce energy consumption.

Moghaddam et al. [55] proposed different ML algorithms for overload detection in the VM consolidation framework. They developed several ML prediction algorithms for individual VMs to predict the most suitable time for migration from overutilized servers. They implemented their approach using PlanetLab traces based on the CloudSim simulation tool [60]. Their framework was compared to LR-MMT-PBFD as a baseline in most publications. Nevertheless, they did not measure the prediction accuracy of the proposed ML models and implemented them directly on the VM consolidation framework. Thus, in this paper, we evaluate the accuracy of our approaches before integrating them with the whole system in future work.

Regardless of the reasonably fast prediction ability for cloud workloads, ML approaches do not achieve high prediction accuracy with high dispersal because of the non-linearity and complexity of cloud workloads. Hence, the third direction was deep learning (DL) approaches to achieve high prediction accuracy.

Deep learning approaches

Due to the recent success of DL in various applications, several works employed DL approaches for time-series analysis and prediction [27]. Specifically, the recurrent neural network (RNN) has outstanding sequential processing capabilities. Therefore, authors in [24, 36, 87] proposed an RNN-based model to predict the future workloads in cloud data centers. However, previous research showed that traditional RNNs struggle to capture long-term dependencies due to the vanishing gradient problem [14, 82]. To solve this issue, LSTM [31] and GRU [20] were developed for better dealing with long-term dependencies [19, 42]. Consequently, Song et al. [76] used the LSTM network for workload prediction to improve their previous RNN-based work [84]. GRU is much less computationally intensive than LSTM due to its ability to converge with fewer parameters [20]. Nevertheless, there is little research work based on GRU networks [19, 32] for workload prediction in the cloud environment.

Focusing on convolutional neural networks (CNNs), Mozo et al. [56] used CNN to predict short-term network traffic in data centers. [56] is considered the only work using a pure CNN approach for prediction in the cloud environment because CNN is also unsuitable for long-term dependencies. That is because CNN models fundamentally focus on extracting features and inter-dependencies from the input sequence and do not use any historical data during the learning process [69].

The nature of cloud workloads is always dynamic and complex. Thus, all previous approaches did not achieve acceptable prediction accuracy due to the long-term dependencies, complexity, and non-linearity of cloud workload traces. As a result, the authors recently tuned the research direction to hybrid approaches rather than single models.

Hybrid approaches

Finally, the hybrid approaches are an amalgamation of various time-series algorithms aiming at forecasting complex time series traces [85]. Liu et al. [52] proposed a hybrid prediction model that combines ARIMA with LSTM models. Their results illustrated that their model improved the prediction accuracy by 6% and 66% compared to the pure LSTM and pure ARIMA models, respectively. Also, Shuvo et al. [73] proposed a hybrid prediction model, namely LSRU, that combined the GRU with the LSTM model. They show that LSRU achieves better accuracy than the pure LSTM or GRU model. Bi et al. [13] proposed a hybrid prediction model integrating bi-directional and grid-long short-term memory networks (BG-LSTM) for high accuracy.

The combination of ConvNets and LSTM is one of the popular hybrid schemes for time series prediction purposes [85]. Regarding cloud environments, Ouhame et al. [59] proposed a hybrid prediction model that combines CNN model with the LSTM model. This combination helps to extract complex features of the VM usage components. This is in addition to modeling temporal information of irregular trends, which may arise in the time series. Their results illustrated that this hybrid model is more accurate than VAR-MLP, VAR-GRU, and ARIMA-LSTM hybrid models.

Recently, the GAN invention revolutionized DL. It achieves remarkable improvement in several fields, such as computer vision and audio. Goodfellow et al. developed GANs in 2014 [30]. Until now, few works considered GAN for time-series cloud workload prediction purposes. The first approach for cloud workload prediction value, E2LG, was proposed by Yazdanian and Sharifan [85]. They combined LSTM networks as a generator and CNNs as a discriminator. This hybrid model can effectively capture the long-term nonlinear dependencies of time series and is suitable for the high-frequency data type. E2LG improved prediction accuracy significantly in the cloud environment. Also, Lin et al. [51] proposed a GAN-based method for realistic cloud workload generation to capture the data distribution and generate high-quality workloads. Generated workloads are useful to mimic real data. In addition, their model can easily generate specific kinds of workloads according to the input. But, their model aimed to generate synthetic data that have a similar distribution to the real data. Unlike our approach, We aim to predict the near future utilization by considering the near historical data to deal with the unexpected change instantaneously.

Table 1 summarizes publications on previous cloud workload prediction approaches. These publications are classified according to their learning category, method, dataset, and weakness.

Table 1 Comparison of cloud workload prediction models

In this paper, we use a modified version of GAN to predict the trend rather than the value. Therefore, the decision of resource allocation will be based on the trend. This approach is a pioneer in cloud workload prediction. Also, we study the effect of using technical indicators (TIs), Fourier, and wavelet transforms in the performance of our regression and classification models.

Proposed architecture

We propose a modified version of GAN to predict future workload values. The proposed model is a step towards a proactive overload detection technique in the resource management framework for cloud data centers. This technique prevents unnecessary migrations by making migration decisions from the over-utilized server based on the predicted CPU utilization value. In addition, we present an alternative solution to make the migration decision based on the future trend of the cloud workload. For this trend prediction, we cast the prediction problem as trend classification (in contrast to the regression problem corresponding to the workload value prediction).

In our suggested workload prediction system, we use a GAN network. In our proposed GAN architecture, the GRU or LSTM model represents a generator, which learns to generate workload values that are consistent with the statistical distribution of the actual workload. In addition, our GAN model includes a 1D-CNN model as a discriminator, which learns to differentiate between actual and artificially generated workloads. Upon interaction between the generator and discriminator, the predicted workload accuracy enhances. The LSTM and GRU are suitable for predicting time series data. To further enhance the prediction accuracy in multi-step-ahead prediction, our proposed system uses technical indicators (TIs) as feature extraction mechanisms. Moreover, we apply and test Fourier and wavelet transform functions as additional TIs that remove redundant data.

Data preprocessing

To improve the predictive performance of our model, we pre-process the data to highlight oscillations and trends in the workload trace. To that end, we study the use of seven technical indicators (TIs) as additional features. We note that the works [9] and [22] used a subset of these TIs. we extend some of the TIs in [43] to include short-term and long-term moving averages (MAs). These MAs smooth the workload trace, discard short-term fluctuations, and highlight overall trends and/or cycles of the workload time series. In the sequel, we enumerate the full list of our proposed TIs:

  • Moving averages (MAs): MAs often capture trends by smoothing a CPU utilization series using a lag factor of order n. The long MAs indicators illustrate changes in CPU utilization that are less sensitive to recent utilization movements than the short MAs. This is due to the fact that the longer the MA is, the smoother and less accurate the output is. We calculate MA by Eq. (1), where \(p_t\) is the CPU utilization value at time t.

    $$\begin{aligned} MA(p_{t},n)=\frac{p_{t}+p_{t-1}+\cdots +p_{t-(n-1)}}{n} =\frac{1}{n} \sum \limits _{i=0}^{n-1} p_{t-i} \end{aligned}$$
    (1)
  • Exponential Moving Average (EMA): EMA is a particular moving average indicator, which exponentially averages historic CPU utilization. Unlike simple MAs, EMA can place more weight on recent CPU utilization. More specifically, the influence of previous CPU utilization samples decreases exponentially fast in the EMA indicator. Hence, it reflects directly on the immediate trend [22]. We calculate EMA according to (2),

    $$\begin{aligned} EMA(p_{t},s)=\frac{p_{t}+\alpha p_{t-1}+\cdots +\alpha ^{t} p_{0}}{1+\alpha +\cdots +\alpha ^{t}} \end{aligned}$$
    (2)

    where s is a tuning parameter to control the importance of the recent past, and \(\alpha\) is a weighting term (\(\alpha =\frac{s-1}{s+1}\)).

  • Moving Average Convergence Divergence (MACD): It gives insight into workload convergence, divergence, and crossover [22]. It reflects the difference between a short-term (fast) EMA and a long-term (slow) EMA, capturing the second derivative of a CPU utilization series. We calculate MACD according to (3),

    $$\begin{aligned} MACD(p_{t},s_1,s_2)=EMA(p_{t},s_1) - EMA(p_{t},s_2), \quad s_2 > s_1 \end{aligned}$$
    (3)
  • Moving Standard Deviation (MSD): MSD measures the nth time slot volatility (i.e., the rate of change) of CPU utilization. It is considered helpful in predicting the magnitude of future CPU utilization changes. This indicator expects low-volatility periods followed by high-volatility periods. We calculate MSD according to (4),

    $$\begin{aligned} MSD(p_{t},n)=\sqrt{\frac{1}{n} \sum \limits _{i=0}^{n-1} (p_{t-i}-MA(p_{t},n))^2} \end{aligned}$$
    (4)
  • Bollinger Bands (BBANDs): Bollinger Bands are indicators that are plotted at standard deviation levels above, and below a simple moving average. BBANDs consist of the upper band (\(BBAND^{+}\)) and the lower band (\(BBAND^{-}\)) [22]. Bollinger Bands are useful indicators to compare volatility against relative CPU utilization levels, over a period of time. We calculate \(BBAND^{+}\) and \(BBAND^{-}\) by Eqs. (5) and (6).

    $$\begin{aligned} BBAND^{+}(p_{t},n)= & {} MA(p_{t},n)+2 \times MSD(p_{t},n) \end{aligned}$$
    (5)
    $$\begin{aligned} BBAND^{-}(p_{t},n)= & {} MA(p_{t},n)-2 \times MSD(p_{t},n) \end{aligned}$$
    (6)
  • Momentum (MOM): MOM measures CPU utilization differences over relatively short periods to follow the speed of the changes in utilization. We used log momentum to center the values at zero. It is often used to predict reversals [9]. We calculate using (7) as,

    $$\begin{aligned} MOM(p_{t},n)=\log (p_{t}-p_{t-n}) \end{aligned}$$
    (7)

In summary, the selected TIs have been plotted in Fig. 2 after being applied to the PlanetLab dataset (200-time slots), which is described in Section “Dataset”.

Fig. 2
figure 2

Selected technical indicators after applied to the used dataset. (200-time slots)

Then, we study applying and testing Fourier and wavelet transforms as additional features, where Fourier and wavelet transforms are used to remove redundant data and retain the most relevant information [8]. Therefore, these approximation tools could help the deep learning network for predicting trends more accurately.

VTGAN models

We use the GAN network to predict the value and trend of future CPU utilization, i.e., to predict future samples of the time series corresponding to the CPU utilization. Figure 3 illustrates the essential components of the proposed VTGAN architecture. The generator produces CPU traces, which have a similar distribution compared to the original CPU traces. The discriminator, however, is responsible for classifying the input trace into either an actual CPU utilization trace or a predicted trace (i.e., an artificially generated CPU utilization trace). The generator and discriminator losses are added together and fed back to the generator to become better at generating CPU utilization traces that mimic the actual data statistics. This process continues until the discriminator no longer be able to differentiate between actual predicted data from generated CPU utilization data.

Fig. 3
figure 3

The proposed VTGAN architecture

Some researchers recently reconstructed the generator and the discriminator based on LSTM and CNN layers for better learning regarding several applications. GAN differs from other deep learning techniques in that it tries to strike a balance between the two sides (generator and discriminator) [85].

Figure 4 illustrates the proposed system using the GAN model. In this work, we use an RNN as a generator. Specifically, we employ one of the following recurrent neural networks: (i) LSTM or (ii) GRU, for generating CPU traces. As described in Subsection “Deep learning approaches”, RNN has the ability to map generated data from the history of the previous inputs, therefore it is suitable for sequential data. For the discriminator, we utilize a multi-layer 1D-CNN. We choose CNN for the discriminator components as it is able to extract temporal features and information for series data. In the numerical result section, we compare the performance of the two RNNs and select the better generator network.

Fig. 4
figure 4

The proposed VTGAN model

Regression and classification approaches

Generally, the main goal of forecasting CPU utilization as a time-series forecasting problem is to estimate the closing value of the next time slot. In this work, we focus on CPU utilization value prediction (CPU utilization value regression problem), and the trend direction of CPU utilization (CPU utilization trend classification problem).

A preliminary process, mandatory to follow this approach, is to build a dataset suited to a classification problem. Next, we associate each past observation from the time series with a symbolic label describing the predicted trend (i.e., we label the trend as an upward or a downward trend).

Consequently, we split the dataset into sub-sequences using the sliding window technique as input for our models. This technique selects every n samples as inputs, and the \((n+1)\)th samples as outputs for value regression and symbolic labels as outputs for trend classification in one-step prediction.

Value regression approach

In this approach, we only focus on predicting the value of CPU utilization and not its trend direction. The CPU utilization value prediction problem has been the traditional approach for proactive resource management in cloud data centers [85]. We use the sliding window technique. In this technique, we use the last n samples as an input to our regression technique, i.e., the VTGAN model, to predict future samples. We consider two versions of our scheme, namely, one-step-ahead prediction and p-step-ahead prediction. In the one-step-ahead version, the regression procedure aims to predict the immediate future sample (i.e., one sample only as an output). This is in contrast to the p-step-ahead version, where the regression procedure outputs p future samples.

More specifically, let the input \(I_{reg}\) be the CPU utilization time-series samples. The kth row of \(I_{reg}\) contains n actual data points (actual CPU utilization), namely, \(\{i_k, i_{k+1}, \cdots , i_{n+k-1}\}\), where \(k=1,2, \cdots , l-n\). We denote the corresponding output by \(O_{reg}\). The output \(O_{reg}\) corresponds to the predicted value(s). The kth row of \(O_{reg}\) is the predicted CPU utilization at the \((n+k)\)th time slot \(\hat{i}_{n+k}\) for one-step-ahead prediction, while it is the predicted values \(\{\hat{i}_{n+k},\hat{i}_{n+k+1}, \cdots , \hat{i}_{n+k+p}\}\), as shown in Eqs. (8) and (9) for one-step-ahead and p-step-ahead prediction, respectively.

$$\begin{aligned} I_{reg} = \left( \begin{array}{cccc} i_{1} &{} i_{2} &{} \dots &{} i_{n} \\ i_{2} &{} i_{3} &{} \dots &{} i_{n+1} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ i_{r} &{} i_{r+1} &{} \dots &{} i_{n+r-1} \end{array}\right) ,\nonumber \\ \qquad O_{reg} = \left( \begin{array}{c} \hat{i}_{n+1}\\ \hat{i}_{n+2}\\ \vdots \\ \hat{i}_{n+r} \end{array}\right) , r=l-n+1 \end{aligned}$$
(8)
$$\begin{aligned} \qquad I_{reg} = \left( \begin{array}{cccc} i_{1} &{} i_{2} &{} \dots &{} i_{n} \\ i_{2} &{} i_{3} &{} \dots &{} i_{n+1} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ i_{r} &{} i_{r+1} &{} \dots &{} i_{n+r-1} \end{array}\right) ,\nonumber \\ O_{reg} = \left( \begin{array}{cccc} \hat{i}_{n+1} &{}\hat{i}_{n+2}&{}\dots &{} \hat{i}_{n+p}\\ \hat{i}_{n+2}&{}\hat{i}_{n+3}&{}\dots &{} \hat{i}_{n+p+1}\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \hat{i}_{n+r} &{}\hat{i}_{n+r+1}&{}\dots &{} \hat{i}_{n+r+p-1} \end{array}\right) \nonumber \\, r=l-n-p+1 \end{aligned}$$
(9)

where \(i_j\) denotes the actual CPU utilization at time slot j, \(\hat{i}_j\) denotes the predicted CPU utilization at time slot j, n is the sliding window length, and l is the input sequence length.

Trend classification: 2-classes approach

In this section, we describe our proposed algorithm for forecasting the trend of CPU utilization. In this case, we classify the direction of the change of the future CPU utilization, whether it is upward or downward. The upward trend of CPU utilization implies that we predict the future CPU utilization to be higher than the current CPU utilization. The downward trend, however, entails that the future CPU utilization is lower than the current CPU utilization. In many practical applications, it is more important to know the trend of workload value rather than the actual value (e.g., in Stock prediction).

Specifically, this approach predicts the CPU utilization trend based on two classes:(i) upward and (ii) downward. The movement of each time slot is associated with a label in the set \(L=\{up, down\}\), which is determined by comparing the current CPU utilization value to one of the previous time slots. We obtain the class \(L_m\) at the mth time slot as follows:

Upward class:

$$\begin{aligned} \hat{i}_m - i_{m-1} > 0 \Rightarrow L_m=up \end{aligned}$$
(10)

Downward class:

$$\begin{aligned} \hat{i}_m - i_{m-1} < 0 \Rightarrow L_m=down \end{aligned}$$
(11)

where \(i_{m-1}\) is the sample of a time series representing the actual value of the CPU utilization at the \((m-1)\)th time slot, and \(\hat{i}_m\) is the predicted future sample at the mth time slot.

Similar to the CPU utilization value prediction problem, in this approach, we use the sliding window technique in the training procedure to predict the next output trend. We perform the trend prediction in either one-step-ahead prediction fashion or p-step-ahead prediction. The trend prediction of the kth time slot can be calculated based on W past observations of the CPU utilization values. We obtain this prediction using the so-called embedding technique, i.e., numeric vector input represents a word, by which the vector \(I_k\) of past samples is defined as:

$$\begin{aligned} I_k=\left( \begin{array}{ccccc} i_{k-W+1}&i_{k-W}&\dots&i_{k-1}&i_{k} \end{array}\right) \end{aligned}$$
(12)

where W denotes the window size, i.e., the number of data points used to obtain a prediction.

The trend classifier aims at finding a function \(f(\cdot )\) that maps the CPU utilization vector \(I_k\) into a binary decision \(L_{k+1}=\{up,down\}\), i.e., \(L_{k+1} = f(I_k)\), where \(L_{k+1}\) denotes the predicted trend label at the \((k+1)\)th time slot. As CPU utilization time series usually have complex behavior, we propose to employ the VTGAN as a classifier (i.e., for identifying upward or downward trends). Consequently, we capture the non-linear and non-stationary behavior of time series by learning the ML model parameters using data-driven techniques. The input \(I_{class}\) is the CPU utilization time-series samples. Each row of \(I_{class}\) corresponds to a window of W samples. We organize the samples in a sliding window fashion as in the regression model. The corresponding output \(O_{class}\) represents the predicted class value(s), as shown in Eqs. (13) and (14) for one-step-ahead and p-step-ahead prediction, respectively.

$$\begin{aligned} I_{class} = \left( \begin{array}{cccc} i_{1} &{} i_{2} &{} \dots &{} i_{W} \\ i_{2} &{} i_{3} &{} \dots &{} i_{W+1} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ i_{r} &{} i_{r+1} &{} \dots &{} i_{W+r-1} \end{array}\right) ,\nonumber \\O_{class} = \left( \begin{array}{c} L_{W+1}\\ L_{W+2}\\ \vdots \\ L_{W+r} \end{array}\right) , r=l-W+1 \end{aligned}$$
(13)
$$\begin{aligned} I_{class} = \left( \begin{array}{cccc} i_{1} &{} i_{2} &{} \dots &{} i_{W} \\ i_{2} &{} i_{3} &{} \dots &{} i_{W\!+\!1} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ i_{r} &{} i_{r\!+\!1} &{} \dots &{} i_{W\!+\!r\!-\!1} \end{array}\right) ,\nonumber \\O_{class} = \left( \begin{array}{cccc} L_{W\!+\!1} &{}L_{W\!+\!2}&{}\dots &{} L_{W\!+\!p}\\ L_{W\!+\!2}&{}L_{W\!+\!3}&{}\dots &{} L_{W\!+\!p\!+\!1}\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ L_{W\!+\!r} &{}L_{W\!+\!r\!+\!1}&{}\dots &{} L_{W\!+\!r\!+\!p-\!1} \end{array}\right) \nonumber \\, r=l\!-\!W\!-\!p+\!1 \end{aligned}$$
(14)

For instance, Fig. 5 illustrates a label association example using three-sample-window (W=3). The embedded vector at the 5th time slot is as follows:

$$\begin{aligned} I_5= \left( \begin{array}{ccc} 55&52&41 \end{array}\right) \end{aligned}$$
(15)

The relative variation from time slot 5 to time slot 6 is:

$$\begin{aligned} 22-41=-19 < 0 , \end{aligned}$$
(16)

and so, the trend label of time slot 6 is \(L_{6} = down\).

Fig. 5
figure 5

Assigning label example for trend classification approach

Experimental configuration and evaluation methodology

This section considers the experimental setting used for assessing our proposed prediction models. Our evaluation includes one-step-ahead and p-step-ahead results. We focus our prediction steps p to be limited to 5 (specifically, we focus on \(p=1, 3, 5\) prediction steps). For \(p>5\), we note that the prediction accuracy diminishes. Hence, the prediction outcomes would be less beneficial in practical applications. We compare the accuracy of our proposed VTGAN models against ARIMA, SVR, LSTM, and GRU benchmarks, which appeared in the most recent related works.

Dataset

In our experimental study, we used the PlanetLab traces [60]. These traces contain CPU utilization collected every five minutes from more than 500 places around the world [4]. We show a visual representation of the behavior in Fig. 6, where six days are considered. In particular, CPU utilization values are inputs to predict the value and label for the next time slot. We consider \(80\%\) of workload data during all experiments for training the model to predict the remaining data.

Fig. 6
figure 6

6 Days CPU utilization data

Performance evaluation metrics

We investigate various accuracy metrics used to evaluate the proposed VTGAN algorithm. Regarding the CPU utilization value prediction problem, we study the RMSE, MAPE, Theil’s coefficient, ARV, POCID, and \(R^2\) coefficient as prediction accuracy (equivalently, evaluate the error in the prediction) metrics. We summarize the formal definitions of the aforestated metrics in Table 2. In the CPU utilization trend classification problem, we consider the precision, the recall, and the \(F_1\) score as classification accuracy metrics. We summarize the formal definitions of the classification accuracy metrics in Table 3. In addition, we use the confusion matrix as a visual evaluation to reflect the classifier’s recognition ability for each class. We show the confusion matrix in terms of a 2-class approach (upward and downward) for the trend classification problem, while we use 10 quantized classes for the regression problem. Specifically, we quantize the CPU utilization percentage into 10 classes (in steps of \(10\%\)). Hence, we have classes \(0, 1, 2, \cdots , 9\) representing the CPU utilization percentages of \(> 90\%\), \(80-90\%\), \(70-80\%\), \(\cdots\), \(0-10\%\).

Table 2 Selected regression evaluation metrics, their formulas, and symbols
Table 3 Selected classification evaluation metrics and their formulas

We select RMSE, MAPE, MAE, and ARV for regression evaluation metrics to measure the deviation between the predicted and actual values. With all these metrics, the absolute value of the error prevents the positive and negative errors from canceling out each other. The MAPE metric, in particular, has the added benefit of allowing prediction accuracy comparison of time series with different value scaling.

Theil’s coefficient measures relative accuracy that compares the obtained predicted results with actual values by giving more weight to massive errors by squaring the deviations. Theil coefficient acceptable ranges from 0 (corresponding to no forecasting error) and 1 (corresponding to no predictive ability). More than 1 value means poor prediction guessing [80, 83].

POCID measures the capability of predicting if future values will increase or decrease. It is superior to MAPE as it measures the prediction accuracy based on its change direction. Therefore, it is a powerful metric during the decision-making stage. POCID value closer to 100 represents the best value [11].

\(R^2\) represents the coefficient of how close the values are to be fitted with the line of regression. If \(R^2\) value equal to 1, this means that the model perfectly fits all variability. Therefore, \(R^2\) value closer to 1 represents the best value [11].

For the classification problem, we evaluate the accuracy of the proposed model using the precision, the recall, and the \(F_1\) score.

Experiment configuration

We perform all experiments on Intel Xeon Gold 6248 processor with 2.5 GHz clock speed, 128 GB of memory, and a Tesla V100 GPU with 32 GB of RAM. We implement all deep learning models using the Keras framework and Tensorflow backend with CuDNN kernels. Table 4 illustrates the architecture of proposed models.

We set the batch size and epochs to 32 and 3000, respectively, regarding the training phase. For hybrid CNN-LSTM/CNN-GRU and stacked LSTM/GRU models, the early stopping technique is used with a 20% validation rate. This technique finds the best point to halt the optimizer (Root Mean Squared Propagation - RMSprop) once the model performance stops improving [53]. We configure the stacked LSTM/GRU network structures as the generator configurations of VTGAN models. Also, the loss function for the generator is the mean squared error after the try-and-error method. We test each model three times, then the average and the standard deviation are calculated.

Table 4 The structure of VTGAN models

Results and discussions

This section presents the regression and classification accuracy results of the proposed VTGAN models. Subsections “One-step-ahead regression and classification accuracy results”, “Regression and classification accuracy results using technical indicators”, and “Multistep-ahead regression and classification accuracy results for different sliding window size” show the experimental results of the proposed algorithm compared to traditional models in recent publications such as CNN-LSTM/CNN-GRU and stacked-LSTM/GRU. Also, Section “Bitbrains dataset comparison” illustrates an additional evaluation study with another real cloud dataset (Bitbrains).

One-step-ahead regression and classification accuracy results

In this section, we assess the performance of VTGAN models in one-step-ahead regression and classification approaches. We optimize the window size such that it achieves maximum accuracy. Tables 5 and 6 illustrate the overall accuracy performance of VTGAN models compared to other models for regression and classification approaches, respectively. In addition, These tables show the optimal values for window size, stopped training epochs, and training time for the best-observed performance in each model. In all tables, the best-observed model is in bold in each approach.

Table 5 Comparison of regression results
Table 6 Comparison of classification results

As we can see from the experimental results, VTGAN (LSTM-based) model is superior to all other prediction models, whether for regression or classification approaches regarding all performance metrics presented in Section “Performance evaluation metrics”. The stacked LSTM model performs the worst compared with all DL techniques. Although, the results of the stacked LSTM remain acceptable since Theil value does not exceed one. Although the SVR model achieves a higher POCID value, it did not exceed the maximum value of VTGAN (LSTM-based) after adding the standard deviation.

Focusing on the sliding window size (from the Tables 5 and 6, \(W=3\), which is equivalent to 15 minutes), VTGAN models achieve higher performance with small sliding window sizes, whether using LSTM or GRU as a generator. This result agrees that the small window size is more suitable for the drift data as cloud workload data, while larger window sizes are more appropriate for noisy data [78]. Nevertheless, since LSTM and GRU techniques capture long-term dependencies [19, 42], the regression and classification accuracy of LSTM/GRU models enhances with a longer window size value relative to VTGAN models.

Hybrid and deep learning-based models are usually more complex and require higher computations for model training. Nevertheless, for all tested models, the training time is acceptable for resource management applications of the data center because overload/underload detection processes often occur every 5 minutes as in [12, 33]. As shown in Tables 5 and 6, the CNN-GRU model achieves less training time and epochs number whether regression or classification approaches (see underlined values in Tables 5 and 6).

We note that the complexity difference between models is a consequence of using the early stopping technique. Also, Tables 5 and 6 show that GRU-based models record less training time and the number of epochs compared to the LSTM-based models. This observation is consistent with the fact that the GRU-based models are much less computationally intensive. This is due to their ability to converge with fewer parameters [20]. However, the performance accuracy of the VTGAN (LSTM-based) model is superior to the VTGAN (GRU-based) model for all tested models.

Figures 7 and 8 illustrate the confusion matrices of all models. We use the confusion matrix comparison to visually examine the behavior of VTGAN models compared to others with regression and classification results, respectively. Also, Fig. 9 illustrates a part of the actual CPU utilization compared to the predicted value using all models. The interval length is of 5 minutes.

Fig. 7
figure 7

Confusion matrices for regression approach. Classes \(0, 1, 2, \cdots , 9\) represent the CPU utilization percentages of \(> 90\%\), \(80-90\%\), \(70-80\%\), \(\cdots\), \(0-10\%\), respectively

Fig. 8
figure 8

Confusion matrices for classification approach

Fig. 9
figure 9

Actual and predicted CPU utilization values for regression approach

The confusion matrix results of regression models in Fig. 7 illustrate the predictive capability within every CPU utilization interval. Figure 7 shows that the VTGAN (LSTM-based) model is superior in overall prediction accuracy. VTGAN (LSTM-based) model achieves accurate prediction at every CPU utilization range. In contrast, the prediction accuracy reduces for very low or very high CPU utilization values compared to other models, particularly for the ARIMA, SVR, and CNN-LSTM models, as shown in Fig. 9.

The confusion matrix results of classification models in Fig. 8 signify the classification accuracy for predicting upward or downward trends. Figure 8, VTGAN (LSTM-based) model achieves the best performance, followed by VTGAN (GRU-based) and stacked GRU models, which record slightly less accuracy. The strength of the classification approach is that it is easy to make direct decisions depending on the classifier results. For instance, we can detect the overloaded server if its CPU utilization records more than a specific threshold and the predicted trend is upward. This solution will reduce unnecessary migrations in resource management frameworks. Especially, the False downward detection probability with VTGAN (LSTM-based) model is low (\(\approx 4\%\)).

Regression and classification accuracy results using technical indicators

This section analyzes the impact of adding Technical Indicators (TIs) to the feature set with our workload traces. By repeating previous experiments in Section “One-step-ahead regression and classification accuracy results”, Tables 7 and 8 illustrate the overall accuracy performance of VTGAN models using TI strategy compared to other models for regression and classification approaches, respectively.

Table 7 Comparison of regression results
Table 8 Comparison of classification results

In general, the TI addition diminishes the regression and classification performance for all tested models in terms of one-step-ahead prediction. This result could be due to the occurrence of over-fitting by adding dependent features. VTGAN models are still the superior models for regression and classification approaches.

VTGAN (GRU-based) model outperforms other models (bold results). In contrast, CNN-LSTM/GRU models are the worst performance. In this case, the regression becomes useless, where the Theil value of these models record exceeds one, as shown in Table 7.

Figures 10 and 11 illustrate the comparison of confusion matrices between all the models using TIs strategy to examine the visual behavior of VTGAN models compared to others.

Fig. 10
figure 10

Confusion matrices for regression approach using TI. Classes \(0, 1, 2, \cdots , 9\) represent the CPU utilization percentages of \(> 90\%\), \(80-90\%\), \(70-80\%\), \(\cdots\), \(0-10\%\), respectively

Fig. 11
figure 11

Confusion matrices for classification approach using TI

Focusing on the training speed of the models, we note that the single benefit of using the TI strategy for one-step-ahead prediction is that the training is faster than others. Specifically, the training time and the number of epochs reduce for CNN-LSTM/GRU and stacked LSTM/GRU models, whether regression or classification approaches compared to the results in Subsection “One-step-ahead regression and classification accuracy results”. For instance, the training epochs and time decrease from 576 and 74.8 seconds in Table 5 to 235 and 35 seconds using the TI strategy for the CNN-LSTM model in Table 7.

Multistep-ahead regression and classification accuracy results for different sliding window size

This section studies the performance of the multi-step-ahead prediction. Also, we assess the effect of changing the sliding window sizes on our models’ performance and/or adding TI features to the input of the prediction algorithm. The following subsections analyze the impact of change in sliding window size, multi-step-ahead, and TI strategy, respectively.

Sliding window size analysis

This section analyzes the effect of changing the sliding window size. Figures 12 and 13 illustrate MAPE and \(F_1\) score values against the sliding-window size for all tested models. Sub-figures in every row represent the step-ahead size (\(p=1, 3, 5\)). The second column represents the results after adding the TI indicators.

Fig. 12
figure 12

MAPE values using different window size input for one-step and mult-step ahead

Fig. 13
figure 13

\(F_1\) score values using different window size input for one-step and multi-step ahead

Figures 12 and 13 show that VTGAN models’ performance significantly declines when the sliding window size increases. In contrast, the performance of other models oscillates to a reasonable degree. Fortunately, the VTGAN models’ accuracy outperforms other models with small window sizes. This result is considered a considerable benefit when we run our model for real-time resource management framework as in [33]. This result implies that as soon as the model collects three CPU utilization data points (i.e., in a period of 15 minutes), it can successfully predict future samples.

Technical indicators effect on multi-step-ahead prediction

This section analyzes the impact of using TIs for all tested scenarios with different sliding windows and step-ahead sizes. Figures 14 and 15 illustrate MAPE and \(F_1\) score values, respectively. Solid and striped bars represent the pure models and models using the TIs, respectively, with various sliding window sizes (3, 5, 10, 15, and 20) and step-ahead sizes (\(p=1, 3, 5\)).

Fig. 14
figure 14

MAPE values using different window size and step ahead values

Fig. 15
figure 15

\(F_1\) score values using different window size and step ahead values

In general, the performance of all models with multi-step-ahead fails to maintain its performance whenever the prediction step size increases for all tested configurations. As shown in solid bars only in Figs. 14 and 15. This result agrees with the results in [61, 85], which confirmed that most deep-learning and hybrid models perform poorly in long-term prediction approaches. That is because of the nature of CPU utilization data, where it fails to fit models due to the complexity and non-linearity issues.

Regarding the one-step-ahead prediction, the use of the TI strategy negatively affects the regression and classification performance except for the VTGAN (LSTM-based) model. It achieves a significant improvement for window size equals 10 (Figs. 14(g) and 15(g)), then a slight improvement in regression performance for window sizes equal 15 and 20 (Fig. 14(j) and (m)).

Regarding multi-step-ahead regression, the use of the TI strategy achieves a significant improvement with stacked LSTM/GRU models (Fig. 14(columns 2 and 3)).

Regarding multi-step-ahead classification, the use of the TI strategy achieves a slight improvement with the stacked LSTM model and most CNN-LSTM/GRU models (Fig. 15(columns 2 and 3)).

Table 9 illustrates the best configurations based on the number of prediction steps for regression and classification approaches. Service providers can choose the model and adjust the configuration based on the required prediction steps. For one-step-ahead prediction, VTGAN (LSTM-based) model outperforms other models with a window size equal to 3 (15 minutes), whether regression or classification approaches. For multi-step-ahead prediction, Stacked LSTM/GRU and CNN-LSTM outperform other models with TIs for the regression and classification approach, respectively.

Table 9 Best configuration based on the number of step-ahead prediction sizes

In general, the use of the TI strategy is powerful in the case of long-term prediction strategy in some models. Unfortunately, this is not suitable for real-time resource management frameworks in cloud data centers, and that might be because adding dependent features leads to an over-fitting issue. Nevertheless, this issue is promising to investigate and could be improved using ensemble and hybrid strategies as in [86].

Bitbrains dataset comparison

To confirm the performance evaluation of the proposed models, we perform experiments using another real cloud dataset, namely, Bitbrains [72]. This dataset is published online in the Grid workloads archive [10]. It is a large-scale and long-term trace of real data. The dataset of Bitbrains contains data spanning over 5,446,811 CPU hours (1750 VMs), with 23,214 GB memory and 5,501 cores. For comparison purposes, we perform the same preprocessing steps as [44]. Then, we evaluate our proposed models compared to the models of Authors in [44] with the regression approach only, as using the trend classification is a novel approach in the field of cloud workload forecasting.

Table 10 illustrates the MAPE of CPU utilization prediction with the values of the same variables that are used in [44], such as window size and train/test ratio. Also, Table 11 illustrates the lowest MAPE value for each model with optimum window size and split ratio, which is obtained from all combinations shown in Table 10.

We can see that our proposed models achieve the highest prediction accuracy compared to other state-of-the-art prediction models in [44]. However, the lowest MAPE is obtained in our VTGAN (GRU-based) model for a window size of 60 and a split ratio of 80:20. The split size ratio remains the same for our VTGAN (LSTM-based) model, but history window size changes to 30.

Table 10 Prediction performance of Bitbrains dataset for the proposed models compared to the models in [44]
Table 11 Summary of lowest MAPE values of our proposed models compared to the models in [44]

Table 12 illustrates the improvement or diminishing percentage of using our proposed models compared to the state-of-the-art prediction models. We calculate it as [44] using the Eq. (17), where \(Y_p\) and \(Y_c\) denote the MAPE value of our proposed model and the compared model, respectively. We take into consideration the best combination scenario only for all the models in terms of window size and split ratio.

$$\begin{aligned} X_c=\frac{(Y_c - Y_p)*100}{Y_p} \end{aligned}$$
(17)
Table 12 MAPE percentage increase/decrease of the compared models in [44] with respect to our proposed model

For this comparison study, we use ARIMA, LSTM, GRU, Bi-LSTM, and BHyPreC as the baseline models to compare. The Positive percentage denotes the percentage increase of the MAPE value of the compared model with respect to our proposed models. We clearly see that the percentage MAPE value increases for all the models compared to our proposed models.

As we can see, our proposed models considerably minimize the MAPE in predicting CPU utilization. Therefore, our models are not only superior to the classical models (ARIMA) but also perform much better compared to other deep learning approaches presented in this paper.

Conclusions and future works

In recent years, the workload prediction process has become a key stage towards efficient resource allocation and management approaches in cloud computing environments. Due to the non-linearity of cloud workloads, this issue faces enormous challenges. Therefore, this paper proposes a novel direction in the cloud workload prediction field by considering the future movement direction in a modern classification structure. In addition, it presents novel VTGAN models, which are based on a GAN network with stacked LSTM or GRU as a generator and 1D CNN as a discriminator. The main benefit of VTGAN models is their ability to deal effectively with long-term nonlinear dependencies of cloud workloads.

In this paper, we study the proposed models on different configurations over an over-volatile real cloud workload trace. Also, we present the impact of tuning sliding window size and multi-step-ahead strategy. In addition, we study the use of technical indicators, Fourier transforms, and wavelet transforms to increase the number of input features. We apply all of these studies with the VTGAN models compared to stacked LSTM/GRU and CNN-LSTM/GRU models.

The experimental results demonstrate that the VTGAN models are superior for the cloud workload prediction approach, whether using LSTM or GRU as a generator. Also, these results illustrate the effectiveness of transforming the problem to classify the trend instead of predicting the value of future workload for all tested models. Significantly, the upward classification accuracy reaches 96.6%. The proactive overload detection stage in the resource management techniques is a critical issue that overcomes the unnecessary migrations that violate the service level agreement for end-users. The results are not promising regarding the multi-step-ahead prediction and technical indicator strategies. Thus, one-step-ahead prediction is more suitable for a real-time cloud environment. In addition, the technical indicator approach may be extended further by proposing a solution to optimize the prediction and classification error.

As an additional suggestion for future work, a dynamic scaling method can be applied rather than set a fixed value to improve the prediction and classification accuracy. Another future direction is to implement these prediction models in an actual resource management framework for the cloud data center through the CloudSim simulation tool to evaluate the proposed models in a large-scale simulated cloud environment. Hence, the decision of resource allocation will be based on the trend. In addition, we will extend the classification approach so that the CPU utilization trend will be predicted based on three classes:(i) upward trend, (ii) hold, and (ii) downward trend.

As further promising directions for future research, our contribution opens research areas concerning next-generation computing, such as Edge AI [75]. Especially, a hybrid solution could be presented by processing real-time applications on edge devices and training models on the cloud [50, 65]. Our trend classification approach could be helpful in this Edge-to-cloud integration approach in offloading the training process to the cloud by allocating it to the best host, depending on the future workload of the servers. This approach could be considered and implemented for most resource allocation frameworks, such as Mobile edge computing and fog computing platforms for internet of things (IoT) purposes [49]. That approach increases computational performance and reduces the total energy consumed and processing times for mobile or edge devices. Moreover, edge computational resources suffer from QoS degradation due to overloading and inconsistency. Therefore, an intelligent proactive workload management framework could be presented to guarantee the load balancing between the edge resources using our classification approach.