1 Introduction

With the development of cloud computing technology, most enterprises move services into the cloud to achieve better performance and security. In order to ensure the reliability and stability of the cloud platform, the operator obtain a large number of monitoring data from different levels of the cloud platform forming a real-time KPI (Key Performance Indicator) curve to observe the running state of the key components of the cloud platform and using anomaly detection model to analyse historical KPI data build a prediction model of KPI curve under normal conditions. In practice, enterprise will formulate marketing strategy according to the market, which will result in the change of data characteristics of the KPI curve which is monitored in the cloud, makes the anomaly detection model cause a lot of false alarm and form the alarm storm.

In order to solve the problem of above, this paper proposes an intelligent KPI data anomaly detection strategy. Firstly, differential and autocorrelation functions are used to construct a perceptron for the change of KPI data characteristics. Then, in the face of the changes in KPI data features, this paper builds an intelligent regulator of time series model based on reinforcement learning technology, and gets rid of the dependence on manual and label data. At the end of this paper, three typical KPI data in the real environment are detected. The experimental results show that the proposed method can adapt to the constant change of the KPI data characteristics under the cloud environment, accurately judge the characteristics of the KPI curve, and intelligently adjust the corresponding anomaly detection algorithm to ensure the effective anomaly detection of the cloud platform.

The remaining part of this paper is organized as follows. The second chapter introduces the whole idea of this article. The third chapter introduces the intelligent anomaly detection framework constructed in this paper. The fourth chapter introduces the related anomaly detection contrast experiments on three typical KPI data, and the fifth chapter introduces the related work with this paper. The sixth chapter summarizes the work and prospects for the future.

2 The Core Idea

In the cloud environment, the change of the KPI data features used to describe the running state of the system modules is uncontrollable. In order to realize automatic anomaly detection of KPI curves in cloud environment, the data features of KPI curves need to be perceived first. This paper compares the global trend of KPI data with the local trend by using the differential technique and autocorrelation function, thus perceiving whether the data characteristics of the KPI curve have changed. Different time series algorithms have different parameters range, therefore, it is very important to abstract the algorithm of different time series into a unified mode of adjustment. In this paper, we use the reinforcement learning technology to transform the adjustment process of different time series algorithms into the process of finding the optimal parameters. Because the reinforcement learning technology can interact with the environment to produce data, it can get rid of the dependence on the historical data and quickly adjust the anomaly detection model. In this paper, an anomaly detection framework for intelligent operation and maintenance is constructed, as shown in Fig. 1.

Fig. 1.
figure 1

Anomaly detection framework

Specific modules are (1) Data monitor: it is different cloud products to collect KPI data at different levels in the cloud environment and carry out persistent storage; (2) Perceptron: compare the global fluctuation trend of KPI data with the local fluctuation trend by using the difference technique and autocorrelation function to judge whether the KPI data characteristics of the real-time flow change; (3) Adjuster: Based on the interaction between the reinforcement learning technology and the external environment, the adjustment of model of different time series is transformed into an automated Markov decision process, which makes the adjustment process free from the manual participation and the self-healing recovery; (4) Decision maker: using a variety of time series algorithms to predict the KPI curve, by comparing the relative deviations between the true value and the predicted value, determine whether the KPI curve is anomaly in the set threshold.

3 Automatic Anomaly Detection Method

3.1 KPI Data Feature Perception

KPI data is essentially a continuous time series data, The characteristics of data are periodic, stable and unstable. On the periodic determination, for the monitoring data set DS, this paper uses the differential technique to carry out the difference processing to the global data and compare the changes of the global variance before and after the difference. If the monitoring data set is periodic, the global variance \( V_{g} \left( {DS} \right) \) before the difference will be far greater than the global variance \( V_{g}^{{\prime }} \left( {diff\left( {DS} \right)} \right) \) after the difference, so the use of Formula 1 can determine whether the monitoring data set is periodic.

$$ \frac{{V_{g} \left( {DS} \right)}}{{V_{g}^{{\prime }} \left( {diff\left( {DS} \right)} \right) + V_{g} \left( {DS} \right)}} = 1 $$
(1)

On the determination of stable and unstable, we calculate the autocorrelation function \( \widehat{{\rho_{k} }} \) of monitoring data set, such as formula 2,

$$ \widehat{{\rho_{k} }} = \frac{{\mathop \sum \nolimits_{t = 1}^{T - k} \left( {p_{t} - \overline{p} } \right)\left( {p_{t + k} - \overline{p} } \right)}}{{\mathop \sum \nolimits_{t = 1}^{T} \left( {p_{t} - \overline{p} } \right)^{2} }} $$
(2)

It can identify whether the time series data have stability, if the autocorrelation function of the KPI curve does not decrease rapidly with the change of the adjacent time points to 0, then the KPI curve has unstable and vice versa.

3.2 Automatic Adjustment of Time Series Model

The Q-Learning [11] algorithm is the main method to solve the model free reinforcement learning. Its basic idea is to record the utility value of the state in each action, that is, the action state value, by establishing a function table. The action state value represents the validity and value of the action selected under the current state, and also as the basis for the next strategy to select the action, and updates the action state value of the current state through the action state value of the next state, as shown in Fig. 2 (the data in the diagram is used for demonstration):

Fig. 2.
figure 2

\( Q\left( {s,a} \right) \) function

The initial value of function table \( {\text{Q}}\left( {{\text{s}},{\text{a}}} \right) \) is (a). In one strategy, \( s_{0} \) is selected randomly from the action of non-negative value in the initial state, In Fig. 2(b), \( a_{2} \) is selected so that the state becomes S1, and the utility value is 0.1 by formula 3, where \( r \) is the immediate reward given by the reward function, \( Q\left( {s_{t + 1} ,a_{t + 1} } \right) \) is the utility value of the next state, 0 in Fig. 2, and the update function table as shown in Fig. 2(c) at the end of a strategy.

$$ Q\left( {s_{t} ,a_{t} } \right) = r + \gamma \left( {{ \hbox{max} }\left( {Q\left( {s_{t + 1} ,a_{t + 1} } \right)} \right)} \right) $$
(3)

In the process of action selection, the Q-Learning algorithm is selected according to the non-negative value of the corresponding \( {\text{Q}}\left( {{\text{s}},{\text{a}}} \right) \) function table, such as the next policy, Fig. 2(c) as the basis, the optional action at \( {\text{s}}_{0} \) is \( \left\{ {{\text{a}}_{1} ,{\text{a}}_{2} ,{\text{a}}_{3} ,{\text{a}}_{4} } \right\} \), and the optional action at \( {\text{s}}_{1} \) is \( \left\{ {{\text{a}}_{2} ,{\text{a}}_{3} ,{\text{a}}_{4} } \right\} \), The execution of each strategy will update the \( {\text{Q}}\left( {{\text{s}},{\text{a}}} \right) \) function table until the \( {\text{Q}}\left( {{\text{s}},{\text{a}}} \right) \) function table converge as Fig. 3, and at this time the optimal strategy is selected to select the maximum cumulative return value, that is, the maximum utility value for each pair state-action is selected, for example, the optimal strategy in Fig. 3 is a sequence \( \uptau = \left\{ {a_{4} ,a_{2} ,a_{4} ,a_{1} ,a_{1} } \right\} \).

Fig. 3.
figure 3

Convergent function table \( Q\left( {s,a} \right) \)

In the Q-Learning algorithm, the setting of the reward function is static. It gives rewards or penalties depending on whether the current action makes the model state better than the initial state or whether it is superior to the previous state. But at this time, there will be a state of S1 in Fig. 3. When the function table is not yet convergent, there are always multiple actions to choose from in this state. Most actions do not bring optimization to the current model, which makes many invalid iteration steps in finding optimal strategies. We want to reduce the steps that can’t make the model state transition to better in the overall adjustment process, so that the function table can converge faster, so this article set the dynamic return function as formula 4:

$$ R = \frac{{F_{t} }}{{F_{T} }} \cdot \left( {F_{t} - F_{max} } \right) $$
(4)

\( F_{t} \) is the F-Score value obtained from the anomaly detection model under the current parameter adjustment action, that is, the current state value. \( F_{T} \) is the target state, \( \frac{{F_{t} }}{{F_{T} }} \) makes the current state value closer to the target value, and the bigger the reward value is, \( F_{max} \) is the maximum state value set during the execution of a policy. \( F_{t} - F_{max} \) makes the award be rewarded only if the exception detection model gets better state values under the adjustment action, otherwise it will be punished, which is beneficial to a strategy to reach the optimal state faster. Based on the above strategy, we get the pseudo code for obtaining the best policy based on Q-Learning algorithm, as shown in Table 1:

Table 1. Optimal strategy seeking algorithm

Through algorithm above, we compare the action selection process of static reward function, as shown in Fig. 4.

Fig. 4.
figure 4

\( Q\left( {s,a} \right) \) table comparison

When the state \( s_{1} \) is updated to \( s_{2} \) in Fig. 4(a), according to the static reward function, action \( a_{4} \) makes the state of the model better than the previous state, so we get the reward. However, because the state a is punished, the whole model is not optimized. According to the dynamic reward function presented in this paper, we will get the punishment, as shown in Fig. 4(b). In the next policy execution, there are 4 actions that can be attempted in the state S2 (a) table, while there are only 3 of them in (b). Therefore, (b) table will arrive at the next state faster, and this advantage will be more obvious in the accumulation of multiple strategies. Comparing the convergent function table \( {\text{Q}}\left( {{\text{s}},{\text{a}}} \right) \) in Fig. 3. Under the proposed strategy, the function table \( {\text{Q}}\left( {{\text{s}},{\text{a}}} \right) \) will converge as shown in Fig. 5.

Fig. 5.
figure 5

The function table \( Q\left( {s,a} \right) \) of convergence under the dynamic reward function

Compared to the static reward function, the dynamic reward function is stricter in the selection of the best strategy, so the utility value is more negative, and the adjustment action corresponding to the negative value will not be selected again, so the function table \( {\text{Q}}\left( {{\text{s}},{\text{a}}} \right) \) will converge faster. Each line of the function table has at least one non- negative value, and it does not appear in a state without the optional action of a adjustment action. In experiment, we verified by experiments that the convergence speed of function table \( {\text{Q}}\left( {{\text{s}},{\text{a}}} \right) \) is faster under dynamic reward function.

4 Experiment

4.1 Experimental Design

In order to verify the effectiveness of the proposed strategy, this paper carries out the related experiments on the open desensitization data set [10] in the real environment of the Baidu Inc search data center. The physical environment of this experiment is 6 servers with 8 Cores CPU 32 GB Mem. The programming environment is Anaconda 3.6. The comparison object is supervised learning strategy decision tree, unsupervised learning clustering strategy K-means, and parameter selection and estimation strategy in document [3]. In order to verify the effectiveness of the strategy in the anomaly detection, we observe and analyze the change process of the recall and accuracy of the anomaly detection results and compare the F-Score values of the different anomaly detection models. In order to verify the optimization of the strategy in the iterative process, the number of iterations per adjustment process is compared with the original Q-learning algorithm.

4.2 Evaluation Index

  1. (1)

    Recall: the ratio of true outliers representing the true outliers detected by the representative, as shown in the formula 5.

    $$ {\text{recall}} = \frac{\# \,of\,ture\,anomalous\,points\,detected}{\# \,of\,ture\,anomalous\,points} $$
    (5)
  2. (2)

    Precision: the ratio of the true outliers represented by the detection to the total outliers is calculated, as shown in 6.

    $$ precision = \frac{\# \,of\,ture\,anomalous\,points\,detected}{\# \,of\,anomalous\,points\,detected} $$
    (6)
  3. (3)

    F-Score: a comprehensive measure of recall and precision. The formula is shown in 7:

    $$ {\text{F}} - Score = \frac{2 \cdot recall \cdot precision}{recall + precision} $$
    (7)

4.3 Experimental Results

4.3.1 Verification of Anomaly Detection Effect

In this strategy, we set up F-Score > 0.70, F-Score > 0.80 as the target state of the anomaly detection model, select Holt-winters, ARIMA, EWMA, Wavelet as the model of time series, and take 1 days as the measurement and decision tree algorithm, clustering algorithm, parameter selection and estimation method in the 30 day data. Rate, accuracy and overall F-score value are obtained.

From Fig. 6, it can be seen that in the process of anomaly detection, the recall and the precision of the manual adjustment method when each data characteristic changes are further reduced. And other strategies can maintain good anomaly detection after adjustment. On the recovery of anomaly detection model adjustment, from the fourth day, the tenth day, the eighteenth day, the twenty-third day and the twenty-eighth day of Fig. 6, the strategy (RL) and parameter selection method proposed in this paper adjust the recovery fastest in the face of changes in data characteristics. Although both the decision tree strategy and the k-means method have high reliance on the characteristics of the new data, the decision tree strategy utilizes the markup update of the expert system, so that it is faster than the adjustment of the anomaly detection effect by the k-means. In the overall detection effect, the overall trend of the decision tree strategy shown in Fig. 6 is relatively stable, but the overall average is low. The overall fluctuations of RL, k-means, and parameter selection methods are large, but the overall average is high.

Fig. 6.
figure 6

Comparison of recall and precision

4.3.2 Optimization Verification of Iterative Process

In order to compare the number of iterations of the original Q-Learning algorithm and the optimized algorithm of this article when adjusting the parameters, we add the counter of iterations in the iteration process to record the number of iterations. After the anomaly detection process, the data is obtained as shown in Fig. 7:

Fig. 7.
figure 7

Iteration number comparison

The original Q-Learning algorithm rewards each positive parameter adjustment action. The dynamic reward function proposed in this paper only rewards the current optimal parameter adjustment action, so that the action to obtain the reward is reduced, and the number of optional adjustment actions in the next adjustment is also reduced. As can be seen from Fig. 7, five anomaly detection model adjustments are made in the anomaly detection process. In each adjustment, the optimized strategy of this paper is less than the original Q-Learning algorithm in the number of iterations. Therefore, the effectiveness of the proposed strategy for the optimization of the iterative process to obtain the best parameters is verified.

5 Related Work

In cloud environment, many researchers have done research on anomaly detection algorithm. Some based on the data distribution to detect the anomaly, which using the inconsistency test method to compare the probability distribution of the detected data to the presumed probability distribution, such as the literature [1], and some methods based on deviation, such as ARIMA algorithm in literature [2], Holt-Winters algorithm in literature [3], Wavelet algorithm in literature [5]. However, these algorithms do not have a good solution to the change of data characteristics, and only rely on manual re-adjustment to achieve the desired detection efficiency.

To solve the problem of data characteristics over changing, researchers have made a study on the adaptive detection model. Some based on supervised learning technology, such as literature [8, 9], Some based on unsupervised learning methods, such as literature [6, 7], but those kind of algorithm usually needs to build an extra expert system to mark anomaly data, and has a high dependence on historical data. In literature [4], two strategies are used in parameter configuration. One is to enumerate the limited parameters by using the reduced parameter sample space and enumerate the spare parameters in advance. The other is to use the targeted parameter estimation algorithm to get the appropriate parameters. However, this method can’t guarantee that the reduced sample space contains the optimal parameters under each data characteristics in the pre-proposed parameter sample space, and for the complex anomaly detection algorithm, the corresponding parameter estimation method should be tested for each anomaly detection algorithm.

Based on the thought of the above work, this paper constructs an adaptive anomaly detection framework using the reinforcement learning technology, automatically triggering the adjustment of the anomaly detection model to the Markov decision process by perceiving the changes of the data characteristics, in addition, the strategy of selecting parameter adjustment action for different anomaly detection algorithms, which realizes the automatic adjustment of the anomaly detection model in the face of the change of data characteristics, and ensures a good anomaly detection effect in the cloud environment.

6 Conclusion

Anomaly detection is an important technology to ensure the stability of the system services of the cloud platform. However, because of the complexity of the data changes in the cloud environment, the anomaly detection model needs to be constantly adjusted. In this paper, we introduce an adaptive detection method based on reinforcement learning, which automatically triggers the transformation of the anomaly detection model to the Markov decision process by perceiving the changes in the characteristics of the monitoring data, and we put forward the selection strategy of parameter adjustment action and the optimization algorithm for obtaining the best parameters, and realize the automatic adjustment of the anomaly detection model when the data characteristics is changed. In the future work, we will further optimize the iterative process of the parameters of the Markov decision process, reduce the time of the parameter selection process, and improve the adaptability and sensitivity of the model in the anomaly detection process.