1 Introduction

Nowadays, system state monitoring has become critical at multiple layers of a system, from physical network infrastructure to services. Monitoring plays a crucial role in malfunction detection, predicting and preventing system downtimes, performance logging and is vital in security aspects like detection of malware penetration.

As the number of interconnected devices and network traffic volume increases exponentially, it has become an ever-greater challenge to attain a dependable, sound, and prompt solution for network infrastructure monitoring. It is pivotal to understand the details of complex processes and identify their influence on each other and the whole infrastructure. The concept of network telemetry is one step in this direction, allowing automated, quick, and concurrent transmission and recording of a wide range of time-series data from numerous network devices. However, the processing of vast amounts of data brings challenges, especially about scalability and time constraints.

One of the most promising applications of telemetry data processing is anomaly detection. Should the monitored system behave abnormally, unusual patterns in the data, i.e., anomalies, occur. The identification of such anomalous patterns is called anomaly detection that is especially useful when it is done in real time.

Machine learning can process and analyse infrastructure behaviour, even in large data volumes. The use of machine learning techniques in anomaly detection is emerging. However, many obstacles are still yet to be tackled.

In our state-of-the-art study, we found ReRe [1], a long short-term memory (LSTM) [2]-based machine learning algorithm, to be one of the most promising approaches for real-time anomaly detection on network time-series data. In the previous work of our research team, we proposed an improved version of ReRe called Alter-Re\(^2\) [3]. This algorithm offers high anomaly detection accuracy with a low amount of false positives and retrainings.

However, it still has shortcomings, such as the manual parameter setting and a constant prediction offset that can result in incorrect predictions.

In this paper, we propose AREP (Alter-Re\(^2\)+), our further improved version of Alter-Re\(^2\). AREP introduces automatic tuning of its two key parameters (detection window size and ageing power coefficient) and includes an offset compensation component, which executes a model retrain when an offset is present.

Unfortunately, we have observed that AREP and its predecessors perform well only on datasets showing specific patterns. Based on this observation, we propose a data type classification method to identify patterns of distinct characteristics. Thanks to this classification, when AREP is used off-line or semi real-time it can be applied for anomaly detection on the most suited datasets. In contrast, other, more appropriate anomaly detection algorithms can be chosen on data streams showing different patterns.

Moreover, we use an extended range of metrics in our performance evaluations, including area under the curve (AUC). AUC computation is based on receiver operating characteristic (ROC) curves. However, in the case of AREP and its predecessors, generating the ROC curves is challenging due to the algorithms’ inherent adaptive threshold technique. Thus, we introduce also a novel ROC generation approach that can be applied to the AREP/Alter-Re\(^2\)/ReRe and other algorithms using adaptive threshold technique.

We have assessed the efficacy of AREP through rigorous experiments using four different Numenta benchmark datasets from NAB [4] and achieved promising results. First, we compared the performance of AREP to its predecessors and then to seven other state-of-the-art approaches. We show that on time series following specific data patterns AREP overperforms its predecessors and produces similar to or even better performance than the other investigated algorithms.

In summary, our contributions are threefold:

  • We propose AREP, an improved version of Alter-Re\(^2\) anomaly detection algorithm. It introduces automatic and adaptive tuning of its two key parameters (detection window size and ageing power coefficient) and includes an offset compensation procedure to eliminate prediction offset efficiently.

  • We propose a data type classification method to identify datasets with specific characteristics on which AREP and its predecessors can be applied efficiently in off-line or semi real-time scenarios.

  • We introduce a novel ROC curve generation approach for algorithms that inherently rely on adaptive detection threshold as a key part of their function, such as AREP and its predecessors.

The rest of the paper is organised as follows: In Sect. 2, we discuss related works in the field of anomaly detection. Section 3 introduces Alter-Re\(^2\), then we present its improved version, AREP, in Sect. 4. In Sect. 5, we introduce our data type classification method. Section 6 discusses our evaluation metrics and presents our novel ROC curve generation approach. Section 7 presents our experiments to investigate AREP’s performance when compared on the one hand to ReRe and Alter-Re\(^2\), and on the other hand to other state-of-the-art anomaly detection approaches. Finally, Sect. 8 concludes the paper and discusses further research implications.

2 Related work

In this section, we shortly overview related anomaly detection approaches and identify room for improvements.

Putina et al. [5] at Cisco developed a streaming telemetry-based anomaly detection engine for BGP anomalies. It uses core and outlier micro-clusters with the DenStream [6] clustering algorithm. This way, clusters of arbitrary shapes can be dynamically constructed. However, DenStream requires calculating certain parameters that cannot be done in real time. Moreover, domain knowledge is also required, and the algorithm is limited to BGP data.

A more generic real-time anomaly detection approach not primarily aimed at network telemetry is the Anomaly Detection Package developed by Twitter [7]. This approach uses the Seasonal Hybrid ESD algorithm to detect statistically significant local and global anomalies. The variant working on time series is called AnomalyDetectionTs (ADT), while the authors also propose a vector of numerical observations called AnomalyDetectionVec (ADV). However, the approach’s applicability to streaming time-series data is limited due to the necessity of a massive number of data points for the algorithm.

Another approach, ContextOSE [8], is based on the contextual anomaly detection (CAD) method. CAD uses contextual, i.e., local information of a time series instead of a global view. This method is unsupervised and uses a set of centroid values calculated from a subset of similar time series. Predictions are made using these centroid values, among other time-series features. Although this approach is promising and applicable to our problem, setting the parameters manually requires domain knowledge.

Skyline [9] is a real-time anomaly detection system built for passive monitoring of hundreds of thousands of metrics without the need to configure a model or thresholds manually for each one. The method uses different expert approaches in an ensemble. These approaches vote independently, and the final anomaly score is calculated from these in multiple manners. Researchers at Numenta [10] modified this approach to better fulfil constraints set up by data stream processing in real time. This was achieved by taking out the computationally expensive detectors. Skyline is directly applicable to our problem, but its computation resource requirements are still considerable.

Adams et al. [11] propose a Bayesian changepoint detection algorithm for online inference. Changepoint detection aims to identify abrupt changes in the generative parameters of sequential data. Their approach calculates the probability that the current record is part of the data stream for various lengths based on previous data points. If the maximum probability corresponds to a stream length of zero, a changepoint is detected. The authors use the size of the whole dataset for initialisation. To make this method work in real time and thus applicable to our problem, the team at Numenta replaced this with a recursively overwritten array. This approach is promising; it can provide fast results, but setting the correct parameter values can be challenging.

One of the most promising trends of time-series analysis in the network telemetry domain and outside is machine learning-based approaches. Kaiafas et al. [12] take advantage of ensemble learning to combine multiple unsupervised machine learning algorithms. Using this method, they can omit ground truth by constructing outlier ensembles on the dataset. A drawback is that the data need normalisation, and the algorithm is devised for offline stored data making it unsuitable for real-time use-cases. Shi et al. [13] propose a novel model of unsupervised anomaly detection approach based on artificial immune network (UADAIN) that consists of unsupervised clustering, cluster partition and anomaly detection. It is an unsupervised method, but works only in offline mode.

Lavin et al. proposed a hierarchical temporal memory-based (HTM) [14] anomaly detection approach [4, 15]. HTM is a continuous learning system derived from the theory of the neocortex and is well suited for real-time applications. Numenta and NumentaTM are two variants of the proposed HTM approach. They provide promising results; however, it is unclear how well they can adapt to changing patterns in time series. The authors have also made available a controlled, open-source test environment called Numenta Anomaly Benchmark (NAB), a valuable research source. Similarly, Pilinszki-Nagy et al. [16] proposed an HTM-based approach to model and predict sequential data including data from time series. The HTM model is trained in an offline, unsupervised way; however, it is solely used for prediction and does not touch the issue of anomaly detection.

In contemporary literature, the pertinency of deep learning for anomaly detection has been explored. Munir et al. [17] propose a convolutional neural network-based (CNN-based) time-series predictor and anomaly detector called DeepAnT, which is able to detect anomalies using a predefined threshold. Although this approach is applicable to our problem, the predefined threshold can influence the detection accuracy in case of changing data patterns. Flusser et al. [18] propose a surrogate neural network based on an auxiliary training set to approximate existing anomaly detectors with high accuracy and with application-enabling inference speed. The presented results show good performance; however, the method works only in batch mode, and it is unclear how it can dynamically adapt to changing data patterns. Jiang et al. [19] propose an anomaly detection method for industrial multi-sensor signals based on enhanced spatiotemporal features. In this method, the signal is first preprocessed; then, a stack of LSTM [2] and autoencoder [20]-based feature extractors is applied, and finally, a high-dimensional unsupervised cluster is used to detect the abnormal signals. The method is effective but works only in offline mode.

As we see from the prior art overview above, the number of existing works aimed at anomaly detection, specifically in the streaming network telemetry domain, is limited but emerging. However, there is still room for research on anomaly detectors that are applicable to real-time streaming data; can adapt to data pattern changes in an unsupervised manner; can apply automatic parameter settings; and can tune their resource demands adaptively. The Greenhouse algorithm [21] is a good starting point to achieve these goals. It combines state-of-the-art machine learning and data management techniques for anomaly detection. It only needs normal samples while training on a relatively small dataset. Prediction is made using an LSTM model. It has received a substantial amount of research recently as multiple algorithms, like RePAD [22], ReRe [1], and Alter-Re\(^2\) [3], are all gradual improvements on this and each other. RePAD eliminates the need for normal training data, ReRe aims to mitigate false-positive detections, and Alter-Re\(^2\) reduces the resource requirements while improving detection performance. As this bunch of algorithms is mostly in line with our goals, we base our approach for streaming telemetry anomaly detection on Alter-Re\(^2\) and develop it further.

3 Background and motivation

In this section, we introduce first Alter-Re\(^2\), the predecessor of AREP, and then discuss its main limitations.

3.1 Alter-Re\(^2\)

Alter-Re\(^2\) [3], also proposed by our research team, is a state-of-the-art anomaly detection approach for streaming time-series anomalies that can process data in real time. The approach is based on RePAD [22] and ReRe [1] by Lee et al.

The algorithm uses the ‘look back, predict forward’ method taking short-term historical data points to predict the upcoming value. The approach uses LSTM, a type of recurrent artificial neural networks (RNNs) for data prediction. Then, similarly to its predecessors RePAD [22] and ReRe [1] and other approaches, the predicted value is compared to the actual measured value of the data stream to decide whether an anomaly is plausible soon. Alter-Re\(^2\) can set the detection threshold dynamically; therefore, it is apt for detecting changing data patterns and anomalies. Two advantages of the approach are that it converges quickly, i.e., it is ready for detection shortly after start-up, and it is unsupervised, i.e., it does not require any labelled dataset but can learn from the raw data itself.

The novelty of Alter-Re\(^2\) compared to its predecessors is introducing a sliding window and an ageing mechanism. The sliding window is a traditional technique utilised in Alter-Re\(^2\) to minimise the algorithm’s memory requirements and phase out older data points that may negatively influence the detection by slowly increasing the detection threshold. Similarly, ageing is a traditional mechanism that used in Alter-Re\(^2\) to assign different weights to the data points, thus giving higher importance to the recent past.

In accordance with the ‘look back, predict forward’ method, Alter-Re\(^2\) takes the previous b data points (where b is the look-back parameter) and uses them to predict the next f data points (where f is the predict-forward parameter). In Alter-Re\(^2\), f is set to 1. The algorithm uses the following key equations:

$$\begin{aligned} {{\textit{AARE}}}_t=\frac{1}{t-W+1}\cdot \sum _{y=W}^{t} {C_y\cdot \frac{|v_y-\widehat{v_y}|}{v_y}}, \end{aligned}$$
(1)

where

  • \({\textit{AARE}}_t\) is the Average Absolute Relative Error at timestep t;

  • t is the current timestep, starts from \(t=0\);

  • b is the look-back parameter;

  • \(v_y\) is the data point at timestep y;

  • \(\widehat{v_y}\) is the predicted data point for timestep y;

  • W is the beginning timestep of the window;

  • \(C_y\) is the ageing coefficient at timestep y;

    $$\begin{aligned} \mu _{{\textit{AARE}},t}=\frac{1}{t-W+1}\cdot \sum _{y=W}^{t} {{\textit{AARE}}_y}, \end{aligned}$$
    (2)

    where

  • \(\mu _{{\textit{AARE}},t}\) is the average of \({\textit{AARE}}_y\) values at timestep t;

    $$\begin{aligned} \sigma _{{\textit{AARE}},t}=\sqrt{\frac{\sum _{y=W}^{t} {({\textit{AARE}}_y-\mu _{{\textit{AARE}},t})^2}}{t-W+1}}, \end{aligned}$$
    (3)

    where

  • \(\sigma _{{\textit{AARE}},t}\) is the standard deviation of \({\textit{AARE}}_y\) values at timestep t;

    $$\begin{aligned} {{\textit{thd}}}_t=\mu _{{\textit{AARE}},t}+3\cdot \sigma _{{\textit{AARE}},t}, \end{aligned}$$
    (4)

    where

  • \({\textit{thd}}_t\) is the threshold value at timestep t.

AARE is a well-known measure for determining the accuracy of predictions. A low AARE value indicates that the forecast value is close to the observed value [22]. The \({\textit{thd}}_t\) values are calculated using the Three-Sigma Rule [23] in Eq. (4) that is commonly present in anomaly detection algorithms to determine the detection threshold.

Alter-Re\(^2\) implements two LSTM models to provide two levels of detection sensitivity just like ReRe does. These models are deployed as two separate detectors working in tandem as described in [3].

3.2 Limitations of Alter-Re\(^2\)

3.2.1 Manual parameter setting

Although Alter-Re\(^2\) improved the performance of ReRe [1], it introduced two new variables, the \({\textit{WS}}\) (window size) and the \({\textit{AP}}\) (age power) parameters, which have to be set manually. Even though the Alter-Re\(^2\) paper [3] provides valuable tips on selecting proper values for them, each dataset requires slightly different settings. Moreover, if data patterns change mid-operation, there is no option for further adjustments.

In AREP, we introduce an automatic parameter tuning component capable of adjusting both the \({\textit{WS}}\) and \({\textit{AP}}\) parameters to improve performance and eliminate the need for manual parameter setting.

3.2.2 Constant prediction offset

While Alter-Re\(^2\) improves the performance of ReRe in many aspects, it still possesses some problems inherent from its internal design, such as the possibility of a constant prediction offset.

Figure 1 depicts that Alter-Re\(^2\) signals a ‘pattern change’ at timestep 1906. This triggers an LSTM retrain. The newly trained model predicts data points until the next LSTM retrain is performed at timestep 3837.

Fig. 1
figure 1

Anomalies and pattern changes

Incidentally, as shown in Fig. 2, this model makes inaccurate predictions, resulting in a visible offset (a close to constant difference) between the original (green) and the predicted (red) values. This offset is most likely due to the LSTM model’s training data handed by the Alter-Re\(^2\) algorithm. Patterns can only be inferred from these data points. If they do not represent patterns normally residing in the dataset, then the LSTM model consistently predicts erroneously. The offset is present until timestep 3837, where another LSTM retrain is performed due to the noticed pattern change. After that, predictions are correct again as the training data handed to the model accurately represent the remaining dataset’s characteristics.

Fig. 2
figure 2

Offset between the original and predicted values

Unfortunately, the presence of offset influences prediction accuracy negatively, increasing the average error. This may result in fewer anomalies detected. As this problem can arise after any LSTM retrains, the algorithm may lose its ability to operate for an unpredictable duration of time.

To address this issue, we introduce in AREP an offset compensation component able to detect the offset and trigger an LSTM retrain. Since the phenomenon occurs rarely, a few retrains will continually suffice.

4 AREP

In this section, we propose AREP, an improved version of Alter-Re\(^2\), which extends the original algorithm with two innovative procedures, such as the automatic tuning of the \({\textit{WS}}\) and \({\textit{AP}}\) parameters, and offset compensation.

4.1 Automatic tuning of WS and AP parameters

4.1.1 Tuning algorithm

We have devised a tuning algorithm capable of automatically adjusting the \({\textit{WS}}\) and \({\textit{AP}}\) parameters in real time. The algorithm (its flowchart is presented in Fig. 3) starts with initialisation and setting the default parameter values. Fortunately, AREP is not sensitive to default settings, as this automated tuning converges to the appropriate values. Then, it passes the preparation period that is \(2b-1\) timestep long. In order to avoid too frequent adjustments, \({\textit{WS}}\) and \({\textit{AP}}\) cannot be changed more often than b timesteps. At every timestep, these two conditions are checked, and automatic tuning starts only if both are true.

Fig. 3
figure 3

Flowchart of automatic \({\textit{WS}}\) and \({\textit{AP}}\) parameter tuning

figure a

The algorithm has three operation modes, denoted by the op variable. When \(op=0\), the \({\textit{WS}}\) parameter is tuned. First, the two Boolean variables called WS_MIN and WS_MAX are evaluated. The algorithm detects certain criteria associated with extreme values (cf. Sect. 4.1.2) of \({\textit{WS}}\) and behaves accordingly. When \(op=1\), the \({\textit{AP}}\) parameter is adjusted similarly. When \(op=2\), neither parameter is tuned in the actual timestep. The adjustment component is presented in Algorithm 1. As a result, both the \({\textit{WS}}\) and \({\textit{AP}}\) parameters tend towards an optimal value, as illustrated in Fig. 4. Note that in the extreme situation when the time-series dataset changes its behaviour entirely after a certain time, the restart of the tuning algorithm is recommended.

Fig. 4
figure 4

Adjustment algorithm in operation

4.1.2 Extreme parameter value detection

To facilitate the detection of extreme parameter values, we use a separate database that stores only three information pieces, such as timestep, anomaly detected (true or false), pattern change detected (true or false). The size of this database, \(L_{db}\) is determined by the b parameter, such as \(L_{db} = b^2\).

Table 1 summarises how extreme parameter values (too small or large) for \({\textit{WS}}\) or \({\textit{AP}}\) are determined from certain criteria.

Table 1 Mapping criteria to extreme parameter values

Note that the WS_MAX and AP_MIN cases are equivalent just like the WS_MIN and AP_MAX. To grab the impact of the \({\textit{WS}}\) and \({\textit{AP}}\) parameters, we compute A, the size of the area under the \(C_y\) curve. The first two criteria in Table 1 are determined by comparing this area to two hard thresholds based on the b parameter. The algorithm functions properly only if the \({\textit{WS}}\) parameter is greater than or equal to the b parameter, so using linear ageing, the smallest legal window size equals b. Therefore, the smallest allowed area \(A_{{\textit{min}}} = (b-1)/2\). Similarly, the other hard threshold, the highest allowed area \(A_{{\textit{max}}}=b^2\). The ‘Too small/Too large area’ criteria are true if the calculated current area is smaller/larger than \(A_{{\textit{min}}}\)/\(A_{{\textit{max}}}\).

The other three criteria are evaluated with regard to certain phenomena that result from too many or too few data points considered by the algorithm. When the area falls between \(A_{{\textit{min}}}\) and \(A_{{\textit{max}}}\), these phenomena help fine-tune the \({\textit{WS}}\) and \({\textit{AP}}\) parameters.

When the window is too large or ageing is not aggressive enough, there are too many data points considered. In this case, anomaly signals arrive slower after an anomaly occurs and stay on longer. We found through rigorous experiments that if the number of an anomaly signal’s continuous timesteps (\(n_{{\textit{anom}}}\)) is higher than \(2.5 \cdot b\), we can interpret it as a too large \({\textit{WS}}\) or a too small \({\textit{AP}}\). Thus, the ‘Too long anomaly’ criterion holds if \(n_{{\textit{anom}}}>2.5 \cdot b\).

On the other hand, if the algorithm considers too few data points, it does not have enough information to determine long-term patterns and thresholds reliably, and its operation becomes unstable. This implies two distinct yet similar effects. The first one is called ‘Too frequent signals’, which indicates that anomaly signals turn on and off rapidly. To detect this effect, we calculate a threshold and compare this threshold to the ratio of signals. The other effect is the ‘Anomaly flapping’ phenomenon, which refers to an anomaly signal quickly followed by another one. We measure this by comparing the length of anomaly (\(n_{{\textit{anom}}}\)) and no-signal periods (\(n_{no}\)) in between them. Based on experiments, we found that if the no-signal period is shorter than 1.5 times the previous anomaly signal, i.e., \(n_{no} < 1.5 \cdot n_{{\textit{anom}}}\), it is a clear sign of ‘Anomaly flapping’.

Table 2 summarises the criteria evaluation discussed above.

Table 2 Criteria evaluation

4.2 Offset compensation

Below we present the procedure proposed to address the offset issue between the original and predicted values. We discuss the details of this offset compensation component and its automatic tuning.

4.2.1 Details of the offset compensation component

At every timestep after performing the regular operation (see Sect. 3.1), AREP calculates the mean values of the original and predicted data points (\({\textit{VM}}_t\) and \({\textit{PM}}_t\)) in the sliding offset window and takes the difference of these values (\(\delta _t\)). The result is then compared to an offset threshold (\({\textit{thd}}_{{\textit{off}}, t}\)) calculated as the standard deviation of the relevant original data. If the ratio of above-threshold cases is higher than a specified value (set manually), we can interpret that as an offset. If the offset still holds after an offset window-long delay and no ‘anomaly’ or ‘pattern change’ was signalled, an LSTM model retrain has been triggered.

The following equations define the formal computations used by our offset compensation component:

$$\begin{aligned} {\textit{OW}}= {\left\{ \begin{array}{ll} t-{\textit{OWS}}+1 &{} \text {if}\;t>2b-1+{\textit{OWS}} \\ 2b-1 &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$
(5)

where

  • \({\textit{OW}}\) is the beginning of the sliding offset window;

  • \({\textit{OWS}}\) is the offset window size parameter (limited by \({\textit{WS}}\));

    $$\begin{aligned} {\textit{VM}}_t=\frac{1}{t-{\textit{OW}}+1}\cdot \sum _{y={\textit{OW}}}^{t} v_y, \end{aligned}$$
    (6)

    where

  • \(v_y\) is the original data value at timestep y;

  • \({\textit{VM}}_t\) is the original values’ mean in the offset window at timestep t;

    $$\begin{aligned} {\textit{PM}}_t=\frac{1}{t-{\textit{OW}}+1}\cdot \sum _{y={\textit{OW}}}^{t} \widehat{v_y}, \end{aligned}$$
    (7)

    where

  • \(\widehat{v_y}\) is the predicted data value at timestep y;

  • \({\textit{PM}}_t\) is the predicted values’ mean in the offset window at timestep t;

    $$\begin{aligned} \delta _t=|{\textit{VM}}_t-{\textit{PM}}_t |, \end{aligned}$$
    (8)

    where

  • \(\delta _t\) is the difference of the original values’ mean and the predicted values’ mean in the offset window at timestep t;

    $$\begin{aligned} {\textit{thd}}_{\textit{off}, t}=\sqrt{\frac{\sum _{y={\textit{OW}}}^{t} {(v_y-{\textit{VM}}_t)^2}}{t-{\textit{OW}}+1}}, \end{aligned}$$
    (9)

    where

  • \({\textit{thd}}_{\textit{off}, t}\) is the offset threshold (standard deviation of original data) in the offset window at timestep t.

The pseudo-code of the offset compensation component is presented in Algorithm 2.

figure b

Our experiments confirmed that the offset compensation performs as intended. It eliminates the constant offset, which occurs in some cases after retrains, between the predicted and original values. However, it may introduce some increase in resource demand and operation time of a given timestep. Another drawback is the introduction of two additional hyperparameters (\(P_{{\textit{OWS}}}\) and \(P_{{\textit{OP}}}\)). As these parameters determine the component’s sensitivity, effectiveness, and resource demands, they need to be set carefully.

4.2.2 Automatic parameter tuning

In order to address the two issues above, we have devised an automatic tuning procedure for the hyperparameters \(P_{{\textit{OWS}}}\) and \(P_{{\textit{OP}}}\) which also alleviates the impact on resource demands and operation time.

This procedure is based on time constraints. Hence, we introduce an intuitive hyperparameter, \(T_{{\textit{max}}}\), which specifies the average time limit on one timestep duration. Setting this parameter manually can influence the algorithm’s running time without struggling with the more sophisticated \({\textit{OWS}}\) and \({\textit{OP}}\) parameters.

We measure the average duration of timesteps with and without LSTM model retrains using the following equations:

$$\begin{aligned} \overline{\Delta t}_{{\textit{LSTM}}, y+1}&=\left( n_{{\textit{LSTM}}, y}\cdot \overline{\Delta t}_{{\textit{LSTM}}, y}+{\Delta t}_{y}\right) \nonumber \\&\quad \cdot \frac{1}{n_{{\textit{LSTM}}, y}+1}, \end{aligned}$$
(10)
$$\begin{aligned} \overline{\Delta t}_{{\textit{NORM}}, y+1}&=\left( (t-n_{{\textit{LSTM}}, y})\cdot \overline{\Delta t}_{{\textit{NORM}}, y}+{\Delta t}_y\right) \nonumber \\&\quad \cdot \frac{1}{t-n_{{\textit{LSTM}}, y}+1}, \end{aligned}$$
(11)

where

  • \(\Delta t_y\) is the duration of timestep y;

  • \(\overline{\Delta t}_{{\textit{LSTM}}, y}\) is the average duration of timesteps (up to timestep y) when an LSTM model retrain occurred;

  • \(\overline{\Delta t}_{{\textit{NORM}}, y}\) is the average duration of timesteps (up to timestep y) when no LSTM model retrain occurred;

  • \(n_{{\textit{LSTM}}, y}\) is the number of timesteps (up to timestep y) when an LSTM model retrain occurred.

Automatic offset compensation parameter tuning works as follows. First, we observed in our experiments that the window size has only a negligible effect on the algorithm’s stability and performance. Thus, we chose to set \({\textit{OWS}}\) to a fixed value (\({\textit{OWS}}=50\)). Second, we used parameter \(T_{{\textit{max}}}\) to set a maximum limit on the timestep’s average duration within the offset window. \(T_{{\textit{OWS}}}\), the window length can be well approximated by Eq. (12) as

$$\begin{aligned} T_{{\textit{OWS}}}&= n_{{\textit{LSTM}},y}\cdot \overline{\Delta t}_{{\textit{LSTM}}, y}\nonumber \\&\quad + ({\textit{OWS}}-1-n_{{\textit{LSTM}},y})\cdot \overline{\Delta t}_{{\textit{NORM}}, y}. \end{aligned}$$
(12)

We introduce \({\textit{RP}}\) (retrain percentage) defined in Eq. (13) that indicates the ratio of LSTM model retrain timesteps in the window

$$\begin{aligned} {\textit{RP}}=\frac{n_{{\textit{LSTM}}}}{{\textit{OWS}}}. \end{aligned}$$
(13)

\({\textit{RP}}_{{\textit{max}}}\), the maximum allowed ratio of LSTM model retrain timesteps can be computed according to Eq. (14) as

$$\begin{aligned} {\textit{RP}}_{{\textit{max}}}=\frac{{\textit{OWS}}-1}{{\textit{OWS}}}\cdot \frac{T_{{\textit{max}}}-\overline{\Delta t}_{{\textit{NORM}}, y}}{\overline{\Delta t}_{{\textit{LSTM}}, y}-\overline{\Delta t}_{{\textit{NORM}}, y}}. \end{aligned}$$
(14)

Based on these equations, the automatic tuning of the offset compensation parameters is presented in Algorithm 3. The number of retrains is limited by the \(T_{{\textit{max}}}\) parameter set manually and never exceeds \({\textit{RP}}_{{\textit{max}}}\).

figure c

5 Pattern-based data classification

In this section, we detail the performance differences of AREP on different types of datasets and propose data classification based on patterns, like periodicity and spikiness, to facilitate the decision of selecting AREP as the suitable anomaly detector.

5.1 Data patterns

We observed that, when used offline, AREP had performed significantly worse on certain datasets. These datasets shared some common characteristics, like periodicity or spikiness. AREP managed to detect most of the anomalies in aperiodic data with a constant average and anomalies either as sudden changes in the average (see Fig. 5) or as spikes rising from the neighbouring normal data. On the other hand, it regularly achieved worse performance on periodic data in which anomalies appeared as spikes with different amplitude or out of phase (see Fig. 6).

Fig. 5
figure 5

Example of aperiodic, not spiked data

Fig. 6
figure 6

Example of periodic, spiked data

We believe that classifying the datasets along some simple patterns can be beneficial. Based on this classification, when used offline we can determine in advance whether or not AREP is the suitable detector for the current dataset.

5.2 Classification

We classify datasets according to periodicity and spikiness. These two properties require different approaches.

5.2.1 Periodicity

One can investigate periodicity, often called seasonality, by generating and analysing a periodogram [24]. It estimates the power spectral density (PSD) of the dataset and is calculated in two steps [25]. First, we must create the original data’s discrete Fourier transform (DFT). Then, we have to square the length of each Fourier coefficient, which gives the points on the periodogram.

Algorithm 4 shows our periodogram-based method to classify datasets.

figure d

The periodogram is generated according to Eq. (15) as

$$\begin{aligned} {\mathcal {P}}\left( f_{\frac{k}{N}}\right) = \left\| {\mathcal {F}}\left\{ x(t)\right\} \left( f_{\frac{k}{N}}\right) \right\| ^2, \end{aligned}$$
(15)

where

  • \({\mathcal {P}}(f)\) is a periodogram point at frequency f;

  • \(\Vert {\mathbf {v}}\Vert\) is the length of vector \({\mathbf {v}}\);

  • \({\mathcal {F}}\left\{ x(t)\right\} \left( f\right)\) is the DFT of the original data values at frequency f;

  • x(t) is the original data value at timestep t;

  • \(f_{\frac{k}{N}}\) are the discreet frequencies the DFT returns values at, \(k=0,1,\dots ,\lceil \frac{N-1}{2}\rceil\); N is the number of points used in an N-point DFT.

5.2.2 Spikiness

We investigate spikiness by Algorithm 5.

figure e

We determine a spike threshold \({\textit{thd}}_{spike}\) (using 6 as the coefficient of std(diffs) is based on our experiments). We consider the dataset spiked if the ratio of spikes is larger than a pre-set \(P_{spike}\) parameter (based on experiments, as well).

6 Evaluation metrics

In this section, we first overview the metrics we used in our performance evaluations and then introduce a novel ROC curve generation approach for anomaly detectors, like ReRe, Alter-Re\(^2\) and AREP, that rely on adaptive detection threshold techniques.

6.1 Metrics

Comparing anomaly detectors is far from trivial. Well-known metrics like Precision, Recall or F-score [26] rely on the confusion matrix, whose elements are the True Positives (\({\textit{TP}}\)), False Positives (\({\textit{FP}}\)), True Negatives (\({\textit{TN}}\)) and False Negatives (\({\textit{FN}}\)). We also use these elements that are calculated by mapping anomaly detections to ground truth in the following way.

We designate anomaly windows with K size around the ground truth anomaly labels similarly to [27] or [28] (the ground truth anomaly labels need to be known regarding the complete time series). For instance, if the ground truth anomaly label is at timestep \(y_{{\textit{label}}}\), the window is the interval \([y_{{\textit{label}}}-K; y_{{\textit{label}}}+K]\). Inside anomaly windows, only the first anomaly signal is counted as \({\textit{TP}}\). In case of no signal in the whole window, the \({\textit{FN}}\) is incremented by one. On the other hand, all anomaly signals outside an anomaly window are considered as \({\textit{FP}}\), which is then normalised by the anomaly window size (\({2 \cdot K - 1}\)). \({\textit{TN}}\) is counted by subtracting \({\textit{TP}}\), \({\textit{FP}}\) and \({\textit{FN}}\) from the normalised (by the anomaly window size) number of data points of the entire dataset.

In our experiments, we used basically the Precision, Recall and F-score metrics for performance evaluation. Precision represents the fraction of positive detections that are actually positive. Recall represents the fraction of actually positive detections that have been correctly detected as positive. F-score is the weighted harmonic mean of Precision and Recall.

Moreover, besides other simple metrics like mean square error (MSE), we wanted to compare the performance of the ReRe family algorithms also by generating ROC (receiver operating characteristic) curves and computing the corresponding AUC (area under the curve) metrics [29]. The ROC curve is an effective visual method, which depicts on a graph the relationship between the false-positive rate (\({\textit{FPR}}={\textit{FP}}/({\textit{FP}}+{\textit{TN}})\)) on the x-axis and the true-positive rate (\({\textit{TPR}}={\textit{TP}}/({\textit{TP}}+{\textit{FN}})\)) on the y-axis.

The points on the curve correspond to the detector’s behaviour at different probability threshold values with two extreme points. Thus, at point (0, 0), all detections are negative, since at point (1, 1), all detections are positive. The more a detector’s ROC curve approaches the perfect detection point (0, 1), the better it is. For quantitative comparison, the AUC metric can be used, which is calculated simply by integrating the area under the ROC curve.

Nonetheless, to generate ROC curves, we need a threshold variable that we can adjust in a range [29], thereby influencing the detector’s \({\textit{FPR}}\) and \({\textit{TPR}}\). Ideally, when this threshold takes its minimum value, \({\textit{FPR}}={\textit{TPR}}=0\), and when it takes its maximum value, \({\textit{FPR}}={\textit{TPR}}=1\). Probabilistic anomaly detectors satisfy this criterion. However, in the case of deterministic detectors, when the detector provides only an ‘anomaly’ or ‘no anomaly’ signal without a probability score, it is not trivial to identify such a threshold variable.

Unfortunately, the ReRe family algorithms—including AREP—are deterministic. Although they use a decision threshold (\({\textit{thd}}\)), it is not a fixed value but adapts to data pattern changes and prediction errors. Thus, just replacing it with a fixed score and adjusting this score would fundamentally alter the operation of these detectors. Therefore, we had to devise a new ROC curve generation method keeping these detectors’ specifics in mind. Our proposed method is detailed in the following section.

6.2 ROC curve generation method

As discussed above, the traditional ROC curve generation method cannot be directly applied to deterministic anomaly detectors, as they lack a probabilistic threshold variable to be adjusted. Maxion et al. [29] list the basic criteria for drawing ROC curves in their comprehensive book as follows:

– Source data must be classifiable into two categories, signals and noises [...]

– The “ground truth” label for each data event is available, i.e., knowledge about what is really a signal, and what is really a noise [...]

Since these criteria are satisfied by the deterministic ReRe family algorithms using adaptive decision thresholds, we propose our own ROC curve generation method devised specially for this family of anomaly detectors.

In these detectors, the adaptive \({\textit{thd}}\) value is calculated by using the three-sigma rule (see Eq. (4)). We recommend to replace the fixed coefficient of sigma with an adjustable parameter \({\textit{TS}}\) (threshold strength). This way, the \({\textit{TS}}\) parameter is able to influence the operation of the algorithm without interfering with its adaptability and basic principles. The modified threshold calculation method is depicted in Eq. (16) as

$$\begin{aligned} {\textit{thd}}_t = \mu _{{\textit{AARE}}, t} + {\textit{TS}} \cdot \sigma _{{\textit{AARE}}, t}. \end{aligned}$$
(16)

Traditionally, the limits of the threshold variable are simply 0% and 100%. This cannot apply directly in our case. We conducted rigorous experiments to determine the boundaries (\({\textit{TS}}_{{\textit{min}}}\) and \({\textit{TS}}_{{\textit{max}}}\)) within which the \({\textit{TS}}\) parameter needs to be adjusted. Through tuning \({\textit{TS}}\), we found that selecting a small interval resulted in points on the ROC curve that did not extend to the edges properly. A too large interval, on the contrary, meant wasted computational resources and points concentrated around the (0, 0) and (1, 1) coordinates. Eventually, we concluded that \({\textit{TS}}_{{\textit{min}}} = -7\) and \({\textit{TS}}_{{\textit{max}}} = 7\) seem to be the most appropriate setting providing the desired endpoints on the ROC curve.

Although the ROC curves generated by the traditional method and ours share the same fundamentals, using them as equivalent metrics is not advised. Therefore, we use these ROC curves to compare only the ReRe family detectors to each other and not in the comparison of AREP to other state-of-the-art algorithms.

7 Experiments

In this section, after the preliminaries, we first assess the performance of AREP compared to ReRe [1] and Alter-Re\(^2\) [3], which share common roots. Then, we compare AREP’s performance to seven other state-of-the-art approaches, such as ContextOSE [8], Bayesian Changepoint [11], DeepAnT [17], HTM [14], Numenta TM [4], Skyline [9], and Twitter ADVec [7]. Finally, we discuss our results (the raw results data are publicly available also on GitHUBFootnote 1) and the potential role of AREP among the anomaly detectors.

7.1 Preliminaries

7.1.1 Implementation

Since we have not found the official and publicly available source code of ReRe, our research team has implemented it in Python. Similarly, our research team has implemented Alter-Re\(^2\) and AREP also in PythonFootnote 2 (AREP’s source code is publicly available on GitHubFootnote 3).

Moreover, to implement the other investigated algorithms we used the source codes from [30, 31].

7.1.2 Parameter settings

For evaluating the ReRe family algorithms, we followed the parameter settings proposed in [3] and our experiments (in the case of Alter-Re\(^2\) and AREP). Table 3 summarises these parameter settings.

Table 3 Parameter settings of the ReRe family algorithms

In the case of the other investigated algorithms, we followed the settings proposed in the literature. Hence, for ContextOSE, Skyline, Bayesian Changepoint, Twitter ADVec, HTM and Numenta TM, we used the parameter settings and optimal detection thresholds suggested by Lavin et al. from Numenta [30]. For DeepAnT, we set the history window size to 35, the median of the values shortlisted by Munir et al. [17]. We used the first 40% part of each dataset as the training data following the authors’ recommendations. We set the rest of the parameters according to the re-implementation of Singh [31].

7.1.3 Anomaly window

An anomaly window needs to be designated to evaluate the algorithms. In the comparison of the ReRe family algorithms, we set, based on our experiments, the anomaly window size (K parameter) to 20. In the comparison of AREP to the other state-of-the-art approaches, we used the method devised by Lavin et al. [4]. Thus, we counted 10% of the original dataset’s length and divided this value by the number of anomalies. Then, we rounded the result up to the nearest integer to determine the anomaly window size.

7.1.4 Datasets

We used the benchmark datasets from NAB [4]. This benchmark consists of one-dimensional time series from various domains. The authors included noisy data and several types of streaming data anomalies in these datasets. Anomaly labels are always provided in a separate file in the NAB GitHub repository [30]. The labels are assigned either based on the time point of the problem’s known root cause or manually following a labelling procedure described in [32, 33].

Table 4 Summary of datasets used in our evaluations

The NAB database contains 58 datasets altogether. For our investigations, we had selected the four most fitting aperiodic and not spiked datasets on which AREP performs the best. The details of these four datasets are summarised in Table 4.

7.2 Comparison of the ReRe family algorithms

This section details our findings when comparing AREP to its predecessors, ReRe and Alter-Re\(^2\). The metrics used for this comparison are the mean square error (MSE) and the traditional Precision, Recall and F-score metrics, extended with the ROC curve and the AUC metric.

First, we have investigated the performance of the ReRe family algorithms concerning the prediction MSE. This metric grabs well the LSTM model’s prediction accuracy. Figure 7 depicts the results together with bars representing the standard deviation. As expected, AREP produced the lowest MSE, 0.001, compared to 0.002 and 0.004 of Alter-Re\(^2\) and ReRe, respectively. This improvement confirms the usefulness of the introduced offset compensation component in AREP (cf. Sect. 4.2).

Fig. 7
figure 7

Prediction MSE produced by the ReRe family algorithms

Furthermore, we depict in Fig. 8 example ROC curves of all the three ReRe family algorithms generated by our newly developed method (cf. Sect. 6.2). The dataset (grok_asg_anomaly) contained three anomalies; therefore, the true-positive rate consists of four distinct values such as 0.00, 0.33, 0.67 and 1.00. On the other hand, the false-positive rate can have a significantly higher granularity, as the number of false positives in the worst case is approximately the length of the dataset. These reasons explain the stepped shape of the ROC curves.

Fig. 8
figure 8

ROC curves of the ReRe family algorithms using one of our evaluation datasets

Finally, we compared ReRe, Alter-Re\(^2\) and AREP using the traditional Precision, Recall and F-score metrics. We have also created ROC curves and calculated the AUC metric for all four datasets used in our evaluations (cf. Table 4). The results are presented in Fig. 9.

Fig. 9
figure 9

Comparison of the ReRe family algorithms

The figure depicts the metrics averaged over all four datasets with bars representing the standard deviation. As expected, Alter-Re\(^2\) overperformed ReRe and AREP overperformed both the other algorithms concerning all the metrics in these experiments. It is also promising that we have observed significant improvement in favour of AREP compared to Alter-Re\(^2\), having over doubled Precision, 4.4 times higher Recall, 3.3 times higher F-score and 11% higher AUC value on average.

7.3 Comparison of AREP to other state-of-the-art approaches

In the second part of our evaluations, we focused on comparing AREP to seven other, state-of-the-art approaches (ContextOSE, Bayesian Changepoint, DeepAnT, HTM, Numenta TM, Skyline, Twitter ADVec). First, we computed and compared the Precision, Recall and F-score metrics in case of each investigated algorithm. The average of each metric through all the four datasets together with their standard deviation values is depicted in Fig.10.

Fig. 10
figure 10

Comparison of AREP with state-of-the-art algorithms

AREP achieved 0.99 average Precision on these datasets, similarly to Bayesian Changepoint, Skyline, Numenta TM and Twitter ADVec. DeepAnT also showed good performance with a 0.87 value, while the remaining two algorithms produced a bit lower Precision, around 0.75. Regarding Recall, AREP outperformed all the other algorithms by achieving 0.92 on average. Its closest competitors were Bayesian Changepoint and DeepAnT, with 0.88 and 0.83, respectively. All the other algorithms achieved lower values (between 0.63 and 0.75). Also, from the viewpoint of F-score, AREP seems to be the optimal choice for these datasets with a 0.95 average F-score value. The two runner-ups were again Bayesian Changepoint and DeepAnT, with 0.91 and 0.84, respectively. Both Numenta TM and Skyline produced 0.83. All the other algorithms achieved a value below 0.80.

Then, we compared the time performance of AREP to the other approaches. We used the metric of average timestep duration measured for all four datasets. In our results, all algorithms showed comparable performance on average, except the Skyline algorithm, which was 1.83 times slower compared to AREP. The fastest two algorithms, ContextOSE and Bayesian Changepoint, although overperformed AREP but took only 8.6% and 8% less time than our approach.

7.4 Discussion and implications

We believe, also supported by our experiments, that the introduced components (automatic parameter tuning, offset compensation) have improved the performance of Alter-Re\(^2\); thus, AREP, the new version of the algorithm can replace its predecessors in every regard. AREP also performs well compared to other state-of-the-art algorithms. It produced similar or even better performance in our experiments than its competitors.

In the case of fitting scenarios and datasets, using AREP is a good choice. It yields better results with less domain knowledge and experimental setup than most other approaches. For example, Bayesian Changepoint, AREP’s closest competitor, as we have found, requires selecting the dataset probability distribution in advance, according to Adams et al. [11], along with setting prior parameters like the mean and standard deviation of the dataset. This is not straightforward in offline usage, but it is a considerable challenge in a real-time scenario with constantly changing dataset patterns.

Furthermore, AREP can run by default parameter settings immediately. Thus, there is no need to spend time experimenting with parameter setup upfront or consult experts. AREP also allows intuitive control over its resource demands by tuning the \(T_{{\textit{max}}}\) parameter. In real-time scenarios with limited resources, setting \(T_{{\textit{max}}}\) low is a viable solution. When more resources are available, we can give more operation time between consecutive timesteps to AREP by increasing \(T_{{\textit{max}}}\).

Another advantage of AREP is its adaptability to changes. This characteristic stems from its ReRe roots using the pattern change signal and the LSTM model retrain and has also been improved with the newly introduced automatic tuning component. Thus, one can set AREP to be more robust and react slower to sudden changes or to be more sensitive and react faster.

However, AREP has some limitations. It can be run only on univariate time series and cannot handle multi-dimensional input data at the moment. Moreover, AREP’s real-time applicability is limited as the whole time series is required for data pattern classification. Furthermore, although AREP has not shown worse performance than the other investigated algorithms in our experiments, this is not always the case. As discussed above, AREP’s performance greatly depends on the dataset’s pattern. Thus, if the dataset does not fit well with AREP, e.g., it is periodic or spiked, we advise using a more suitable anomaly detector.

8 Conclusions

In this paper, we proposed AREP, an adaptive, LSTM-based machine learning approach for real-time anomaly detection on network time series. AREP is an improved version of its direct predecessor algorithm called Alter-Re\(^2\). It extends Alter-Re\(^2\) with two components, such as the automatic tuning of its key parameters and an offset compensation to increase accuracy.

As AREP performs well only on datasets showing specific patterns (aperiodic, not spiked), we also proposed a data type classification method to identify fitting patterns.

We devised a novel ROC curve generation approach for anomaly detectors using adaptive threshold techniques like AREP and its predecessors. Thus, one can also use the AUC metric to compare the performance of these algorithms.

We showed through rigorous experiments that on time series following specific data patterns AREP overperforms its predecessors and can compete with other state-of-the-art algorithms.

In future work, inspired by the results of our experiments, we plan to develop a framework that constantly analyses and classifies the incoming data stream based on more sophisticated patterns and adaptively chooses the most appropriate anomaly detection engine according to this classification. Moreover, we want to investigate the applicability of our anomaly detection method to multivariate data from two directions. First, apply our approach to every dimension of the multivariate data in parallel, correlate the results and derive a decision. The alternative is to conduct dimension reduction first, then apply our method to the reduced data space.