Introduction

The detection of anomalies in time series has received considerable attention in both the statistics (Chen and Liu 1993) and machine learning (Chandola et al. 2009) literature. This is no surprise given the broad range of applications, from fraud detection (Ferdousi and Maeda 2006) to fault detection (Theissler 2017; Zhao et al. 2018), that this area lends itself to. In recent years, the proliferation of sensors within the internet of things (IoT) has led to the emergence of real time detection of anomalies in streaming (high frequency) data as an important new challenge.

Anomalies can be classified in a number of different ways (Chandola et al. 2009). In this work, following the definitions of Fisch et al. (2022a), we distinguish between point and collective anomalies. Point anomalies, also known as outliers, global anomalies or contextual anomalies (Chandola et al. 2009), are single observations that are anomalous with regards to their local or global data context. Conversely, collective anomalies, also known as abnormal regions (Bardwell and Fearnhead 2017), or epidemic changepoints (Bruce and Jennie 1985), are sequences of contiguous observations which are not necessarily anomalous when compared to either their local or the global data context, but together form an anomalous pattern. Figure 1 provides several examples. In this paper, collective anomalies and epidemic changepoints are used interchangeably.

Fig. 1
figure 1

Time series containing collective and point anomalies. Typical data shown in grey, anomalous segments in red and point anomalies shown in blue

The epidemic changepoint model assumes that data follows some baseline, or typical distribution, everywhere except for some anomalous time windows during which it follows another distribution. The detection of epidemic changes in mean was first studied by Bruce and Jennie (1985) with applications to epidemiology. Since then, research in this area has been driven by various applications including the detection of copy number variants in DNA (Bardwell and Fearnhead 2017; Jeng et al. 2013; Olshen et al. 2004) and the analysis of brain imaging data (Aston and Kirch 2012; Stoehr et al. 2019). In particular, much of the pertinent literature has concentrated on the epidemic change in mean setting. See Yao (1993) and Gut and Steinebach (2005) for details.

More recently, the detection of joint epidemic changes in mean and variance as well as point anomalies was considered by Fisch et al. (2022a). In parallel, there has also been some work on detecting anomalies within the online setting. Gut and Steinebach (2005) consider the problem of detecting epidemic changepoints sequentially while Wang et al. (2011) and Ahmad et al. (2017) propose methods for the online detection of point anomalies. Other recent contributions include the MOSUM work by Kirch and collaborators (see, e.g., Eichinger and Kirch (2018)) that permits the online detection of changepoints, and is thus able to identify collective anomalies. Various outlier-robust Kalman filter approaches have also been proposed in recent years. See for example Ting et al. (2007), Agamennoni et al. (2011), Ruckdeschel et al. (2014), Chang (2014) and references therein. Fearnhead and Rigaill (2019) have introduced a computationally efficient changepoint detection approach that is robust to the presence of anomalies. Similarly the OSTAD package, developed by Iturria et al. (2020), can be used online to identify contextual anomalies.

The main contribution of this paper is to extend the offline Collective And Point Anomaly (CAPA) algorithm of Fisch et al. (2022a) to the online setting to detect both collective and point anomalies in streaming data, formalising early heuristic ideas appearing in Bezahaf et al. (2019). We call this algorithm Sequential-CAPA (SCAPA). We explore various practical aspects of working in this online setting, including (i) computational and storage costs, (ii) the typical (baseline) parameter estimation and (iii) penalty selection. In addition, we provide various theoretical and empirical guarantees to the resulting costs and accuracy trade-offs that arise from our focus on the online setting.

The article is organised as follows. In Sect. 2 we introduce the literature on offline detection of anomalous time series regions, particularly focusing on the recently proposed CAPA approach. Sect. 3 proceeds to extend this methodology to the online setting, introducing the Sequential Collective and Point Anomaly (SCAPA) algorithm. Theoretical properties of the proposed methodology are investigated in Sect. 4. Further results, together with a set of simulation studies is given in Sect. 5, indicating how these can be used to inform practitioners on how to select the hyper-parameters of SCAPA. Finally, we apply SCAPA to the monitoring of a sensor on a publically available, industrial machine-level data in Sect. 6. All proofs can be found in the supplementary material.

Background

CAPA, introduced by Fisch et al. (2022a), seeks to jointly detect and distinguish between point and collective anomalies within an offline, univariate time series setting. The heart of the approach is founded upon an epidemic changepoint model. To this end, consider a stochastic process \(x_t \sim {\mathcal {D}}(\theta (t))\), drawn from some distribution, \({\mathcal {D}}\), indexed by a set of model parameters, \(\theta (t)\). Collective anomalies can then be modelled as epidemic changes of the set of parameters \(\theta (t)\). I.e., time windows in which \(\theta (t)\) deviates from the typical, and potentially unknown, set of parameters \(\theta _0\). Formally,

$$\begin{aligned} \theta (t) = {\left\{ \begin{array}{ll} \theta _1 &{} s_1< t \le e_1 \\ &{}\vdots \\ \theta _K &{} s_K < t \le e_K \\ \theta _0 &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

Here K denotes the number of collective anomalies, while \(s_i\), \(e_i\), and \(\theta _i\) correspond to the start point, end point and the unknown parameter(s) of the ith collective anomaly respectively.

The number and locations of collective anomalies are estimated by choosing \(K, (s_1,e_1) , \ldots , (s_k,e_K)\), and \(\theta _0\) such that they minimise the penalised cost

$$\begin{aligned} \sum _{t \notin \cup [s_i+1,e_i] } {\mathcal {C}}(x_t,\theta _0) + \sum _{j=1}^{K} \left[ \min _{\theta _j} \left( \sum _{t=s_j +1}^{e_j} {\mathcal {C}}(x_t,\theta _j) \right) + \beta _C \right] . \end{aligned}$$
(2.1)

\({\mathcal {C}}(\cdot ,\cdot )\) is a cost function, e.g., twice the negative log-likelihood, and \(\beta _C\) is a penalty term for introducing a collective anomaly, which seeks to prevent overfitting. A minimum segment length, l, can be imposed by adding the constraint \(e_k - s_k \ge l\) for \(k = 1,2, \ldots ,K\), if collective anomalies of interest are assumed to be of length at least \(l \ge 1\).

Minimising the cost function (2.1) exactly by solving a dynamic programme like the PELT method (Killick et al. 2012) is not possible. This is because the parameter of the typical distribution, \(\theta _0\), is shared across segments, and introduces dependence. Fisch et al. (2022a) suggest removing this dependence in \(\theta _0\) by obtaining a robust estimate \({\hat{\theta }}_0\) over the whole data and then minimising

$$\begin{aligned} \sum _{t \notin \cup [s_i+1,e_i] } {\mathcal {C}}(x_t,{\hat{\theta }}_0) + \sum _{j=1}^{K} \left[ \min _{\theta _j} \left( \sum _{t=s_j +1}^{e_j} {\mathcal {C}}(x_t,\theta _j) \right) + \beta _C \right] , \end{aligned}$$
(2.2)

as an approximation to (2.1) over just the number and location of collective anomalies. The main focus of Fisch et al. (2022a) was on the case where anomalies are characterised by an atypical mean and or variance. In this case, the authors suggest minimising

subject to a minimum segment length l of at least 2. The above expression arises from setting the cost function to twice the negative log-likelihood of the Gaussian. The robust estimates for mean and variance, \({\hat{\mu }}_0\) and \({\hat{\sigma }}_0\), can be obtained from the median and the inter-quartile range.

The main weakness of the above penalised cost is that point anomalies will be fitted as collective anomalies in a segment of length l. To remedy this, point anomalies are modelled as epidemic changes of length one in variance (only). The set of point anomalies is denoted as O. To infer both collective and point anomalies we minimise

(2.3)

with respect to \(K, (s_1,e_1) , \ldots , (s_k,e_K)\), and O, subject to the constraint \(e_k - s_k \ge l \ge 2\) for \(k = 1,2, \ldots , K\). Here, \(\beta _O\) corresponds to a penalty for a point anomaly.

The CAPA algorithm then minimises the cost in (2.3) by solving the dynamic programme

taking \(C(0) = 0\).

In practice, as is common in many time series settings, some form of pre-processing of the series may be required to ensure it is of a suitable form for the CAPA framework. For example, some form of deseasonalisation may be appropriate.

Sequential CAPA

We now introduce our Sequential CAPA procedure. In extending CAPA to the online setting three main challenges arise. Specifically, any approach developed should be mindful of the following: (i) that the computational and storage cost of the dynamic programme increase with time; (ii) the typical (baseline) parameters have to be learned online and (iii) penalty selection. We address each of these three challenges in turn, proposing solutions in the following sections, prior to formally introducing the SCAPA algorithm in Sect. 3.3.

Increasing Computational and Storage Cost

As noted in Sect. 2, CAPA infers collective and point anomalies by solving a set of dynamic programme recursions. However both the computational cost of each recursion, and the storage cost, increase linearly in the total number of observations. This is unsuitable for the online setting in which both storage and computational resources are finite.

In practice, this problem can be surmounted by imposing a maximum length m for collective anomalies. This can be achieved by adding the set of constraints

$$\begin{aligned} e_i - s_i \le m \forall i = 1,2, \ldots ,K \end{aligned}$$
(3.1)

to the optimisation problem in equation (2.3). The resulting problem can then be solved using the following dynamic programme

As a consequence of restriction (3.1), each recursion only requires a finite number of calculations. Moreover, only a finite number of the optimal costs, C(t), need to be stored in memory. The practical implications of this additional constraint are likely to be limited. Within this setting collective anomalies encompassing fewer than m observations will be detected as before. However, for those scenarios where an anomaly encompasses more than m observations, these will be fitted as a succession of collective anomalies each of length less than m, provided that their signal strength (cf Sect. 5.1 for a definition) is large enough. As one might anticipate, within this setting long anomalous segments with low signal strength would not be detectable any more as a result of the approximation.

Fig. 2
figure 2

a) Example time series with collective and point anomalies as well as the b) median and c) IQR estimated sequentially over time using different methods: The quantile filter by Justusson (1981) (Filter), the \(p^2\)-method of Jain and Chlamtac (1985) (P squared) and the Stochastic Approximation based method by Tierney (1983) (SA)

Sequential estimation of the typical parameters

As described in Sect. 2, the dynamic programme used by CAPA requires robust estimates of the set of typical parameters \(\theta _0 = (\mu _0,\sigma _0)\). Fisch et al. (2022a) estimate \(\mu _0\) and \(\sigma _0\) on the full data using the median and inter-quartile range respectively. In an online setting, however, these quantiles have to be learnt as the data is observed.

A range of methods have been proposed that aim to estimate the cumulative distribution function (CDF) of the data sequentially and use it to estimate quantiles. For example, Tierney (1983) proposed a method based on techniques from Stochastic Approximation (SA) to estimate the \(\alpha \)th quantile \(x_{(\alpha )}\) of an unknown distribution function. Moreover, Tierney (1983) also established that, in the i.i.d. setting, the resulting sequential estimates \({\hat{x}}_{(\alpha ),n} \rightarrow x_{(\alpha )}\) almost surely as the number of observations \(n \rightarrow \infty \). Under the same assumptions, they also showed that \(\sqrt{n}( {\hat{x}}_{(\alpha ),n} - x_{(\alpha )})\) converges in distribution to a Normal distribution. These consistency results are important for an online implementation of CAPA, as Fisch et al. (2022a) showed that the consistency of CAPA requires the robustly estimated mean and variance to be within \(O_p\left( \sqrt{\frac{\log (n)}{n}}\right) \) of the true typical mean and variance.

The memory required to obtain the SA-estimate is finite and small. Moreover, the standard errors of the SA-estimate and sample quantiles are close even for relatively small sample sizes, as can be seen from Fig. 2. Further, we note that these estimates tend to be considerably more accurate than those of other commonly used methods such as the quantile filter (Justusson 1981) and the \(p^2\)-algorithm (Jain and Chlamtac 1985). This is due to the fact that the quantile filter is not consistent, and that the \(p^2\)-algorithm is not robust with respect to outliers, thus losing a critical property of quantile estimators.

Pseudo-code for the SA-based method is given in Algorithm 1. Using a burn in period to stabilise the quantile estimates is recommended, as even the exact order statistics take some time to initially converge. SA-based methods can also be used to calculate other important statistics in an online fashion. For example, Sharia (2010) applied SA-techniques to learn auto-regressive parameters sequentially. Such estimators can be used to inflate the penalties used to account for deviations from the i.i.d. assumptions. This is discussed in more detail in Sect. 5.

Penalty selection

We now turn to the important question of penalty selection. In the offline setting, penalties are typically chosen to control false positives under the null hypothesis. For example, Fisch et al. (2022a) suggested using penalties

$$\begin{aligned} \beta _C(a,\lambda ) = 2\frac{a}{a-1} \left( 1 + \lambda + \sqrt{2\lambda } \right) ,\quad \beta _O(\lambda ) = 2\lambda , \end{aligned}$$
(3.2)

indexed by a single parameter, \(\lambda \), for CAPA when considering the change in mean and variance setting. Here, the penalty for collective anomalies, \(\beta _C\), depends on the length a of the putative collective anomaly. The motivation for these penalties is to ensure that the estimates for the number of collective anomalies and the set of point anomalies, \({\hat{K}}\) and \({\hat{O}}\), satisfy

$$\begin{aligned} \Pr ( {\hat{K}} = 0 , {\hat{O}} = \emptyset ) \ge 1 - C_1ne^{-\lambda } - C_2(ne^{-\lambda })^2, \end{aligned}$$
(3.3)

under the null hypothesis that no point or collective anomaly is present in the data. Consequently, setting \(\lambda = \log (n)\) asymptotically controls the number of false positives of a time series of length n.

In the online setting, however, the concept of the length of a time series does not exist. Consequently, fixed constants are used for the penalties instead. This means that, unless the errors are bounded, false positives will be observed eventually. In common with Lorden (1971) and Pollak (1985), we suggest choosing \(\lambda \) to be as small as possible, to maximise power against anomalies, whilst maintaining the average run length (ARL), the average time between false positives, at an acceptable level. Practical guidance on the choice of \(\lambda \) can be taken from Proposition 1, which provides an asymptotic result for the relationship between the log-ARL and \(\lambda \), under a certain model form. This relationship is empirically verified for other models using simulations in Sect. 5.

Sequential Collective And Point Anomaly  Given the above solutions to the three identified challenges, we are able to extend CAPA to an online setting. We call the resultant approach Sequential Collective And Point Anomaly (SCAPA). The basic steps of the algorithm are as follows: When an observation comes in, it is used to update the sequential estimates of the typical parameters. The observation is then standardised using the typical mean and variance \((\mu _0, \sigma ^2_0)\), before being passed to the finite horizon dynamic programme. Detailed pseudocode can be found in Algorithm 2 of the supplementary material

The sequential nature of SCAPA’s analysis is displayed in Fig. 3 across three plots, each representing the output of the analysis at different time points. Note how a collective anomaly is detected, initially, as a sequence of point anomalies until the number of observations equals the minimum segment length.

Fig. 3
figure 3

The evolution in the detection of a collective anomaly with a minimum segment length of \(l = 5\). The times shown are a) \(t = 100\) just prior to the anomalous observations, b) \(t = 104\) where the observations \(x_{101:104}\) have been labelled as point anomalies and c) \(t = 105\) where the observations \(x_{101:105}\) have been labelled as a collective anomaly

Theory

We now turn to consider the theoretical properties of SCAPA. In particular, we investigate the average run length (ARL) and the average detection delay (ADD). Here, the ARL corresponds to the expected number of baseline datapoints SCAPA processes before detecting a false positive. Conversely, the ADD corresponds to the expected number of observations between the onset of a collective anomaly and the time at which a collective anomaly is first detected. We will place a particular emphasis on the effects of the maximum segment length, m on the ADD, as the results following from that analysis provide practical guidance on how to choose m. Proofs of all propositions presented below may be found in the Appendix.

For simplicity of exposition, we will restrict our attention to the change in mean setting, in which the penalised cost is

$$\begin{aligned}&\sum _{t \notin \cup [s_i+1,e_i] \cup O }\left( \frac{x_t-{\mu }_0}{{\sigma }_0} \right) ^2 + \sum _{t \in O} \left[ 0 + \beta _O \right] \\&\qquad + \sum _{j=1}^{K} \left[ \sum _{t=s_j + 1}^{e_j} \left( \frac{ x_t - {\bar{x}}_{(s_j + 1):e_j}}{{\sigma }_0} \right) ^2 + \beta _{C} \right] \end{aligned}$$

Recalling the penalty selection approach outlined in Sect. 3.3. In this setting, the ARL of SCAPA can be related to the penalty constant, \(\lambda \), via the following result:

Proposition 1

Assume we observe a data sequence with typical mean, \(\mu _0\), and the typical variance, \(\sigma _0^2\), both known. Then the ARL of SCAPA on i.i.d. \(N(\mu _0,\sigma _0^2)\)-distributed observations \(x_1,x_2,\ldots \) then satisfies

$$\begin{aligned} \log (ARL) \sim \lambda /2 \end{aligned}$$

as \(\lambda \rightarrow \infty \).

As a consequence of the above, the probability of false alarm is proportional to \(\exp (-\lambda /2)\). As discussed in the previous section, this can be used to inform the choice of penalty in practice if an acceptable probability of false alarm is given. For comparison, we also provide additional simulation results for the log-average run length in a selection of Laplace and t-distributed settings (see Fig. 4).

Fig. 4
figure 4

Simulation results for the log-average run length in a selection of (a) t-distributed and (b) Laplace distributed noise settings for different values of \(\lambda \)

We now turn to investigate the effects of the maximum segment length, m, on the ADD. To simplify the exposition of these results, we assume that the collective anomaly begins at time \(\tau = 0\). Formally, consider the series

$$\begin{aligned} x_1, x_2, \ldots {\mathop {\sim }\limits ^{i.i.d.}} N(\mu ,1) \end{aligned}$$
(4.1)

and assume that the typical mean, \(\mu _0\), is equal to 0 and known. For a maximum segment length m, we then define \(ADD_m\) to be the ADD of SCAPA with a maximum segment length m. Additionally, we define \(ADD_\infty \) to be the ADD of SCAPA without maximum segment length. The following proposition shows that imposing a maximum segment length does not affect the ADD, provided that the maximum segment length increases at a rate faster than the penalty.

Proposition 2

Let \(x_1,x_2,.. \) follow the distribution specified in (4.1). Moreover, let the known baseline mean and variance be 0 and 1 respectively. Then, if \(m > \frac{\lambda }{\mu ^2}(1+\epsilon )\) for some \(\epsilon >0\),

$$\begin{aligned} ADD_m = ADD_\infty + o(1) \end{aligned}$$

as \(\lambda \rightarrow \infty \).

Given Proposition 2, it is natural to consider what happens in the converse setting. I.e. what happens if the maximum segment length increases at a slower rate than the penalty.

Proposition 3

Let \(x_1,x_2,.. \) follow the distribution specified in (4.1). Moreover, let the known typical mean and variance be 0 and 1 respectively. Then, if \(1 \le m < \lambda ^{1-\epsilon }\) for some \(\epsilon >0\)

$$\begin{aligned} \log (ADD_m) \sim \lambda /2 \end{aligned}$$

as \(\lambda \rightarrow \infty \).

In other words, the log-ADD has the same exponential rate as the log-ARL on non-anomalous data.

As previously discussed, limits on the number of possible interventions often determine a tolerable probability of false alarm in practice. Proposition 1 therefore provides a mechanism to determine a suitable penalty constant \(\lambda \). Further, Propositions 2 and 3 can be used to help inform an appropriate choice of maximum segment length, m. Specifically m should be at least of magnitude \(\frac{\lambda }{\mu ^2}\), where \(\mu \) is the smallest change in mean of interest to ensure power.

Simulation study

We now turn to examine the performance of SCAPA in various simulated settings. We start by considering the case where a single collective anomaly is present to evaluate SCAPA via its ARL and ADD performance in Sect. 5.1. The effect of auto-correlation is also examined. This is followed by a comparison with CAPA on time series containing multiple anomalies in Sect. 5.2.

Fig. 5
figure 5

The solid line shows the log-ARL for SCAPA as a function of \(\lambda \). The grey shaded region is a pointwise 95% bootstrapped confidence interval. Results shown from 500 replications

Fig. 6
figure 6

The lines show the ADD for SCAPA as a function of \(\lambda \) for different strengths of collective anomaly (\(\Delta = 0.05\), 0.1 and 0.2). The grey shaded regions are pointwise 95% bootstrapped confidence intervals. Results shown from 500 replications

A single anomaly

Prior to describing our first simulation scenario, we begin by noting that the ARL and ADD are functions of \(\beta _C\) and \(\beta _O\). Further, as we have seen in equation (3.2) these are a function of a single parameter \(\lambda \). The aim of our simulation study, therefore, is to inform the choice of \(\lambda \) that gives a suitable ARL/ADD trade off. In particular, ceteri paribus, a weaker change gives rise to a larger delay than a stronger change. In other words, we must control for the strength of change when investigating the ADD. To do so, we take the definition of signal strength from Fisch et al. (2022a).

Fig. 7
figure 7

The lines show the log-ARL for SCAPA as a function of \(\lambda \) where the simulated time series are AR(1) processes with differing lag-1 auto-correlation (\(\phi = 0\), 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.85). The two penalties, \(\beta _C(\lambda )\) and \(\beta _O(\lambda )\) are the same as in the i.i.d. case (Fig. 5). The grey shaded regions are pointwise 95% bootstrapped confidence intervals. Results shown from 500 replications

Fig. 8
figure 8

The lines show the ADD for SCAPA as a function of \(\lambda \) for different strengths of collective anomaly a) \(\Delta = 0.05\), b) \(\Delta = 0.1\) and c) \(\Delta = 0.2\). In each case the simulated residuals are AR(1) processes with differing lag-1 auto-correlation (\(\phi = 0\), 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.85). The two penalties, \(\beta _C(\lambda )\) and \(\beta _O(\lambda )\) are the same as in the i.i.d. case (Fig. 6). The grey shaded regions are pointwise 95% bootstrapped confidence intervals. Results shown from 500 replications

For a collective anomaly with mean \(\mu \) and variance \(\sigma ^2\) the strength, \(\Delta \), of a change is defined as

$$\begin{aligned}&\Delta = \log \left( 1 + \frac{1}{2}\Delta _\sigma ^2 + \frac{1}{4}\Delta _\mu ^2 \right) \;\;\;\;\;\; \;\;\;\;\;\; \Delta _\mu ^2 = \frac{ (\mu _0 - \mu )^2 }{\sigma _0\sigma } \\&\Delta _\sigma ^2 = \frac{\sigma _0}{\sigma } + \frac{\sigma }{\sigma _0} - 2. \end{aligned}$$

Here, \(\mu _0\) and \(\sigma _0\) are the parameters of the typical distribution, while \(\Delta _\mu \) and \(\Delta _\sigma \) denote the strengths of the change in mean and variance respectively.

To simplify the simulations, we assume that the standard deviation remains unaffected by collective anomalies, i.e. \(\sigma = \sigma _0\). Without loss of generality, we then set \(\mu _0=0\) and \(\sigma _0 = 1\). Consequently, the strength of the change only depends on the mean, \(\mu \), of the collective anomaly and is given by

$$\begin{aligned} \Delta = \log \left( 1 + \frac{\mu ^2}{4} \right) . \end{aligned}$$
(5.1)

We investigate a number of differing strengths \(\Delta = \{0.05,0.1,0.2\}\) corresponding to mean changes of \(\mu = \{0.45,0.65,0.94\}\).

In all the simulations reported below we set the minimum segment length to be \(l=2\), the maximum segment length to be \(m=1000\) and used a burn-in period of \(n_0= 1000\) time points. To estimate the ARL, data from the typical regime was simulated and SCAPA ran until the first anomaly was (erroneously) detected. To estimate the ADD, \(n_0\) observations were simulated from the typical regime followed by simulated observations from a distribution with an altered mean. We ran SCAPA on this data and calculated the detection delay as being the number of observations after \(n_0\) when the anomaly was detected.

Case 1: IID gaussian errors

For our initial simulations, we simulated from the assumed model with standard Gaussian errors. Figure 5 depicts the log-ARL over a range of values for the penalties (3.2) indexed by the parameter \(\lambda \) as in (3.2) along with a bootstrapped 95% confidence interval. Similarly, Fig. 6 shows the relationship between \(\lambda \) and the ADD over a range of different values for the mean change of the collective anomaly.

Note that the log-ARL increases linearly with \(\lambda \). This structure is reminiscent of the theoretical exponential relationship between \(\lambda \) and the ARL derived by Cao and Xie (2017), even though these results were derived for known pre and post change behaviour.

Fig. 9
figure 9

The lines show the log-ARL for SCAPA as a function of \(\lambda \) where the simulated time series are AR(1) processes with differing lag-1 auto-correlation (\(\phi = 0\), 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.85). The two penalties, \(\beta _C(\lambda )\) and \(\beta _O(\lambda )\) are inflated by a function of \({\hat{\phi }}\), the estimate obtained from the burn-in period using a robust M-estimator. The grey shaded regions are pointwise 95% bootstrapped confidence intervals. Results shown from 500 replications

Fig. 10
figure 10

The lines show the ADD for SCAPA as a function of \(\lambda \) for different strengths of collective anomaly a) \(\Delta = 0.05\), b) \(\Delta = 0.1\) and c) \(\Delta = 0.2\). In each case the simulated residuals are AR(1) processes with differing lag-1 auto-correlation (\(\phi = 0\), 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.85). The two penalties, \(\beta _C(\lambda )\) and \(\beta _O(\lambda )\) are inflated by a function of \({\hat{\phi }}\), the estimate obtained from the burn-in period using a robust M-estimator. The grey shaded regions are pointwise 95% bootstrapped confidence intervals. Results shown from 500 replications

Similarly, the ADD, increases linearly in \(\lambda \), as can be seen in Fig. 6. This is consistent with Proposition 2.

Fig. 11
figure 11

ROC curves for CAPA, SCAPA, RPOP, and MOSUM from 100 replications. A red dot indicates the behaviour under default parameters. Graphic curtailed at maximum 20 false positives for ease of viewing

Fig. 12
figure 12

A comparison of a) CAPA to b) SCAPA on an example time series. Segments in red show inferred collective anomalies. Dashed lines below the x-axis show the position of the true collective anomalies in the data

Case 2: temporal dependence

Whilst the i.i.d. data setting is appealing theoretically, many observed time series are not independent (in time). Instead many data sequences display serial auto-correlation. To assess the robustness of SCAPA to temporal dependence we simulated an AR(1) error process as the typical distribution, \(x_t\), with standard normal errors \(\epsilon _t\),

$$\begin{aligned} x_t = \phi x_{t-1} + e_t. \end{aligned}$$

This process was simulated for a range of values of \(\phi \). As can be seen in Fig. 7, the presence of auto-correlation in the residuals leads to the spurious detection of collective anomalies at a higher rate than for independent residuals. This is due to the fact that the cost functions of Sect. 2 assumed i.i.d. data. However, Bardwell et al. (2019) gave some empirical evidence that changepoints could be recovered even when auto-correlation is present by applying a correction or inflation factor to the penalty. This factor is the sum of the auto-correlation function for the residuals from \(-\infty \) to \(\infty \). This is equal to the long run variance, \((1+\phi )/(1-\phi )\), for the AR(1) model. A similar correction exists for MA processes. We repeated the simulations using this correction.

The results in Fig. 9 show that the log-ARL of SCAPA with appropriately inflated penalties is almost identical to that of the i.i.d. case. On the other hand, the ADD now depends on the auto-correlation due to the inflated penalty (see Fig. 10).

Performing this correction requires knowledge of the AR(1) parameter \(\phi \). In these simulations, the estimate of \(\phi \), \({\hat{\phi }}\), was estimated from the burn-in period using a robust M-estimator for the lag-1 autocorrelation of Rocke (1996) from the R package robust (Wang et al. 2017).

Multiple anomalies

A natural comparison to make when investigating an online method is to compare its performance to its offline counterpart. We therefore compare SCAPA and CAPA for the detection of multiple anomalies using ROC curves in this section. In addition to this, we also compare SCAPA to a robust changepoint detection algorithm proposed by (Fearnhead and Rigaill 2019), which we refer to as RPOP. We also compare to the MOSUM implementation provided by (Meier et al. 2021), an online though not robust changepoint detection algorithm.

Fig. 13
figure 13

ROC curves over 100 replications for a) CUSUM and b) SCAPA. Point anomalies were generated from a t-distribution with varying degrees of freedom (\(\nu \) = 2, 5 or 10 respectively)

Fig. 14
figure 14

A comparison of a) CUSUM to b) SCAPA on an example time series. Segments in red show inferred collective anomalies. Dashed lines below the x-axis show the position of the true collective anomalies in the data

Fig. 15
figure 15

Machine temperature data

Fig. 16
figure 16

Machine temperature data. The burn-in period is shaded in blue. Anomalies detected by SCAPA and CAPA are shaded in red and green respectively, and brown when they overlap. Dashed vertical lines show the hand labelled anomalies given by an engineer working on the machine

To this end, we simulated time series with a total length of 10,000 observations with a number of point and collective anomalies. The length of stay for the typical state and for collective anomalies were sampled from a \(\text {NB}(5,0.01)\) distribution and a \(\text {NB}(5,0.03)\) distribution respectively. Observations in the typical state were sampled from an N(0, 1) distribution, while observations from the kth collective anomaly were sampled from an \(N(\mu _k,\sigma _k^2)\) distribution, where \(\mu _1,\ldots ,\mu _K \sim N(0,2^2)\) and \(\sigma _1,\ldots \sigma _K \sim \Gamma (1,1)\). Point anomalies occurred in the typical state independently with probability \(p = 0.01\) and were drawn from a t-distribution with 2 degrees of freedom.

The ROC curve resulting from this simulation can be found in Fig. 11, alongside an example time series in Fig. 12 shown segmented by both CAPA and SCAPA. As expected CAPA, which has access to the whole data, outperforms SCAPA. However, the gap in performance accuracy is small, in particular for typical choices of \(\lambda \) as described in Sect. 4.

CUSUM comparison

A natural comparison that can be made to assess SCAPA’s simulated performance is with the widely used online change point detection method CUSUM (Page 1954). Both methods use a test statistic based on the log-likelihood ratio and can be configured with known typical mean and variance. The difference between the two methods is that in SCAPA, collective and point anomalies are detected jointly with separate penalties whereas CUSUM is not designed to be robust to point anomalies. In our simulations, the CUSUM approach is implemented by setting the penalty for point anomalies in SCAPA to an arbitrarily large value (\(\beta _O = 10^{12}\)) so that no point anomalies are detected.

Table 1 Labelled anomalies from the NAB obtained from https://github.com/numenta/NAB/blob/master/labels/combined_windows.json along with the time it was detected (in bold)

The data was simulated in a similar way to that of Sect. 5.2 with the difference being that 20% of points in the typical state were point anomalies, simulated from a t-distribution with degree of freedom \(\nu \in \{2,5,10\}\). The ROC curve resulting from this simulation can be found in Fig. 13, alongside an example time series in Fig. 14 shown segmented by both methods. Between them, these plots highlight SCAPA’s ability to distinguish explicitly between point anomalies and collective anomalies, whereas an approach such as CUSUM suffers. Note in particular that, as expected, CUSUM gives a higher number of false positives than SCAPA when point anomalies are from distributions with heavier tails.

Machine temperature data

The Numenta Anomaly Benchmark (NAB) (Lavin and Ahmad 2015; Ahmad et al. 2017) provides a number of data sets that can be used to compare different anomaly detection approaches. The data can be obtained from https://github.com/numenta/NAB.

One example consists of heat sensor data from an internal component of a large industrial machine. The data is displayed in Fig. 15. There are \(n = 22,695\) observations spanning 2nd December 2013 - 19th February 2014 sampled every five minutes.

Lavin and Ahmad (2015) use an initial, or burn-in, period to allow their algorithms to learn about the data. In line with their approach, we set the burn in period to be the first 15% of the data (2nd December 2013 until the 14th December 2013, as shown by the blue shaded area in Fig. 16). We used the burn in to obtain a robust M-estimator for the lag-1 autocorrelation of the observations after standardisation by the sequential mean and variance estimate. Using the robust M estimator of Rocke (1996) from the R package robust (Wang et al. 2017), we obtained an autocorrelation estimate \({\hat{\phi }}=0.974\). In line with the approach taken in Sect. 5.1.2, we therefore set the penalties to:

$$\begin{aligned} \beta _C = 2 \times \frac{1 + {\hat{\phi }} }{1 - {\hat{\phi }} } \times \log (n),\quad \beta _O = 2 \times \frac{1 + {\hat{\phi }} }{1 - {\hat{\phi }} } \times \log (n). \end{aligned}$$

Figure 16 shows the three anomalies SCAPA detected shaded in red. These corresponded to a set of hand labelled anomalous regions given by an engineer working on the machine shown by the dashed vertical lines. The positions of these are given in Table 1. It should be noted that the data labels in the NAB consist of anomalous periods, rather than points. However, all approaches applied to the data as part of the NAB only return points of anomalous behaviour, highlighting SCAPA’s potential to provide new insights into the data.

Table 2 Collective anomalies found using CAPA

The detection of the more subtle second anomaly in a timely fashion is important as this was claimed in the NAB literature to be the cause of the catastrophic system failure (third anomaly). We can see that the time at which SCAPA first detected it in Table 1. If users of the system deemed this to be too long of a delay the penalties used above could be decreased, however, as noted elsewhere in this paper this would increase the frequency of false alarms.