Adaptive Bernstein Change Detector for High-Dimensional Data Streams

Change detection is of fundamental importance when analyzing data streams. Detecting changes both quickly and accurately enables monitoring and prediction systems to react, e.g., by issuing an alarm or by updating a learning algorithm. However, detecting changes is challenging when observations are high-dimensional. In high-dimensional data, change detectors should not only be able to identify when changes happen, but also in which subspace they occur. Ideally, one should also quantify how severe they are. Our approach, ABCD, has these properties. ABCD learns an encoder-decoder model and monitors its accuracy over a window of adaptive size. ABCD derives a change score based on Bernstein’s inequality to detect deviations in terms of accuracy, which indicate changes. Our experiments demonstrate that ABCD outperforms its best competitor by at least 8% and up to 23% in F1-score on average. It can also accurately estimate changes’ subspace, together with a severity measure that correlates with the ground truth.


Introduction
Data streams are open-ended, ever-evolving sequences of observations from some process.They pose unique challenges for analysis and decision-making.One crucial task is to detect changes, i.e., shifts in the observed data, that may indicate a change in the underlying process.Change detection has been an active research area.However, the high-dimensional setting, in which observations contain a large number of simultaneously measured quantities, did not receive enough attention.Yet, it may yield useful insights in environmental monitoring (de Jong and Bosman, 2019), human activity recognition (Vrigkas et al, 2015), network traffic monitoring (Naseer et al, 2020), automotive (Liu et al, 2019), predictive maintenance (Zhao et al, 2018), and biochemical engineering (Mowbray et al, 2021): Example (Biofuel production).The production of fuel from biomass is a complex process comprising many interdependent process steps.Those include pyrolysis, synthesis, distillation, and separation.Many steps rely on (by-)products of other steps as reactants, leading to a highly interconnected system with many process parameters.A monitoring system tracks the process parameters to detect failures in the plant: (i) The system must detect changes in a large (i.e., high-dimensional) vector of process parameters, which may indicate failures.(ii) The system must find out which process parameters are affected by the change to allow for a targeted reaction.Since the system is very complex and has many interconnected components, change is often evident only when considering correlations between process parameters.An example would be the correlation between temperature and concentration fluctuations.So it is insufficient to monitor each process parameter in isolation.(iii) There can exist slight changes which only require minor adjustments and more severe ones that require immediate intervention to avoid a shutdown of the plant.The monitoring system should provide an estimate of the severity of change.
The example illustrates three requirements for modern change detectors: • R1: Change point.The primary task of change detectors is to identify that the data stream has changed and when it occurred.• R2: Change subspace.A change may only concern a subset of dimensions -the change subspace.Change detectors for high-dimensional data streams should be able to identify such subspaces.• R3: Change severity.Quantifying relative change severity to distinguish between changes of different importance is essential to react appropriately.
Prior works already acknowledge the relevance of the above requirements (Lu et al, 2019;Webb et al, 2018).However, fulfilling R1-R3 in combination remains challenging since they depend on each other: on the one hand, detecting changes in high-dimensional data is difficult because changes typically only affect few dimensions.Unaffected dimensions "dilute" a change (i.e., a change occuring in a subspace appears to be less severe in the full space).
This might make changes harder to detect in all dimensions.On the other hand, detecting the change subspace should occur after detecting a change, since monitoring all possibles subspaces is intractable.Last, one should restrict computation of change severity to the change subspace to eliminate dilution.
Existing methods for change detection, summarized in Table 1, either are univariate (UV), multivariate (MV), or specifically designed for highdimensional data (HD); the latter claim efficiency w.r.t.high-dimensionality or resilience against the "curse of dimensionality".However, they do not fulfill R1-R3 in combination sufficiently well as Section 2 describes.
Thus, we propose the Adaptive Bernstein Change Detector (ABCD), which addresses R1-R3 in combination.We articulate our contributions as follows: (i) Problem Definition: We formalize the problem of detecting changes in high-dimensional data streams such that R1-R3 can be tackled in combination.(ii) Adaptive Bernstein Change Detector: We present ABCD, a change detector for high-dimensional data, that satisfies R1-R3.It monitors the loss of an encoder-decoder model using an adaptive window size and statistical testing.Adaptive windows enable ABCD to detect severe changes quickly and, over a longer period, identify hard-to-detect changes that would typically require a large window size.(iii) Bernstein change score: Our approach applies a statistical test based on Bernstein's inequality.This limits the probability of false alarms.(iv) Online computation: We propose an efficient method for computing the change score in adaptive windows and discuss design choices leading to constant time and memory.(v) Benchmarking: We conduct experiments on 10 data streams based on real-world and synthetic data with many dimensions and compare ABCD with recent approaches.The results indicate that ABCD outperforms its competitors consistently w.r.t.R1-R3, is robust to high-dimensional data and is useful in domains including human activity recognition, gas detection, and image processing.We also study ABCD's parameter sensitivity.Our code1 follows the popular Scikit-Multiflow API (Montiel et al, 2018), so it is easy to use in future research.

Change detector types
Most existing change detectors are supervised, i.e., they focus on detecting changes in the relationship between input data and a target variable (Iwashita and Papa, 2019).However, class labels are rarely available in reality, which limits their applicability.On the contrary, the unsupervised change detectors aim to detect changes only in the input data.Our approach belongs to this category, so we restrict our review to unsupervised approaches.
Most existing approaches detect changes whenever a measure of discrepancy between newer observations (the current window) and older observations (the reference window) exceeds a threshold.Some approaches, e.g., D3 (Gözüaçık et al, 2019) or PCA-CD (Qahtan et al, 2015), implement the reference and current window as two contiguous sliding windows.Other approaches, such as IBDD (de Souza et al, 2020), IKS (dos Reis et al, 2016) or WATCH (Faber et al, 2021) use a fixed reference window.A major problem is to choose the appropriate size for the window; thus (Bifet and Gavaldà, 2007) propose windows of adaptive size, that grow while the stream remains unchanged and shrink otherwise.Several work leverage this principle, e.g.(Sun et al, 2016;Khamassi et al, 2015;Fouché et al, 2019;Suryawanshi et al, 2022).
We also use adaptive windows to lower the number of parameters of ABCD.

Multivariate change detection
To detect changes in multivariate (MV) data, some approaches apply univariate algorithms in each dimension of the stream.Faithfull et al (2019) propose to use one ADWIN detector per dimension (with k dimensions).They declare a change whenever a certain fraction of the detectors agree.We call this approach AdwinK later on.Similarly, IKS (dos Reis et al, 2016) uses an incremental variant of the Kolmogorov-Smirnov test deployed in each dimension.Unlike AdwinK, IKS issues an alarm if at least one dimension changes.
There also exist approaches specifically designed for multivariate (Jaworski et al, 2020;Ceci et al, 2020;Qahtan et al, 2015;Gözüaçık et al, 2019;Dasu et al, 2006), or even high-dimensional (HD) data (Faber et al, 2021;de Souza et al, 2020).Similar to ABCD, Jaworski et al (2020) and Ceci et al (2020) use dimensionality-reduction methods to capture the relationships between dimensions.However, our approach is computationally more efficient, limits the probability of false alarms, identifies change subspace, and estimates change severity.D3 (Gözüaçık et al, 2019) uses the AUC-ROC score of a discriminative classifier that tries to distinguish the data in two sliding windows.It reports a change if the AUC-ROC score exceeds a pre-defined threshold.PCA-CD (Qahtan et al, 2015) first maps observations in two windows to fewer dimensions using PCA.Then the approach estimates the KL-divergence between both windows for each principal component.PCA-CD detects a change if the maximum observed KL-divergence exceeds a threshold.However, (Goldenberg and Webb, 2019) point out that this technique is limited to linear transformations and ignores combined change in multiple dimensions.LDD-DSDA (Liu et al, 2017) measures the degree of local drift that describes regional density changes in the input data.The approach proposed by (Dasu et al, 2006) structures observations from two windows (sliding or fixed) in a kdq-tree.For each node, they measure the KL-divergence between observations from both windows.However, (Qahtan et al, 2015) show experimentally that this approach is not suitable for high-dimensional data.
IBDD (de Souza et al, 2020) and WATCH (Faber et al, 2021) specifically address challenges arising from high-dimensional data.The former monitors the mean squared deviation between two equally sized windows.The latter monitors the Wasserstein distance between a reference and a sliding window.However, both cannot detect change subspaces or measure severity.

Offline change point detection
Offline change point detection, also known as signal segmentation, divides time series of a given length into K homogeneous segments (Truong et al, 2020).Many of the respective algorithms are not suitable for data streams: Some require specifying K a priori (Bai and Perron, 2003;Harchaoui and Cappe, 2007;Lung-Yut-Fong et al, 2015); others (Killick et al, 2012;Lajugie et al, 2014;Matteson and James, 2014;Chakar et al, 2017;Garreau and Arlot, 2018) scale superlinearly with time.WATCH (Faber et al, 2021), discussed above, is the state of the art extension of offline change point detection to data streams.

Change subspace
The notion of a change subspace is different from the existing notion of change region (Lu et al, 2019).The former describes a subset of dimensions that changed, the latter identifies density changes in some local region, e.g., a hyperrectangle or cluster (Liu et al, 2017).Our definition of change subspaces is related to marginal change magnitude (Webb et al, 2018), but is more general since it can also accomodate changes in a subspace's joint distribution.
Because high-dimensional spaces are typically sparse (due to the curse of dimensionality), identifying density changes in them is not effective.On the other hand, knowing that a change affected a specific set of dimensions can help identify the cause of the change, as we have motivated in our introductory example.Thus, we focus on detecting change subspaces in this work.
In the domain of statistical process control, some approaches extend wellknown methods, such as Cusum (Page, 1954) or Shewhart charts (Shewhart, 1930), to multiple dimensions.They address the problem of identifying change subspaces to some extent, however, they often make unrealistic assumptions: they focus on Gaussian or sub-Gaussian data (Chaudhuri et al, 2021;Xie et al, 2020), require that different dimensions are initially independent (Chaudhuri et al, 2021), require subspace changes to be of low rank (Xie et al, 2020), or assume that the size of the change subspace is known a priori (Jiao et al, 2018).
From the approaches reviewed in Section 2.3 only AdwinK and IKS identify the corresponding change subspace.However, both approaches do not find changes that hide in subspaces, e.g., correlation changes, because they monitor each dimension in isolation.In contrast, our approach aims to learn the relationships between different dimensions so that it can detect such changes.Next, AdwinK cannot identify subspaces with fewer than k dimensions.

Change severity
According to (Lu et al, 2019), change severity is a positive measure of the discrepancy between the data observed before and after the change.One can either measure the divergence between distributions directly, as done by kdq-Tree (Dasu et al, 2006), LDD-DSDA (Liu et al, 2017), and WATCH (Faber et al, 2021), or indirectly with a score that correlates with change severity, as done by D3 (Gözüaçık et al, 2019).Following this reasoning, an approach that satisfies R3 should compute a score that depends on the change severity (Gözüaçık et al, 2019;Dasu et al, 2006;de Souza et al, 2020;Qahtan et al, 2015;Faber et al, 2021), i.e., the higher the score, the higher the severity.Finally, hypothesis-testing-based approaches, such as ADWIN (Bifet and Gavaldà, 2007), SeqDrift2 (Pears et al, 2014), AdwinK (Faithfull et al, 2019), or IKS (dos Reis et al, 2016), do not quantify change severity: a slight change observed over a longer time can lead to the same p-value as a severe change observed over a shorter time, hence p is not informative about change severity.

Pattern based change detection
A related line of research, pattern-based change detection, deals with identifying changes in temporal graphs (Loglisci et al, 2018;Impedovo et al, 2019Impedovo et al, , 2020a,b),b).In particular, Loglisci et al (2018) detect changes in the graph, identify the affected subgraphs, and quantify the amount of change for these subgraphs.This is similar to our methodology.However, these methods work well with graph data, but we are dealing with vector data.To apply these methods in our context, one would need to create a graph, e.g., by representing each dimension as a node and indicating pairwise correlations with edges.However, constructing such a graph becomes impractical for high-dimensional observations because of the exponentially growing number of subspaces.

Competitors
In our experiments, we compare to AdwinK, IKS, D3, IBDD, and WATCH.IBDD, WATCH, and D3 are recent change detectors for multivariate and highdimensional data that fulfill R3.AdwinK extends the ADWIN algorithm to the multivariate case and fulfills R2.Finally, IKS is the only approach employing a non-parametric two-sample test for change detection while also satisfying R2.

Preliminaries
We are interested in finding changes in the last t observations S = (x 1 , x 2 , . . ., x t ) from a stream of data.Each x i is a d-dimensional vector independently drawn from a (unknown) distribution F i .We assume without loss of generality that each vector coordinate is bounded in [0, 1], i.e., Definition 1 (Change).A change occurs at time point t * if the datagenerating distribution changes after t * : In high-dimensional data, changes typically affect only a subset of dimensions, which we call the change subspace.Let D = {1, 2, . . ., d} be the set of dimensions and F D ′ i be the joint distribution of F i observed in the subspace D ′ ⊆ D at time step i.We define the change subspace as follows: Definition 2 (Change subspace).The change subspace D * at time t * is the union of all D ′ ⊆ D in which the joint distribution F D ′ changed and which does not contain a subspace D ′′ for which If the dimensions in D * are uncorrelated, then changes will be visible on the marginal distributions, i.e., all D ′ are of size 1.However, changes may only be detectable w.r.t the joint distribution of D * or the union of its subspaces of size greater than 1, which our definition accommodates.Note that the definition can also handle multiple co-occurring changes and considers them as one single change.Last, change severity measures the difference between F D * t * and F D * t * +1 : Definition 3 (Change severity).The severity of a change is a positive function ∆ of the mismatch between Since we do not know the true distributions F t * and F t * +1 , the best we can do is detecting changes and their characteristics based on the observed data.

Principle of ABCD
Direct comparison of high-dimensional distributions is impractical as it requires many samples (Gretton et al, 2012).Yet the number of variables required to describe such data with high accuracy is often much smaller than d (Lee and Verleysen, 2007).Dimensionality reduction techniques let us encode observations in fewer dimensions.The more information encodings retain, the better one can reconstruct (decode) the original data.However, if the distribution changes, the reconstruction will degrade and produce higher errors.
We leverage this principle in ABCD by monitoring the reconstruction loss of an encoder-decoder model ψ • ϕ for some encoder function ϕ and decoder function ψ. Figure 1 illustrates this.Specifically, we first learn ϕ : ), mapping the data to fewer dimensions, and ψ : [0, 1] d ′ → [0, 1] d .Then, we monitor the loss between each x t and its We hypothesize that distribution changes lead to outdated encoder-decoder models -see for example (Jaworski et al, 2020) for empirical evidence.Hence, we assume that changes in the reconstruction affect the mean µ t * +1 of the loss, because the model can no longer accurately reconstruct the input: (2) We can now replace the definition of change in high-dimensional data with an easier-to-evaluate, univariate proxy: (3) It allows detecting arbitrary changes in the original (high-dimensional) distribution as long as they affect the average reconstruction loss of the encoder-decoder.Since the true µ t * and µ t * +1 are unknown, we estimate them from the stream: (4)

Detecting the change point
ABCD detects a change at t * if μ1,t * differs significantly from μt * +1,t .To quantify this, we derive a test based on Bernstein's inequality (Bernstein, 1924).It is often tighter than more general alternatives like Hoeffding's inequality (Boucheron et al, 2013).Let μ1 , μ2 be the averages of two independent samples from two univariate random variables.One wants to evaluate if both random variables have the same expected values: The null hypothesis H 0 is µ 1 = µ 2 .Based on the two samples, one rejects where δ is a preset significance level.The following theorem allows evaluating Equation (3) based on Bernstein's inequality.
Theorem 1 (Bound on Pr (|μ 1 − μ2 | ≥ ϵ)).Given two independent samples X 1 , X 2 of size n 1 and n 2 from two random variables with unknown expected values µ 1 , µ 2 and variances σ 2 1 , σ 2 2 .Let μ1 , μ2 denote the sample means and let Assuming µ 1 = µ 2 , we have: Proof We follow the same steps as in (Bifet and Gavaldà, 2007;Pears et al, 2014).Recall Bernstein's inequality: Let x 1 , . . ., xn be independent random variables with sample mean μ = 1/n x i and expected value µ s.th.∀x i : |x i − µ| ≤ M .Then, for all ϵ > 0, We apply the union bound to Pr (|μ 1 − μ2 | ≥ ϵ).For all κ ∈ [0, 1], we have: Substituting above with Bernstein's inequality completes the proof.□ With regard to change detection, one can use Equation ( 5) to evaluate for a time point k if a change occurred.The question is, however, how to choose ϵ to limit the probability of false alarm at any time t to a maximum δ.
Our approach is to set ϵ to the observed |μ 1,k − μk+1,t | and to set n 1 = k, n 2 = t − k.The result bounds the probability of observing |μ 1,k − μk+1,t | between two independent samples of sizes k and t − k under H 0 .If this probability is very low, the distributions must have changed at k.Then, we search for changes at multiple time points k in the current window.Hence, we obtain multiple such probability estimates; our change score is their minimum: The corresponding change point t * splits (L 1 , L 2 , . . ., L t ) into the two subwindows with the statistically most different mean.

Choice of parameter κ
The bound in Equation ( 5) holds for any κ ∈ [0, 1].A good choice, however, provides a tighter estimate, resulting in faster change detection for a given rate of allowed false alarms δ. (Bifet and Gavaldà, 2007) suggest to choose κ s.th.
, that approximately minimizes the upper bound.Substituting both sides with Bernstein's inequality, we get Setting n 1 = rn 2 and simplifying, we have To solve for κ, note that |μ 1,k − μk+1,t | ≈ 0 for large enough k and t − k while there is no change.This leads to a change score p ≫ δ for any choice of κ.Hence, choosing κ optimal is irrelevant while there is no change.

Minimum sample sizes and outlier sensitivity
This section investigates the conditions under which ABCD detects changes.We derive a minimum size of the first window above which ABCD detects a change.It bases on the fact that the number of observations before an evaluated time point k remains fixed while the number of observations after k grows with t.Those counts are n 1 = k and n 2 = t − k in Equation ( 5).Also, since we consider bounded random variables, their variance is bounded as well.Hence, the second term in Equation ( 5) approaches 0 for any ϵ > 0. With this, solving Equation (5) for n 1 yields: By setting ϵ = |μ 1 − μ2 | we see that the required size of the first window decreases the larger the change in the average reconstruction error.For example, with M = 1, ϵ = σ 1 = 0.1, and δ = 0.05 our approach requires n 1 ≥ 32.
Since ABCD detects changes in the average reconstruction loss of a bounded vector, it is stable with respect to outliers as long as they are reasonably rare.To see this, assume w.l.o.g. that window 1 contains n out outliers and that ϵ > 0. One can show that the average of the outliers, μout , must exceed the average of the remaining inliers, μin , by n 1 ϵ/n out .In the example above, a single outlier would thus have to exceed μin by n 1 ϵ = 3.2.This, however, is impossible because M = 1 bounds the reconstruction loss.

Detecting the change subspace
After detecting a change, we identify the change subspace.Restricting the encoding size to d ′ < d forces the model to learn relationships between different input dimensions.As a result, the loss observed for dimension j contains not only information about the change in that dimension (i.e., the marginal distribution in j changes), but also about correlations influencing dimension j.Hence, we can detect changes in the marginal-and joint-distributions by evaluating in which dimensions the loss changed the most.
Algorithm 1 describes how we identify change subspaces.For each dimension j, we compute the average reconstruction loss (the squared error in dimension j) before and after t * , denoted μj 1,t * , μj t * +1,t (lines 5 and 6), and the standard deviation σ j 1,t * , σ j t * +1,t (lines 6 and 7).We then evaluate Equation (5), returning an upper bound on the p-value in the range (0, 4] for dimension j (line 9).If p j < τ ∈ [0, 4], an external parameter for which we give a recommendation later on, we add j to the change subspace (lines 10 and 11).

Quantifying change severity
ABCD provides a measure of change severity in the affected subspace, based on the assumption that the loss in the change subspace increases with severity.Hence, we compute the average reconstruction loss observed in D * before and after the change, Algorithm 1 Identification of change subspaces.

Working with windows
In comparison to most approaches, ABCD evaluates multiple possible change points within an adaptive time interval [1, . . . , t].This frees the user from choosing the window size a-priori and allows to detect changes at variable time scales.Next, we discuss how to efficiently evaluate those time points.

Maintaining loss statistics online
To avoid recomputing average reconstruction loss values and their variance for multiple time points every time new observations arrive, we store Welford aggregates A 1,k summarizing the stream in the interval [1, . . ., k].Each aggregate A 1,k is a tuple containing the average reconstruction loss μ1,k and the sum of squared differences ssd 1,k = k −1 k j=1 L j .We store these aggregates for the time interval [1, . . . , t].

Implementation
Algorithm One can implement ABCD as a recursive algorithm, see Algorithm 2, which restarts every time a change occurs.We keep a data structure W that contains the aggregates, instances, and reconstructions.W can either be empty, or, in the case of a recursive execution, already contain data from the previous run.
Prior to execution, our algorithm must first obtain a model of the current data from an initial sample of size n min .If necessary, ABCD allows enough instances to arrive (lines 5-7).Larger choices of n min allow for better approximations of the current distribution but delay change detection.Hence our recommendation is to set n min as small as possible to still learn the current distribution; a default of n min = 100 has worked well for us.
Afterwards, the algorithm trains the model using the instances in W (lines 8-9).ABCD can in principle work with various encoder-decoder models; thus we deal with tuning the model only on a high level.Nonetheless, we give recommendations in our sensitivity study later on.
Once ABCD detects a change, it identifies the corresponding subspace and evaluates its severity (lines 21-22).Then it adapts W by dropping the outdated part of the window (line 23), including all information obtained with the outdated model.At last, we restart ABCD with the adapted window (line 24).

Discussion
In the worst case our approach consumes linear time and memory because W grows linearly with t.However, we can simply restrict the size of W to n max items for constant memory or evaluate only k max window splits for constant runtime.In the latter case we split W at every t/k max th time point.Regarding n max , it is beneficial that the remaining aggregates still contain information about all observations in (1, . . ., t).Hence, ABCD considers the entire past since the last change even though one restricts the size of W .
ABCD can work with any encoder-decoder model, such as deep neural networks.However, handling a high influx of new observations faster than the model's processing capability can be challenging.Assuming that ψ • ϕ ∈ O(g(d)) for some function g of dimensionality d, the processing time of a single instance during serial execution is in O (g(d) + k max ).Nevertheless, both the deep architecture components and the computation of the change score (cf.Equation 8) can be executed in parallel using specialized hardware.
Dimensionality reduction techniques are often already present in data stream mining pipelines, for example as a preprocessing step to improve the accuracy of a classifier (Yan et al, 2006).Reusing an existing dimensionality reduction model makes it is easy to integrate ABCD into an existing pipeline.
Bernstein's inequality holds for zero-centered bounded random variables that take absolute values of at maximum M almost surely.While M = 1 serves as a theoretical upper limit of the zero-centered reconstruction error L t −E[L t ] for x t ∈ [0, 1] d , we observe that this theoretical limit is very conservative in practice (cf.Appendix A.3).In fact, observing an error of 1 corresponds to an instance and reconstruction of x = [0] d and x = [1] d .This leads us to use M = 0.1 in our experiments.

Experiments
This section describes our experiments and results.We first describe the experimental setting (Section 5.1).Then we analyze ABCD's change detection performance (Section 5.3), its ability to find change subspaces and quantify change severity (Section 5.4), and its parameter sensitivity (Section 5.5).

Algorithms
We evaluate ABCD with different encoder-decoder models: (1) Principal Component Analysis (PCA) (d ′ = ηd), (2) Kernel-PCA (d ′ = ηd, RBF-kernel), and * used or recommended in the respective papers † only relevant for autoencoders ‡ authors did not recommend parameters for their approach (3) a standard fully-connected autoencoder model with one hidden ReLU layer (d ′ = ηd) and an output layer with sigmoid activation.For (1) and ( 2), we rely on the default scikit-learn implementations.We implement the autoencoder (3) in pytorch and train it through gradient descent using E epochs and an Adam optimizer with default parameters according to Kingma and Ba (2015); see Appendix A.1 for pseudocode of the autoencoder training procedure.We compare ABCD with AdwinK, IKS, IBDD, WATCH, and D3 (c.f.Section 2).We evaluate for each approach a large grid of parameters, shown in Table 2. Whenever possible, the evaluated grids of hyperparameters for competitors base on recommendations in respective papers.Otherwise, we choose them based on preliminary experiments.For ABCD, we evaluate larger and smaller values for δ, η and E to observe our approach's sensitivity to those parameters.The choice of τ = 2.5 is our recommended default based on our sensitivity study in Section 5.5.Last, we set n min = 100 and k max = 20, minimum values that have worked well in preliminary experiments.

Datasets
There are not many public benchmark data streams for change detection.Thus we generate our own from seven real-world (rw) and synthetic (syn) classification datasets, similar to (Faber et al, 2021;Faithfull et al, 2019).We simulate changing data streams2 by sorting the data by label, unless stated otherwise.If the label changes, a change has occurred.In real-world data streams, the number of observations between changes depends on each dataset, reported below.In the synthetic streams, we introduce changes every 2000 observations, which is a relatively large interval, to assess whether some approaches generate many false alarms.The generators base on the following datasets: • HAR (rw): The dataset Human Activity Recognition with Smartphones (Anguita et al, 2013) (d = 561) bases on smartphone accelerometer and gyroscope readings for different actions a person performs.A change occurs on average every 1768 observations.• GAS (rw): This data set (Vergara et al, 2011) (d = 128) contains data from 16 sensors exposed to 6 gases at various concentrations.A change occurs on average every 2265 observations.• LED (syn): The LED generator samples instances representing a digit on a seven segment display.It contains 17 additional random dimensions.We add changes by varying the probability of bit-flipping in the relevant dimensions.• RBF (syn): The RBF generator (Bifet et al, 2010) starts by drawing a fixed number of centroids.For each new instance, the generator chooses a centroid at random and adds Gaussian noise.To create changes, we increment the seed of the generator resulting in different centroids.We then use samples from the new generator in a subspace of random size.Changes can occur rapidly ("abrupt" or "sudden") or in time intervals ("gradual" or "incremental").The shorter the interval, the more sudden the change.We vary the interval size between 1 and 300 unless stated otherwise.Real-world and image data do not have a ground truth for change subspaces and severity.Thus we generate three additional data streams: • HSphere (syn): This generator draws from a d * -dimensional hypersphere bound to [0, 1] and adds d − d * random dimensions.We vary the radius and center of the hypersphere to introduce changes.The change subspace contains those dimensions that define the hypersphere.• Normal-M/V (syn): These generators sample from a d * -dimensional normal distribution and add d − d * random dimensions.For type M, changes affect the distribution's mean, for V we change the distribution's variance.

Change point detection
We use precision, recall, and F1-score to evaluate the performance of the approaches at detecting changes.We define true positives (TP), false positives (FP) and false negatives (FN) as follows: • TP: A change was detected before the next change.
• FN: A change was not detected before the next change.
• FP: A change was detected although no change occurred.
Also, we report the mean time until detection (MTD) indicating the average number of instances until a change is detected.Figure 2 shows F1-score, precision, recall, and MTD for all datasets and algorithm, as well as a column "Average" that summarizes across datasets.Each box contains the results for the grid of hyperparameters shown in Table 2.We see that our approach outperforms its competitors w.r.t.F1-score and precision.It also is competitive in terms of recall, though it loses against IKS, IBDD, and WATCH.These approaches seem overly sensitive.The results also indicate that ABCD works well for a wide range of hyperparameters.One reason is that ABCD uses adaptive windows, thereby eliminating the effect of a window size parameter (demonstrated in Section 5.6).Another reason is that ABCD detects changes in reconstruction loss irrespective of the actual quality of the reconstructions.For instance, Kernel PCA and PCA produce reconstructions of different accuracy in our experiments.However, for both models, the average accuracy changes when the stream changes, which is what our algorithm detects.Refer to Appendix A.3 for an illustration of the models' reconstruction loss over time.Hence, our reported results do not yield information about the actual accuracy of the underlying encoder-decoder models.
ABCD has a higher MTD than D3, IBDD, and IKS, i.e., it requires more data to detect changes.However, those competitors are much less conservative and detect many more changes than exist in the data.Hence they have low precision but high recall -this leads to a lower MTD.
Table 3 reports the results of all approaches with their best hyperparameters.WATCH and D3 achieve relatively high F1-score and precision.In fact, those approaches are our strongest competitors although we still outperform them by at least 3 %.Further, WATCH has an MTD of 626, which is more than ABCD while D3 and ABCD have a comparable MTD.
ABCD has much higher precision than its competitors.We assume this is because ABCD (1) leverages the relationships between dimensions, in comparison to AdwinK, IKS, or IBDD, and (2) learns those relationships more effectively than, say, D3 or WATCH.For example, we observed in our experiments that WATCH was frequently unable to accurately approximate the Wasserstein distance in high-dimensional data.
ABCD has lower recall than most competitors, partly due to their oversensitivity.In this regard, our approach might benefit from application-specific encoder-decoder models that leverage structure in the data, such as spacial relationships between the pixels of an image, more effectively.

Change subspace and severity
We now evaluate change subspace identification and change severity estimation.We set d = {24, 100, 500} and vary the change subspace size d * randomly in [1, d] (except for LED, here the subspace always contains dimensions 1-7).We set the ground truth for the severity to the absolute difference between the parameters that define the concepts, e.g., the hypersphere-radius in HAR before and after the change.We report an approach's subspace detection accuracy (SAcc.),where true positives (true negatives) represent those dimensions that were correctly classified as being member (not being member) of the change subspace.We use Spearman's correlation between the detected severity and the ground truth.We also report the F1-score for detecting change points.Figure 3 shows our results.As before, each box summarizes the results for the grid of evaluated hyperparameters.Comparing the two approaches, Ad-winK and IKS, that monitor each dimension separately, we see that the former can only detect changes that affect the mean of the marginal distributions (i.e., on Norm-M, LED).At the same time, the latter can also detect other changes (e.g., changes in variance).This is expected since AdwinK compares the mean in two windows while IKS compares the empirical distributions.
Regarding subspace detection, our approach achieves an accuracy of 0.72 for PCA, 0.78 for autoencoders, and 0.79 for Kernel PCA.AdwinK performs similarly well when changes affect the mean of the marginal distributions.Except on LED, IKS performs worse than ABCD and AdwinK, presumbably because IKS issues an alarm as soon as a single dimension changed.
The estimates of our approach correlate more strongly with the ground truth than those of competitors, with an average of 0.31 for PCA, 0.36 for Kernel PCA and 0.37 for Autoencoders.However, we expect more specialized models to better than our tested models.On LED, PCA-based models appear to struggle to separate patterns from noise, resulting in poor noise level estimates and low correlation scores.

Parameter sensitivity of ABCD
Sensitivity to η Figure 4a plots F1 for different datasets over η.We observe that the size of the bottleneck does not significantly impact the change detection performance of ABCD (ae) and ABCD (kpca).For PCA, however, too large bottlenecks seem to inhibit change detection on CIFAR, Gas, and MNIST.For those datasets, we assume that the change occurs along the retained main components, rendering it undetectable; see Appendix A.2 for an illustration.Figure 4b shows the subspace detection accuracy and Spearman's ρ.The influence of η on both metrics is low.As mentioned earlier, we assume that a change in reconstruction loss, rather than the quality of reconstruction itself, is crucial for ABCD.An exception is the LED dataset, on which PCA and Kernel-PCA are unable to provide a measure that positively correlates with change severity.We hypothesize that those methods struggle to separate patterns from noise, resulting in poor noise level estimates and low correlation scores.
Sensitivity to E Figure 4c plots our approach's performance for different choices of E. Overall, our approach seems to be robust to the choice of E. On LED, however, larger choices of E lead to substantial improvements in F1-score.The reason may be that the autoencoder does not converge to a proper representation of the data for small E. To avoid this, we recommend choosing E ≥ 50 and to increase the value if one observes that the model has not yet converged sufficiently.

Sensitivity to τ
Figure 4d investigates how the choice of τ affects the performance of ABCD at detecting subspaces.Since the change score in Equation ( 5) provides an upper bound on the probability that a change occurred, the function can return values greater than 1, i.e,. in the range (0, 4].Hence we vary τ in that range and record the obtained subspace detection accuracy.For all approaches we achieve optimal accuracy at τ ≈ 2.5.This is probably because some dimensions could change more severely than others, resulting in variations of the change scores observed in the different dimensions of the change subspace.Based on our findings we recommend τ = 2.5 as default.

Ablation study on window types
Next, we investigate the effect of different window types on change detection performance.We evaluate those commonly found in change detection literature (and in our competitors) and couple them with encoder-decoder models and the probability bound in Equation (5).In particular, we compare: (1) Adaptive windows (AW), as in ADWIN, AdwinK, and our approach, (2) fixed reference windows (RW), as in IKS, (3) sliding windows (SW), as in WATCH, and (4) jumping windows (JW), as in D3.The latter "jump" every ρ|W | instances.
We evaluate the hyperparameters mentioned in Table 2.For example, because D3 uses jumping windows, we include the evaluated hyperparameters for D3 in our evaluation of jumping windows.In addition, we extend the grid with other reasonable choices since we already preselected those in Table 2 for our competitors in a preliminary study.For ABCD we use η = 0.5 and E = 50.
Table 4 reports the average over all hyperparameter combinations.AWs yield higher F1-score and recall than other techniques, while precision remains high (≥ 0.95).SWs have a lower MTD than AWs and hence seem to require a fewer instances until they detect a change.This is expected: in contrast to sliding windows, adaptive windows allow the detection of even slight changes after a longer period of time, resulting in both higher MTD and recall.(d) Subspace detection accuracy of ABCD depending on τ .
Figure 4: Sensitivity of our approach to its hyperparameters.

Comparison with competitors
Figure 5a shows the mean time per observation (MTPO) of ABCD and its competitors for d ∈ {10, 100, 1000, 10, 000} running single-threaded.The results are averaged over all evaluated parameters (Table 2).ABCD (id) replaces the encoder-decoder model with the identity which does not cause overhead.This allows measuring how much the encoder-decoder model influences ABCD's runtime.The results confirm that the runtime of ABCD alone, i.e, without the encoding-decoding-process, remains unaffected by a stream's dimensionality.We observe that our approach is able to process around 10,000 observations per second for d ≤ 100.This is more than IKS, WATCH and AdwinK (except at d = 10) but slower than D3 and IBDD.The reason is that our approach evaluates k max possible change points in each time step.In high-dimensional data, our competitors' MTPO grows faster than ABCD with PCA or KPCA; in fact, ABCD (pca) is second fastest after D3 for d ≥ 1000.An exception is WATCH at d = 10000.This is due to an iteration cap for approximating the Wasserstein distance restricting the approach's MTPO.

Runtime depending on window size
Next, we investigate ABCD's runtime for different choices of k max and η.We run this experiment on a single CPU thread.For all three evaluated models, the encoding-decoding of an observation has a time complexity of O(ηd 2 ); hence, ABCD's processing time of one instance is in O(ηd 2 + k max ).We therefore expect a quadratic increase in execution time with dimensionality and a linear increase with η and k max when running on a single core.
The results in Figure 5b show the influence of k max on the execution time: k max effectively restricts the MTPO as soon as |W | = k max .Afterwards, MTPO remains unaffected by |W |.This also confirms that one can evaluate different possible change points in constant time using the proposed aggregates.
We show the runtime for different choices of bottleneck-size η in Figure 5c.η has little influence on the runtime of ABCD with PCA and Kernel-PCA .However, coupled with an autoencoder (implemented in pytorch) we observe the expected linear increase in execution time from 0.1 ms for η = 0.3 to 0.3 ms for η = 0.7.Considering that change detection performance has shown to remain stable even for smaller choices of η, we recommend η ≤ 0.5 as default.

Conclusion
We presented a change detector for high-dimensional data streams, called ABCD, that monitors the reconstruction loss of an encoder-decoder-model in an adaptive window with a change score based on Bernstein's inequality.
Our approach identifies changes and change subspaces, and provides a severity measure that correlates with the ground truth.Since encoder-decoder models are already used in many domains (Rani et al, 2022), our approach is widely applicable.In the future, it would thus be interesting to test ABCD with application or data specific encoder-decoder models.For example, one might observe even better performance on streams of image data when applying convolutional autoencoders.Last, ABCD could also benefit from a theoretical analysis of the relationship between changes in data distribution and the loss of different encoder-decoder models.X train ← {x i ∀ (−, −, x i ) ∈ W } 3:

6:
Return ϕ, ψ A.2 Detectable and undetectable change for ABCD (pca) This section illustrates under which conditions one can use principal component analysis to detect change.Figure 6 shows data from two distributions: black points (e.g., before the change) plus the associated main principle component, and blue points (e.g., after the change).On the left, the change affects the correlation between Dim. 1 and Dim. 2. This leads to an increased reconstruction error for the points highlighted in blue.On the right, the change occurs along the main principle component.I.e., the variance along the main principle component has increased.Such kind of change is undetectable by ABCD (pca) as the reconstruction error remains unchanged.

A.3 Reconstruction loss over time
Figure 7 shows the reconstruction loss of the evaluated encoder-decoder models over the length of the stream.We observe that indeed the reconstruction loss decreases with increasing bottleneck size (controlled by η), and with increasing number of training epochs E (first three columns).Further, we see that regardless of E, η, or the type of model, the reconstruction loss typically changes after a change point.After the change was detected, ABCD learns the new concept, which mostly leads to a decrease in reconstruction loss.Last, we observe that the theoretical limit M = 1 for the absolute difference between the reconstruction loss and its expected value is overly conservative.A value of M = 0.1 seems to be a more realistic choice.
21) Derivation.Given two non-overlapping samples A = {x 1 , . . ., x m } and B = {x 1 , . . ., x n } of a real random variable.Let T A = m i=1 x i and T B = n i=1 x i be the sums of the samples and ssd A = m i=1 (x i − m −1 T A ) 2 and ssd B = n i=1 (x i − n −1 T B ) 2 be the sums of squared distances from the mean.For the union of both sets AB = A ∪ B we have T AB = T A + T B , which is equivalent to (m + n)μ AB = mμ A + nμ B .Solving for μB gives μB • MNIST, FMNIST, and CIFAR (syn): Those data generators sample from the image recognition datasets MNIST (LeCun et al, 1998), Fashion MNIST (FMNIST) (Xiao et al, 2017) (d = 784), and CIFAR (Krizhevsky et al, 2009) (d = 1024, grayscale).

Figure 2 :
Figure 2: Change Point Detection: Results for different algorithms and datasets; each box contains the results for the evaluated grid of parameters.

Figure 3 :
Figure 3: Results for evaluating change subspace and severity.
Influence of η on the estimation of change subspaces and severity.

Figure 6 :
Figure 6: Illustration of detectable and undetectable change using PCA.

Figure 7 :
Figure 7: Reconstruction loss over the length of the stream.

Table 2 :
Evaluated approaches and their parameters.

Table 3 :
Results of approaches with their best hyperparameter configuration w.r.t.F1 score averaged over all data sets.

Table 4 :
Ablation: Using encoder-decoder models with different window types.
A.1 Training of autoencoderAlgorithm 3 describes the training of the autoencoder model as done in our experiments.First, we collect the training data from the current window W (line 2).Afterwards we perform gradient descent on X train for E epochs at a learning rate of lr.Require: W , learning rate lr, number of training epochs E 1: procedure TrainAE(W, lr, E)