Learning and Adaptation to Detect Changes and Anomalies in High-Dimensional Data

The problem of monitoring a datastream and detecting whether the data generating process changes from normal to novel and possibly anomalous conditions hasrelevantapplicationsinmanyrealscenarios,suchashealthmonitoringandquality inspectionofindustrialprocesses.Ageneralapproachoftenadoptedintheliterature istolearnamodeltodescribenormaldataanddetectasanomalousthosedatathatdo notconformtothelearnedmodel.However,severalchallengeshavetobeaddressed tomakethisapproacheffectiveinrealworldscenarios,whereacquireddataareoften characterizedbyhighdimensionandfeaturecomplexstructures(suchassignalsand images).Weaddressthisproblemfromtwoperspectivescorrespondingtodifferent modelingassumptionsonthedata-generatingprocess.Atﬁrst,wemodeldataas realizationofrandomvectors,asitiscustomaryinthestatisticalliterature.Inthis settingswefocusonthechangedetectionproblem,wherethegoalistodetectwhether thedatastreampermanentlydepartsfromnormalconditions.Wetheoreticallyprove theintrinsicdifﬁcultyofthisproblemwhenthedatadimensionincreasesandpropose anovelnon-parametricandmultivariatechange-detectionalgorithm.Inthesecond part,wefocusondatahavingcomplexstructureandweadoptdictionariesyielding sparserepresentationstomodelnormaldata.Weproposenovelalgorithmstodetect anomaliesinsuchdatastreamsandtoadaptthelearnedmodelwhentheprocess generating normal data changes.


Introduction
The general problem we address here is the monitoring of datastreams to detect in the data-generating process. This problem has to be faced in several applications, since the change could indicate an issue that has to be promptly alarmed and solved. In particular, we consider two meaningful examples. Figure 5.1a shows a Scanning Electron Microscope (SEM) image acquired by an inspection system that monitors the quality of nanofibrous materials produced by a prototype machine. In normal conditions, the produced material is composed of tiny filaments whose diameter ranges up to 100 nanometers. However, several issues might affect the production process and introduce small defects among the fibers, such as the clots in Fig. 5.1a. These defects have to be promptly detected to improve the overall production quality.
The second scenario we consider is the online and long-term monitoring of ECG signals using wearable devices. This is a very relevant problem as it would ease the transitioning from hospital to home/mobile health monitoring. In this case the data we analyze are the heartbeats. As shown in Fig. 5.1b, normal heartbeats feature a specific morphology, while the shape of anomalous heartbeats, that might be due to potentially dangerous arrhythmias, is characterized by a large variability. Since the morphology of normal heartbeats depends on the user and the position of the device [16], the anomaly-detection algorithm has to be configured every time the user places the device.
Monitoring this kind of datastream raises three main challenges: at first data are characterized by complex structure and high dimension and there is no analytical model able to describe them. Therefore, it is necessary to learn models directly from data. However, only normal data can be used during learning, since acquiring anomalous data can be difficult if not impossible (e.g., in case of ECG monitoring acquiring arrhythmias might be dangerous for the user). Secondly, we have to careful design indicators and rules to assess whether incoming data fit or not the learned model. Finally, we have to face the domain adaptation problem, since normal condition might changes during time and the learned model might not be able to describe incoming normal data, thus it has to be adapted accordingly. For example, in ECG monitoring the model is learned over a training set of normal heartbeats acquired at low heart rate, but the morphology of normal heartbeats changes when the heart rate increases, see The content of these images is different, although they are perceptually similar. b Examples of heartbeats acquired at different heart rate. We report the name of the waveforms of the ECG [13] We investigate these challenges following two directions, and in particular, we adopt two different modeling assumptions on data-generating process. At first, we assume that data can be described by a smooth probability density function, as customary in the statistics literature. In these settings we focus on the change-detection problem, namely the problem of detecting permanent changes in the monitored datastreams. We investigate the intrinsic difficulty of this problem, in particular when the data dimension increases. Then, we propose QuantTree, a novel change detection algorithms based on histograms that enables the non-parametric monitoring of multivariate datastreams. We theoretically prove that any statistic compute over histograms defined by QuantTree does not depend on the data-generating process.
In the second part we focus on data having a complex structure, and address the anomaly detection problem. We propose a very general anomaly-detection algorithm based on a dictionary yielding sparse representation learned from normal data. As such, it is not able to provide sparse representation to anomalous data, and we exploit this property to design low dimensional indicators to assess whether new data conform or not to the learned dictionary. Moreover, we propose two domain adaptation algorithms to make our anomaly detector effective in the considered application scenarios.
The chapter is structure as follows: Sect. 5.2 presents the most relevant related literature and Sect. 5.3 formally state the problem we address. Section 5.4 focuses on the contribution of the first part of the thesis, where we model data as random vectors, while Sect. 5.5 is dedicated to the second part of the thesis, where we consider data having complex structures. Finally, Sect. 5.6 presents the conclusions and the future works.

Related Works
The first algorithms [22,24] addressing the change-detection problem were proposed in the statistical process control literature [15] and consider only univariate datastreams. These algorithms are well studied and several properties have been proved due to their simplicity. Their main drawback is that they require the knowledge of the data generating distributions. Non-parametric methods, namely those that can operate when this distribution is unknown, typically employ statistics that are based on natural order of the real numbers, such as Kolmogorov-Smirnov [25] and Mann-Whitney [14]. Extending these methods to operate on multivariate datastreams is far from being trivial, since no natural order is well defined on R d . The general approach to monitor multivariate datastreams in is to learn a model that approximates φ 0 and monitor a univariate statistic based on the learned model. One of the most popular non-parametric approximation of φ 0 is given by Kernel Density Estimation [17], that however becomes intractable when the data dimension increases.
All these methods assume that data can be described by a smooth probability density function (pdf). However, complex data such as signal and images live close to a low-dimensional manifold embedded in a higher dimensional space [3], and do not admit a smooth pdf. In these cases, it is necessary to learn meaningful representations to data to perform any task, from classification to anomaly detection. Here, we consider dictionary yielding sparse representations that have been originally proposed to address image processing problems such as denoising [1], but they were also employed in supervised tasks [19], in particular classification [20,23]. The adaptation of dictionaries yielding sparse representations to different domain were investigated in particular to address the image classification problem [21,27]. In this scenario training images are acquired under different conditions than the test ones, e.g. different lightning and view angles, and therefore live in a different domain.

Problem Formulation
Let as consider a datastream {s t } t=1,... , where s t ∈ R d is drawn from a process P N in normal condition. We are interest in detect whether s t ∼ P A , i.e., s t is drawn from an alternative process P A representing the anomalous conditions. Both P N and P A are unknown, but we assume that a training set of normal data is available to approximate P N . In what follows we describe in details the specific problems we consider in the thesis.
Change Detection. At first, we model s t as a realization of a continuous random vector, namely we assume that P N and P A admit smooth probability density functions φ 0 and φ 1 , respectively. This assumption is not too strict, as it usually met after a feature extraction process.
We address the problem of detecting abrupt change in the data-generating process. More precisely, our goal is to detect if there is a change point τ ∈ N in the datastreams such that s t ∼ φ 0 for t ≤ τ and s t ∼ φ 1 for t > τ. For simplicity, we analyze the datastream in batches W = {s 1 , . . . , s ν } of ν samples and detect changes by performing the following hypothesis test: The null hypothesis is rejected whether T (W ) > γ , where T is a statistic typically defined upon the model that approximate the density φ 0 , and γ is defined to guarantee a desired probability of false positive rate α, i.e.
Anomaly Detection. The anomaly-detection problem is strictly related to change detection. The main difference is that in anomaly detection we analyze each s t independently to whether is draw from P N or P A , without taking into account the temporal correlation (for this reason we will omit the subscript t ). In this settings we consider data having complex structure, such as heartbeats or patches, i.e., small region extracted from an image. We adopt dictionaries yielding sparse representation to describe s ∼ P N , namely we assume that s ≈ Dx, where D ∈ R d×n is a matrix called dictionary, that has to be learned from normal data and the coefficient vector x ∈ R n is sparse, namely it has few of nonzero components. To detect anomalies, we have to define a decision rule, i.e., a function T : R d → R and a threshold γ such that where T (s) is defined using the sparse representation x of s.
Domain Adaptation. The process generating normal data P N may change over time. Therefore, a dictionary D learned on training data (i.e., in the source domain), might not be able to describe normal data during test (i.e., in the target domain). To avoid degradation in the anomaly-detection performance, D has to be adapted has soon as P N changes. For example, in case of ECG monitoring the morphology of normal heartbeats changes when the heart rate increases, while in case of SEM images, the magnification level of the microscope may change, and this modify the qualitative content of the patches, as shown in Fig. 5.2.

Data as Random Vectors
In this section we consider data modeled as random vectors. At first we investigate the detectability loss phenomenon, showing that the change-detection performance are heavily affected by the data dimension. Then we propose a novel change-detection algorithm that employs histograms to describe normal data.

Detectability Loss
As described in the Sect. 5.3, we assume that both P N and P A admit a smooth pdf φ 0 , φ 1 : R d → R. For simplicity, we also assume that φ 1 that can be expressed as where v ∈ R d and Q ∈ R d×d is an orthogonal matrix. This is quite a general model as it includes changes in the mean as well as in the correlations of components of s t . To detect changes, we consider the popular approach that monitor the loglikelihood w.r.t. the distribution generating normal data φ 0 : In practice, we reduce the multivariate datastream {s t } to a univariate one {L(s t )}.
Since φ 0 is unknown, we should preliminary estimate φ 0 from data, and use it in (5.3) in place of φ 0 . However, in what follows we will consider φ 0 since it make easier to investigate how the data dimension affect the change-detection performance.
We now introduce two measures that we will use in our analysis. The change magnitude assesses how much φ 1 differs from φ 0 and is defined as sKL(φ 0 , φ 1 ), namely the symmetric Kullback-Leibler divergence between φ 0 and φ 1 [12]. In practice, large values of sKL(φ 0 , φ 1 ) makes the change very apparent, as proved in the Stein's Lemma [12]. The change detectability assesses how the change is perceivable by monitoring the datastream {L(s t )} and is defined as the signal-tonoise ratio of the change φ 0 → φ 1 : , (5.4) where E and var denote the expected value and the variance, respectively. The following theorem proves detectability loss on Gaussian datastreams (the proof is reported in [2]).
The main consequences of Theorem 1 is that the change detectability decreases when the data dimension increases, as long as the change magnitude is kept fixed.
Remarkably, this results is independent on how the changes affected the datastream (i.e., it is independent on Q and v), but only on the change magnitude. Moreover, it does not depend on estimation error, since in (5.4) we have considered the true and unknown distribution φ 0 . However, the detectability loss becomes more severe when the loglikelihood is computed w.r.t. the estimated φ 0 , as we showed in [2].
Finally, we remark that the detectability loss holds also for more general distributions φ 0 , such as Gaussian Mixture, and on real data. In that case the problem cannot treated analytically, but we empirically show similar results using Controlling Change Magnitude (CCM) [6], a framework to inject changes of a given magnitude in real world datastreams.

QuantTree
In this section we present QuantTree, a novel change detection algorithm that computes a histogram to approximate φ 0 . A histogram h is defined as h = {(B k , π k )} k=1,...,K , where the bins {B k } identify a partition of the data space, while the { π k } are the probabilities associated to the bins. In practice, we estimate h to approximate φ 0 and π k is an estimate of the probability of s ∼ φ 0 to fall inside B k . As described in Sect. 5.3, we monitor the datastream in a batch-wise manner and we consider statistics T h (W ) defined over the histogram h, namely statistics that depend only on the number y k of samples of W that fall in the bin B k , for k = 1, . . . , K . Examples of such statistics are the Total Variation distance and the Pearson's statistic.
The proposed QuantTree algorithm takes as input target probability values {π k } and generates a partitioning {B k } such that the corresponding probability { π k } are close to {π k }. QuantTree is an iterative splitting scheme that generate a new bin at each iteration k. The bin is defined by splitting along a component chosen at random among the d available. The splitting point is selected to contain exactly round(π k N ) samples of the training set, where S is the number of training samples. An example of partitioning computed by QuantTree is show in Fig. 5.3.
The main property of QuantTree is that its peculiar splitting scheme makes the distribution of any statistics T h independent on the data-generating distribution φ 0 . This properly is formally stated in the following theorem, that we proved in [4].

Theorem 2 Let T h (·) be defined over an histogram h computed by QuantTree. When
W ∼ φ 0 , the distribution of T h (W ) depends only on ν, N and {π k } k . In practice, the distribution of any statistic T h depends only on the cardinalities of the training set and the window W and on the target probabilities {π k }. The main consequence of Theorem 2 is that the threshold γ that guarantees a given false positive rate α does not depend on φ 0 . Therefore, we can precompute γ through by estimating the distribution of T h over synthetically generated samples though Montecarlo simulations. To the best of our knowledge, QuantTree is one of the first algorithms that performs non-parametric monitoring of multivariate datastreams.
We compare the histograms computed by QuantTree with other partitioning in the literature in [4] through experiments on Gaussian datastreams and real world datasets. Histograms computed by QuantTree yield a larger power and are the only ones that allows to properly control the false positive rate. We remark that also QuantTree suffers of the detectability loss, confirming the generality of our results on detectability loss.

Data Featuring Complex Structures
In this section we employ dictionaries yielding sparse representations to model data having complex structures. At first we present our general anomaly detection algorithm, then we introduce our two domain-adaptation solutions, specifically designed for the application scenarios described in Sect. 5.1.

Sparsity-Based Anomaly Detection
Our modeling assumption is that normal data s ∼ P N can be well approximated as s ≈ Dx, where x ∈ R n is sparse, namely it has only few nonzero components and the dictionary D ∈ R d×n approximates the process P N . The sparse representation x is computed by solving the sparse coding problem: where the 1 norm x 1 is used to enforce sparsity in x, and the parameter λ ∈ R controls the tradeoff between the 1 norm and the reconstruction error s − Dx 2 2 . The dictionary D is typically unknown, and we have to learn it from a training set of normal data by solving the dictionary learning problem: where S 0 ∈ R d×m is the training set that collects normal data column-wise.
To assess whether s is generated or not from P N , we define a bivariate indicator vector f ∈ R 2 collecting the reconstruction error and the sparsity of the representation: In fact, we expect that when s is anomalous it deviates from normal data either in the sparsity of the representation or in the reconstruction error. This means that f(s) would be an outlier w.r.t. the distribution φ 0 of f computed over normal data. Therefore, we detect anomalies as in (5.2) by setting T = − log(φ 0 ), where φ 0 is estimate from a training set of normal data (different from the set S 0 used in dictionary learning) using Kernel Density Estimation [5]. We evaluate our anomaly-detection algorithm on a dataset containing 45 SEM images. In this case s ∈ R 15×15 is a small squared patch extracted from the image. We analyze each patch independently to determine if it is normal or anomalous. Since each pixel of the image is contained in more than one patch, we obtain several decisions of each pixel. To detect defects at pixel level, we aggregate all the decisions through majority voting. An example of the obtained detections is shown in Fig. 5.4: our algorithm is able to localize all the defects by keeping the false positive rate small. More details on our solution and experiments are reported in [8].
In case of ECG monitoring, the data to be analyzed are the heartbeats, that are extracted from the ECG signal using traditional algorithms. Since our goal is to perform ECG monitoring directly on the wearable device, that has limited computational capabilities, we adopt a different sparse coding procedure, that is based on the 0 "norm" and it is performed by means of greedy algorithms [11]. In particular, we proposed a novel variant of the OMP algorithm [9], that is specifically designed for dictionary D ∈ R d×n where n < d, that is settings we adopt in ECG monitoring. Dictionary learning, that is required every time the device is positioned as the shape of normal heartbeats depends both of the users and device position, is performed on a host device, since the computational resources of wearable devices are not sufficient.

Multiscale Anomaly-Detection
In the quality inspection through SEM images, we have to face the domain adaptation problem since the magnification level of the microscope may change during monitoring. Therefore, we improve our anomaly-detection algorithm to make it scaleinvariant. The three key ingredients we use are: (i) a multiscale dictionary that is able to describe patch extracted from image at different resolution, (ii) a multiscale sparse representation that captures the structure in the patch at different scales and (iii) a triviate indicator vector, that is more powerful that the one in (5.8).
We build the multiscale dictionary as D = [D 1 | . . . |D L ]. Each subdictionary D j is learned by solving problem (5.7) over patched extracted from synthetically rescaled version of training images. Therefore, we can learn a multiscale dictionary even if the training images are acquired at a fixed scale. To compute the multiscale representation x we solve the following sparse coding problem: where each x j refers to the subdictionary D i and x collects all the x j . The group sparsity term L j=1 x j 2 ensures that each patch is reconstructed using coefficients of x that refers only to few scales. Finally, we define a trivariate indicator vector f that includes the reconstruction error and the sparsity as in (5.8) and also the group sparsity term j x j 2 . We detect anomalies using the same approach described in Sect. 5.5.1. Our experiments [7] show that employing a multiscale approach in all phases of the algorithm (dictionary learning, sparse coding and anomaly indicators) achieves good anomaly detection performance even when training and test images are acquired at different scales.

Dictionary Adaptation for ECG Monitoring
To make our anomaly-detection algorithm effective in long-term ECG monitoring, we have to adapt the user-specific dictionary. In fact, to keep the training procedure safe for the user, this dictionary is learned from heartbeats acquired in resting conditions, and it would be not able to describe normal heartbeats acquired during daily activities, when the heart rate increases and normal heartbeats get transformed, see Fig. 5.2b. These transformations are similar for every user: the T and P waves approach the QRS complex and the support of the heartbeats narrows down. Therefore, we adopt user-independent transformations F r,r 0 R d r ×d r 0 to adapt the user-specific dictionary D u,r 0 ∈ R d r 0 ×n : where r 0 and r denotes the heart rates in resting condition and during daily activities, respectively, and u indicates the user. The transformation F r,r 0 depends only on the heart rates and is learned from collections S u,r of training sets of normal heartbeats of several users acquired at different heart rates extracted from publicly available datasets of ECG signals. The learning has to performed only ones by solving the following optimization problem: where the first term ensures that the transformed dictionaries F D u,r 0 provide good approximation to the heartbeats of the user u. Moreover, we adopt three regularization terms: the first one is based on 1 norm and enforces sparsity in the representations of heartbeats at heart rate r for each user u. The other two terms represent a weighted elastic net penalization over F to improves the stability of the optimization problem, and the weighting matrix W introduces regularities in F r,r 0 . The learned F r,r 0 (for several values of r and r 0 ) are then hard-coded in the wearable device to perform online monitoring [18]. We evaluate our solution on ECG signals acquired using the Bio2Bit Dongle [18]. Our experiments [10] show that the reconstruction error is kept small when the heart rate increases, implying that our transformations effectively adapt the dictionaries to operate at higher heart rate. Moreover, our solution is able to correctly identify anomalous heartbeats even at very large heart rate, when the morphology of normal heartbeats undergoes severe transformations.

Conclusions
We have addressed the general problem of monitoring a datastream to detect whether the data generating process departs from normal conditions. Several challenges have to be addressed in practical applications where data are high dimensional and feature complex structures. We address these challenges from two different perspectives by making different modeling assumption on the data generating process.
At first we assume that data can be described by a smooth probability density function and address the change detection problem. We prove the detectability loss phenomenon, that relates the change-detection performance and the data dimension, and we propose QuantTree, that is one of the first algorithms that enables non-parametric monitoring of multivariate datastream. In the second part, we employ dictionary yielding sparse representation to model data having featuring complex structures and propose a novel anomaly-detection algorithm. To make this algorithm effective in practical applications, we propose two domain-adaptation solutions that turn to be very effective in long-term ECG monitoring and in a quality inspection of an industrial process.
Future works include the extension of QuantTree to design a truly multivariate change-detection algorithm, and in particular to control the average run length instead of the false positive rate. Another relevant directions is the design of models that provide good representation for detection tasks. In fact, our dictionaries are learned to provide good reconstruction to the data, and not to perform anomaly detection. Very recent works such as [26] show very promising results using deep learning, but a general methodology is still not available.