Expected similarity estimation for largescale batch and streaming anomaly detection
 1.3k Downloads
 4 Citations
Abstract
We present a novel algorithm for anomaly detection on very large datasets and data streams. The method, named EXPected Similarity Estimation (expose), is kernelbased and able to efficiently compute the similarity between new data points and the distribution of regular data. The estimator is formulated as an inner product with a reproducing kernel Hilbert space embedding and makes no assumption about the type or shape of the underlying data distribution. We show that offline (batch) learning with exposecan be done in linear time and online (incremental) learning takes constant time per instance and model update. Furthermore, exposecan make predictions in constant time, while it requires only constant memory. In addition, we propose different methodologies for concept drift adaptation on evolving data streams. On several real datasets we demonstrate that our approach can compete with state of the art algorithms for anomaly detection while being an order of magnitude faster than most other approaches.
Keywords
Anomaly detection Largescale data Kernel methods Hilbert space embedding Mean map1 Introduction
What is an anomaly? An anomaly is an element whose properties differ from the majority of other elements under consideration which are called the normal data. “Anomaly detection refers to the problem of finding patterns in data that do not conform to expected behavior. These nonconforming patterns are often referred to as anomalies[...]” (Chandola et al. 2009).
Typical applications of anomaly detection are network intrusion detection, credit card fraud detection, medical diagnosis and failure detection in industrial environments. For example, systems which detect unusual network behavior can be used to complement or replace traditional intrusion detection methods which are based on experts’ knowledge in order to defeat the increasing number of attacks on computer based networks (Kumar 2005). Credit card transactions which differ significantly from the usual shopping behavior of the card owner can indicate that the credit card was stolen or a compromise of data associated with the account occurred (Aleskerov et al. 1997). The diagnosis of radiographs can be supported by automated systems to detect breast cancers in mammographic image analysis (Spence et al. 2001). Unplanned downtime of production lines caused by failing components is a serious concern in many industrial environments. Here anomaly detection can be used to detect unusual sensor information to predict possible faults and enabling conditionbased maintenance (Zhang et al. 2011). Novelty detection can be used to detect new interesting or unusual galaxies in astronomical data such as the Sloan Digital Sky Survey (Xiong et al. 2011).
Obtaining labeled training data for all types of anomalies is often too expensive. Imagine the labeling has to be done by a human expert or is obtained through costly experiments (Hodge and Austin 2004). In some applications anomalies are also very rare as in air traffic safety or space missions. Hence, the problem of anomaly detection is typically unsupervised, however it is implicitly assumed that the dataset contains only very few anomalies. This assumption is reasonable since it is quite often possible to collect large amounts of data for the normal state of a system as, for example usual credit card transactions or network traffic of a system not under attack.
Moreover, we will see that the proposed exposeclassifier can be learned incrementally making it applicable to online and streaming anomaly detection problems. Learning on data streams directly is unavoidable in many applications such as network traffic monitoring, video surveillance and document feeds as data arrives continuously in fast streams with a volume too large or impractical to store.
Only a few anomaly detection algorithms can be applied to largescale problems and even less are applicable to streaming data. The proposed exposeanomaly detector fills this gap.

We present an efficient anomaly detection algorithm, called EXPected Similarity Estimation (expose), with \(\mathcal {O}(n)\) training time, \(\mathcal {O}(1)\) prediction time and only \(\mathcal {O}(1)\) memory requirements with respect to the dataset size n.

We show that exposeis especially suitable for parallel and distributed processing which makes it scalable to very large problems.

We demonstrate how exposecan be applied to online and streaming anomaly detection, while requiring only \(\mathcal {O}(1)\) time for a model update, \(\mathcal {O}(1)\) time per prediction and \(\mathcal {O}(1)\) memory.

We introduce two different approaches which allow exposeto be efficiently used with the most common techniques for concept drift adaptation.

We evaluate exposeon several real datasets, including surveillance, image data and network intrusion detection.
This is an extended and revised version of a preliminary conference report that was presented in the International Joint Conference on Neural Networks 2015 (Schneider et al. 2015). This work reviews the EXPoSE anomaly detection algorithm and provides a derivation that makes fewer assumptions on the input space and kernel function. It provides an online version of EXPoSE that is applicable to largescale and evolving data streams. The experimental section is extended comparing more algorithms and additional datasets with a statistical analysis of the results.
2 Problem definition
Even though there is a vast amount of literature on anomaly detection, there is no unique definition of what anomalies are and what exactly anomaly detection is. In this section we will state the problem of anomaly detection in batch and streaming application.
Definition 1
(Input Space) The input space for an observation X is a measurable space^{1} \((\mathcal {X},\mathscr {X})\) containing all values that X might take. We denote the realization after measurement of the random variable X with \(X=x\).
Definition 2
(Output/Label Space) In anomaly detection an observation \(X = x\) can belong to the class of normal data \({C}_{{N}}\) or can be an anomaly \({C}_{{A}}\). This is called label of the observation and denoted by the random variable Y. The collection of all labels is given by the measurable space \((\mathcal {Y},\mathscr {Y})\) called label space or output space (Fig. 1).
The distribution of the observation \(x\in \mathcal {X}\) is stochastic and depends on the label Y and hence is distributed according to \(\mathbb {P}_{X \vert Y}\).
Definition 3
(Prediction/Decision Space) Based on the outcome \(X=x\) of an observation, the objective of an anomaly detection algorithm is to make a prediction \(\vartheta \in \mathcal {Q}\), where the measurable space \((\mathcal {Q},\mathscr {Q})\) is called the prediction space or sometimes decision space.
The prediction space \(\mathcal {Q}\) is not necessarily equal to label space \(\mathcal {Y}\). Especially in anomaly detection and classification many algorithms calculate a probability or a score for a label. Such a score is called anomaly score if it quantifies the likelihood of x belonging to \({C}_{{A}}\) and normal score if it determines the degree of certainty to which x belongs to \({C}_{{N}}\).
Definition 4
(Classifier/Predictor) A measurable function \(\eta :(\mathcal {X},\mathscr {X})\rightarrow (\mathcal {Q},\mathscr {Q})\) is called a classifier or predictor.
A classifier calculates a prediction for an observation \(X=x\). In the context of anomaly detection our goal is to find a good predictor which can distinguish normal from anomalous data. However, the distribution \(\mathbb {P}_X \otimes \mathbb {P}_Y\) is typically unknown and hence we have to build a classifier solely based on observations. The estimation of such a functional relationship between the input space \(\mathcal {X}\) and the prediction space \(\mathcal {Q}\) is called learning or training.
Definition 5

The objective is to estimate a predictor \(\eta \) based on an unlabeled training set.

The training set contains mostly normal instances from \(\mathbb {P}_{X \vert Y={C}_{{N}}}\) and only a few anomalies as, by definition, anomalies are rare events.

It is assumed that the algorithm has complete access to all n elements of the dataset at once.

We may have access to a small labeled fraction of the training data to configure our algorithm.
Definition 6

The data stream is possible infinite which requires the algorithm to learn incrementally since it is not possible to store the whole stream.

Most instances in the data stream belong to the class of normal data and anomalies are rare.

The stream can evolve over time, forcing algorithms to adapt to changes in the data distribution.

Only a small time frame at the beginning of the stream is available to configure the algorithm’s parameter.

We have to instantly make a prediction \(\eta _t(x_t)\) as soon as an observation \(x_t\) is available. This requires that predictions can be made fast.
3 Related work
Many approaches from statistics and machine learning can be used for anomaly detection (Chandola et al. 2009; Gupta et al. 2014), but only a few are applicable on highdimensional, largescale problems, where a vast amount of information has to be processed. We review several algorithms with focus on their computational complexity and memory requirements.
3.1 Distribution based models
One of the oldest, statistical methods for anomaly detection is the kernel (or Parzen) density estimator (KDE). With \(\mathcal {O}(n)\) time for predictions the KDE is too slow for large amounts of data and known to be problematic in the case of increasing data dimensionality (Gretton et al. 2012). Fitting parametric distributions such as the Normal, Gamma, etc. is problematic since in general, the underlying data distribution is unknown. Therefore, a mixture of Gaussians is often used as a surrogate for the true distribution as, for example, done by SmartSifter (Yamanishi et al. 2004). SmartSifter can handle multivariate data with both, continuous and categorical observations. The main disadvantage of this approach is the high number of parameters required for the mixture model which grows quadratically with the dimension (Tax 2001).
3.2 Distance based models
Distance based models are popular since most of them are easy to implement and interpret. Knorr et al. (2000) and Knorr and Ng (1998) labels an observation as a distance based outlier (anomaly) if at least a fraction of points in the dataset have a distance of more than a threshold (based on the fraction) to this point. The authors proposed two simple algorithms which have both \(\mathcal {O}(n^2)\) runtime and a cellbased version which runs linear in n, but exponential with the dimension d. Ramaswamy et al. (2000) argues that the threshold can be difficult to determine and proposes an outlier score which is simply the distance from a query point to its kth nearest neighbor. The algorithm is called KNNOutlier and suffers from the problem of efficient nearest neighbor search. If the input space is of low dimension and n is much larger than \(2^d\) then finding 1 nearest neighbor in a \(kd\) tree with randomly distributed points takes \(\mathcal {O}(\log n)\) time on average. However this does not hold in high dimensions, where such a tree is not better than an exhaustive search with \(\mathcal {O}(n)\) (Goodman and O’Rourke 2004). Also the algorithm proposed by Ramaswamy et al. is only used to identify the top outliers in a given dataset. An alternative algorithm was proposed by Angiulli and Pizzuti (2002) using the sum of distances from its knearest neighbors. Ott et al. (2014) simultaneously perform clustering and anomaly detection in an integer programming optimization task.
Popular approaches from data mining for distance based novelty detection on streams are OLINDDA (Spinosa et al. 2007) and its extension MINAS (Faria et al. 2013) which both represent normal data as a union of spheres obtained by clustering. This representation becomes problematic if data within one cluster exhibits high variance since then the decision boundary becomes too large to detect novelties. Both algorithms are designed to incorporate novel classes into their model of normal data and hence barely applicable to anomaly detection.
The STream OutlieR Miner (STORM) (Angiulli and Fassetti 2007, 2010) offers an efficient solution to the problem of distancebased outlier detection over windowed data streams using a new data structure called Indexed Stream Buffer. Continuous Outlier Detection (COD; Kontaki et al. 2011) aims to further improve the efficiency of STORM by reducing the number of range queries.
3.3 Density based models
Nearest neighbor data description (Tax 2001) approximates a local density while using only distances to its first neighbor. The algorithm is very simple and often used as a baseline. It is also relatively slow approaching \(\mathcal {O}(n)\) per prediction. More sophisticated is the local density based approach called Local Outlier Factor (LOF; Breunig et al. 2000). It considers a point to be an anomaly if there are only relatively few other points in its neighborhood. LOF was extended to work on data streams (Pokrajac 2007), however both (the batch and incremental approach) are relatively slow with training time between \(\mathcal {O}(n \log n)\) and \(\mathcal {O}(n^2)\) and \(\mathcal {O}(n)\) memory consumption.
The angle based outlier detection for highdimensional data (ABOD) proposed by Kriegel and Zimek (2008) is able to outperform LOF, however requires \(\mathcal {O}(n^2)\) time per prediction with the exact model and \(\mathcal {O}(n+k^2)\) if the full dataset is replaced by the knearest neighbors of the query point (FastAbod).
3.4 Classification and tree based models
The Oneclass support vector machine (OCSVM; Schölkopf et al. 2001; Tax and Duin 2004) is a kernel based method which attempts to find a hyperplane such that most of the observations are separated from the origin with maximum margin. This approach does not scale very well to large datasets where predictions have to be made with high frequency. As Steinwart (2003) showed, the number of support vectors can grow linearly with the dataset size. There exist Oneclass support vector machines which can be learned incrementally (Gretton and Desobry 2003).
Hoeffding Trees (Domingos and Hulten 2000) are anytime decision trees to mine highspeed data streams. The Hoeffding Trees algorithm is not applicable to solve the unsupervised anomaly detection problem considered in this work since it requires the availability of class labels. Streaming HalfSpaceTrees (HSTa; Tan et al. 2011) randomly construct a binary tree structure without any data. It selects a dimension at random and splits it in half. Each tree then counts the number of instances from the training set at each node referred to as “mass”. The score for a new instance is then proportional to the mass in the leaf in which new instance hits after passing down the tree. Obviously, an ensemble of such trees can be builtin constant time and the training is linear in n. However, randomly splitting a very highdimensional space will not yield in a tree sufficiently finegrained for anomaly detection. The RSForest (Wu et al. 2014) is a modification of HSTa in which each dimension is not splitted in half, but at a random cutpoint. Also the assumption that “[...] once each instance is scored, streaming RSForest will receive the true label of the instance [...]” (Wu et al. 2014) does not always hold. The Isolation Forest (iForest) is an algorithm which uses a tree structure to isolate instances (Liu et al. 2012). The anomaly score is based on the path length to an instance. iForests achieve a constant training time and space complexity by subsampling the training set to a fixed size. The characteristics of the most relevant anomaly detection algorithms is summarized in Table 1. All complexities are given with respect to the dataset size n in highdimensional spaces.
Comparison of anomaly detection techniques
Training  Prediction  Memory  Batch  Online  Streaming  Problem size  

expose  \(\mathcal {O}(n)\)  \(\mathcal {O}(1)\)  \(\mathcal {O}(1)\)  ✓  ✓  ✓  Large 
OCSVM  \(\mathcal {O}(n^2)\)  \(\mathcal {O}(n)\)  \(\mathcal {O}(n)\)  ✓  ✓  ✗  Medium 
LOF  \(\mathcal {O}(n^2)\)  \(\mathcal {O}(n)\)  \(\mathcal {O}(n)\)  ✓  ✓  ✗  Medium 
KDE  \(\mathcal {O}(1)\)  \(\mathcal {O}(n)\)  \(\mathcal {O}(n)\)  ✓  ✓  ✗  Small 
FastAbod  \(\mathcal {O}(n)\)  \(\mathcal {O}(n)\)  \(\mathcal {O}(n)\)  ✓  ✗  ✗  Small 
i Forest  \(\mathcal {O}(1)\)  \(\mathcal {O}(1)\)  \(\mathcal {O}(1)\)  ✓  ✗  ✗  Large 
STORM  \(\mathcal {O}(n)\)  \(\mathcal {O}(1)\)  \(\mathcal {O}(1)\)  ✗  ✗  ✓  Small 
COD  \(\mathcal {O}(n)\)  \(\mathcal {O}(1)\)  \(\mathcal {O}(1)\)  ✗  ✗  ✓  Small 
HSTa  \(\mathcal {O}(n)\)  \(\mathcal {O}(1)\)  \(\mathcal {O}(1)\)  ✗  ✗  ✓  Medium 
4 Expected similarity estimation
As before, let X be a random variable taking values in a measurable space \((\mathcal {X},\mathscr {X})\). We are primarily interested in the distribution of normal data \(\mathbb {P}_{X \vert Y={C}_{{N}}}\) for which we will simply use the shorthand notation \(\mathbb {P}\) in the remainder of this work. Next we introduce some definitions which are necessary in the following.
Throughout this work we assume that the reproducing kernel Hilbert space \((\mathcal {H},\langle \cdot ,\cdot \rangle )\) is separable such that \(\phi \) is measurable. We therefore assume that the input space \(\mathcal {X}\) is a separable topological space and the kernel k on \(\mathcal {X}\) is continuous, which is sufficient for \(\mathcal {H}\) to be separable (Steinwart and Christmann 2008, Lemma 4.33).
As mentioned in the introduction, exposecalculates a score which can be interpreted as the likelihood of an instance \(z\in \mathcal {X}\) belonging to the distribution of normal data \(\mathbb {P}\). It uses a kernel function k to measure the similarity between instances of the input space \(\mathcal {X}\).
Definition 7
Intuitively the query point z is compared to all other points of the distribution \(\mathbb {P}\). We will shown that this equation can be rewritten as an inner product between the feature map \(\phi (z)\) and the kernel mean map \(\mu [\mathbb {P}]\) of \(\mathbb {P}\). This reformulation is of central importance and will enable us to efficiently compute all quantities of interest. Given a reproducing kernel k, the kernel mean map can be used to embed a probability measure into a RKHS where it can be manipulated efficiently. It is defined as follows.
Definition 8
Theorem 1
This reformulation has several desirable properties. At this point we see how the exposeclassifier can make prediction in constant time. After the kernel mean map \(\mu [\mathbb {P}]\) of \(\mathbb {P}\) is learned, exposeonly needs to calculate a single inner product in \(\mathcal {H}\) to make a prediction. However there are some crucial aspects to consider i.e. in Hilbert spaces, integrals and continuous linear forms are not in general interchangeable. In the proof of Theorem 1 we will thus use the weak integral and show that it coincides with the strong integral (see Definitions 9 and 10 in the “Appendix”). A sufficient condition therefore is provided by the following lemma.
Lemma 1
If \(\phi \) is strong (Bochner) integrable then \(\phi \) is weak (Pettis) integrable and the two integrals coincide. (Aliprantis and Border 2006, Theorem 11.50)
We are now in the position to proof Theorem 1.
Proof (Theorem 1)
4.1 Parallel and distributed processing
Parallel and distributed data processing is the key to scalable machine learning algorithms. The formulation of exposeas \(\eta (z) = \langle \phi (z),\mu [\mathbb {P}_n] \rangle \) is especially appealing for this kind of operations. We can use a SPMD (single program, multiple data) technique to achieve parallelism. One of the first programming paradigms on this line is Googles MapReduce for processing large data sets on a cluster (Dean and Ghemawat 2008).
4.1.1 Summary
In this section we derived the exposeanomaly detection algorithm. We showed how EXPoSE can be expressed as an inner product \(\langle \phi (z),\mu [\mathbb {P}_n] \rangle \) between the kernel mean map of \(\mathbb {P}\) and the feature mapping of a query point \(z\in \mathcal {X}\) for which we need to make a prediction. Evaluating this inner product takes constant time while estimating the model \(\mu [\mathbb {P}_n]\) can be done in linear time and with constant memory. We will explain the calculation of \(\phi \) in more detail in Sect. 6 and will explore now how exposecan be learned incrementally and applied to largescale data streams.
5 Online and streaming expose
In this section we will show how exposecan be used for online and streaming anomaly detection. To recap, a data stream is an often infinite sequence of observations \((x_1,x_2,x_3,\cdots )\), where \(x_t \in \mathcal {X}\) is the instance arriving at time t. A source of such data can be, for example, continuous sensor readings from an engine or a video stream from surveillance cameras.

Require small constant time per instance.

Use only a fixed amount of memory, independent of the number of past instances.

Build a model using at most one scan over the data.

Make a usable predictor available at any point in time.

Ability to deal with concept drift.

For streams without concept drift, produce a predictor that is equivalent (or nearly identical) to the one that would be obtained by an offline (batch) learning algorithm.
Proposition 1
The EXPoSE model \(\mu [\mathbb {P}_n]\) can be learned incrementally, where each model update can be performed in \(\mathcal {O}(1)\) time and memory.
Proof
We see that online learning of exposedoes neither increase the computational complexity nor the memory requirements of expose. We also emphasize that online learning yields the exact same model as the exposeoffline learning procedure.
5.1 Learning on evolving data streams
Sometimes it can be expected that the underlying distribution of the stream evolves over time. This is a property known as concept drift (Sadik and Gruenwald 2014). For example in environmental monitoring, the definition of “normal temperature” changes naturally with seasons. We can also expect that human behavior changes over time which requires us to redefine what anomalous actions are. In Fig. 2 we illustrate the difference between incremental learning as in Proposition 1 and a model which adapts itself to changes in the underlying distribution. In the following we will use \(w_t\) to denote the exposemodel at time t since the equation \(w_t=\mu [\mathbb {P}_t]\) will not necessarily hold when concept drift adaptation is implemented.
5.1.1 Windowing
Windowing is a straight forward technique which uses a buffer (the window) of \(l\in \mathbb {N}\) previous observations. Whenever a new observation is added to the window, the oldest one is discarded. We can efficiently implement windowing for exposeas follows.
Proposition 2
Concept drift adaption on data streams using a sliding window mechanism can be implemented for EXPoSE with \(\mathcal {O}(1)\) time and \(\mathcal {O}(l)\) memory consumption, where \(l\in \mathbb {N}\) is the window size.
Proof
The downside of a sliding window mechanism is the requirement to keep the past \(l\in \mathbb {N}\) events in memory. Also the sudden discard of a data point can lead to abrupt changes in predictions of the classifier which is sometimes not desirable. Another question is how to choose the correct window size. A shorter sliding window allows the algorithm to react faster to changes and requires less memory though the available data might not be representative or noise has too much negative impact. On the other hand a wider window may take too long to adapt to concept drift. The window size is therefore often dynamically adjusted (Widmer and Kubat 1996) or multiple competing windows are used (Lazarescu et al. 2004).
5.1.2 Gradual forgetting (decay)
Proposition 3
Concept drift adaptation on data streams using a forgetting mechanism can be implemented for EXPoSE in \(\mathcal {O}(1)\) time and memory.
Proof
This is a direct consequence from Proposition 1. \(\square \)
In general, weighting with a fixed \(\gamma \) or using a static window size is called blind adaptation since the model does not utilize information about changes in the environment (Gama 2010). The alternative is informed adaptation where one could, for example, use an external change detector (Gama et al. 2014) and weight new samples more if a concept drift was detected. We could also apply more sophisticated decay rules making \(\gamma \) a function of t or \(x_t\).
Comparison of online learning techniques for expose
Pros  Cons  

Prop. 1: online  ✓ Equivalent to batch version  ✗ No concept drift adaptation 
Prop. 2: window  ✓ Concept drift adaptation  ✗ Possibly sudden changes in predictions 
✗ Difficult to choose window size  
✗ Increased memory requirements for window buffer  
Prop. 3: decay  ✓ Concept drift adaptation  
✓ Gradual vanishing influence of outdated information  
✓ No increased memory requirements 
5.2 Predictions on data streams
6 Approximate feature maps
6.1 Random kitchen sinks
6.2 Nyström’s approximation
An alternative to random kitchen sinks are Nyström methods (Williams and Seeger 2001) which project the data into a subspace \(\mathcal {H}_r \subset \mathcal {H}\) spanned by \(r \le n\) randomly chosen elements \(\phi (x_1), \cdots , \phi (x_r)\).
The Nyström approximation needs in general less basis functions, r, than the RKS approach (typically around 1000). However the approximation is data dependent and hence becomes erroneous if the underlying distribution changes or when we are not able to get independent samples from the dataset. This is a problem for online learning and streaming applications with concept drift. We therefore suggest to avoid the Nyström feature map in this context.
Random kitchen sinks and the Nyström approximation the most common feature map approximations. We refer to the corresponding literature for a discussion of other approximate feature maps such as Li et al. (2010), Vedaldi and Zisserman (2012) and Kar and Karnick (2012), which can be used as well for expose.
6.3 EXPoSE and approximate feature maps
We emphasize that with an efficient approximation of \(\phi \), as showed here, the training time of this algorithm is linear in the number of samples n and an evaluation of \(\eta (z)\) for predictions takes only constant time. Moreover we need only \(\mathcal {O}(r)\) memory to store the model which is also independent of n and the input dimension d.
7 Experimental evaluation
In this section we show in several experiments how exposecompares to other state of the art anomaly detection techniques in prediction and runtime performances. We first explain which statistical test are used to compare the investigated algorithms.
7.1 Statistical comparison of algorithms
When comparing multiple (anomaly detection) algorithms over multiple datasets one cannot simply compare the raw numbers obtained from the area under receiver operating characteristic (AUC) or precisionrecall curves. Webb (2000) warns against averaging these numbers: “It is debatable whether error rates in different domains are commensurable, and hence whether averaging error rates across domains is very meaningful” (Webb 2000).
As Demšar (2006) points out, it is also dangerous to use tests which are designed to compare a pair of algorithms for more than two: “A common example of such questionable procedure would be comparing seven algorithms by conducting all 21 paired ttests [...]. When so many tests are made, a certain proportion of the null hypotheses is rejected due to random chance, so listing them makes little sense.” (Demšar 2006)
Demšar suggests to use the Friedman test with the corresponding posthoc Nemenyi test for comparison of more classifiers over multiple data sets. A methodology we summarize in the following.
7.1.1 The Friedman test
7.1.2 The Nemenyi test
Demšar (2006) also suggests to visually represent the results of the Nemenyi test in a critical difference diagram as in Fig. 4. In this diagram we compare 5 algorithms on 20 datasets against each other. Algorithms not connected by a bar have a significantly different performance.
7.2 Batch anomaly detection
The aim of this experiment is to compare exposeagainst iForest, OCSVM, LOF, KDE and FastAbod in terms of anomaly detection performance and processing time in a learning task without concept drift. In order to be comparable, we follow Liu et al. (2012) and perform an outlier selection task with the objective to identify anomalies in a given dataset.
7.2.1 Datasets
For performance analysis and evaluation we take the following datasets which are often used in literature for comparison of anomaly detection algorithms as for example in Schölkopf et al. (2001), Tax and Duin (2004), and Liu et al. (2012). We use several smaller benchmark datasets with known anomaly classes such as Ionosphere, Arrhythmia, Pima, Satellite, Shuttle (Lichman 2013), Biomed and Wisconsin Breast Cancer (Breastw) (Tax and Duin 2004). These datasets are set up as described in Liu et al. (2012) where all nominal and binary attributes are removed.
Batch dataset properties
Size (n)  Dim. (d)  \({{C}_{{A}}}\) (anomaly class)  \({{C}_{{A}}}\) proportion (%)  

KddCup  1,036,241  127  “Attack”  0.3 
ForestCover  286,048  10  Class 4 vs. 2  9 
MNIST 1  101,968  784  2,3,4,5,6,7,8,9  1 
MNIST 2  90,196  784  1,3,4,5,6,7,8,9  1 
MNIST 3  92,763  784  1,2,4,5,6,7,8,9  1 
MNIST 4  88,417  784  1,2,3,5,6,7,8,9  1 
MNIST 5  82,062  784  1,2,3,4,6,7,8,9  1 
MNIST 6  89,536  784  1,2,3,4,5,7,8,9  1 
MNIST 7  94,771  784  1,2,3,4,5,6,8,9  1 
MNIST 8  88,568  784  1,2,3,4,5,6,7,9  1 
MNIST 9  90,034  784  1,2,3,4,5,6,7,8  1 
SVHN 1  91,475  2592  2,3,4,5,6,7,8,9  1 
SVHN 2  75,466  2592  1,3,4,5,6,7,8,9  1 
SVHN 3  61,376  2592  1,2,4,5,6,7,8,9  1 
SVHN 4  51,135  2592  1,2,3,5,6,7,8,9  1 
SVHN 5  54,034  2592  1,2,3,4,6,7,8,9  1 
SVHN 6  41,965  2592  1,2,3,4,5,7,8,9  1 
SVHN 7  44,438  2592  1,2,3,4,5,6,8,9  1 
SVHN 8  35,709  2592  1,2,3,4,5,6,7,9  1 
SVHN 9  34,793  2592  1,2,3,4,5,6,7,8  1 
Shuttle  58,000  9  Classes 2,3,4,5,7  6 
Satellite  6435  36  Classes 2,4,5  32 
Pima  768  8  “Pos”  35 
Breastw  683  9  “Malignant”  35 
Arrhythmia  452  274  Classes 3,4,5,7,8,9,14,15  14 
Ionosphere  351  32  “Bad”  36 
Biomed  194  5  “Carrier”  34 
Batch anomaly detection performances [AUC]
expose  iForest  OCSVM  LOF  KDE  FastAbod  

KddCup  1.00  0.99  1.00  \(^\mathrm{a}\)  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
ForestCover  0.83  0.87  0.89  0.56  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 1  1.00  0.99  1.00  0.97  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 2  0.79  0.70  0.80  0.85  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 3  0.86  0.70  0.80  0.88  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 4  0.88  0.81  0.93  0.87  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 5  0.89  0.69  0.82  0.89  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 6  0.94  0.86  0.94  0.89  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 7  0.92  0.88  0.92  0.89  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 8  0.78  0.64  0.79  0.84  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 9  0.89  0.82  0.90  0.90  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 1  0.90  0.88  0.91  0.85  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 2  0.88  0.76  0.78  0.78  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 3  0.72  0.58  0.59  0.71  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 4  0.85  0.74  0.75  0.83  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 5  0.83  0.74  0.73  0.74  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 6  0.84  0.79  0.80  0.87  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 7  0.89  0.86  0.86  0.87  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 8  0.83  0.76  0.75  0.88  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 9  0.85  0.79  0.80  0.87  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
Shuttle  0.99  1.00  0.91  0.55  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
Satellite  0.79  0.70  0.62  0.57  0.78  0.74 
Pima  0.68  0.68  0.62  0.59  0.67  0.65 
Breastw  0.99  0.99  0.81  0.45  0.99  0.99 
Arrythmia  0.79  0.80  0.71  0.68  0.74  0.79 
Ionosphere  0.92  0.85  0.66  0.89  0.81  0.93 
Biomed  0.87  0.83  0.76  0.69  0.88  0.88 
Average rank  1.85  3.48  2.90  3.06  4.93  4.77 
Batch anomaly detection runtimes \([t]=s\)
expose  i Forest  OCSVM  LOF  KDE  FastAbod  

KddCup  44  70  22213  \(^\mathrm{a}\)  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
ForestCover  29  24  25901  47  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 1  12  7  1976  23760  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 2  11  9  2773  18717  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 3  11  8  1991  20109  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 4  11  8  1159  17412  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 5  10  8  1892  15324  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 6  10  8  3091  18208  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 7  11  8  2727  20153  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 8  11  8  1607  18217  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
MNIST 9  10  7  2383  18542  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 1  24  11  9311  18192  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 2  20  10  10371  18144  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 3  16  9  14122  28508  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 4  13  10  4044  19247  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 5  14  11  10348  22359  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 6  11  7  5389  13166  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 7  11  8  6790  14906  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 8  9  7  4210  9741  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
SVHN 9  9  6  3831  9290  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
Shuttle  3  7  38  24  \(^\mathrm{b}\)  \(^\mathrm{b}\) 
Satellite  0  3  3  4  1  55 
Pima  1  2  0  0  0  9 
Breastw  1  3  0  0  0  4 
Arrythmia  1  1  0  0  0  4 
Ionosphere  1  3  0  0  0  0 
Biomed  0  2  0  0  0  0 
In the experiment we provide a dedicated labeled random subset of 1 % or 2000 instances (whichever is smaller) to configure the algorithms parameters. We emphasize that this subset is not used to evaluate the predictive performance. The parameter configuration is done by a pattern search (Torczon 1997) using crossvalidation. Examples of parameters being optimized are the number of nearest neighbors in LOF and FastAbod, the kernel bandwidth of expose, KDE and OCSVM or the number of trees for iForest (see Table 8 for a complete list of parameters which are optimized). We do not optimize over different distance metrics and various kernels functions, but use the most common Euclidean distance and squared exponential kernel, respectively. However, we remark that the choice of these functions pose a possibility to include domain and expert knowledge into the system. Each experiment is repeated five times and their AUC scores are used to perform the Friedman test. Since iForest is a randomized algorithm we conduct five trials in each repetition to get an average results. If not stated otherwise we use exposein combination with Nyström’s approximation for batch anomaly detection and random kitchen sinks in the streaming experiments as discussed in Sect. 6.
7.2.2 Evaluation
With the AUC values we can perform the Friedman and posthoc Nemenyi tests. The Friedman test confirms a statistical significant difference between the performances of the individual algorithms at a p value of 0.05. From the critical difference diagram in Fig. 5 we observe that exposeperforms significant better than iForest, FastAbod and KDE. While no significant difference in terms of anomaly detection between expose, OCSVM and LOF can be confirmed, exposeis several orders of magnitude faster on largescale, highdimensional datasets.
7.3 Streaming anomaly detection
In this set of experiments we compare the streaming variants of exposeagainst HSTa, STORM and COD. All of these algorithms are blind methods as they adapt their model at regular intervals without knowing if a concept drift occurred or not. They can be combined with a concept drift detector to make the adaptation informed (Gama 2010).

Using a dedicated subset of the data (holdout) and evaluate the algorithm at regular time intervals. The holdout set must reflect the respective stream properties and therefore has to evolve with the stream in case of a concept drift.

Making a prediction as the instance becomes available (prequential).^{3} A performance metric can then be applied based on the prediction and the actual label of the instance. Since predictions are made on the stream directly there are no special actions which have to be taken in case of concept drift.
7.3.1 Datasets
Similar, Ho (2005) proposed the three digit data stream (TDDS) which contains four different concepts. Each concept consists of three digits of the USPS handwritten digits dataset as described.^{4} After all instances of concept 1 are processed, the stream switches to the second concept and so on until concept 4. We randomly induce 1 % anomalies to each concept and use the prequential method for evaluation to calculate the balanced accuracy.
Streaming dataset properties
#concepts  Concept drift type  Evaluation method  

SVHN  9  Sudden  Holdout 
Satellite  3  Sudden  Holdout 
Shuttle  2  Sudden  Holdout 
TDDS  4  Sudden  Prequential 
SDD  9  Smooth  Prequential 
7.3.2 Evaluation
In the following we will denote exposewith a sliding window (Sect. 5.1.1) and exposewith gradual forgetting (Sect. 5.1.2) by wexpose and \(\gamma \)expose, respectively.
Streaming anomaly detection performance [accuracy]
wexpose  \(\gamma \)expose  STORM  COD  HSTa  

Shuttle  0.88  0.88  0.75  0.74  0.89 
Satellite  0.89  0.88  0.78  0.79  0.88 
SVHN  0.71  0.73  0.68  0.68  0.66 
TDDS  0.71  0.71  0.67  0.64  0.67 
SDD  0.83  0.85  0.79  0.76  0.77 
Average rank  1.80  1.70  3.80  4.50  3.20 
A detailed illustration of the SVHN experiment is shown in Fig. 7. The predictive performance of all algorithms is relatively similar. It can be observed that, as the stream changes from one digit to another, the accuracy suddenly drops which indicates that the current model is not valid anymore. After a short period of time, the model adapts and the accuracy recovers. exposeperforms on average better than COD, STORM and HSTa. A possible interpretation of this result is the sound foundation in probability theory of our approach. The suboptimal performance of HSTa indicates the random binary trees constructed by HSTa are not sufficiently finegrained for this highdimensional datasets. This interpretation is supported by the experiments with the lowdimensional Shuttle and Satellite data, where HSTa performs better.
Although these results are promising we recommend to combine the techniques presented here with a concept drift detection technique to make informed model updates (Gama 2010).
8 Conclusion
We proposed a new algorithm, EXPoSE, to perform anomaly detection on very largescale datasets and streams with concept drift. Although anomaly detection is a problem of central importance in many applications, only a few algorithms are scalable to the vast amount of data we are often confronted with.
The exposeanomaly detection classifier calculates a score (the likelihood of a query point belonging to the class of normal data) using the inner product between a feature map and the kernel embedding of probability measures. The kernel embedding technique provides an efficient way to work with probability measures without the necessity to make assumptions about the underlying distributions.
Despite its simplicity exposeobeys a linear computational complexity for learning and can make predictions in constant time while it requires only constant memory. When applied incrementally or online, a model update can also be performed in constant time. We demonstrated that exposecan be used as an efficient anomaly detection algorithm with the same predictive performance as the best state of the art methods while being significant faster than techniques with the same discriminant power.
Footnotes
 1.
A measurable space is a tuple \((\mathcal {X},\mathscr {X})\), where \(\mathcal {X}\) is a nonempty set and \(\mathscr {X}\) is a \(\sigma \)algebra of its subsets. We refer the reader unfamiliar with this topic to Kallenberg (2006) for an overview.
 2.
The notation \(k(x, \cdot )\) indicates that the second function argument is not bound to a variable.
 3.
Prequential originates from predictive and sequential (Dawid 1984).
 4.
See Ho (2005) for a detailed description of the TDDS dataset.
References
 Aleskerov, E., Freisleben, B., & Rao, B. (1997). Cardwatch: A neural network based database mining system for credit card fraud detection. In Computational intelligence for financial engineering. doi: 10.1109/cifer.1997.618940
 Aliprantis, C. D., & Border, K. (2006). Infinite dimensional analysis: A hitchhiker’s guide. Berlin: Springer Science & Business Media. doi: 10.1007/3540295879.zbMATHGoogle Scholar
 Angiulli, F., & Fassetti, F. (2007). Detecting distancebased outliers in streams of data. In Proceedings of the sixteenth ACM conference on conference on information and knowledge management (pp. 811–820). New York: ACM. doi: 10.1145/1321440.1321552.
 Angiulli, F., & Fassetti, F. (2010). Distancebased outlier queries in data streams: The novel task and algorithms. Data Mining and Knowledge Discovery, 20(2), 290–324. doi: 10.1007/s1061800901599.MathSciNetCrossRefGoogle Scholar
 Angiulli, F., & Pizzuti, C. (2002). Fast outlier detection in high dimensional spaces. In PKDD, Vol. 2 (pp. 15–26). Berlin: Springer. doi: 10.1007/3540456813_2.
 Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., & Gavaldà, R. (2009). New ensemble methods for evolving data streams. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 139–148). ACM. doi: 10.1145/1557019.1557041.
 Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: Identifying densitybased local outliers. In ACM sigmod record, Vol. 29 (pp. 93–104). ACM. doi: 10.1145/335191.335388.
 Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys (CSUR), 41(3), 1–58.CrossRefGoogle Scholar
 Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 1–27.CrossRefGoogle Scholar
 Dawid, A. P. (1984). Present position and potential developments: Some personal views: Statistical theory: The prequential approach. Journal of the Royal Statistical Society Series A (General) 278–292. doi: 10.2307/2981683.
 Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113. doi: 10.1145/1327452.1327492.CrossRefGoogle Scholar
 Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1–30.MathSciNetzbMATHGoogle Scholar
 Domingos, P., & Hulten, G. (2000). Mining highspeed data streams. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 71–80). ACM. doi: 10.1145/347090.347107.
 Domingos, P., & Hulten, G. (2001). Catching up with the data: Research issues in mining data streams. In DMKD.Google Scholar
 Faria, E.R., Gama, J., & Carvalho, A.C. (2013). Novelty detection algorithm for data streams multiclass problems. In Proceedings of the 28th annual ACM symposium on applied computing (pp. 795–800). ACM. doi: 10.1145/2480362.2480515.
 Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32(200), 675–701. doi: 10.1080/01621459.1937.10503522.zbMATHCrossRefGoogle Scholar
 Gama, J. (2010). Knowledge discovery from data streams. Boca Raton: CRC Press. doi: 10.1201/ebk1439826119c1.zbMATHCrossRefGoogle Scholar
 Gama, J., Žliobaite, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4), 44. doi: 10.1145/2523813.zbMATHCrossRefGoogle Scholar
 Goodman, J. E., & O’Rourke, J. (2004). Handbook of discrete and computational geometry (Vol. 2). Boca Raton: CRC Press.zbMATHGoogle Scholar
 Gretton, A., & Desobry, F. (2003). Online oneclass support vector machines. An application to signal segmentation. In Acoustics, speech, and signal processing, 2003. Proceedings (ICASSP’03). 2003 IEEE international conference on, IEEE, vol. 2, pp. 2–709. doi: 10.1109/icassp.2003.1202465.
 Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel twosample test. The Journal of Machine Learning Research, 13(1), 723–773.MathSciNetzbMATHGoogle Scholar
 Gupta, M., Gao, J., & Aggarwal, C. C. (2014). Outlier detection for temporal data: A survey. Knowledge and Data Engineering, IEEE Transactions on, 26(9), 2250–2267. doi: 10.1109/tkde.2013.184.MathSciNetCrossRefGoogle Scholar
 Ho, S. S. (2005). A martingale framework for concept change detection in timevarying data streams. In Proceedings of the 22nd international conference on machine learning (pp. 321–327). ACM. doi: 10.1145/1102351.1102392.
 Hodge, V. J., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2), 85–126. doi: 10.1023/b:aire.0000045502.10941.a9.zbMATHCrossRefGoogle Scholar
 Iman, R. L., & Davenport, J. M. (1980). Approximations of the critical region of the Friedman statistic. Communications in StatisticsTheory and Methods, 9(6), 571–595.zbMATHCrossRefGoogle Scholar
 Kallenberg, O. (2006). Foundations of modern probability. Berlin: Springer Science & Business Media. doi: 10.1007/b98838.zbMATHGoogle Scholar
 Kar, P., & Karnick, H. (2012). Random feature maps for dot product kernels. In International conference on artificial intelligence and statistics, pp. 583–591.Google Scholar
 Knorr, E. M., & Ng, R. T. (1998). Algorithms for mining distancebased outliers in large datasets. In Proceedings of the international conference on very large data bases, citeseer (pp. 392–403).Google Scholar
 Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distancebased outliers: Slgorithms and applications. The VLDB Journal the International Journal on Very Large Data Bases, 8(3–4), 237–253. doi: 10.1007/s007780050006.CrossRefGoogle Scholar
 Kontaki, M., Gounaris, A., Papadopoulos, A. N., Tsichlas, K., & Manolopoulos, Y. (2011). Continuous monitoring of distancebased outliers over data streams. In Data engineering (ICDE), 2011 IEEE 27th international conference on, IEEE, pp. 135–146. doi: 10.1109/icde.2011.5767923.
 Kriegel, H. P., & Zimek, A. (2008). Anglebased outlier detection in highdimensional data. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 444–452). ACM. doi: 10.1145/1401890.1401946.
 Kumar, V. (2005). Parallel and distributed computing for cybersecurity. IEEE Distributed Systems Online, 6(10), 1. doi: 10.1109/mdso.2005.53.CrossRefGoogle Scholar
 Lazarescu, M. M., Venkatesh, S., & Bui, H. H. (2004). Using multiple windows to track concept drift. Intelligent Data Analysis, 8(1), 29–59.Google Scholar
 Le, Q., Sarlós, T., & Smola, A. J. (2013). Fastfood: Approximating kernel expansions in loglinear time. In Proceedings of the international conference on machine learning.Google Scholar
 Li, F., Ionescu, C., & Sminchisescu, C. (2010). Random Fourier approximations for skewed multiplicative histogram kernels. In Pattern recognition Accessed 1 Jan 2016. Berlin: Springer. doi: 10.1007/9783642159862_27.
 Lichman, M. (2013). UCI machine learning repository. Irvine: University of California, School of Information and Computer Sciences. http://archive.ics.uci.edu/ml. Accessed 1 Jan 2016.
 Liu, F. T., Ting, K. M., & Zhou, Z. H. (2012). Isolationbased anomaly detection. ACM Transactions on Knowledge Discovery from Data, 6(1), 3. doi: 10.1145/2133360.2133363.CrossRefGoogle Scholar
 Nemenyi, P. (1963). Distributionfree multiple comparisons. Ph.D. thesis, Princeton University.Google Scholar
 Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., & Ng, A. (2011). Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, p. 4.Google Scholar
 Ott, L., Pang, L., Ramos, F. T., & Chawla, S. (2014). On integrated clustering and outlier detection. In Advances in neural information processing systems (pp. 1359–1367).Google Scholar
 Pokrajac, D. (2007). Incremental local outlier detection for data streams. IEEE symposium on computational intelligence and data mining (pp. 504–515). doi: 10.1109/cidm.2007.368917.
 Rahimi, A., & Recht, B. (2007). Random features for largescale kernel machines. In Advances in neural information processing systems (pp. 1177–1184).Google Scholar
 Rahimi, A., & Recht, B. (2008). Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In Advances in neural information processing systems (pp. 1313–1320).Google Scholar
 Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In ACM SIGMOD record, Vol. 29, pp. 427–438. ACM. doi: 10.1145/335191.335437.
 Sadik, S., & Gruenwald, L. (2014). Research issues in outlier detection for data streams. ACM SIGKDD Explorations Newsletter, 15(1), 33–40. doi: 10.1145/2594473.2594479.CrossRefGoogle Scholar
 Schneider, M. (2016). Probability inequalities for kernel embeddings in sampling without replacement. In Proceedings of the nineteenth international conference on artificial intelligence and statistics.Google Scholar
 Schneider, M., Ertel, W., & Palm, G. (2015). Expected similarity estimation for large scale anomaly detection. In International joint conference on neural networks, IEEE, pp. 1–8. doi: 10.1109/ijcnn.2015.7280331.
 Schölkopf, B., Platt, J. C., ShaweTaylor, J., Smola, A. J., & Williamson, R. C. (2001). Estimating the support of a highdimensional distribution. Neural Computation, 13(7), 1443–1471. doi: 10.1162/089976601750264965.zbMATHCrossRefGoogle Scholar
 Sejdinovic, D., Sriperumbudur, B., Gretton, A., & Fukumizu, K. (2013). Equivalence of distancebased and RKHSbased statistics in hypothesis testing. The Annals of Statistics, 41(5), 2263–2291. doi: 10.1214/13aos1140.MathSciNetzbMATHCrossRefGoogle Scholar
 Smola, A. J., Gretton, A., Song, L., & Schölkopf, B. (2007). A Hilbert space embedding for distributions. In Algorithmic learning theory. Berlin: Springer. doi: 10.1007/9783540754886_5.
 Spence, C., Parra, L., & Sajda, P. (2001). Detection, synthesis and compression in mammographic image analysis with a hierarchical image probability model. In Mathematical methods in biomedical image analysis, 2001. MMBIA 2001. IEEE workshop on, IEEE, pp. 3–10. doi: 10.1109/mmbia.2001.991693.
 Spinosa, E. J., de Leon, F., de Carvalho, A. P., & Gama, J. (2007). OLINDDA: A clusterbased approach for detecting novelty and concept drift in data streams. In Proceedings of the 2007 ACM symposium on applied computing (pp. 448–452). ACM.Google Scholar
 Steinwart, I. (2003). Sparseness of support vector machines. The Journal of Machine Learning Research, 4, 1071–1105.MathSciNetzbMATHGoogle Scholar
 Steinwart, I., & Christmann, A. (2008). Support vector machines. Berlin: Springer.zbMATHGoogle Scholar
 Tan, S. C., Ting, K. M., & Liu, T. F. (2011). Fast anomaly detection for streaming data. In IJCAI proceedingsinternational joint conference on artificial intelligence, Citeseer, Vol. 22, p. 1511.Google Scholar
 Tax, D. M. J. (2001). Oneclass classification. Ph.D. thesis, Technische Universiteit Delft.Google Scholar
 Tax, D. M. J., & Duin, R. P. W. (2004). Support vector data description. Machine Learning, 54(1), 45–66.zbMATHCrossRefGoogle Scholar
 Torczon, V. (1997). On the convergence of pattern search algorithms. SIAM Journal on Optimization, 7(1), 1–25.MathSciNetzbMATHCrossRefGoogle Scholar
 Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(3), 480–492. doi: 10.1109/cvpr.2010.5539949.CrossRefGoogle Scholar
 Vondrick, C., Khosla, A., Malisiewicz, T., & Torralba, A. (2013). Hoggles: Visualizing object detection features. In Computer vision (ICCV), 2013 IEEE international conference on, IEEE, pp. 1–8. doi: 10.1109/iccv.2013.8.
 Webb, G. I. (2000). Multiboosting: A technique for combining boosting and wagging. Machine Learning, 40(2), 159–196.CrossRefGoogle Scholar
 Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1), 69–101. doi: 10.1007/bf00116900.Google Scholar
 Williams, C., & Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In Proceedings of the 14th annual conference on neural information processing systems. MIT Press, EPFLCONF161322, pp. 682–688.Google Scholar
 Wu, K., Zhang, K., Fan, W., Edwards, A., & Yu, P. S. (2014). RSforest: A rapid density estimator for streaming anomaly detection. In Data mining (ICDM), 2014 IEEE international conference on, IEEE, pp. 600–609. doi: 10.1109/icdm.2014.45.
 Xiong, L., Poczos, B., Schneider, J., Connolly, A., & Vander Plas, J. (2011). Hierarchical probabilistic models for group anomaly detection. In AISTATS 2011.Google Scholar
 Yamanishi, K., Takeuchi, J. I., Williams, G., & Milne, P. (2004). Online unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Mining and Knowledge Discovery, 8(3), 275–300. doi: 10.1145/347090.347160.MathSciNetCrossRefGoogle Scholar
 Yu, H., Yang, J., & Han, J. (2003). Classifying large data sets using SVMs with hierarchical clusters. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 306–315). ACM. doi: 10.1145/956750.956786.
 Zhang, B., Sconyers, C., Byington, C., Patrick, R., Orchard, M. E., & Vachtsevanos, G. (2011). A probabilistic fault detection approach: Application to bearing fault detection. IEEE Transactions on Industrial Electronics, 58(5), 2011–2018. doi: 10.1109/tie.2010.2058072.CrossRefGoogle Scholar