Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors
 965 Downloads
 1 Citations
Abstract
Conventional wisdom in machine learning says that all algorithms are expected to follow the trajectory of a learning curve which is often colloquially referred to as ‘more data the better’. We call this ‘the gravity of learning curve’, and it is assumed that no learning algorithms are ‘gravitydefiant’. Contrary to the conventional wisdom, this paper provides the theoretical analysis and the empirical evidence that nearest neighbour anomaly detectors are gravitydefiant algorithms.
Keywords
Learning curve Anomaly detection Nearest neighbour Computational geometry AUC1 Introduction
Conventional wisdom in machine learning says that all algorithms are expected to follow the trajectory of a learning curve, though the actual rate of performance improvement may differ from one algorithm to another. We call this ‘the gravity of learning curve’, and it is assumed that no learning algorithms are ‘gravitydefiant’.
Recent research (Liu et al. 2008; Zhou et al. 2012; Sugiyama and Borgwardt 2013; Wells et al. 2014; Bandaragoda et al. 2014; Pang et al. 2015) has provided an indication that some algorithms may defy the gravity of learning curve, i.e., these algorithms can learn a better performing model using a small training set than that using a large training set. However, no concrete evidence of the ‘gravitydefiant’ behaviour is provided in the literature, let alone the reason why these algorithms behave this way.
‘Gravitydefiant’ algorithms have a key advantage of producing a good performing model using a training set significantly smaller than that required for ‘gravity compliant’ algorithms. They will yield significant saving on time and memory space that the conventional wisdom thought impossible.
This paper focuses on nearest neighbourbased anomaly detectors because they have been shown to be one of the most effective class of anomaly detectors (Breunig et al. 2000; Sugiyama and Borgwardt 2013; Wells et al. 2014; Bandaragoda et al. 2014; Pang et al. 2015).
 1.
Provide a theoretical analysis of nearest neighbourbased anomaly detection algorithms which reveals that their behaviours defy the gravity of learning curve. This is the first analysis in machine learning research on learning curve behaviour that is based on computational geometry, as far as we know.
 2.
The theoretical analysis provides an insight into the behaviour of the nearest neighbour anomaly detector. In sharp contrast to the conventional wisdom: more data the better, the analysis reveals that sample size has three impacts which have not been considered by the conventional wisdom. First, increasing sample size increases the likelihood of anomaly contamination in the sample; and any inclusion of anomalies in the sample increases the false negative rate, thus, lowers the AUC. Second, the optimal sample size depends on the data distribution. As long as the data distribution is not sufficiently represented by the current sample, increasing the sample size will improve AUC. The optimal size is the number of instances best represents the geometry of normal instances and anomalies; this gives the optimal separation between normal instances and anomalies, encapsulated as the average nearest neighbour distance to anomalies. Third, increasing the sample size decreases the average nearest neighbour distance to anomalies. Increasing beyond the optimal sample size reduces the separation between normal instances and anomalies smaller than the optimal. This leads to the decreased AUC and gives rise to the gravitydefiant behaviour.
 3.
Present empirical evidence of the gravitydefiant behaviour using three nearest neighbourbased anomaly detectors in the unsupervised learning context.
 A.
Some nearest neighbour anomaly detector can achieve high detection accuracy with a significantly smaller sample size than others.
 B.
Any change in geometrical data characteristics, which affects the detection error, manifests as a change in nearest neighbour distance such that the detection error and anomalies’ nearest neighbour distances change in opposite directions. Because nearest neighbour distance can be measured easily and other indicators of detection accuracy are difficult to measure, it provides a unique useful practical tool to detect change in domains where any such changes are critical in their change detection operations, e.g., in data streams. Note that the change in sample size (described in (2) above) does not alter the geometrical data characteristics discussed here.
The rest of the paper is organised as follows. We review current anomaly detectors which are reported to perform well using small training sets in the next section. Section 3 provides the theoretical analysis, and Sect. 4 analyses the influence of different factors, including changes in data characteristics. Section 5 discusses the implication of the theoretical analysis on three nearest neighbourbased anomaly detectors. The empirical methodology and evaluation are given in Sects. 6 and 7, respectively. A discussion of related issues is provided in Sect. 8, followed by the conclusion in the last section.
2 Anomaly detectors that perform well using small training sets
This section summarises the anomaly detectors in the literature which are reported to produce a good performing model using small training sets.
Isolation Forest or iForest (Liu et al. 2008) was one of the first to report that a high performing anomaly detector could be trained using a training sample as small as 256 instances to train a model, in an ensemble of 100 models. On a dataset of over half a million instances, iForest could rank almost all anomalies at the top of the ranked list.
An information retrieval system called ReFeat (Zhou et al. 2012), which employed iForest as the ranking model, also exhibits the same behaviour. It requires a subsample of 8 instances only to build a model, in an ensemble of 1000 models to produce a high performing retrieval system. ReFeat is shown to outperform three stateoftheart information retrieval systems.
Applying the isolation approach to nearest neighbour algorithms, LiNearN (Wells et al. 2014) and iNNE (Bandaragoda et al. 2014) have an explicit training process to build a model to isolate every instance in a small training sample, where the local region which isolates an instance is defined using the distance to the instance’s nearest neighbour.^{1} Like iForest, this new nearest neighbour approach is shown to produce high performing models using small training sets in anomaly detection. iNNE could produce competitive anomaly detection accuracy using an ensemble of 1000 models, each trained using 2 instances only in a dataset of over half a million instances.
In an independent study, Sugiyama and Borgwardt (2013) have advocated the use of a nearest neighbour anomaly detector (kNN where \(k=1\)) which employs a small sample, and showed that it performed competitively with LOF (Breunig et al. 2000) which employs the entire dataset. They provide a theoretical analysis which yields a lower error bound to explain the reason why a small dataset is sufficient to detect anomalies using a nearest neighbour anomaly detector. It reveals that most instances in a randomly selected subset are likely to be normal instances because the majority of instances in an anomaly detection dataset are normal instances. Finding the nearest neighbour from this subset and using the nearest neighbour distance as the anomaly score lead directly to good anomaly detection accuracy. A recent study shows that ensembles of 1NN can further improve the detection accuracy of of 1NN (Pang et al. 2015).
While our analysis and the analysis by Sugiyama and Borgwardt (2013) are based on the same nearest neighbour anomaly detector, the approaches to the theoretical analyses are different which yield different outcomes. Our computational geometry approach reveals that the learning curve behaviour is substantially influenced by the geometry of the areas covering normal instances and anomalies in the data space. In addition, our analysis yields both the lower and upper bounds of the anomaly detector’s detection accuracy. In contrast, the analysis based on the probabilistic approach (Sugiyama and Borgwardt 2013) is limited to the lower bound of the probability of the perfect anomaly detection only. Moreover, our resultant bounds are represented by simple closed form expressions, while their result contains a complex probability distribution. Our result enables an interpretation of the learning curve behaviours that are influenced by different components of data characteristics. Most importantly, we show explicitly that the nearest neighbour anomaly detector has the gravitydefiant behaviour, and its detection accuracy is influenced by three factors: the proportion of normal instances (or anomaly contamination rate), the nearest neighbour distances of anomalies in the dataset, and sample size used by the anomaly detector where the geometry of normal instances and anomalies is best represented by the sample with the optimal data size.
3 Theoretical analysis
In this section, we characterise the learning curve of a nearest neighbour anomaly detector and reveal its gravitydefiant behaviour through a theoretical analysis based on computational geometry.
Measuring the detection accuracy of the anomaly detector using area under the receiver operating characteristic curve (AUC), we show that the lower and upper bounds of its performance have simple closed form expressions, and there are three factors which influence the AUC and explain the gravitydefiant behaviour.
3.1 Preliminary
Let a dataset D be sampled independently and randomly with respect to p(x). All instances in D are located within \({\mathcal X}\) according to Eq. (1). Further let \(D_N\) and \(D_A\) (in D) be sets of instances belonging to \({\mathcal X}_N\) and \({\mathcal X}_A\), respectively, i.e., \(D_N=\{x \in D x \in {\mathcal X}_N\}\) and \(D_A=\{x \in D x \in {\mathcal X}_A\}\). In anomaly detection, we assume that the size of \(D_A\) is substantially smaller than \(D_N\), i.e., \(D_N \gg D_A\).
Let \({\mathcal D}\) be a subsample set consisting of instances independently and randomly sampled from D, and let \({\mathcal D}_N\) and \({\mathcal D}_A\) be sets of normal instances and anomalies, respectively, in \({\mathcal D}\).
Note that the geometrical shapes and sizes of \({\mathcal X}\), \({\mathcal X}_N\) and \({\mathcal X}_A\) are independent of the data size of D, \({\mathcal D}\) and their subsets.
3.2 Definitions of anomaly detector and AUC
Using this decision rule, \(y \in {\mathcal D}_N\) contributes to correctly judge an anomaly x, i.e., a true positive; while \(y \in {\mathcal D}_A\) contributes to erroneously judge an anomaly x as normal, i.e., a false negative. In other words, any anomalies in \({\mathcal D}\) have the detrimental effect to reduce the true positive rate by increasing the number of false negatives. However, this effect is not very significant, because \({\mathcal D} \simeq {\mathcal D}_N\) holds from \(D_N \gg D_A\).
Let the AUC (area under the receiver operating characteristic curve) of the anomaly detector using \({\mathcal D}\) be AUC \(({\mathcal D})\). From the description in the last paragraph, AUC \(({\mathcal D})\) is lower than but very close to the AUC of the anomaly detector using \({\mathcal D}_N\), i.e., AUC \(({\mathcal D}_N)\). Accordingly, we investigate AUC \(({\mathcal D}_N)\) in place of AUC \(({\mathcal D})\) for ease of analysis by assuming AUC \(({\mathcal D}) \simeq \,\) AUC \(({\mathcal D}_N)\). This assumption is true to the extent of the probability of \({\mathcal D} \simeq {\mathcal D}_N\); and this probability is high when \({\mathcal D}\) is small as shown in Sugiyama and Borgwardt (2013).
3.3 Modeling \({\mathcal X}_N\) and \({\mathcal X}\) based on computational geometry
Here, we model \({\mathcal X}_N\) and \({\mathcal X}\) in \({\mathcal M}\) using computational geometry in relation to the anomaly score \(q(x;{\mathcal D}_N)\), and connect this model to the AUC. The idea is to use balls of radius r to cover the geometry occupied by normal instances and anomalies. The AUC can then be computed through integration from 0 to r.
Let the set of all points satisfying \(q(x;{\mathcal D}_N) \le r\) in \({\mathcal M}\) be a union of balls \(B_d(y,r)\) for all \(y \in {\mathcal D}_N\), where \(B_d(y,r)\) is a d dimensional ball centred at y with radius r.
The AUC, governed by \(f(r;{\mathcal D}_N)\) and \(G(r;{\mathcal D}_N)\) in Eq. (3), can now be modelled as follows. The set of all points satisfying \(q(x;{\mathcal D}_N)=r\) in \({\mathcal M}\) can be modelled as a surface of a union of balls \(B_d(y,r)\) for all \(y \in {\mathcal D}_N\). Thus, \(\{x \in {\mathcal X}_A q(x;{\mathcal D}_N)=r\}\), used to determine \(f(r;{\mathcal D}_N)\) in Eq. (4), is the intersection of the surface and \({\mathcal X}_A\). A similar modeling applies to \(\{x \in {\mathcal X}_N q(x;{\mathcal D}_N)=s\}\), used to determine \(g(s;{\mathcal D}_N)\).
If r is between the two critical radii, the intersection \(\{x \in {\mathcal X}_A q(x;{\mathcal D}_N)=r\}\) is not an empty set and some anomalies in \({\mathcal X}_A\) are judged correctly, and thus \(f(r;{\mathcal D}_N)>0\). Otherwise, the intersection is an empty set and \(f(r;{\mathcal D}_N)=0\). This implies that the AUC is solely governed by \(f(r;{\mathcal D}_N)\) and \(G(r;{\mathcal D}_N)\) with \(r \in [\rho _\ell ({\mathcal D}_N, {\mathcal X}_N),\rho _u({\mathcal D}_N, {\mathcal X})]\). The theoretical results stated in the next subsection show that these two radii play a key role in characterising the AUC of the nearest neighbour anomaly detector.
3.4 Characterisation of the AUC based on computational geometry
Here, we characterise the AUC of the nearest neighbour anomaly detector. The lower and upper bounds of AUC are formulated through the inradius and the covering radius of the balls centred at every instance in the given sample. The lower bound is derived from an extreme case where all normal instances are concentrated in a small area. The upper bound is derived from a more general case where the geometry is not simple and thus requires a large number of balls to cover. The lemmas and theorem are given below.
Let \(\psi \) and h be the cardinalities of \({\mathcal D}\) and \({\mathcal D}_N\), respectively. \(\psi \ge h \ge 0\) holds because of \({\mathcal D} \supseteq {\mathcal D}_N\). Further let \(S_d(x,r)\) be the surface of the ball \(B_d(x,r)\), and let \(B_d\) and \(S_d\) be the volume and the surface area of a d dimensional unit ball. The following lemmas and theorem provide bounds of \(f(r;{\mathcal D}_N)\), \(G(r;{\mathcal D}_N)\) and AUC with reference to \(\rho _\ell ({\mathcal D}_N, {\mathcal X}_N)\) and \(\rho _u({\mathcal D}_N, {\mathcal X})\). Their proofs are provided in “Appendix 1”.
Lemma 1
Lemma 2
Lemma 3
Lemma 4
By combining these lemmas, we derive the following bounds of AUC for a given \({\mathcal D}_N\).
Lemma 5
We immediately obtain the following result by taking the expectation of the inequalities in Lemma 5.
Corollary 1
Theorem 1
As \((1\alpha )\) is sufficiently close to 0, both \(O((1\alpha )\psi ^2)\) and \(O((1\alpha )\psi )\) can be ignored in the upper and lower bounds, respectively. By denoting \(\rho _{\delta }^b(\psi ) = \left\langle (\rho _u({\mathcal X})\delta )^b \right\rangle _\psi \left\langle (\rho _\ell ({\mathcal X}_N)+\delta )^b \right\rangle _\psi \), the upper and lower bounds can be expressed in the forms: \(C_U \psi \alpha ^{\psi } \rho _{\delta }^b(\psi )\) and \(C_L \alpha ^{\psi } \rho _{\delta }^b(\psi )\), respectively, where \(b=d\) and \(\delta =0\) for the upper bound, and \(b=d+1\) and \(0< \delta < \delta _0\) for the lower bound; \(C_U\) and \(C_L\) are constants.
In plain language, the three factors can be interpreted as follows: \(1 \alpha ^\psi \) reflects the likelihood of anomaly contamination in the subsample set \({\mathcal D}\); \(\psi \) represents the number of balls used to represent the geometry of normal instances and anomalies; and \(\rho _\delta ^b(\psi )\) signifies the separation between anomalies and normal instances, represented by \(\psi \) balls.
3.5 Gravitydefiant behaviour
As revealed in the last section, the upper bound of AUC has two critical terms \(\psi \) and \(\alpha ^{\psi }\) which are monotonic functions changing in opposite directions, i.e., as \(\psi \) increases, \(\alpha ^{\psi }\) decreases. Therefore, the AUC bounded by \(\psi \alpha ^{\psi }\) is expected to reach the optimal at some finite and positive \(\psi _{opt}\); and the anomaly detector will perform worse if the sample size used is larger than \(\psi _{opt}\), i.e., the gravitydefiant behaviour.
In addition, \(\rho _u({\mathcal D}_N, {\mathcal X})\) and \(\rho _\ell ({\mathcal D}_N, {\mathcal X}_N)\) decrease when \(\psi \) increases, and \(\rho _u({\mathcal D}_N, {\mathcal X})\) is usually substantially larger than \(\rho _\ell ({\mathcal D}_N, {\mathcal X}_N)\) as depicted in Fig. 2. These behaviours are seen in many other examples shown later. Accordingly, \(\left\langle \rho _u({\mathcal X})^b \right\rangle _\psi \) dominates \(\left\langle \rho _\ell ({\mathcal X}_N)^b \right\rangle _\psi \) in most cases where \(b>1\). Thus, like \(\alpha ^\psi \), the term \(\rho _{\delta }^b(\psi )\) decreases as \(\psi \) increases. This fact indicates that \(\rho _{\delta }^b(\psi )\) is positive, smooth and antimonotonic over the change of \(\psi \).
Thus, both \(\alpha ^\psi \) and \(\rho _{\delta }^b(\psi )\) decrease as \(\psi \) increases and wield the similar influence on the nearest neighbour anomaly detector to exhibit the gravitydefiant behaviour.
The lower bound denotes that AUC always decreases as \(\psi \) increases. The actual behaviour of an 1NN anomaly detector follows this lower bound when \({\mathcal X}_N\) is concentrated on a small area, as in the case of sharp Gaussian distribution, which yields \(\psi _{opt}=1\). This is evident from the proofs of Lemmas 35 in “Appendix 1” where the lower bound comes from the data points in a small part of \({\mathcal X}_N\).
4 Analysis of factors which influence the AUC of the nearest neighbour anomaly detector
 (a)
The proportion of normal instances, \(\alpha \): According to the argument in the last section, the nearest neighbour anomaly detector is expected to improve its AUC by the rate \(\alpha ^{\psi }\) as the proportion of normal instances increases. The change in \(\alpha \) does not affect the other two factors, if the change does not affect the geometry of \({\mathcal X}_N\) and \({\mathcal X}\). In addition to the effect on the magnitude of AUC, Fig. 3 also shows that \(\psi _{opt}\) becomes larger as \(\alpha \) increases. This is because \(\alpha ^{\psi }\) increases as \(\alpha \) increases.
 (b)The difference between the covering radius of \({\mathcal X}\) and the inradius of \({\mathcal X}_N\), i.e., \(\rho _{\delta }^b(\psi ) = \left\langle \rho _u({\mathcal X})^b \right\rangle _\psi \left\langle \rho _\ell ({\mathcal X}_N)^b \right\rangle _\psi \) : This factor depends on the geometry of normal clusters as well as anomalies, and influences the AUC in the following scenarios:
 (b1)
\({\mathcal X}_A\) becomes bigger. The change in \({\mathcal X}_A\) with fixed \({\mathcal X}_N\) directly affects \({\mathcal X}\). Examples of this change from Fig. 2 are shown in Fig. 4. The enlarged \({\mathcal X}_A\), thus the enlarged \({\mathcal X}\), leads to larger \(\rho _{\delta }^b(\psi )\) and higher AUC because the expected \(\rho _u({\mathcal D}_N,{\mathcal X})\) gets larger while the expected \(\rho _\ell ({\mathcal D}_N,{\mathcal X}_N)\) is fixed for a given \(\psi \).
 (b2)
\({\mathcal X}_N\) becomes bigger. The change in \({\mathcal X}_N\) with fixed \({\mathcal X}\) affects both the expected \(\rho _\ell ({\mathcal D}_N,{\mathcal X}_N)\) and the expected \(\rho _u({\mathcal D}_N,{\mathcal X})\). Examples of this change from Fig. 2 are depicted in Fig. 5. The enlarged \({\mathcal X}_N\) leads to smaller \(\rho _{\delta }^b(\psi )\) and thus lower AUC because the difference between the expected \(\rho _\ell ({\mathcal D}_N,{\mathcal X}_N)\) and the expected \(\rho _u({\mathcal D}_N,{\mathcal X})\) gets smaller—a result of the increased \(\rho _\ell ({\mathcal D}_N,{\mathcal X}_N)\) and the decreased \(\rho _u({\mathcal D}_N,{\mathcal X})\) for a given \(\psi \).
 (b3)
Number of clusters in \({\mathcal X}_N\) increases. If \({\mathcal X}_N\) consists of multiple wellseparated clusters as shown in Fig. 6, then \(\left\langle \rho _\ell ({\mathcal X}_N)^b \right\rangle _\psi \) is determined by the minimum of \(\rho _\ell ({\mathcal D}_N,{\mathcal X}_N)\) of single clusters, regardless of the total volume or the number of clusters in \({\mathcal X}_N\). This is despite the fact that the total volume of \({\mathcal X}_N\) has increased from that in Fig. 2. The expected \(\rho _u({\mathcal D}_N,{\mathcal X})\) in Fig. 6 is less than that in Fig. 2 because the clusters are scattered in \({\mathcal X}\). With fixed \({\psi }\), the AUC is expected to decrease in Fig. 6 in comparison with that in Fig. 2 because of the decreased \(\left\langle \rho _u({\mathcal X})^b \right\rangle _\psi \), which is obvious in the change from Fig. 2b, 3, 4, 5 and 6b.
Anomalies’ nearest neighbour distances As indicated in (b1) and (b2), enlarging \({\mathcal X}_N\) has the same effect of shrinking \({\mathcal X}\) in decreasing AUC. It is instructive to note that either of these two changes effectively reduces the anomalies’ nearest neighbour distances because the area occupied by \({\mathcal X}_A\) decreases. This can be seen from \(\rho _{\delta }^b(\psi ) = \left\langle \rho _u({\mathcal X})^b \right\rangle _\psi \left\langle \rho _\ell ({\mathcal X}_N)^b \right\rangle _\psi \), where \(\rho _{\delta }^b(\psi )\) changes in the same direction of the expected \(\rho _u({\mathcal D}_N,{\mathcal X})\) or in the opposite direction of the expected \(\rho _\ell ({\mathcal D}_N,{\mathcal X}_N)\). The nearest neighbour distance of anomaly,^{3} which can be measured easily, is a proxy to \(\rho _{\delta }^b(\psi )\).
In a nutshell, any changes in \({\mathcal X}\) and \({\mathcal X}_N\) that matter—which finally vary AUC—are manifested as changes in the anomalies’ nearest neighbour distances (\(\varDelta _A\)). The AUC, \(\left\langle \rho _u({\mathcal X})^b \right\rangle _\psi \left\langle \rho _\ell ({\mathcal X}_N)^b \right\rangle _\psi \) and \(\varDelta _A\) change in the same direction. Note that \(\rho _{\delta }^b(\psi )\) has the same effect of \(\alpha ^\psi \) to shift \(\psi _{opt}\). But the influence of \(\rho _{\delta }^b(\psi )\) is more difficult to predict because it depends on the rate of decrease between the covering radius of \(\mathcal X\) and inradius of \({\mathcal X}_N\) which in turn depend on the geometry of \({\mathcal X}\) and \({\mathcal X}_N\); and it is hard to measure in practice too.
 (b1)
 (c)
The sample size (\(\psi \)) used by the anomaly detector : The optimal sample size is the number of instances best represents the geometry of normal instances and anomalies (\({\mathcal X}_N\) and \({\mathcal X}\)). The sample size also affects two other factors, i.e., as \(\psi \) increases, both \(\alpha ^{\psi }\) and \(\rho _{\delta }^b(\psi )\) decrease. The direction of the change in AUC depends on the interaction between \(\psi \) and \(\alpha ^{\psi }\rho _{\delta }^b(\psi )\) which change in opposite directions. In general, as \(\psi \) increases from a small value, AUC improves until it reaches the optimal. Further increase from \(\psi _{opt}\) degrades AUC which gives rise to the gravitydefiant behaviour. Note that the change in \(\psi \) does not alter the data characteristics (i.e., \(\alpha \), \({\mathcal X}_N\) and \({\mathcal X}\)).
Changes in AUC and \(\psi _{opt}\) as one data characteristic (\(\alpha \), \(\mathcal{X}\) or \(\mathcal{X}_N\)) changes. \({\varDelta }_A\) is the nearest neighbour distances of anomalies
Change in one data characteristic  \(\rho _u({\mathcal X})\)  \(\rho _\ell ({\mathcal X}_N)\)  \({\varDelta }_A\)  AUC  \(\psi _{opt}\) 

(a) \(\alpha \) increases  =  =  =  \(\Uparrow \)  \(\Uparrow \) 
(b1) \({\mathcal X}_A\) becomes bigger  \(\Uparrow \)  =  \(\Uparrow \)  \(\Uparrow \)  * 
(b2) \({\mathcal X}_N\) becomes bigger  \(\Downarrow \)  \(\Uparrow \)  \(\Downarrow \)  \(\Downarrow \)  * 
(b3) Number of clusters in \({\mathcal X}_N\) increases  \(\Downarrow \)  =  \(\Downarrow \)  \(\Downarrow \)  * 

First half of the curve: For a dataset which requires large \(\psi _{opt}\), the dataset size needs to be very large in order to observe the gravitydefiant behaviour. In the case that the data collected is not large enough, \(\psi _{opt}\) may not be achievable in practice.

Second half of the curve: This is observed on a dataset which requires small \(\psi _{opt}\) e.g., \(\psi _{opt}=1\).
The above impacts are due to the change in sample size which, by itself alone, does not alter anomaly contamination rate or geometry of normal instances and anomalies in the given dataset [described in (a) and (b) above]. Any change in geometrical data characteristics, which affects the AUC, manifests as a change in nearest neighbour distance such that the AUC and anomalies’ nearest neighbour distances change in the same direction. Because nearest neighbour distance can be measured easily and other indicators of detection accuracy are difficult to measure, it provides a unique useful practical tool to detect change in domains where any such changes are critical in their change detection operations, e.g., in data streams.
5 Does the theoretical result apply to other nearest neighbourbased anomaly detectors?
The above theoretical analysis is based on the simplest nearest neighbour (1NN) anomaly detector with a small sample. We believe that this result applies to other nearest neighbourbased anomaly detectors too, although a direct analysis is not straightforward in some cases.
We provide our reasoning as to why the theoretical result can be applied to three nearest neighbourbased anomaly detectors, i.e., an ensemble of nearest neighbours, a recent nearest neighbourbased ensemble method called iNNE (Bandaragoda et al. 2014), and knearest neighbour. These are provided in the following three subsections.
5.1 The effect of ensemble
A significant advantage of this technique is that the effect of the ensemble size t is almost solely limited to the variance of the estimation only. In other words, we can easily control the variance by choosing some appropriate t with almost no other side effects.
Thus, the theoretical analysis also applies to aNNE, and it is expected to have the gravitydefiant behaviour.
5.2 iNNE is a variant of aNNE
Like aNNE, a recent nearest neighbourbased method iNNE (Bandaragoda et al. 2014) employs small samples and aggregates the nearest neighbour distances from all samples to compute the score for each test instance. iNNE is a variant of aNNE because iNNE can be interpreted as identifying anomalies as instances which have the furthest nearest neighbours, as in aNNE. The reasoning is given in the following paragraphs.
Conceptually, iNNE isolates every instance c in a random sample \(\mathcal {D}\) by building a hypersphere centred at c with radius \(\tau (c)\) in the training process. The hypersphere is defined as follows:
Let hypersphere \( B (c,\tau (c))\), centred at c with radius \(\tau (c)\), be \(\lbrace x : m(x,c) < \tau (c)\rbrace \), where \(\tau (c) = \min \limits _{y \in \mathcal {D} \setminus \{c\}}\ m(y,c)\), \(x \in {\mathcal M}\) and \(c \in \mathcal {D}\).
cnn(x) can be viewed as a variant of nearest neighbour of x because \(cnn(x) = \eta _x\), except in two conditions: (i) \(x \in B (cnn(x), \tau (cnn(x))\), but \(x \notin B (\eta _x,\tau (\eta _x))\) when \(\tau (cnn(x)) \ge \tau (\eta _x)\); and (ii) cnn(x) could be nil or undefined when x is not covered by any hypersphere \(\forall c \in \mathcal {D}\).
The anomaly score for iNNE,^{4} \(q(x; \mathcal {D})\), is simply defined by \(\tau (cnn(x))\).
Anomalies identified by aNNE are the instances in D which have the longest distance to \(\eta _x\) in \(\mathcal D\). Similarly, anomalies identified by iNNE are the instances in D which have the longest distance to \(cnn(\cdot )\), i.e., their (variant of) nearest neighbours in \(\mathcal D\). Viewed from another perspective, anomalies identified by iNNE are those covered by the largest hyperspheres.
Conceptual comparison of aNNE and iNNE. The first column indicates the nearest neighbour of x, a variant of nearest neighbour cnn(x), and the nearest neighbour distance for a given \(\mathcal {D}\)
aNNE  iNNE  

\(\eta _x\)  \(\mathop {\hbox {arg min}}\limits _{y \in {\mathcal D}}\ m(x,y)\)  — 
cnn(x)  —  \(\mathop {\hbox {arg min}}\limits _{c \in \mathcal {D}} \{\tau (c) : x \in B (c, \tau (c)) \}\) 
\(q(x; \mathcal {D})\)  \(m(x,\eta _x)\)  \(\tau (cnn(x))\) 
For ease of reference, the algorithms for aNNE and iNNE are provided in “Appendix 2”. Note that the key difference between aNNE and iNNE in Algorithms 1 and 2 are in steps 1 and 4 only: The construction of hyperspheres in training and the version of nearest neighbour employed in evaluation.
As a consequence of the similarity between aNNE and iNNE, we expect that iNNE also has the gravitydefiant behaviour.
5.3 Extension to kNN
The extension of our analysis to the anomaly detector using the k nearest neighbour (kNN) distance is not straightforward. This is because the geometrical shape, cast from kNN distance, cannot be simply characterised by the inradius and the covering radius of the data space. In addition, the optimal k is a monotonic function of the data size (see the discussion on biasvariance analysis in Sect. 8.2).
Nevertheless,1NN and kNN have the identical algorithmic procedure, except a small operational difference in the decision function—using one or k nearest neighbours. Thus, we can expect both 1NN and kNN to have the same gravitydefiant behaviour. We show that this is the case when \(k=\sqrt{n}\) (the rule suggested by Silverman 1986).
5.4 Section summary
aNNE can be expected to behave similarly as 1NN, but with a lower variance; and the size of the variance reduction is proportional to the ensemble size. Thus, the result of the theoretical analysis applies directly to aNNE. We shall refer to aNNE rather than 1NN hereafter, both in our discussion and empirical evaluation, because aNNE has a lower variance than 1NN.
The analyses on kNN and iNNE are not a straightforward extension of the analysis on aNNE, and the optimal k for kNN depends on data size. Given that they all based on the basic operation: nearest neighbour, we can expect iNNE, aNNE and kNN to have the same behaviour in terms of learning curve.
In a nutshell, all three algorithms, aNNE, kNN and iNNE, can be expected to have the gravitydefiant behaviour. However, at what sample size (\(\psi _{opt}\)) each will arrive at its optimal detection accuracy is of great importance in choosing the algorithm to use in practice. We will investigate this issue empirically in Sect. 7.
6 Experimental methodology
Algorithms used in the experiments are aNNE, iNNE and kNN (where the anomaly score is computed from the average distance of k nearest neighbours (Bay and Schwabacher 2003)).
A learning curve is produced for each anomaly detector on each dataset.
aNNE and iNNE have two parameters: sample size \(\psi \) and ensemble size t. To produce a learning curve, the training data is constructed using a sample of size \(t\psi \) where \(t = 100\) and \(\psi \) is 1, 2, 5, 10, 20, 35, 50, 75, 100, 150, 200, 500 and 1000 for each point on the curve. The parameter k in kNN was set as \(k=\lfloor \sqrt{n} \rfloor \) (Silverman 1986) [chapter 1]^{5} where n is the number of training instances (\(n=t \psi \)). Note that the minimum \(\psi \) setting is 1 for aNNE and kNN; but because each hypersphere iNNE built derives its size from the hypersphere’s centre to its nearest neighbour, it requires a minimum of two instances in each sample.
For an ensemble of t models, the total number of training instances employed is \(t \psi \). To train a single model such as kNN, a training set of \(t \psi \) instances is used in order to ensure that a fair comparison is made between an ensemble and a single model.
The Euclidean distance is used in all three algorithms.
Datasets used in the experiments
Data  Size n  d  anomaly class 

CoverType  286,048  10  class 4 (0.9 %) vs. class 2 
Mulcross  262,144  4  1 % anomalies 
Smtp  95,156  3  attack (0.03 %) 
U2R  60,821  34  attack (0.37 %) 
P53Mutant  31,159  5408  active (0.5 %) vs. inactive 
Mammograhpy  11,183  6  class1 (2.32 %) 
Har  4728  561  sitting, standing & laying (1.2 %) 
ALOI  100,000  64  0.553 % anomalies with 900 normal clusters 
The ALOI dataset has \(C=900\) normal clusters and 100 anomaly clusters, where each anomaly cluster has between 1 and 10 instances. It is used in Sects. 7.3–7.5 because it allows us to easily change \({\mathcal X}_N\) or \({\mathcal X}\) to examine the resultant effect on AUC as predicted by the theoretical analysis. Sections 7.1, 7.4 and 7.5 employ the ALOI \(C=10\) dataset where the ten normal clusters are randomly selected from the 900 clusters.
In every experiment, each dataset is randomly split into two equalsize stratified subsets, where one is used for training and the other for testing. For example, in each trial, the CoverType dataset is randomly split into two subsets, each has 143,024 instances. The subset, which is used to produce training instances, is sampled without replacement to obtain the required t samples, each having \(\psi \) instances. As \(t=100\) and the maximum \(\psi \) is 1000, the maximum training instances employed is 100,000. For datasets which have less than 200,000 instances, the sampling without replacement process is restarted with the same subset when the instances have run out.
The result in each dataset is obtained from an average over 20 trials. Each trial employs a training set to train an anomaly detector and its detection performance is measured using a testing set.
As nearest neighbour distance is an important indicator of the detection error of the nearest neighbourbased anomaly detectors, we produce a barchart showing the average nearest neighbour distances of two groups: normal and anomaly, on a dataset created for each trial but before the data is split into the training and testing subsets. The reported result is averaged over 20 trials, and it is used in Sects. 7.2–7.5. Note that this is q(x; D), unlike \(q(x; {\mathcal D})\) used by aNNE. It allows us to measure the nearest neighbour distance for a given dataset, independent of the anomaly detector used. We use \({\varDelta }\) to denote q(x; D) hereafter.
7 Empirical evaluation
The empirical evaluation is divided into five subsections. The first subsection investigates whether the three nearest neighbour anomaly detectors are gravitydefiant algorithms.
The other four subsections investigate the influence of three factors identified in the theoretical analysis. The experiments are designed by making specific changes in data characteristics in order to observe the resultant changes in detection error and \(\psi _{opt}\), as predicted by the theoretical analysis. Table 4 shows the changes made to a dataset in each subsection in terms of \(\alpha \), \(\mathcal{X}\), \(\mathcal{X}_N\).
Changes in error(=1AUC), \(\psi _{opt}\) and \({\varDelta }_A\) as data characteristics (\(\alpha \), \(\mathcal{X}\), \(\mathcal{X}_N\)) change, where \({\varDelta }_A\) and \({\varDelta }_N\) are the average nearest neighbour distance of anomalies and normal instances, respectively
Section  Change in data characteristic  \(\alpha \)  \(\rho _u({\mathcal X})\)  \(\rho _\ell ({\mathcal X}_N)\)  \({\varDelta }_A\)  Error  \(\psi _{opt}\) 

\({\mathcal X}_A\) becomes bigger  =  \(\Uparrow \)  =  \(\Uparrow \)  \(\Downarrow \)  \(\Downarrow \)  
7.3(i)  \({\mathcal X}_N\) becomes bigger  =  \(\Downarrow \)  \(\Uparrow \)  \(\diamondsuit \)  \(\Uparrow \)  \(\Uparrow \) 
7.3(ii)  Number of clusters in \({\mathcal X}_N\) increases  \(\uparrow \)  \(\Downarrow \)  \(\cong \)  \(\Downarrow \)  \(\Uparrow \)  \(\Uparrow \) 
Increase number of anomalies  \(\Downarrow \)  \(\cong \)  =  \(\Downarrow \)  \(\Uparrow \)  \(\Downarrow \)  
Increase number of normal instances  \(\Uparrow \)  =  \(\cong \)  \(\cong \)  \(\Downarrow \)  \(\Uparrow \) 
Recall that, based on the theoretical analysis in Sect. 3.5, the error and \(\rho _{\delta }^b(\psi ) \alpha ^{\psi }\) change in opposite directions. The theoretical analysis correctly predicts the error outcomes of all six cases shown in Table 4.
The details of the experiments are given in the following five subsections.
7.1 Gravitydefiant behaviour
\(\psi _{opt}\) for iNNE, aNNE and kNN, where \(\psi _{max}=1000\) and \(t=100\)
Data  iNNE  aNNE  kNN 

CoverType  35  200  200t 
Mulcross  2  10  5t 
Smtp  1000  1000  1000t 
U2R  20  200  100t 
P53Mutant  2  20  75t 
Mammograhpy  200  500  500t 
Har  2  50  5t 
ALOI \(C=10\)  150  200  200t 
Another interesting observation in Fig. 7 is that the learning curves of iNNE almost always have steeper gradient than those of aNNE and kNN.
We will investigate in the next section the reason why the Smtp dataset does not allow the algorithms to exhibit the gravitydefiant behaviour.
7.2 Enlarge \({\mathcal X}\) by increasing anomalies’ distances to normal instances
The Smtp dataset has \(\psi _{opt}=\psi _{max}\) in the previous experiment. By examining the dataset, we have found out that all the anomalies are very close to normal clusters. Thus, it is an ideal dataset to examine the effect of an enlarged \({\mathcal X}\) by increasing the distance between anomalies and normal clusters. We offset all anomalies by a fixed distance diagonally in the first two dimensions (but the third dimension is unchanged.) The offsets used are 0.0, 0.03, 0.075, 0.2. An example offset is shown in “Appendix 3”.
As everything else stays the same, the offset enlarges \({\mathcal X}\) without changing \({\mathcal X}_N\). The theoretical analysis suggests that this will lower the error.
In addition, \(\psi _{opt}\) of iNNE decreases (from 1000 to 150)—the gravitydefiant behaviour now prevails. This is despite the fact that both aNNE and kNN still have \(\psi _{opt}=\psi _{max}\) as the offset increases.^{9} This phenomenon is consistent with the result in the previous section that iNNE has significantly smaller \(\psi _{opt}\) than those of aNNE and kNN.
This experiment verifies the analysis in Sect. 4 that a dataset which exhibits the first half of the learning curve has large \(\psi _{opt}\); and enlarging \({\mathcal X}_A\) increases \({\varDelta }_A\), as predicted in row b1 in Table 1. The offset apparently decreases \(\psi _{opt}\)—enables the gravitydefiant learning curve to be observed.
7.3 Changing \({\mathcal X}_N\) by changing normal clusters only
In this section, we conduct two experiments to examine the effect of changing \({\mathcal X}_N\). In the first experiment, we use a fixed number of normal clusters of increasing complexities which increases the volume of \({\mathcal X}_N\). In the second experiment, we increase the number of normal clusters where the increased diffusion of clusters that makes up \({\mathcal X}_N\) has a more important influence than the total volume. In both experiments, the number of anomalies is kept unchanged.
 (i)
ALOI(10): We use three categories of ten normal clusters (\(C=10\)) from the ALOI dataset. The low, medium and high categories indicate the three different complexities of single normal clusters based on \(\psi _{opt}\) of aNNE.^{10} The results are shown in Fig. 10.
 (ii)
ALOI(C): We increase the number of normal clusters, i.e., \(C=10,100,500,900\), which indicate that high C has high diffusion of clusters. The resultant subsets of the ALOI dataset used in the experiments are shown in Table 6. The results are shown in Fig. 11.
Subsets of the ALOI dataset with \(C=10,100,500,900\)
C  Number of instances  \(\alpha \)  

Normal  Anomaly  Total  
10  1104  553  1657  0.67 
100  11,047  553  11,600  0.95 
500  55,241  553  55,794  0.990 
900  99,447  553  100,000  0.994 
The results, shown in Figs. 10a and 11a, reveal that increasing either the complexity of normal clusters or the number of normal clusters increases the error for all three algorithms. Yet, the sources that lead to this apparently same outcome are different which are manifested as the decrease in anomalies’ nearest neighbour distances (\({\varDelta }_A\)) in case (ii); and the increase in normal instances’ nearest neighbour distances (\({\varDelta }_N\)), though \({\varDelta }_A\) also decreases to a lesser degree, in case (i). These results are shown in Figs. 10b and 11b.
Note that \(\alpha \) is unchanged in case (i). But in case (ii), \(\alpha \) of ALOI(C) increases as C increases which has the effect of decreasing error, if nothing else changes according to the analysis in Sect. 4. However, the increased diffusion of clusters shrinks \(\rho _u({\mathcal X})\) drastically which outweighs the effect of increasing \(\alpha \). Though the increased number of clusters increases the total volume of \({\mathcal X}_N\), but this may not increase \(\rho _\ell ({\mathcal X}_N)\) [as discussed in item (b3) in Sect. 4.] Thus, the increased diffusion of clusters is the most dominant factor in case (ii).
We observe increased \(\psi _{opt}\) for all three algorithms in both cases. It is due to the increased \(\alpha \) in case (ii); but it is due to the change in \(\rho _{\delta }^b(\psi )\) in case (i).
The net effect in both cases is that the error increases for all three algorithms, as predicted in rows b2 and b3 in Table 1.
We investigate the effect of changing \(\alpha \) that is due to the changing number of anomalies only in the next section.
7.4 Changing \(\alpha \) by changing the number of anomalies only
Here we decrease \(\alpha \) by increasing the number of anomalies through random selection using the ALOI \(C=10\) dataset: The number of anomalies is changed from 25, 270 to 553 (i.e., \(\alpha =0.98, 0.80, 0.67\))
However, the decrease in anomalies’ nearest neighbour distances (\({\varDelta }_A\)), as shown in Fig. 12b, is not predicted by the theoretical analysis because there is a minimum impact on \({\mathcal X}\). The decrease in \({\varDelta }_A\) in this case is a direct result of the characteristics of anomalies in this dataset, i.e., many anomalies are in clusters. When the number of anomalies is small, anomalies’ nearest neighbours are normal instances. As the number of anomalies increases, members of the same clusters become their nearest neighbours, resulting in the reduction of \({\varDelta }_A\) we have observed. It is interesting to note that even in this case, the change in \({\varDelta }_A\) has correctly predicted the movement of error.
7.5 Changing \(\alpha \) by changing the number of normal instances only
This experiment examines the impact of changing \(\alpha \) by changing the number of normal instances through random selection, which has a minimum impact on \({\mathcal X}_N\) or \({\mathcal X}\).
We employ the same ALOI \(C=10\) dataset, as used in the previous subsection, but vary the percentage of normal instances from 50, 70 to 100 %, which are equivalent to \(\alpha = 0.5, 0.58, 0.67\).
8 Section summary
 1.
All three anomaly detectors, aNNE, iNNE and kNN exhibit the gravitydefiant behaviours in seven out of the eight datasets, having \(\psi _{opt} < \psi _{max}\). Even for the only dataset exhibiting the first half of the learning curves, which appears to be gravitycompliant, we have shown that the gravitydefiant behaviour will prevail if some variants of the dataset are employed, as predicted by the theoretical analysis.
 2.
The error is influenced by components of data characteristics, i.e., \(\alpha \), \({\mathcal X}\) or \({\mathcal X}_N\). The error changes in the opposite direction of \(\rho _{\delta }^b(\psi ) \alpha ^\psi \).
 3.
All changes in \(\mathcal X\) or \({\mathcal X}_N\) in the experiments result in changes in anomalies’ nearest neighbour distances (\({\varDelta }_A\)), a direct consequence of \(\rho _{\delta }^b(\psi ) = \left\langle \rho _u({\mathcal X})^b\right\rangle _\psi \left\langle \rho _\ell ({\mathcal X}_N)^b\right\rangle _\psi \).
 4.
The average nearest neighbour distance of anomalies (\({\varDelta }_A\)) and error change in opposite directions.
 5.
\(\alpha \) and \(\rho _{\delta }^b(\psi )\), independently, shift \(\psi _{opt}\) in the same direction. When both \(\alpha \) and \(\rho _{\delta }^b(\psi )\) impart their influence, the net effect on \(\psi _{opt}\) is hard to predict. Also, because \(\rho _{\delta }^b(\psi )\) is the most difficult indicator to measure in practice, the direction of change in \(\psi _{opt}\) can be difficult to predict when \(\rho _{\delta }^b(\psi )\) plays a significant part.
 6.
The experiments verify the theoretical analysis that the change in \({\varDelta }_A\) is able to predict the movement of error accurately, even in the case of clustered anomalies which is a factor not considered in the analysis.
 7.
The empirical evaluation suggests that limiting the predictions only for instances covered by hyperspheres centred at cnn(x) with radius \(\tau (cnn(x))\) in iNNE (rather than always making a prediction based on \(m(x,\eta _x)\) regardless of the distance between \(\eta _x\) and x as used in aNNE and kNN) has led to two positive outcomes: the learning curves of iNNE have a markedly steeper gradient and significantly smaller \(\psi _{opt}\) than those of aNNE and kNN.
9 Discussion
9.1 Other factors that influence the learning curve
The geometrical shape of \({\mathcal X}\) or \({\mathcal X}_N\) is an important factor which influences the AUC. Our descriptions in Sects. 4 and 7 have focused on the volume; but many of the changes would have affected the shape as well. For examples, the changes from Figs. 2, 3, 4 and 5, and the changes made to ALOI in Sect. 7.3 have affected both the volume and the shape. While the former is obvious, the latter can be assumed because the normal instances are from different normal clusters, even though the multidimensional ALOI dataset cannot be visualised.
It is possible that the change in shape alone has an impact on AUC. For example, using a hypothetical variant of the ALOI(C) dataset which allows us to increase the number of normal clusters without changing the volume of \({\mathcal X}_N\). In this case, the density of each cluster increases and the volume of each cluster decreases as C increases such that the total volume and \(\alpha \) remain unchanged. Assume that each cluster is well separated from others and for a fixed \(\mathcal D\), this change in shape has the effect to reduce \(\rho _u({\mathcal X})\) significantly because instances in \(\mathcal D\) are more spread out in \(\mathcal X\) when there is a large number of normal clusters (i.e., a high C). Though \(\rho _\ell ({\mathcal X}_N)\) may also be reduced in the process, \(\rho _u({\mathcal X})\) is reduced at a higher rate than \(\rho _\ell ({\mathcal X}_N)\), leading to higher error. In comparison to Sect. 7.3, the source of the change is different though both changes lead to the same outcome, i.e., higher error as C increases.
Although our analysis in Sect. 3.4 assumes that \(\alpha \) is close to 1, our experimental results using small \(\alpha \) (e.g., in Sects. 7.3–7.5) suggest that the predictions from the analysis can still be applied successfully even though this assumption is violated.
9.2 Biasvariance analyses: density estimation versus anomaly detection
The current discussion on the bias and variance of kNNbased anomaly detectors (Aggarwal and Sathe 2015) assumes that the result of the biasvariance analysis on kNN density estimator carries over to kNNbased anomaly detectors. This assumption is not true because the accuracy of an anomaly detector depends on not only the accuracy of the density estimator employed, but also the distribution and the rate of anomaly contamination. The latter is not taken into consideration in the biasvariance analysis of density estimator.
kNN  LiNearN  

Squared bias  \(O((k/n)^{\frac{4}{d}})\)  \(O(\psi ^{2/d})\) 
Variance  \(O(k^{1})\)  \(O(t^{1}\psi ^{12/d+\epsilon }\varPsi ^{1})\) 
Biasvariance tradeoff parameter  k  \(\psi \) 
The abovementioned assumption, in addition to the dependency between k and n for kNN, creates complication and confusion in terms of learning curve for kNNbased anomaly detectors, as exemplified by the investigations conducted by Zimek et al. (2013) and Aggarwal and Sathe (2015). The former provides a case of gravitydefiant behaviour for kNNbased anomaly detector using a fixed k; the latter points out the former’s wrong reasoning, and shows that both gravitycompliance and gravitydefiant behaviours are possible using a fixed k. Aggarwal and Sathe (2015) posit that only gravitycompliance behaviour is observed if k is set to some fixed proportion of the data size. However, this conclusion is not derived from an analysis of detection error, but by ‘stretching’ the biasvariance analysis of density estimation error to detection error.
It is important to point out that the biasvariance analysis on kNN density estimator assumes that \(1 \ll k \ll n\) (Fukunaga 1990). In other words, the analysis on kNN (Fukunaga 1990; Aggarwal and Sathe 2015) does not apply to 1NN or ensembles of 1NN.
The biasvariance analysis on LiNearN (Wells et al. 2014) is the only analysis available on ensemble of 1NN density estimator. The result of the analysis is shown in Table 7. But this result also cannot be used to explain the behaviour of anomaly detector based on it, for the same reason mentioned earlier.
In summary, to explain the behaviour of anomaly detectors, the analysis must be based on anomaly detection error. The existing biasvariance analysis on density estimator is not an appropriate tool for this purpose.
9.3 Intuition of why small data size can yield the best performing 1NN ensembles
There is no magic to the gravitydefiant algorithms such as aNNE and iNNE which manifest that small data size yields the best performing model. Our result does not imply that less data the better or in the limit zerodata does best. But it does imply that, under some data distribution, it is possible to have a good performing aNNE where each model is trained using one instance only!
We provide an intuitive example as follows. Consider a simple example that all normal instances are generated from a Gaussian distribution. Assume an oracle which provides the representative exemplar(s) of the given dataset for an 1NN anomaly detector. In this case, the only exemplar required is the instance which locates at the centre of the Gaussian distribution. Using the decision rule in Eq (2), where the oraclepicked exemplar is the only instance in \({\mathcal D}\), anomalies are those instances which have the longest distances from the centre, i.e., at the outer fringes of the Gaussian distribution. In this albeit ideal example, \(\psi =1\) for 1NN (as a single model) is sufficient to produce accurate detection. In fact, \(\psi > 1\) can yield worse detection accuracy because \({\mathcal D}\) may now contain anomalies when they exist in the given dataset. Both the lower and upper bounds in our theoretical analysis also yield \(\psi _{opt}=1\) for the case of sharp Gaussian distribution.
In practice, we can obtain a result close to this oracleinduced result by random subsampling, as long as the data distribution admits that instances close to the centre has a higher probability of being selected than instances far from the centre, which is the case for sharp Gaussian distribution. Then, an average of an ensemble of 1NN derived from multiple samples \({\mathcal D}_i\) of \(\psi =1\) (randomly selected one instance) will approximate the result achieved by the oraclepicked exemplar. Pang et al. (2015) report that an ensemble of 1NN (which is the same as aNNE) achieves the best or close to the best result on many datasets using \({\mathcal D}_i\) of \(\psi =1\)!
In a complex distribution (e.g., multiple peaks and asymmetrical shape), the oracle will need to produce more than one exemplar to represent fully the structure of the data distribution in order to yield good detection accuracy. For those distributions with moderate complexity, this number can still be significantly smaller than the size of the given dataset. Pang et al. (2015) report that 13 out of the 15 realworld datasets used (having data sizes up to 5 million instances) require \(\psi \le 16\) in their experiments. Note that the dataset size is irrelevant in terms of the number of exemplars required in both the intuitive example and complex distribution scenarios, as long as the dataset contains sufficient exemplars which are likely to be selected to represent the data distribution.
Sugiyama and Borgwardt (2013) have previously advocated the use of 1NN (as a single model) with a small sample size and provided a probabilistic explanation which can be paraphrased as follows: a small sample size ensures that the randomly selected instances are likely to come from normal instances only; increasing the sample size increases the chance of including anomalies in the sample which leads to an increased number of false negatives (of predicting anomalies as normal instances).
The above intuitive example and our analysis based on computational geometry further reveal that the geometry of normal instances and anomalies plays one of the key roles in determining the optimal sample size—that signifies the gravitydefiant behaviour of 1NNbased anomaly detectors.
9.4 Which nearest neighbour anomaly detector to use?
The investigation by Aggarwal and Sathe (2015) highlights the difficulty of using kNN because the accuracy of kNNbased anomaly detectors depends on not only the biasvariance tradeoff parameter k, but also the data size. Furthermore, the biasvariance tradeoff is delicate in kNN because a change in k alters both bias and variance in opposite directions (see Table 7). Our theoretical analysis points to an additional issue, i.e., kNN which insists on using all the available data (as dictated by the conventional wisdom) has no means to reduce the risk of anomaly contamination in the training dataset.
In contrast, our theoretical analysis reveals that, by using 1NN, the risk of anomaly contamination in the training sample can be controlled by selecting an appropriate sample size (\(\psi \)). The previous analysis on ensemble of 1NN^{12} density estimator (Wells et al. 2014) shows that the ensemble size (t) can be increased independently to reduce the variance without affecting the bias (see the result shown in Table 7).
In addition, our empirical results show that both aNNE and kNN have approximately the same detection accuracy, but kNN requires approximately t times \(\psi _{opt}\) of aNNE in order to achieve its optimal detection accuracy.^{13} Moreover, searching for \(\psi \) (which is usually significantly less than k and does not depend on the data size) is a much easier task than searching for k which is a monotonic function of data size (Fukunaga 1990). All in all, we recommend ensembles of 1NN over kNN.
Between the two ensembles of 1NN, we recommend iNNE over aNNE because it reaches its optimal detection accuracy with a significantly smaller sample size.
Comparisons with other stateoftheart anomaly detectors, which is outside the scope of this paper, can be found in Bandaragoda et al. (2014) and Pang et al. (2015).
9.5 Implications and potential future work
Both of our theoretical analysis and empirical result reveal that any changes in \({\mathcal X}\) or \({\mathcal X}_N\) lead to changes in nearest neighbour distances. In an unsupervised learning setting, changes in \({\mathcal X}\) or \({\mathcal X}_N\) are usually unknown and difficult to measure in practice. Yet any change that leads to the change in detection error can be measured in terms of nearest neighbour distance: If anomalies’ nearest neighbour distances become shorter (or normal instances’ nearest neighbour distances become longer), then we know that the detection error has increased, and vice versa. This is despite the fact that the source(s) of the change or the prediction error cannot be measured directly in an unsupervised learning task where labels for instances are not available at all times. However, \({\varDelta }_A\), the average nearest neighbour distance of anomalies, can be easily obtained in practice by examining a small proportion of instances which have the longest distances to their nearest neighbours in the given dataset (so can \({\varDelta }_N\) by examining a portion of instances which have the shortest distances to their nearest neighbours), even though the labels are unknown.
This knowledge has a practical impact. In a data stream context, for example, timely model updates are crucial in maintaining the model’s detection accuracy along the stream; and the updates rely on the ability to detect changes and the type of change in the stream (for example, whether they are due to changes in anomalies or normal clusters or both.) We are not aware of any good guidance with respect to change detection under different change scenarios in the unsupervised learning context. The majority of current works in data streams (Bifet et al. 2010; Masud et al. 2011; Duarte and Gama 2014) focus on supervised learning.
Our finding suggests that the net effect of any of these changes can be measured in terms of nearest neighbour distance, if a nearest neighbour anomaly detector such as iNNE or aNNE is used. This significantly reduces the type and the number of measurements required for change detection. The end result is a simple, adaptive and effective anomaly detector for data streams. A thorough investigation into the application of this finding in data streams will be conducted in the future.
The revelation of the gravitydefiant behaviour of nearest neighbour anomaly detectors invites broader investigation. Do other types of anomaly detectors, or more generally, learning algorithms for other data mining tasks also exhibit the gravitydefiant behaviour? In a complex domain such as natural language processing, millions of additional data has been shown to continue to improve the performance of trained models (Halevy et al. 2009; Banko and Brill 2001). Is this the domain for which algorithms always comply with the learning curve? Or there is always a limit of domain complexity, over which the gravitydefiant behaviour will prevail. These are open questions that need to be answered.
10 Concluding remarks
As far as we know, this is the first work which investigates algorithms that defy the gravity of learning curve. It provides concrete evidence that there are gravitydefiant algorithms which produce good performing models with small training sets; and models trained with large data sizes perform worse.
Nearest neighbourbased anomaly detectors have been shown to be one of the most effective classes of anomaly detectors. Our analysis focuses on this class of anomaly detectors and provides a deeper understanding of its behaviour that has a practical impact in the age of big data.
The theoretical analysis based on computational geometry gives us an insight into the behaviour of the nearest neighbour anomaly detector. It shows that the AUC changes according to \(\psi \alpha ^{\psi } \left\langle \rho \right\rangle _\psi \), influenced by three factors: the proportion of normal instances (\(\alpha \)), the radii (\(\rho \)) of \(\mathcal X\) and \({\mathcal X}_N\), and the sample size (\(\psi \)) employed by the nearest neighbourbased anomaly detector. Because \(\psi \) and \(\alpha ^{\psi } \left\langle \rho \right\rangle _\psi \) are monotonic functions changing in opposite directions, an overly large sample size amplifies the negative impact of \(\alpha ^{\psi } \left\langle \rho \right\rangle _\psi \), leading to higher error at the tail end of the learning curve—the gravitydefiant behaviour.
We also discover that any change in \({\mathcal X}\) or \({\mathcal X}_N\), which varies the detection error, manifests as a change in nearest neighbour distance such that the detection error and anomalies’ nearest neighbour distances change in opposite directions. Because nearest neighbour distance can be measured easily and other indicators of detection error are difficult to measure, it provides a unique useful practical tool to detect change in domains where any such changes are critical in their change detection operations, e.g., in data streams.
The knowledge that some algorithms can achieve high performance with a significantly small sample size is highly valuable in the age of big data because these algorithms consume significantly less computing resources (memory space and time) to achieve the same outcome as those require a large sample size.
We argue that existing biasvariance analyses on kNNbased density estimators are not an appropriate tool to be used to explain the behaviour of kNNbased anomaly detectors; and the analysis on kNN does not apply to 1NN or ensemble of 1NN on which our analysis targets. In addition, we further uncover that 1NN is not a poor cousin of kNN, rather an ensemble of 1NN has an operational advantage over kNN or an ensemble of kNN: It has only one parameter, i.e., sample size, rather than two parameters, k and data size, that influence the bias—this enables it to have a simpler parameter search. In the age of big data, the most important feature of ensemble of 1NN is that it has significantly smaller optimal sample size than kNN. Unless a compelling reason can be found, we recommend the use of ensemble of 1NN instead of kNN or ensemble of kNN.
Interesting future works include analysing the behaviour of (1) other types of anomaly detector, especially those which have shown good performance using small samples, such as iForest; and (2) gravitydefiant algorithms for other tasks such as classification and clustering.
Footnotes
 1.
Apart from being an eager learner, further distinctions in comparison with the conventional k nearest neighbour learner are provided in Sect. 5.2.
 2.
Note that some convention yields a different expression of AUC, i.e., \(AUC({\mathcal D}_N)=\int _\infty ^0 F(r;{\mathcal D}_N)g(r;{\mathcal D}_N)dr\). This is because the convention has the yaxis and the xaxis reversed for the ROC plot. Both expressions give the same AUC. See Hand and Till (2001) for details.
 3.
A more accurate proxy is the distance between anomaly and its nearest normal instance. In the unsupervised learning context, this distance cannot be measured easily. We will see in the experiment section that the nearest neighbour distance of anomaly is a good proxy to \(\rho _{\delta }^b(\psi )\), even in a dataset with clustered anomalies—a factor not considered in the analysis.
 4.
iNNE’s original score (Bandaragoda et al. 2014) is a relative measure. We employ the base measure of the relative measure to point out that the basic algorithm has a lot in common with aNNE.
 5.
 6.
 7.
http://elki.dbs.ifi.lmu.de/wiki/DataSets/MultiView. Accessed: 11, November 2014.
 8.
Note that the actual optimal training set size for kNN is \(t\psi _{opt}\).
 9.
Note that the trend of the learning curves for both aNNE and kNN (shown in “Appendix 4”) is similar to that for iNNE—error decreases as the distance offset increases. Because \(\psi _{opt}\) of iNNE decreases, \(\psi _{opt}\)’s of aNNE and kNN are expected to follow the same trend although they cannot be determined because aNNE and kNN require significantly larger data size in order to reach their optimal performances.
 10.
900 ALOI(1) single normal cluster datasets are formed by combining each of the 900 normal clusters with all anomalies. aNNE is applied to each dataset to find its \(\psi _{opt}\). The datasets of ALOI(1) are then grouped based on aNNE’s \(\psi _{opt}\) (=1,2,5,10,20,35,50,75,100,200,500,1000), i.e., which have different data characteristics/complexities manifested as learning curves with different \(\psi _{opt}\) values. Each dataset of ALOI(10), used in the experiment, has 10 normal clusters which are randomly selected from one of the following three categories: low has \(\psi _{opt}=1,2,5,10\); medium has \(\psi _{opt}=75\); and high has \(\psi _{opt}=200,500,1000\).
 11.
The derivative \(d(\psi \alpha ^{\psi })/d\psi = \alpha ^{\psi } + \psi \log (\alpha ) \alpha ^{\psi }=0\) gives \(\psi _{opt}=1/\log (\alpha )\). When \(\alpha \) is close to 1, \(\psi _{opt}\) changes drastically; otherwise the change to \(\psi _{opt}\) is small.
 12.
 13.
Similar result applies to ensemble of LOF (\(k=1\)) versus LOF. See the result in ‘Appendix 5”.
Notes
Acknowledgments
We would like to express our gratitude to Dr. Mahito Sugiyama in The Institute of Scientific and Industrial Research, Osaka University for his informative discussion with us. This material is based upon work supported by the Air Force Office of Scientific Research, Asian Office of Aerospace Research and Development (AOARD) under Award Numbers: 15IOA009154006 (Kai Ming Ting), and 15IOA008154005 (Takashi Washio). This work is also supported by JSPS KAKENHI Grant Number 2524003, awarded to Takashi Washio.
References
 Aggarwal, C. C., & Sathe, S. (2015). Theoretical foundations and algorithms for outlier ensembles. SIGKDD Explorations, 17(1), 24–47.CrossRefGoogle Scholar
 Bandaragoda, T., Ting, K. M., Albrecht, D., Liu, F., & Wells, J. (2014). Efficient anomaly detection by isolation using nearest neighbour ensemble. In Proceedings of the 2014 IEEE international conference on data mining, workshop on incremental classification, concept drift and novelty detection (pp. 698–705).Google Scholar
 Banko, M., & Brill, E. (2001). Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th annual meeting on association for computational linguistics, association for computational linguistics, ACL ’01 (pp. 26–33).Google Scholar
 Bay, S., & Schwabacher, M. (2003). Mining distancebased outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 29–38).Google Scholar
 Bifet, A., Frank, E., Holmes, G., & Pfahringer, B. (2010). Accurate ensembles for data streams: Combining restricted hoeffding trees using stacking. In JMLR workshop and conference proceedings. The 2nd Asian conference on Machine learning (Vol. 13, pp. 225–240).Google Scholar
 Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: Identifying densitybased local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on management of data, ACM (pp. 93–104).Google Scholar
 Duarte, J., & Gama, J. (2014). Ensembles of adaptive model rules from highspeed data streams. JMLR workshop and conference proceedings. The 3rd international workshop on big data (Vol. 36, pp. 198–213). Algorithms, Systems, Programming Models and Applications: Streams and Heterogeneous Source Mining.Google Scholar
 Evans, D., Jones, A. J., & Schmidt, W. M. (2002). Asymptotic moments of nearneighbour distance distributions. Proceedings: Mathematical, Physical and Engineering Sciences, 458(2028), 2839–2849.MathSciNetzbMATHGoogle Scholar
 Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd ed.). San Diego: Academic Press.Google Scholar
 Halevy, A., Norvig, P., & Pereira, F. (2009). The unreasonable effectiveness of data. IEEE Intelligent Systems, 24(2), 8–12.CrossRefGoogle Scholar
 Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The weka data mining software: An update. SIGKDD Explorations, 11(1), 10–18.CrossRefGoogle Scholar
 Hand, D. J., & Till, R. J. (2001). A simple generalisation of the area under the roc curve for multiple class classification problems. Machine Learning, 45(2), 171–186.CrossRefzbMATHGoogle Scholar
 Lichman, M. (2013). UCI machine learning repository. archive.ics.uci.edu/ml
 Liu, F., Ting, K. M., & Zhou, Z. H. (2008). Isolation forest. In Proceedings of the eighth IEEE international conference on data mining (pp. 413–422).Google Scholar
 Masud, M., Gao, J., Khan, L., Han, J., & Thuraisingham, B. (2011). Classification and novel class detection in conceptdrifting data streams under time constraints. IEEE Transactions on Knowledge and Data Engineering, 23(6), 859–874.CrossRefGoogle Scholar
 Pandya, D., Upadhyay, S., & Harsha, S. (2013). Fault diagnosis of rolling element bearing with intrinsic mode function of acoustic emission data using apfknn. Expert Systems with Applications, 40(10), 4137–4145.CrossRefGoogle Scholar
 Pang, G., Ting, K. M., & Albrecht, D. (2015). LeSiNN: Detecting anomalies by identifying least similar nearest neighbours. In 2015 IEEE international conference on data mining workshop (ICDMW) (pp. 623–630).Google Scholar
 Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. London: Chapman and Hall.CrossRefzbMATHGoogle Scholar
 Sugiyama, M., & Borgwardt, K. (2013). Rapid distancebased outlier detection via sampling. Advances in Neural Information Processing Systems, 26, 467–475.Google Scholar
 Wells, J. R., Ting, K. M., & Washio, T. (2014). LiNearN: A new approach to nearest neighbour density estimator. Pattern Recognition, 47(8), 2702–2720.CrossRefzbMATHGoogle Scholar
 Zhou, G. T., Ting, K. M., Liu, F. T., & Yin, Y. (2012). Relevance feature mapping for contentbased multimedia information retrieval. Pattern Recognition, 45(4), 1707–1720.CrossRefGoogle Scholar
 Zimek, A., Gaudet, M., Campello, R. J., & Sander, J. (2013). Subsampling for efficient and effective unsupervised outlier detection ensembles. In Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, ACM (pp. 428–436).Google Scholar
 Zitzler, E., Laumanns, M., & Bleuler, S. (2004). A tutorial on evolutionary multiobjective optimization. In X. Gandibleux, M. Sevaux, K. Sörensen, & V. T’Kindt (Eds.), Metaheuristics for multiobjective optimisation (pp. 3–37). Berlin, Heidelberg: Springer.Google Scholar