Anomaly detection for high-dimensional space using deep hypersphere fused with probability approach

Data distribution presents sparsity in a high-dimensional space, thus difficulty affording sufficient information to distinguish anomalies from normal instances. Moreover, a high-dimensional space may exist many subspaces, obviously, anomalies can exist in any subspaces. This also creates trouble for anomaly mining. Consequently, it is a challenge for anomaly mining in a high-dimensional space. To address this, here proposed a deep hypersphere method fused with probabilistic approach for anomaly mining. In the proposed method, the deep neural network is used as a feature extractor to capture those layered low-dimensional features from the data lying in a high-dimensional space. To promote the ability of the deep neural network to capture these features, the probability approach of sample binary-classification is fused into the loss function, thereby forming the probability deep neural network Then, the hypersphere is used as an anomalous detector. In the low-dimensional features extracted by the deep neural network, the anomalous detector separates anomaly features from normal features. Finally, experimental results on synthetic and real-world data sets show that the proposed method not only outperforms the state-of-the-art methods in the precision of mined anomalies, but also this hybrid method consisting of deep neural networks and traditional detection methods has outstanding capabilities of mining high-dimensional anomalies. We find that deep neural networks fusing the probabilistic method of sample multi-classification can capture these desired low-dimensional features; moreover, these captured low-dimensional features present more obvious layered characteristics. We also demonstrate that as long as these captured features represent a fewer anomaly instances, it can sufficiently identify anomalies from normal instances.


Introduction
The "curse of dimensionality" brings three major challenges for anomaly mining. The first one is the computational power of detection methods. High-dimensional space may exist many subspaces, and anomalies may exist any subspaces. A brute-force method is computationally prohibitive because of searching an exponential number of subspaces. Another challenge is that the relative contrast between data objects B Jian Zheng zhengjian.002@163.com 1 Chongqing Aerospace Polytechnic, Chongqing 400021, China 2 Chongqing College of Mobile Communication, Chongqing 401520, China 3 Chongqing Medical and Pharmaceutical College, Chongqing 401331, China become more and more similar in a high-dimensional space [1,2]. Most existing anomaly detection methods implicitly or explicitly rely on the distance between data objects. For a low-dimensional space, the distance assessment metrics are easy to mine anomalies, such as the distance-based methods in [3,4]. Unfortunately, in a high-dimensional space, the metric of assessing distance may not be able to measure the similarity between data [5][6][7]. The third challenge is that data distribution becomes sparse in a high-dimensional space, thereby hard providing sufficient information to distinguish anomalies from normal instances [8][9][10]. Overall, it is a tough task to mine anomalies existing in a high-dimensional space.
Currently, anomaly detection methods are divided into the following categories: (I) distance-based, such as K-Nearest Neighbor (KNN) [3,4]. Although such method does not have to assume data distribution and require training samples, the distance between data in a high-dimensional space is not easy to calculate. (II) Cluster-based, e.g., the model in [11], such method requires assume data distribution. (III) Reconstruction error-based, for this method, the threshold is set in advance [12], such as Matrix Factorization (MF) [13]. When the reconstructed error is larger than set the threshold, this is considered to be anomalies. If threshold is not set properly, the precision of detection methods suffers from seriously negative effects. (IV) Classification-based, e.g., One Class-Support Vector Machine (OC-SVM) [14], as for this method, anomalies are linearly separated by SVM from normal instances but the curse of dimensionality limits the ability of SVM to linearly separate features [15,16]. (V) Deep network architectures-based, e.g., Deep Autoencoder (DAE) [17], Generative Adversarial Networks (GANs) [18], deep networks can capture the layered features used to identify anomalies and normal instances from the background space [19,20]. For instance, GANs exhibit excellent ability for anomaly detection even if the reconstructed anomaly instances are very poor [21]. Nevertheless, GANs are prone to pattern collapse during training. In addition, Goh et al. use recurrent neural networks (RNNs) [22] for anomaly detection in cyber physical systems. (VI) Hybrid method-based, i.e., consisting of deep networks and tradition detection methods, such as Deep Neural Networks based K-classification (DNN-K) [23], Deep Neural Networks-Support Vector Machine (DNN-SVM) [24], and Deep Neural network-support Vector Data Description (DNN-SVDD) [25]. Hybrid methods are much more extensible while the computational complexity augments as the depth of hybrid architecture increases [26].
In addition to above involved methods, hyperspheres are also commonly used for anomaly detection. For example, in [27,28], the hypersphere is used for binary-classification of normal classes and anomaly classes. Usually, hyperspheres are sensitive to a lack of data, moreover, have also no advantages at learning complex invariants. Whereas, in a good feature space, hyperspheres can exhibit attractive capability of separating data.
Given these complementary advantages of a hypersphere and a deep network, this is very attractive to study a hybrid methods of both for anomaly detection. In this work, our motivation is to mine a limited number of potential anomalies existing in a high-dimensional space. In addition, we also look at exploring the searching efficiency of subspaces existing in a high-dimensional space, aiming at giving the computational complexity of anomaly detection algorithms. Hence, we developed a hybrid model consisting of a hypersphere and a deep neural network to meet our studied targets. First, the deep neural network in the proposed model captures these layered low-dimensional features from high-dimensional data. To promote the ability of the deep neural network to learn these layered lowdimensional features, we consider the probability approach of sample binary-classification to be fused into the loss function, thereby forming the probability deep neural network. Then, the hypersphere linearly separates these captured lowdimensional features. Finally, the proposed method is verified on synthetic and real-world data sets.
We summarize the main contributions of this work as follows.
(1) High dimensionality increases the complexity of the data space, facing to the curse of dimensionality, the deep neural networks fusing the probabilistic method of sample multi-classification can capture these desired low-dimensional features from high-dimensional data; moreover, these captured low-dimensional features present more significant layered characteristics. (2) As long as these extracted features can represent a fewer anomalous instances, it is sufficient to identify anomalies from normal instances. (3) Upon a high-dimensional space, the hybrid approach composing of a deep neural network and a traditional detection method has stronger mining power than deep detection approaches or traditional detection approaches.

Layered features extraction
Usually, the background space is high dimension, which is not conducive for anomaly mining, so we first capture these low-dimensional features from the data in the background space using deep neural networks. This purpose is to reduce the dimensionality of a searching space for anomaly mining. The loss function is one of the critical hyper-parameters for deep neural networks because of affecting the learning capability of deep neural networks [29][30][31]. Anomaly detection can be treated to be binary-classification of samples, i.e., anomaly and normal classes, based on this, we consider the loss function from the perspective of calculating sample classification probability.
Given a sample set x {x 1 ,x 2 ,….,x l }, and l > 0. Item P(C i ) denotes the probability that the point x l should be in class C i ∈ Class list C, and i 1,2,….,m. In addition, let us assume that the classification is mutually independent, i.e., conditional independence. The result of conditional independence is given in the following equation: where s j is the class j from the class set. The posterior probability for tagging x is as follows: (2) P(s j |C) P(s j )P(C|s j ) P(C) Since the denominator in Eq. (2) is not based on s j , this part can be ignored [32]. Therefore, s j can be calculated using the following equation: According to the [32], the probability estimate P(C i |s j ) and the prior probability for class C j are replaced using and N j N , respectively. Hence, Eq. (3) is modified as follows: where C is the number of the classes. M i (k,u i ) represents the number of data set elements which have the class of s k , and are assigned to class s u . D k determines the number of class s u . B is a constant. In regard to the B value, Titterington [32] has proposed the values 1, 0.8, or 0.5 as a reference. For the detailed proof respecting Eqs. (3) and (4), please see the [32]. Let us assume C 1 and C 2 are anomaly classes and normal classes, respectively. As such, we can simplify Eq. (4), having that Equation (5) demonstrates the probability that the point x l should be classified into class C i is (x). Sparsity can be encouraged by adding a regularization term that takes a large value when the average activation valueρ i , of a neuron i and its desired value ρ i , are not close in the value [33]. One such sparsity regularization term can be the K-L divergence, as follows: We calculate the average output activation measure of the neuron i with probability manner, havinĝ where n is the total number of training examples. w (1)T i is the ith row of the weight matrix W (1) , and b (1) i is the ith entry of the bias vector b (1) . The loss function L(w,b) is given in the following equation: where e andê are the inputting and the reconstructed inputting, respectively. Equation (8) demonstrates the probability that anomalous points and normal points should belong to the C 1 classes and C 2 classes in a given sample.

Anomaly separation
Hypersphere can be defined with center a and radius R, given a data set y {y 1 ,y 2 ,…,y i ,…,}, i 1, 2,.., the error function using a hypersphere to learn a compact space around y i can be defined as follows [25]: where ξ i is a slack variable that lets some data points fall outside the hypersphere.ξ i is an auxiliary variable rather than part of the parameter set.λ is used for the proportion to tradeoff data outside the hypersphere, allowing users to predefine.
The distance between the sample and the center of the sphere can be calculated (the detailed derivation see [25]) as follows [25]: where α i and α j are Lagrange multipliers, and j 1, 2, … Using the mapping transformation, Eq. (10) can be converted into the following equation: where ϕ( ) is a mapping function.K ( ) is a kernel function of satisfying Mercer theorem. Certainly, there are many the kernel functions of satisfying Mercer theorem. In this work, we select the Matern52 kernel in [34], having that K (y, y i ) θ 0 1 + C r * r 2 (y, y i ) +A r * r 2 (y, y i ) exp − B r * r 2 (y, y i ) , (12) where C r , A r , B r are constant coefficients.
There are two reasons for selecting the Matern52 kernel as our kernel function, (i) the Matern52 kernel can make radius warping concave and non-decreasing [34,35], so as to be prone to focus more on areas with small radii. (ii) The Matern52 kernel, which is a continuous positive definite kernel, can flexibly control the searches in the normal data region because of being non-stationary [36]. This is very conducive to promoting the separated accuracy of anomaly and normal features.
The output of a hypersphere can be calculated in the following equation: In Eq. (13), the sample is normal if the output is a positive value. Otherwise, the sample is anomalies.
The proposed model is composed of the deep neural network and the hypersphere, so the final learning function ∇(L(w, b), (a, r )) of our model includes the loss function L(w,b) in Eq. (8) of the probability deep neural network and the error function (a, r ) in Eq. (9) of the hypersphere, as follows:

Model
In this section, we interpret the rationality of the proposed model and describe the model architecture. In addition, some hyper parameters are configured, as well as, the model training is also presented.

Rationality
Since the background space is high dimension, anomaly detection is considered to be performed in the lowdimensional feature space, instead of the background space. Base on this fact, we opt for the deep neural network to capture low-dimensional features from high-dimensional data.
To allow these captured low-dimensional features more layered, the binary-classification probability of samples in Eq. (5) is fused into the loss function in Eq. (8). Our though originates from the fact that anomaly detection can be treated as a binary-classification of samples, so we consider the probability approach of sample binary-classification.
As for the proposed model, the performance not only relies on these captured low-dimensional features, but also depends on the kernel in the hypersphere. In view of this, using the kernel in Eq. (12) is beneficial for promoting the separated capability of the hypersphere to anomaly features. The hypersphere is trained using the error function in Eq. (9), then outputs these separated features. As such, the proposed model exhibits outstanding ability to identify anomalies from the input samples.

Model architecture
The proposed model, which consists of the probability deep neural network and the hypersphere (namely, DNNH), has three modules, including an encoding module, a hypersphere module and a decoding module, as shown in Fig. 1. For the encoding module in Fig. 1, there are two hidden layers. The ith hidden layer is denoted as H i (e), and i 1, 2. Given a inputting sample Z {z 1 ,z 2 ,…,z n }, Z is mapped onto the input layer in the encoding module. Then, H i (e) captures the low-dimensional features F {f 1 , f 2 ,…, f m } from Z, where F contains anomaly features and normal features, and m < n. Equation (8) ensures that F is better extracted by H i (e). Then, the captured F is sent to the hypersphere module.
For the hypersphere module in Fig. 1, according to the captured F, the kernel in Eq. (12) performs the operation that separates anomaly features from normal features. Through iteration learning the error function in Eq. (9), the hypersphere is well trained. Once the training is completed, the hypersphere sends out the separated low-dimensional fea- For the decoding module in Fig. 1, similar to the encoding module, there are two hidden layers. The jth hidden layer is denoted as H j (d), and j 1, 2. After receiving Fs, H j (d) reconstructs the input Z. Finally, the output layer in the decoding module sends out the learned normal and anomaly classes.

Hyper parameter configuration
Regarding these hyper parameters in DNNS, we carefully studied part of them, e.g., optimizer, learning rate and activation function. Due to the other hyper parameters have no substantial effects on results, their default values are adopted.
Optimizer. Adam is used as the optimizer of DNNH. This reasons are that (i) Adam inherits the capability of Ada-Grad to deal with sparse gradients [37]. (ii) As for Adam, the capability of handling sparse gradients is stronger than existing optimizers, such as RMSprop, SGD, Momentum and Nesterov etc. (iii) Adam can provide different learning rates based on different hyper parameters.
Learning rate. Using Adam as the optimizer of DNNH, there is no have to initialize the learning rate for DNNH.
Activation function. The function Sigmoid is considered to be used as the activation function. Compared to other activation functions, e.g., tanh, Relu and elu, the output of Sigmoid is only 0 and 1, so this is very suitable for judging anomalies and normal points. In addition, we also need to consider the B value in Eq. (5), so B adopts the reference value in [32], i.e., let B be equal to 1.

Training and testing
For the model training, we dynamically adjust the iteration epoch according the observed training precision, until DNNH can converge, the training is finished. While for the model testing, the testing set is used to verify the performance of DNNH.

Experimental settings
In "Training and testing", experimental data sets are described, including 15 synthetic data sets and 6 real-world high-dimensional data sets. In "Experimental settings", those compared approaches and their parameters are illustrated. Experimental assessment metrics are given in "Data sets".

Data sets
We generated three type of synthetic data sets using the manner in [38], as shown in Fig. 2, each of type contains 5 sub data sets. For each sub data sets in the same type, data dimensionality gradually increases from 1000 to 5000. The first type in Fig. 2a, denoted as T 1, represents that normal data and anomalies are mixed by random manner. The second type in Fig. 2b, denoted as T 2, represents that anomalies locate outside normal data. The third type in Fig. 2c, denoted as T 3, represents that normal data surrounds anomalies. For the 15 synthetic data sets, we consider a few limited number of anomalies, i.e., anomaly ratio is equal to 3%. The detailed description of the 15 synthetic data sets is listed in Table 1 of Appendix A.
The six real-world data sets are adopted, whose data dimensions are greater than 1000 dimension. In addition, we also use two 2 benchmark data sets for the cross-verification in regard to the dividing data. Since the eight real-world highdimensional data sets are usually used for classification tasks or clustering tasks, we converted them to anomaly detection data sets using the manner in [38]. Table 2 of Appendix A gives detailed description in regard to the eight real-world data sets.
For the nine competitors, their optimal parameters observed in the corresponding literature were used. Unless otherwise state, all experiments are run on the same experimental settings. In addition, those parameters that are not stated adopt default values.
In this work, receiver operating characteristic curve (ROC) and corresponding area under the curve (AUC) are commonly used to assess the accuracy of anomaly detection. In addition, mean square error (mse) and standard deviation (sd) are also applied to assess the detection results of methods. The calculation formulas of mse and sd are given in the following equation: where y i is the actual value, and p i is the predicted value. D is the input data volume. To have the fair results, all experiments were run independently 100 times. Then, we analyze these results with statistical significance of the synthetic data set by of t test (p value < 0.05 for mse).

Results
In this section, all experimental results are presented, including mse, sd and detection accuracy, aiming at presenting the capabilities of DNNH for anomaly detection, and giving some insights respecting anomaly detection upon a highdimensional space.
All results show that the detected performance of DNNH is significantly better than that of the nine competitors in considered cases. Moreover, these experimental results do not show in general a difference between the ability of anomaly detection upon a high-dimensional space for DNNH and the nine competitors.

Cross-validation
Due to the division of experimental data sets' effects on the training accuracy of DNNH, the division testing on data sets needs to be considered so as to get the optimal configuration proportion for the training set and the testing set. Using benchmark data set B1, B2, we configured the following divided proportion for the training set and the testing set, respectively, i.e., training /testing set 0.9/0.1, 0.8/0.2, 0.7/0.3, 0.6/0.4,0.5/0.5.
The results are shown in Fig. 7 of Appendix B. When the proportion of training/testing set is equal to 0.8/0.2, DNNH gains the best performance, i.e., the AUC is 90.80% and 92.25% on the benchmark data set B1, B2, respectively. As such, all subsequent experiments adopt the value 0.8/0.2 to divide the training/testing set.

Experiments on synthetic data sets
The results on the two metrics of mse and sd show that DNNH is all lower than the nine competitors for the errors of mined anomalies, as shown in Table 3 of Appendix C. These statistical results in Table 3 of Appendix C indicates that there are no differences in general between DNNH and these competitors in the mining accuracy.

Mining accuracy
The results of the mined accuracy in Fig. 3 show that the capabilities of anomalous detection methods decrease as the dimensionality of input data increases. For different data distributions, i.e., on three type of data sets in Fig. 3a-c, DNNH maintains the high mining accuracy and also outperforms the nine competitors. In particular, when the dimensionality of the input data is equal to 5000, DNNH gains the accuracy of over 93%. However, the tradition methods are below 71% in the mined accuracy, e.g., the [3], the [14] and the [15]. While for deep methods and hybrid methods, they are below 85% in terms of accuracy. Together, these results implies that DNNH is not sensitive to data distribution, and DNNH has outstanding advantages for the high-dimensional anomaly detection. Figure 4 displays these visualized results of anomaly clustering on synthetic data set T 1(5), T 2(5) and T 3(5), where these regions surrounded by a black line represent the results of anomaly clustering detected using DNNH and the nine competitors. Figure 4a shows that the quantity and quality of anomaly clustering detected using DNNH outperforms that of nine competitors. (Please observe the area surrounded by a black line). While for the traditional methods, such as OC-SVM in [15], MF in [14] and KNN in [3], the poorest detected results are obtained. Similarly, in Fig. 4b, c, DNNH not only gains these advanced results of the anomaly clustering, but also is better than the nine competitors. As such, for high-dimensional data presenting different distributions, DNNH outperforms these competitors in regard to the detected anomaly clustering. Results show that the execution time of the tradition methods, e.g., the methods in [3] in [14] and in [15], is lower than that of deep methods and hybrid methods on most data sets. While DNNH is lower than hybrid methods, such as the methods in [23], in [24] and the in [25].

Anomaly clustering
During calculating Eq.

Experiments on real-world data sets
The results on real-world data sets show that the mined accuracy of DNNH outperforms nine competitors, as shown in Fig. 6. Especially, on the ultra-high dimensional data set R1 (data dimension is equal to 10,000), DNNH reaches above 71% of the mined accuracy, while the traditional methods, e.g., KNN [3], MF [14] and OC-SVM [15], almost fails on data set R1, i.e., their mined accuracy is less than 20%. Obviously, hybrid approaches and deep approaches are superior than traditional methods in terms of the mine performance. This is because deep architectures can capture low-dimensional features from high-dimensional data, so as to decrease the dimensionality of the input data.
According to experimental results on synthetic data sets and real-world data sets, several observations can be obtained from Figs. 3, 4, 5 and 6.
i. High dimensionality of the input data increases the complexity of data space, through fusing the probability method of sample binary-classification into deep neural networks, the hidden layers can extract these lowdimensional layered features of being used to distinguish anomalies from normal instance. ii. If the extracted features can represent fewer anomaly instances, this enough identify anomalies from normal classes. iii. On a high-dimensional space, hybrid methods consisting of deep networks and traditional methods show excellent the mined capabilities of anomalies. iv. The time complexity of deep detection algorithms includes the depth of network architectures and the dimensionality of input data. Usually, the data lying in a high-dimensional space is prone to a deeper network architecture to learn more meaningful features, so that the computational complexity of deep detection algorithms increases along with the number of network layers. While the computational complexity of hybrid algorithms contains the complexity of deep detection algorithms and traditional detection algorithms.

Discussion
DNNH has outstanding the mined capability for highdimensional anomalies, so we give a detailed explanation as following. First, Eq. (8) ensures the layered characteristic of the captured low-dimensional anomaly features and low-dimensional normal features. Equation (5) reduces the probability that anomalous points are classified into the C 2 classes (i.e., normal classes) during mining anomalies. Then, the kernel in Eq. (12) achieves the linear separation of the two types of captured low-dimensional features. Moreover, through learning the error function (a, r ) in Eq. (9), the hypersphere allows that the captured low-dimensional features are far away from the center a of the sphere. Finally, through learning the final learning function ∇ (L(w, b), (a, r )) in Eq. (14), DNNH gains these advanced results of anomaly mining upon a high-dimensional space.

Conclusion
In this work, a hybrid method is proposed for anomaly mining upon a high-dimensional space. In the proposed method, the probability deep neural network first captures the lowdimensional features from the background space. Then, the captured low-dimensional features are separated using the hypersphere, so as to realize the distinction between anomaly and normal classes. Experimental results show that the proposed method outperforms the advanced anomalous detection methods in mining ability. We demonstrate that deep neural networks can capture these desired low-dimensional layered features through fusing the probability method of sample multi-classification. Moreover, the extracted features that represent a fewer anomaly instances can sufficiently identify anomalies from normal instances. In future work, we will look at exploring anomaly detection methods targeted to irrelevant attributes interference on a high-dimensional space, i.e., how to identify the anomalies masked by irrelevant attributes on a high-dimensional space.

Declarations
Conflict of interest All authors have no conflicts of interest to declare that are relevant to the content of this article.
Ethics approval and consent to participate All authors declare that this work does not include humans and animals, as well as never collects data from human subjects.   Table 3.