Keywords

1 Introduction

Intrusion is a set of actions aiming to compromise the security of computer and network components in terms of confidentiality, integrity and availability [1]. Intrusion detection techniques can be classified into two categories:misuse detection (or signature-based detection) and anomaly detection. Misuse detection identifies intrusions based on patterns acquired from known attacks [2]. Anomaly detection discovers intrusions based on significant deviations from normal activities [3].

In early days, signature-based methods such as Snort [4], based on extensive knowledge of the particular characteristics of each attack, referred to as its signature are commonly applied. Such systems are highly effective in dealing with attacks for which they are programmed to defend unknown intrusion. Besides, they are not applicable for anomaly detection with large-scale network data because of the famous 4V [5]:

Volume. The scale and complexity of network data is beyond the Moores law which means the amount of traffic to be detected in every terminal increases rapidly. String matching based signature method is a computationally intensive task.

Variety. Network data usually is derived from various sources, where it is described in unstructured or semi-structured way. Proper integration is necessary to make uniform format.

Value. The value density of data is low. Anomaly detection problem usually faces with high dimensional network data. Some features of these data are useless in identifying anomaly.

Velocity. The detection needs response in real-time in order to detect attack or anomaly in time.

In addition, building new signatures require human experts’ manual inspection which is not only expensive, but also induces a significant period of vulnerability between the discovery of a new attack and the construction of its signatures.

Patcha et al. [6] further categorizes anomaly detection methods into three categories: statistics-based, data mining-based and machine learning-based. Statistics-based method is difficult to adapt to the non-stationary variation of the network traffic, which leads to a high false positive rate [7]. To alleviates these shortcomings, a number of ADSs employ data mining techniques [8–12]. Data mining techniques aim to discover understandable patterns or models from given data sets [13]. It can efficiently identify profiles of normal network activities for anomaly detection, and build classifiers to detect attacks. Some earlier work show that these techniques can help to identify abnormal network activities efficiently.

Supervised anomaly intrusion detection approaches [8–10] highly rely on training data from normal activities, which are commonly used as data mining techniques. Since training data only contain historical activities, the profile of normal activities can only include the historical patterns of normal behavior. Therefore, new activities due to the change in the network environment or services are considered as deviations from the previously built profile, namely attacks. In addition, attack-free training data are not easy to obtain in real-world networks. The ADS trained by the data with hidden intrusions usually lacks the ability to detect intrusions.

To overcome the limitations of supervised anomaly-based systems, ADS employing unsupervised approaches has become a focus recently [14–17]. Unsupervised anomaly detection does not need attack-free training data. In distance-based methods, clusters are groups of data characterized by a small distance to the cluster center. However, a data point is always assigned to the nearest center, these approaches are not able to detect nonspherical clusters. In density-based spatial clustering methods, one chooses a density threshold, discards as noise the points in regions with densities lower than this threshold, and assigns to different clusters disconnected regions of high density. However, it can be nontrivial to choose an appropriate threshold.

Another challenge in ADS is feature selection. Many existing algorithms suffer from low effectiveness and low efficiency due to high dimensionality and large size of the data set. Hence, feature selection is essential for improving detection rate, since it can not only help reduce the computational cost but also improve the precision by removing irrelevant, mistaking and redundant features. However, in amount of data mining methods, features are selected based on the mutual information between feature and labels. Moreover, in many cases network data contain continuous variables which is challenging to measure the relation between features because the result greatly relies on the discretization methods.

Such limitations impose a serious bottleneck to unsupervised network anomaly detection problem. In this paper, we investigate anomaly detection problem in large scale and high-dimensional network data without labels and propose a new approach, called UFSDP (Unsupervised Feature Selection based Density Peak clustering) to tackle it. The major contributions of this paper are summarized as follows.

  1. (1)

    We propose a new systematic framework that employs the density peak based clustering algorithm for network anomaly detection. This clustering algorithm has the advantage of extracting cluster centers and outlier points automatically. Besides, sampling adaptation is applied to improve the time and memory efficiency of the original clustering method in center selection stage.

  2. (2)

    An unsupervised cluster-based feature selection mechanism is proposed before clustering procedure. We use two different ways to compute the relations for discrete and continuous attributes respectively. Different from other feature selection mechanism, we cluster the relevant features into groups according to their maximum redundancy from each other. Eventually redundant features are removed to make the feature number as least as possible.

  3. (3)

    Extensive experiments are made to evaluate the performance of proposed method. Firstly, comparison are made over different classifiers by using original dataset and dataset with feature reduced by proposed selection algorithm. The proposed sampled-density peak clustering methodology is also compared with other clustering algorithms to evaluate its clustering performance in different credible metrics.

The rest of the paper proceeds as follows. Section 2 reviews related work. Section 3 describes our methodologies including unsupervised feature selection and density peak clustering respectively and highlights our motivation in using them. Section 4 presents our evaluation results and analysis. Section 5 finally summarizes our work.

2 Related Work

2.1 Unsupervised Anomaly Detection

Most of current network anomaly detection systems are supervised learning method. However, training data is typically expensive to obtain. Using unsupervised anomaly detection techniques, the system can be trained with unlabeled data and is capable of detecting previously unseen attacks.

Clustering, a ubiquitous unsupervised learning method, aims to group objects into meaningful subclasses. Therefore, network data generated from different attack mechanism or normal activities have distinct characteristics so each of them can be distinguished from others.

KMeans, a clustering method, is employed to detect unknown attacks and divide network data space effectively in [17]. However the performance and computation complexity of KMean method are sensitive to the predefined number of clusters and initialized cluster centers. Wei et al. [18] employs improved FCM algorithms to obtain an optimal k.

In [19], the authors proposed an anomaly detection method. This method utilizes a density-based clustering algorithm DBSCAN for modeling the normal activities of a user in a host.

Egilmez et al. [16] proposed a novel spectral anomaly detection method by developing a graph-based framework over wireless sensor networks. In their method, graphs are chosen to capture useful proximity information of measured data and employed to project the graph signals into normal and anomaly subspaces.

In [20], a SOM-based anomaly intrusion detection system was proposed, which could contract high-dimension data to lower dimension, meanwhile keeping the primary relationship between clustering and topology. But results is sensitive to parameters such as neuron number.

2.2 Feature Selection

The machine learning community has developed many solutions to address the curse of dimensionality problem in the form of feature selection and feature extraction. Different from feature extraction methods such as principal component analysis (PCA) [21] and linear discriminant analysis (LDA) [22], feature selection methods aim to choose a representative subset of all the features instead of creating a subset of new features by combinations of the existing features, which reserves the interpretability of attributes.

Feature selection can be briefly divided into three broad categories: the filter, embedded and wrapper approaches. In terms of feature selection, filter methods are commonly used.

Filter algorithms have low computational complexity, but the accuracy of the learning algorithms is not guaranteed. In [23], Peng et al. propose a minimal-redundancy-maximal-relevance (mRMR) criterion, which adds a feature to the final subset if it maximizes the difference between its mutual information with the class and the sum of its mutual information with each of the individual features already selected. Qu et al. [24] suggested a new redundancy measure and a feature subset merit measure based on mutual information concepts to quantify the relevance and redundancy among features. Song et al. [25] proposed a feature filter FAST based on the mutual information between features and minimum spanning tree is used to split features into clusters. Only one representative feature will be selected from every cluster to form the best discriminative feature subset. But when all weight value of edges is not high enough to arise split, it is not applicable.

In addition, it lacks an effective way to compute the mutual information between continuous features. Since continuous variables have unlimited values and the probability of any of them is not defined. Equal-width [26] divides continuous value into a number of bins with equal width, however it can be inaccurate since the width is an uncertainty. Others uses parzen window [27] to estimate the probability density distribution of two variables and employ integration computation. The actual distribution is unknown and the result highly relies on the selection of kernel function. FSFC [28] applies a new similarity measure, called maximal information compression index as the measurement of feature similarity and also predefines the number of selected features in the final feature subset.

3 Methodology

3.1 Feature Selection

Feature selection is a commonly used technique to select relevant features by reducing the data dimensionality and building effective prediction models. Feature selection can improve the performance of prediction models by alleviating the effect of the curse of dimensionality, enhancing the generalization performance, speeding up the learning process.

Relevance Definition. Suppose F denotes the set of whole features, \(F_i\) denotes an element of F, C denotes the target concept and \(S_i\) denotes the F-\(F_i\). There are mainly three kinds of features:

Definition 1

(Strong correlation). \(F_i\) is strong relevant to target concept C if and only if

$$\begin{aligned} p(C|{S_i},{F_i}) \ne p(C|{S_i}) \end{aligned}$$
(1)

Strong relevant features can have impact on distribution of classification. Lacking strong relevant features, the result would be inaccurate.

Definition 2

(Weak correlation). \(F_i\) is weak relevant to target concept C if and only if

$$\begin{aligned} p(C|{S_i},{F_i}) = p(C|{S_i}),\ \exists S{'_i} \subset {S_i},p(C|S{'_i},{F_i}) \ne p(C|S{'_i}) \end{aligned}$$
(2)

A weak relevant feature impacts the distribution of classification in some condition, but not necessary.

Definition 3

(Independent correlation). \(F_i\) is an independent feature if and only if

$$\begin{aligned} \forall S{'_i} \subset {S_i},p(C|S{'_i},{F_i}) \ne p(C|S{'_i}) \end{aligned}$$
(3)

Independent features do not influence the distribution of classification, so they are firstly removed in feature selection.

Mutual Information Calculation. In previous work [23, 25], the symmetric uncertainty is used as the measure of correlation between two features. The symmetric uncertainty is defined as follows:

$$\begin{aligned} SU({F_i},{F_j}) = \frac{{2*Gain({F_i},{F_j})}}{{H({F_i}) + H({F_j})}} \end{aligned}$$
(4)

\(H({F_i})\) is the entropy of a discrete random variable \(H({F_i})\), if p(f) is the prior probabilities for all values of \({F_i}\), \(H({F_i})\) is defined by:

$$\begin{aligned} H({F_i}) = - \sum \limits _{f \in {F_i}}^{} {p(f){{\log }_2}p(f)} \end{aligned}$$
(5)

\(H({F_i},{F_j})\) is the conditional entropy of \({F_i}\) with priori knowledge of all values of \({F_j}\). The smaller \(H({F_i},{F_j})\) is, the greater \(Gain({F_i},{F_j})\) is:

$$\begin{aligned} Gain({F_i},{F_j}) = H({F_i}) - H({F_i}|{F_j}) = H({F_j}) - H({F_j}|{F_i}) \end{aligned}$$
(6)

\(Gain({F_i},{F_j})\) means the contribution made by a known variable to reduce the uncertainty of an unknown variable, which can referred to another feature or the target concept.

Definition 4

(Relevancy). In supervised learning methods, features with low value of \(SU({F_i},C)\) are firstly removed as independent ones. However, in unsupervised learning cases, the distribution of C are inaccessible. To deal with this problem, another measurement called ref is introduced to replace \(SU({F_i},C)\) and their definition are as follows:

$$\begin{aligned} ref({F_i},C) = \frac{1}{n}\sum \limits _{j = 1}^n {SU({F_i},{F_j})} \end{aligned}$$
(7)
$$\begin{aligned} ref({F_i},{F_j}) = SU({F_i},{F_j}) \end{aligned}$$
(8)

Discrete attributes such as \(protocol\_type\) can directly be applied with aforementioned formulas. But continuous attributes such as \(src\_bytes\) are uneasy to directly do so since their possible values are approximately infinite, and resulting in value \(H({F_i})\) greater and value \(SU({F_i,F_j})\) less than discrete attributes. As a result, it’s challenging to compute relations between continuous features. Usually discretization operation is applied to map infinite values into finite values. However, most unsupervised discretization methods such as clustering and equal-width compute the relation in a rough way.

In this paper, the relation information between two continuous features are calculated using Maximal Information Coefficient (MIC) [29]. Methods such as mutual information estimators show a strong preference for some types of relations, but fails to describe well in other cases, which makes it unsuitable for identifying all potentially interesting relationships in a dataset. However, MIC has the ability to examine all potentially interesting relationships in a dataset independent of their form, which allows tremendous versatility in the search for meaningful insights.

MIC is based on the idea that if a relationship exists between two variables, then a grid can be drawn on the scatterplot of the two variables that divides the data to encapsulate that relationship. Given a finite dataset D of two dimensions, one of the dimensions named x-values and the other as y-values. Suppose x-values is divided into x bins and y-values into y bins, and we got a \(x*y\) grid G, givenby

$$\begin{aligned} I * \left( {D,\mathrm{{ }}x,\mathrm{{ }}y} \right) \mathrm{{ }} = \mathrm{{ }}argmax\mathrm{{ }}I\left( {D|G} \right) \end{aligned}$$
(9)

For each pair (x,y), the MIC algorithm finds the x by y grid with the highest induced mutual information. Then MIC algorithm normalizes the mutual information scores and compiles a matrix that stores \(D{|_G}\). Then, the MIC(x,y) is the maximum value in the matrix.

Feature Cluster. After computing MI and MIC we get \(ref({F_i},C)\) and \(ref({F_i},{F_j})\) from previous steps, then an intuitive clustering algorithm is proposed to filter those features. Firstly, features with low \(ref({F_i},C)\) are removed since those features do not make obvious contribution for identifying. We set a threshold1 for \(ref({F_i},C)\). In this paper, we run algorithm multiple times and choose the best one. After that, redundant features are removed according to the value of \(ref({F_i},{F_j})\). We set threshold2 for \(ref({F_i},{F_j})\), if \(ref({F_i},{F_j})\) exceeds threshold2, \({F_i}\) and \({F_j}\) can be regarded as redundant. Then we cluster those redundant features together. The details of the unsupervised feature selection algorithm for continuous features are given in Algorithm 1.

figure a

3.2 Density Peak Based Clustering

In distance-based methods, clusters are groups of data characterized by a small distance to the cluster center. However, a data point is always assigned to the nearest center, these approaches are not able to detect nonspherical clusters. In density-based spatial clustering methods, one chooses a density threshold, discards as noise the points in regions with densities lower than this threshold, and assigns to different clusters disconnected regions of high density. However, it can be nontrivial to choose an appropriate threshold.

Most clustering algorithms [14–17] need parameters predefined, such as cluster number, and the detection accuracy is sensitive to those parameters. In [30], Alex et al. develop a modern clustering method named Fast Search and Find of Density Peaks (DP). Given data samples, there are two variables that does this algorithm calculates for each data sample.

  1. (1)

    local density \({\rho _i}\):

    \({\rho _i}\) measures the local density of a target point i by computing the number of points within the fixed radius to point i. There are two ways to compute local density. In cut-off kernel,

    $$\begin{aligned} {\rho _i} = \sum \limits _{j \in {I_S}\backslash \{ i\} }^{} {\chi ({d_{ij}} - {d_c})} \end{aligned}$$
    (10)
    $$\begin{aligned} \chi (x) = \left\{ \begin{array}{l} 1,x < 0;\\ 0,x \ge 0, \end{array} \right. \end{aligned}$$
    (11)

    In Gaussian kernel,

    $$\begin{aligned} {\rho _i} = \sum \limits _{j \in {I_S}\backslash \{ i\} }^{} {{e^{ - \mathop {(\frac{{{d_{ij}}}}{{{d_c}}})}\nolimits ^2 }}} \end{aligned}$$
    (12)
  2. (2)

    minimum distance to high density point \({\delta _i}\):

    \({\delta _i}\) is measured by computing the minimum distance between point i and any other point with higher density. The points with higher value of local density and distance are selected as cluster center.

Cluster Center Selection. In original density peak clustering, the density and distance of all the data samples are computed primarily. During this procedure, the method maintains a matrix with float number for distance in size of N*N where N is the number of samples. When N is higher than 32000, the memory can not store the whole matrix at one pass. Memory constraints density peak clustering to applied in a larger scale dataset. We notice that if we downsample the network data randomly, the whole distribution of data become sparse but the position of cluster centers remains changed slightly. Because the original data points with high density are still higher than other points after unbiased downsampling. Given this, we use a portion of network data instead of whole dataset and obtain approximate centers.

Clustering Process. After the cluster centers have been found, every remaining point is assigned to the nearest center. The label assignment is performed in a single step.

figure b

4 Experiments and Analysis

4.1 Dataset and Preprocess

KDDCup99 dataset [31] is used as a benchmark which contains five million connection records processed from four gigabytes of compressed binary TCP dump data from seven weeks of network traffic. Due to the huge volume of original dataset, we use 10 % containing about 494021 records of this KDDCup99 dataset which is publicly available for experimental purpose. Attacks are broadly categorized in four groups such as Probes (information gathering attacks), DoS (denial of service), U2R (user to root) and R2L (remote to local). Each labeled record consists of 41 attributes (features) as depicted in Table 1 and one target value. Target value indicates the attack category name.

Table 1. Summay of the 41 attributes in KDDCup99 data sets
Table 2. Specific of KDDCup99_10_percent

Since attributes in the KDD datasets include forms of continuous, discrete and symbolic with significantly varying resolution and ranges. In feature selection step, entropy and mutual information between discrete and symbolic attributes are computed without preprocessing. While in clustering stage, symbolic and discrete data are normalized and scaled. Firstly symbolic features like \(protocol\_type\), services, flags and \(attack\_names\) were mapped to integer values ranging from 0 to \(N-1\) where N is the number of symbols. Secondly, min-max normalization process is implemented. Each of feature is linearly scaled to the range of [0.0,1.0] for the fairness between different attributes. As we see in Table 2, the 10 % of KDDCup99 is an imbalanced dataset, with ‘neptune’, ‘normal’ and ‘smurf’ greatly higher than other kinds. Therefore we downsample three kinds to ensure the relative balance with other attributes.

4.2 Performance Evaluation

To evaluate the effectiveness and performance of our proposed method, simulation experiments have been carried out. All experiments are executed on a computer with Intel I5 CPU, CPU clock rate of 3.20 GHz, 4 GB main memory. The algorithm proposed is implemented with Winpython-64bit using programming language Python 2.7.9. Several valuable utilities, MINE package [32] and Python open source machine learning library Scikit-learn, Numpy, SciPy, Matplotlib [33] are adopted during experiments.

In feature selection stage, we present the experimental results in terms of the classification accuracy and the the time gain from reduced data to original. Parameters of Alrogithm 1 are setup as following: D=\(KDDCup99\_10\_percent,\) \(\theta 1\)=0.2, \(\theta 2\)=0.5. After running Algorithm 1, we obtained selected discrete feature subset \(\{2,3,4,12\}\) and continuous feature subset \(\{1,8,10,23,24,25,26,27,28,29,32,33\}\), totally 16 features with \(60.97\,\%\) reduction compared to original features numbers. Our experiment is set up as follows:

  1. 1.

    Comparison is carried out over our unsupervised method with other feature selection approaches, including supervised such as RFE, ExtraTreeClassifier.

  2. 2.

    Five classification algorithms are employed to classify data before and after feature selection. They are the tree-based DecisionTreeClassifier, ensemble learning method ExtraTreesClassifier, Random Forest Classifier algorithm and AdaboostClassifier and optimal margin-based Support Vector Machine, respectively.

  3. 3.

    We sampled those three categoreis to obtain a balanced dataset and the total number of samples is about 20000. Given that the result can be different every time, we run the comparision experiments 100 times on the same machine and then obtain average measured values.

Fig. 1.
figure 1

Classification accuracy over different feature selection mehods

Figure 1 records the classification accuracy of five classifier achieved on datasets reduced by four feature selection methods. From it we observed that

  1. 1.

    The original data without feature selection achieve the highest accuracy in most classifier situation since it reserves all information of the whole data.

  2. 2.

    Most feature selection methods can achieve a high accuracy and is close to original data. In most case, ensemble learning model, Random Forest and AdaBoost methods can achieve better detection accuracy compared with other model, such as Decision Tree, Support Vector Machine.

  3. 3.

    Compared with other supervised feature selection, MIC based-unsupervised feature selection acquire relatively high detection accuracy which is very close to the ExtraTreesClassifier with \(0.4\,\%\) gap and to the original data with \(0.6\,\%\) gap. Moreover, UFS-MIC achieves \(3.3\,\%\) better than another supervised method RFE. The result shows that with absence of labels, the detection accuracy of proposed method is comparable with supervised approaches and thus suitable for network anomaly detection.

In the meanwhile, we record the time of running every classifier both features are selected and not. The detailed statistics in Table 3 illustrate that the proposed method efficiently reduces the time of running classification method on the reduced data. The average runtime benefit is considerable \(14.44\,\%\) among different classifiers. In Decision Tree Classifier model, the benefit of \(30.63\,\%\) is impressive.

Table 3. Runtime comparison between two datasets

5 Conclusion

In this paper, we propose a two-stage framework for network anomaly detection. High-dimensional data commonly happens in network anomaly detection problems. Methods in solving these problem may suffer from curse of dimensionality.

In our first stage, we propose a sophisticated feature section method to get ride of irrelevant features and redundant features. By employing MIC approach, we solve the difficulty in calculating mutual information for continuous attributes. The experimental results show that this method achieves comparable accuracy with supervised methods and can effectively reduce the runtime of those methods with little sacrificing.

In the second stage, we introduce density peak based cluster. we have made a tradeoff that using fraction instead of the whole data samples to determine cluster centers approximatively. Experimental result shows that this method is efficient and achieve higher accuracy than other existing unsupervised methods generally.