Elliptical modeling and pattern analysis for perturbation models and classification

Regular Paper
  • 112 Downloads

Abstract

The characteristics of a feature vector in the transform domain of a perturbation model differ significantly from those of its corresponding feature vector in the input domain. These differences—caused by the perturbation techniques used for the transformation of feature patterns—degrade the performance of machine learning techniques in the transform domain. In this paper, we proposed a semi-parametric perturbation model that transforms the input feature patterns to a set of elliptical patterns and studied the performance degradation issues associated with random forest classification technique using both the input and transform domain features. Compared with the linear transformation such as principal component analysis (PCA), the proposed method requires less statistical assumptions and is highly suitable for the applications such as data privacy and security due to the difficulty of inverting the elliptical patterns from the transform domain to the input domain. In addition, we adopted a flexible block-wise dimensionality reduction step in the proposed method to accommodate the possible high-dimensional data in modern applications. We evaluated the empirical performance of the proposed method on a network intrusion data set and a biological data set, and compared the results with PCA in terms of classification performance and data privacy protection (measured by the blind source separation attack and signal interference ratio). Both results confirmed the superior performance of the proposed elliptical transformation.

Keywords

Data privacy Classification Dimension reduction Network intrusion Perturbation model 

1 Introduction

Feature vectors carry useful numerical patterns that characterize the original domain (or a sub original domain—input domain) formed by the feature vectors themselves. Machine learning algorithms generally utilize these patterns to generate classifiers, that can help make decisions from data, by using supervised or unsupervised learning techniques [30]. However, certain data science applications, such as data privacy and data security [34], require the alteration of these feature patterns to protect data privacy so that it should be difficult to recover the original patterns from the altered patterns [22]. Perturbation models have been studied and developed for this purpose [11, 24]. The perturbation models generally transform the feature vectors from an original domain to a new set of feature vectors within a transform domain where the data privacy can be protected. On the other hand, the performance of machine learning algorithms can be degraded in the transform domain due to the alternations of the patterns. Hence a significant research has been performed to develop an efficient perturbation model to minimize the degradation of the performance of machine learning algorithms while providing a robust protection of data privacy. Perturbation models may be categorized into two top-level groups: parametric models and nonparametric models. The parametric models may also be further divided into two subgroups: vector space (or the original domain) models and feature space (or the transform domain) models. The vector space models include the models proposed by [24], in which the authors have shown that their proposed models perform well in the original domain. Alternatively, Oliveira and Zaïane [26] proposed a feature space model which was constructed using a matrix rotation, and Lasko and Vinterbo [19] also developed a feature space model, but they used a spectral analysis. They showed that their proposed techniques performed well in the transform domain. These types of models make parametric statistical assumption which in practice can be easily violated for different types of data. As a consequence, the current techniques may not perform as desired. A thorough review was presented in a recent paper by [27], in which the authors summarized the possible types of violations of parametric assumptions, including uncertainty in marginal distributional properties of independent variables and possible nonlinear relationship that linear models cannot fully explore (e.g., invert-U shape [1]). They proposed a nonparametric model based on density ratios to address these problems and reported that the nonparametric models in general can perform better than the other parametric models.

In this paper, we consider a semi-parametric perturbation model in the sense that no parametric assumptions are imposed on the marginal distribution of features, while the machine learning model that combines transformed variables is indeed parametric. The main idea is to construct a transform domain (or feature space) from the original domain using parametrized elliptical patterns with the goals of making the restoration of the original patterns very difficult, while maintaining a similar performance for the machine learning algorithms in both the original and the transform domains. Our proposed approach, elliptical pattern analysis (EPA), sets the criteria on privacy strength based on blind source separation attack [35], because of the use of mutual interaction between variables to construct transform domain.

Our key contribution includes the use of mutual interaction between two variables (or features); however, this type of aggregation may jeopardize the performance of classification algorithms through the loss of some of the data characteristics (or patterns). To solve this problem, we proposed an additional data aggregation step through the random projection in the feature space before applying any machine learning algorithms. The main idea is to search over possible ways to combine pairs (or blocks) of variables to achieve efficient dimension reduction while maintaining useful predictive information to help later-stage for machine learning algorithms. In particular, we consider classification algorithms and use random forest classification on the reduced feature space. By aggregating feature variables, the proposed method significantly enhances the protection of data privacy and reduces computational cost.

2 A perturbation model

We define the proposed EPA approach as a model that transforms a suboriginal domain (input domain) through a systematic perturbation process such that the feature vector is altered in the transform domain to achieve a set of specific recommended goals—the goals that lead to the protection of data privacy and the generation of classifiers. In this section, the perturbation models is defined using a mathematical transformation (T) and recommended quantitative measures for quantifying the strength of data privacy (\(\rho \)) and misclassification error (\(\eta \) or \(\zeta \)).

2.1 Mathematical definition

Suppose \(\mathbf x \) is a feature vector with dimension p in the input domain X, and \(\mathbf y \) is its perturbed feature vector with dimension q (where \(q < p\)) in the transform domain Y, then we define the mathematical relationship between \(\mathbf x \) and \(\mathbf y \) as described in the following equation:
$$\begin{aligned} \mathbf y = T(\mathbf x ), \end{aligned}$$
(1)
where the mathematical transformation T defines the proposed perturbation model, and its intention is to satisfy the condition \(\rho (\mathbf x ,\mathbf y )> \epsilon _0\) for some quantitative measure \(\rho \). In other words, this condition describes the difficulty of recovering the feature vector \(\mathbf x \) from the feature vector \(\mathbf y \) given the transformation T and the quantitative measure \(\rho \). One of the real-world, data science applications that satisfy this type of modeling is data privacy where the owner of the data wants to share the data to an intended user, while its privacy is protected, given the transformation T and the measure \(\rho \) are chosen appropriately.

2.2 Problem definition

The condition imposed on the proposed perturbation model can adversely affect other applications that require the use of a feature vector in the transform domain to achieve similar or better classification results obtained with the feature vector of the input domain, along with data privacy. Suppose \(\eta \) is a performance measure (e.g., misclassification error) of an application M, then the performance degradation of the perturbation model T can be defined as follows:
$$\begin{aligned} \eta (M(\mathbf y )) > \eta (M(\mathbf x )), \end{aligned}$$
(2)
where \(\mathbf y =T(\mathbf x )\) and we define the degradation measure as follows: \(\zeta _\mathrm{T}(\mathbf x ,\mathbf y )=\eta (M(\mathbf x )) - \eta (M(\mathbf y ))\). While it is expected that \(\zeta _\mathrm{T}(\mathbf x ,\mathbf y ) \le 0\) for a perturbation model, it is also possible that we get \(\zeta _\mathrm{T}(\mathbf x ,\mathbf y ) > 0\); that is better performance with \(\mathbf y \) for a perturbation model. The application M that we consider in this paper is a classification technique—in particular the random forest technique—with the misclassification error (MCE) as the performance measure \(\eta \).

3 The proposed methodology

This study requires—as per the definitions and problems stated in the previous section—a perturbation model T with its condition measure \(\rho \), and an application M with its performance degradation measure \(\eta \). They are presented in this section with a detailed discussion.
Fig. 1

Three ellipses generated by Eq. (4) using three sets of parameter values: (0.22, 0.78, 0.1)—highlighted using red color; (0.32, 0.68,0.04)—highlighted using blue color; (0.1, 0.9, 0.05)—highlighted using green color for the parameters (a, b, \(\alpha \)), respectively. It shows some signal interference between the elliptical patterns distorted by the noise parameter \(\alpha \)

3.1 Elliptical perturbation model

Our feature vector \(\mathbf x \) in the input domain may be represented by p variables (or features), \(x_1, x_2, \dots , x_p \ge 0\). We also assume p is an even integer without loss of generality. We use the proposed perturbation on consecutive pairs of variables (tuples): \((x_1,x_2)\), \((x_3,x_4)\), \(\dots \), \((x_{p-1},x_p)\) to generate the feature vector \(\mathbf y \) which is represented by new variables \(y_1,y_2,\ldots ,y_q\); \(q=p/2\), respectively. Taking the tuple \((x_1,x_2)\) as an example, we consider
$$\begin{aligned} y_1 = \sqrt{a x_1^2 + b x_2^2} + \alpha \varepsilon , \end{aligned}$$
(3)
where a and b are unknown parameters, \(\varepsilon \sim N(0, 1)\) and \(\alpha \) determines the strength of noise degradation. To further simplify the process, we can assume \(a+b=1\) and \(a,b \ge 0\). The model reduces to the standard linear model when \(a \rightarrow 0\) or \(a \rightarrow 1\). The nonlinear transformation \(\sqrt{a x_1^2 + bx_2^2}\) defines the elliptical perturbation model and describes the nonlinear mutual interaction between the feature variables \(x_1\) and \(x_2\).
Fig. 2

Three ellipses generated by Eq. (4) using three sets of parameter values: (0.22, 0.78, 0.05)—highlighted using red color; (0.32, 0.68,0.10)—highlighted using blue color; (0.1, 0.9, 0.15)—highlighted using green color for the parameters (a, b, \(\alpha \)), respectively. It illustrates a significant signal interference between the ellipses distorted by a very high noise

On the one hand, we can choose the value for a such that the classification results using \(\mathbf y \) and \(\mathbf x \) are significantly close to each other (i.e., \(\zeta _\mathrm{T} \sim 0\)). On the other hand, we can choose a to minimize the absolute value of correlations between \(y_1\) and \((x_1,x_2)\). Meanwhile, noise strength \(\alpha \) will be tuned to achieve the intended goal (e.g., data privacy determined by \(\rho \)) of the perturbation model. In the process of building the model, we will use this correlation-minimization to tune the model parameter a.

3.2 Elliptical patterns visualization

The visual interpretation of the studied model in Eq. (3) is presented in Fig. 1. We have illustrated the elliptical characteristics of the model by fixing the variable y to a single value and varying the values of the parameters a, b, and \(\alpha \). For simplicity, we have selected \(y=1\), and a set of values (0.22, 0.78, 0.03), (0.32, 0.68, 0.04), and (0.1, 0.9, 0.05) for the parameters a, b, \(\alpha \), respectively. The model in Eq. (3), with these values, provides the three elliptical patterns with interference characteristics as illustrated in Fig. 1. In order to generate these elliptical patterns, we transform Eq. (3) as follows:
$$\begin{aligned} x_2 = \sqrt{\frac{{(y_1-\alpha \varepsilon )}^2 + a x_1^2}{b}}. \end{aligned}$$
(4)
It clearly shows the difficulty of finding a pair of \((x_1, x_2)\) for a given value of \(y_1\) under a scaled noise degradation due to elliptical interference. To illustrate the strength of the model visually, we increased the values of \(\alpha \) from 0.03, 0.04, and 0.05 to 0.05, 0.1, and 0.15, respectively, and generated the values of \(x_2\). The results are presented in Fig. 2. It clearly displays a stronger interference (or cross talk) between the elliptical models with respect to the values of a. The measure of this interference will help to determine parameters of the model for the protection of data privacy. We treat this interference conceptually as signal interference, and then apply blind signal separation approaches [35] to determine the strength of data privacy.

3.3 Blind source separation

The blind source separation (BSS) is one of the classical techniques that is capable of separating the original signals from their copies of modulated signals without having any prior information about the original signals [35]. The recent studies show that BSS is even capable of handling multidimensional data, like images and video (or image sequences) [29]. Therefore, we have adopted this technique as an attack approach [23] for the proposed perturbation model and derive robust parameters for the model. The standard measure used with BSS technique (or the attack) is called the signal interference ratio (SIR), which is defined mathematically by the following simple fraction:
$$\begin{aligned} \rho = \frac{ps_\mathrm{m}}{pc_\mathrm{t}}, \end{aligned}$$
(5)
where \(ps_\mathrm{m}\) and \(pc_\mathrm{t}\) stand for the power of modulated signal and the power of cross talk between the co-channels, respectively. The ratio \(\rho \) is measured in decibel dB. When the denominator—power of cross talk—increases, the ratio \(\rho \) decreases, and it is hard to recover the source signals from the modulated signals. This fraction is defined based on the information available at https://cran.r-project.org/web/packages/JADE/index.html. It means that lower the SIR the higher the strength of modulation. The BSS technique states that if the SIR value is greater than 20 dB then the source signals (\(x_1\) and \(x_2\)) are recoverable from \(y_1\), and if the SIR values is less than or equal to 20 dB then source signals are not recoverable [2, 6]. We use this for the validation of proposed perturbation model.

3.4 Random forest classification

Among many classification techniques in a machine learning system, we have selected the random forest technique [4] for our research, because of its ability to address multi-class classification problem better than many other machine learning techniques, including support vector machine [15, 31] and decision tree [25]. The random forest classifiers divide the data domain efficiently using bootstrapping technique—used to generate random decision trees—and Gini index—used to split the tree nodes. Hence it is highly suitable for the classification objectives of a large and imbalanced data set with many features.

3.5 Misclassification and OOB errors

Several measures have been used to quantify the performance of classification techniques in machine learning; among them out-of-bag (OOB) error and misclassification errors are the most commonly used errors for the random forest classifiers [3]. OOB error is defined by the ratio between the total number of misclassified items from a set and the total number of items in the set. Similarly the misclassification error of a class is defined by the ratio between the number of misclassified items in the class and the total number of items in the class. We have used both of these quantitative measures to evaluate the performance of random forest classification algorithm in the input domain as well in the transform domain with the proposed perturbation model, and compare the simulation results.

3.6 Theory

To better understand the theoretical properties of the proposed random projection idea, we study the asymptotic property of the proposed method and show the following lemma.

Lemma 1

Suppose that the number of observations n and the number of predictors p both go to \(\infty \). Assume that there are only a fixed number of predictors related to the output, denoted by \(p_0\), and that \(p = m B\), where m is the fixed block size and B is the number of blocks that goes to \(\infty \). Then with probability tending to one, our method will manage to select the true set of predictors. In other words, the proposed method achieves variable selection consistency.

Lemma 1 states that the proposed random projection idea enjoys the nice variable selection consistency property under an asymptotic sense. This result is desired as is provides a theoretical justification for the proposed method in the sense that the randomly generated block structure manages to capture the unknown true set of predictors with probability tending to one each time. The assumptions we make here are standard and commonly used in both statistics and machine learning literature, e.g., [10]. The proof follows by standard calculation.

Proof

Note that the proposed method essentially employs a quadratic transformation model within each block. Then if one block only contains one true predictor, then that predictor will naturally be recovered from the model. In other words, it is good enough to show that with probability tending to one, each block from the proposed random projection iteration will only contain at most one true predictor.

For each single random projection iteration, given p predictors in total, there are p! possible permutations. Out of these permutations, the number of possible occurrence of having at most one true predictor in each block is given by \(B (B-1) \cdots (B-p_0+1) m^{p_0} (p-p_0)!\). By taking the ratio, we obtain the probability as
$$\begin{aligned}&\frac{B (B-1) \cdots (B-p_0+1) m^{p_0} (p-p_0)!}{p!} \\&\quad = \frac{m^{p_0}\prod _{k=1}^{p_0} (B-k +1) }{\prod _{k=1}^{p_0} (p-k +1) } = \prod _{k=1}^{p_0} \frac{m (B-k+1)}{p-k +1} \rightarrow 1 \end{aligned}$$
as \(n, p \rightarrow \infty \) since \(p_0\) is assumed to be fixed. This concludes the proof. \(\square \)

4 Experimental results

We studied the performance degradation of random forest classifiers using the proposed elliptical perturbation model and the highly imbalanced NSL-KDD data set (http://www.unb.ca/cic/research/datasets/nsl.html), which we downloaded and used it in a previous research [32]. This data set has 25,192 observations with 41 network traffic features and 22 network traffic classes. We labeled the entire feature vector as (\(f_1, f_2, \dots , f_{41}\)) and reduced it later to a lower-dimensional feature vector, based on their importance to random forest classification. This data set forms the original domain and we represented this data set as “dataset-O”. In this data set, the normal traffic class and the Neptune attack class have large number of observations, compared to other attack classes; hence, it provides a highly imbalanced data set that is useful for our analysis.

The network traffic details of this data set presented in Table 1 clearly show the imbalanced nature of the data set between normal and attack traffic classes, and among the attack traffic classes. The first 11 traffic classes (labeled 0–10) presented in this table have more than 30 observations, and the next 11 traffic classes (labeled 11–21) have much less than 30 observations. One of the goals is to study the effect of the proposed perturbation model on the performance of random forest classifiers using the first 11 traffic classes only; however, we will use the other 11 traffic classes to understand imbalanced nature of the data and its significance to random forest classification.

4.1 Feature selection using random forest

There are 41 features—as we denoted by (\(f_1, f_2, \dots , f_{41}\)) earlier—in the dataset-O, and this feature vector determines the dimensionality 41 of the original domain; however, not necessarily all of these features contribute to the classification performance of random forest. To prepare the data set for our experiments and select the important features for classification, we first removed the categorical variables (or features) along with the features that overshadow the other features due to outliers. We then applied random forest classification to determine the importance of features by ordering them based on their misclassification errors.

Using the approach suggested by [36], and by removing the least important feature from the feature vector one-by-one, while performing random forest classification repeatedly until a change in misclassification error can be observed. This process resulted in a lower-dimensional data set with 16 features, (\(f_{33}\), \(f_4\), \(f_{32}\), \(f_6\), \(f_{36}\), \(f_{20}\), \(f_{28}\), \(f_{19}\), \(f_{31}\), \(f_{27}\), \(f_9\), \(f_{29}\), \(f_8\), \(f_{23}\), \(f_{37}\), \(f_{30}\)) in the decreasing order of importance. Hence, we have reduced the data set to a data set (\(p=16\)) with the most important feature vector that contributes to random forest classification. For simplicity, we represented these features by (\(x_1, x_2, \dots , x_{16}\)), respectively. Therefore, the dimension of the input domain of the proposed perturbation model is \(p=16\) with 25,192 observations, 16 network traffic features, and 22 network traffic classes. For convenience, let us represent this dimension-reduced data set for the input domain as “dataset-I.”

4.2 Transform domain pattern analysis

The next step of this experiment is to build the perturbation model, using the dataset-I as the input domain and construct the transform domain so that the random forest classifiers can be evaluated. Due to the pairing of features, multiple elliptical perturbation models were generated by selecting suitable parameters for the model, and they are discussed in the subsections below.

4.2.1 Multiple model generation

The proposed theoretical model for a single pair of features was presented in Eq. (3), which is applied to every consecutive pair of features: (\(x_1, x_2\)), \((x_3, x_4)\), ..., (\(x_{15}, x_{16}\)) associated with the input domain; however, one can apply different techniques to select and combine the features. The pairing of these 16 features of the input domain can give 8 models \(M_i\) with new features \(y_i\) for transform domain as follows:
$$\begin{aligned} y_i = \sqrt{a_i x_{2i-1}^2 + (1-a_i) x_{2i}^2} + \alpha \varepsilon , \end{aligned}$$
(6)
where \(i=1 \dots 8\); hence, we have 8 different models with elliptical patterns that form the transform domain with dimension 8. It is obvious that the parameters \(a_i\), \(i =1 \dots , 8\) together, and \(\alpha \) contribute to the elliptical patterns and their distortion, and in turn contribute to the robustness of the proposed perturbation model to privacy attacks. They also contribute to the performance degradation of random forest classifiers in the transform domain. Therefore, a trade-off mechanism is required to achieve a strong privacy protection and a low misclassification error. The SIR measure is a flexible quantifier that allows a wide range of values to quantify the strength of privacy protection against BSS attack. The next subsection describes the empirical approach where we utilized this measure to find a set of values for the parameters \(a_i\), \(i =1 \dots , 8\) by fixing \(\alpha = 0.001\).

4.2.2 Parameter selection for the models

We used Monte Carlo approach with the JADE implementation of SIR computation to assess BSS attack empirically. In this implementation, multiple copies of modulated source signals are generated using random weights, and then a SIR value is calculated to determine if the source signals are recoverable (if SIR is greater than 20 dB, then source signals are recoverable, otherwise they are not) from the multiple modulated signals. In our implementation, the feature pair (\(x_{2i-1},x_{2i}\)), \(i = 1, \dots , 8\) is considered as source signals, and \(y_i\) is considered as their modulated signal. To create, multiple copies of modulated signal \(y_i\), using (\(x_{2i-1},x_{2i}\)), we generated several values for \(a_i\) randomly from uniform distribution, then used the Monte Carlo approach to achieve desired results.

The Monte Carlo approach, combined with the JADE application of SIR and BSS attack provided us with the three values 0.042, 0.021, 0.096, which we selected for \(a_1\), \(a_2\), and \(a_3\). To cut down the computational cost of Monte Carlo approach, we used them repeatedly for the parameters \(a_i\) as follows: \(a_1=0.042\), \(a_2=0.021\), \(a_3=0.096\), \(a_4=0.042\), \(a_5=0.021\), \(a_6=0.096\), \(a_7=0.042\), and \(a_8=0.021\) for the 8 models, respectively. We obtained the SIR values for these parameters: 14.289, 10.983, 7.873, 11.483, 11.758, 12.608, 14.675, 16.235, respectively—the values less than 20 dB indicate the source signal separation is difficult; hence, BSS attack is not possible. We can also see, each model has different privacy strengths, for example, model \(M_3\) is much stronger than model \(M_8\) against BSS attack. Therefore, in this step, we generated a data set for the transform domain, and it has 25,192 observations with 8 newly defined traffic features (\(y_i\), \(i=1, \dots , 8\)) and 22 network traffic classes. Let’s represent this transform domain data set as “dataset-T”.

4.3 Performance degradation evaluation

We divided the performance degradation evaluation task into two experiments: (i) “experiment with full-imbalanced data sets”, and (ii) “experiment with reduced-imbalanced data sets”. In the first experiment, we used the data sets dataset-I and dataset-T to compare the performance of random forest in both the input domain and transform domain. These two data sets have all 22 network traffic types with their full-imbalanced traffic nature. As listed in Table 1, there are 11 traffic types with much fewer than 30 observations (totaling 40 observation)—the removal of these traffic types may influence the classification results. Hence, for the second experiment, we created two new data sets, dataset-IR and dataset-TR, from dataset-I and dataset-T, respectively. We removed the 40 observations related to these 11 traffic types. Hence the dataset-IR has 25,152 observations with dimension 16 and 22 traffic classes, and the dataset-TR has 25,152 observations with dimension 8 and 22 traffic classes.
Table 1

Statistical information of different traffic types in the NSL-KDD data set—number of observations \(\ge \) 30

Label

Traffic

#Obs.

0

Normal

13,449

1

Neptune

8282

2

back

196

3

Warezclient

181

4

ipsweep

710

5

portsweep

587

6

teardrop

188

7

nmap

301

8

satan

691

9

smurf

529

10

pod

38

4.3.1 Experiment with full-imbalanced data sets:

We used both dataset-I and dataset-T to compare the performance of random forest classifiers in input domain and transform domain, respectively. We conducted this experiment to evaluate the classification performance using random forest with the original (unprotected features) and transformed variables (protected features). The idea is to analyze the performance of random forest if the training is performed on these two full-imbalanced data sets. Therefore, we used both OOB error and misclassification error to compare the performances.
Table 2

Input domain: random forest classification results of NSL-KDD data with original features and full-imbalanced data

Label

OOB errors

Misclassification errors

Normal

0.0098

0.005 (13,379, 70)

Neptune

0.0098

0.003 (8256, 26)

back

0.0098

0.025 (191, 5)

warezclient

0.0098

0.127 (158, 23)

ipsweep

0.0098

0.026 (691, 19)

portsweep

0.0098

0.017 (577, 10)

teardrop

0.0098

0.010 (186, 2)

nmap

0.0098

0.086 (275, 26)

satan

0.0098

0.041 (662, 29)

smurf

0.0098

0.015 (521, 8)

pod

0.0098

0.184 (31, 7)

OOB error The OOB errors and misclassification errors are presented in Tables 2 and 3 in their second and third columns, respectively. The tables also provide the information of the tuples, correctly classified and misclassified number of observations, for each class in input domain—denoted by (idcc, idmc)—and transform domain—denoted by (tdcc, tdmc), respectively. In the tables, the OOB errors are calculated as a single measure for the classification performance on the set, and thus we have a single value of 0.0098 for input variables (unprotected features), 0.0169 for transformed variables (protected features). If we round these values to the second decimal places, we get 0.01 and 0.02 OOB errors, making it 1% error difference in the performance degradation—input domain versus transform domain. We can see that the perturbation model increases the OOB error slightly while protecting data privacy.

Misclassification error Similarly, by comparing misclassification errors presented in Tables 2 and 3, we observed that the perturbation model has a higher misclassification errors as expected, showing the characteristics of a perturbation model. As we can observe, the misclassification errors are increased, except for the traffic types ipsweep, teardrop, and pod. However, the error differences are significantly lower; hence, the perturbation model helps achieve both the protection of data privacy and the classification performance of random forest.
Table 3

Transform domain: random forest classification results of NSL-KDD data with EPA transformed features and full-imbalanced data

Label

OOB errors

Misclassification errors

Normal

0.0169

0.009 (13,322, 127)

Neptune

0.0169

0.009 (8205, 77)

back

0.0169

0.041 (188, 8)

warezclient

0.0169

0.232 (139, 42)

ipsweep

0.0169

0.021 (695, 15)

portsweep

0.0169

0.063 (550, 37)

teardrop

0.0169

0.005 (187, 1)

nmap

0.0169

0.116 (266, 35)

satan

0.0169

0.063 (647, 44)

smurf

0.0169

0.045 (505, 24)

pod

0.0169

0.053 (36, 2)

Table 4

Input domain: random forest classification results of NSL-KDD data with original features and reduced-imbalanced data

Label

OOB errors

Misclassification errors

Normal

0.0088

0.005 (13,381, 68)

Neptune

0.0088

0.003 (8253, 29)

back

0.0088

0.025 (191, 5)

warezclient

0.0088

0.127 (158, 23)

ipsweep

0.0088

0.025 (692, 18)

portsweep

0.0088

0.013 (579, 8)

teardrop

0.0088

0.010 (186, 2)

nmap

0.0088

0.093 (273, 28)

satan

0.0088

0.044 (660, 31)

smurf

0.0088

0.015 (521, 8)

pod

0.0088

0.210 (30, 8)

Table 5

Transform domain: random forest classification results of NSL-KDD data with EPA transformed features and reduced-imbalanced data

Label

OOB errors

Misclassification errors

Normal

0.0156

0.009 (13,322, 127)

Neptune

0.0156

0.009 (8207, 75)

back

0.0156

0.040 (188, 8)

warezclient

0.0156

0.220 (141, 40)

ipsweep

0.0156

0.022 (694, 16)

portsweep

0.0156

0.061 (551, 36)

teardrop

0.0156

0.005 (187, 1)

nmap

0.0156

0.102 (270, 31)

satan

0.0156

0.059 (650, 41)

smurf

0.0156

0.039 (508, 21)

pod

0.0156

0.053 (36, 2)

4.3.2 Experiment with reduced-imbalanced data sets

We used dataset-IR and dataset-TR to compare the performance of random forest classifiers in input and transform domains for the purpose of this experiment. It means only the 11 traffic types with more than 30 observations were classified to study whether there was any significant effect due to the elimination of other traffic types that have significantly lower number of observations. The results are presented in Tables 4 and 5, and we can observe similar patterns between the input domain and transform domain results. Hence, comparing the results in Tables 2 and 4, we can see that the OOB error has slightly decreased due to the reduced-imbalanced nature of traffic types, as expected. Similarly, comparing the results in Tables 3 and 5, we can see the reduction in the OOB error, and an overall reduction in the misclassification errors.

4.4 Overall performance degradation

Although the results presented in the previous section provide information to compare the performance degradation of the random forest classifiers between the input domain and the transform domain, it is important to understand the overall performance degradation to conclude if the proposed perturbation is meaningful. Therefore, to estimate the percentage performance degradation, we defined a simple measure:
$$\begin{aligned} pd_\mathrm{t} = \frac{tdmc_\mathrm{t} - idmc_\mathrm{t}}{tot_\mathrm{t}}. \end{aligned}$$
(7)
For example, the transform domain misclassification (\(tdmc_\mathrm{t}\)) of traffic type “normal” is 127 (from Table 3), and the input domain misclassification (\(idmc_\mathrm{t}\)) of traffic type “normal” is 70 (from Table 2). Also the total number of observations of “normal” traffic class is 13,449 (Table 1). Therefore, the percentage degradation of random forest by the proposed perturbation model for the “normal” class is 0.4238233. Similarly, we calculated the percentage degradations for other 10 traffic types with full-imbalanced data sets, and listed all of them in Table 6 (column 2). We also calculated the same for reduced-imbalanced data sets, and provided the results in column 3 of Table 6. Note that a positive value indicates it is a degradation over input domain to transform domain, whereas, a negative value indicates there is an improvement over input domain to transform domain. The average degradations over all the class types are 1.05% for full-imbalanced data sets, and 0.45% for reduced-imbalanced data sets—indicating additional average degradation of 1.05% when the full-imbalanced data are used, additional average degradation of 0.45% when reduced-imbalanced data are used, and the difference shows the use of additional imbalanced data affects the performance negatively.
Table 6

Performance degradation of random forest classifiers over input domain to EPA transformed domain using full-/reduced-imbalanced data

Label (t)

Full-Imb. (\(pd_\mathrm{t}\))

Reduced-Imb. (\(pd_\mathrm{t}\))

Normal

0.4238233

0.4386943

Neptune

0.6157933

0.5554214

back

1.5306122

1.5306122

warezclient

10.4972376

9.3922652

ipsweep

\(-\) 0.5633803

\(-\) 0.2816901

portsweep

4.5996593

4.7700170

teardrop

\(-\) 0.5319149

\(-\) 0.5319149

nmap

2.9900332

0.9966777

satan

2.1707670

1.4471780

smurf

3.0245747

2.4574669

pod

\(-\) 13.1578947

\(-\) 15.7894737

Avg. Err.

1.054483

0.4532049

Table 7

Performance degradation of random forest classifiers over input domain to PCA-transformed domain using full-imbalanced data only

Label (t)

Full-Imb. 5PC (\(pd_\mathrm{t}\))

Full-Imb. 6PC (\(pd_\mathrm{t}\))

Normal

0.3345974

0.1487099

Neptune

0.4829751

0.4346776

back

13.7755102

10.7142857

warezclient

7.1823204

3.3149171

ipsweep

0.4225352

0.7042254

portsweep

1.8739353

1.7035775

teardrop

\(-\) 0.5319149

\(-\) 0.5319149

nmap

0.9966777

0.3322259

satan

0.8683068

0.5788712

smurf

5.6710775

6.4272212

pod

\(-\) 7.8947368

\(-\) 13.1578947

Avg. Err.

2.107389

0.9699002

5 Comparisons with competing methods

We have selected PCA and differential privacy logistics regression (DPLR) [8] as the competing methods to evaluate the performance of the proposed EPA approach. PCA is a classical linear transformation which transforms the original features to principal components (PCs) and hence achieves effective dimension reduction [9]. However, it has been extensively used in modern applications, including atmospheric science [17], neuroscience [20], and neuroimaging [18]. It became popular in the last two decades because of the recent developments in computer technology that can help the application of PCA to high-dimensional large data sets. However, it generally suffers from two major drawbacks as reported in [5]. One of them is the strong statistical assumptions and the second one is the difficulty of selecting the number of PCs for dimensionality reduction and achieve data utility. Similarly, the DPLR is a new privacy-preserving data analytics technique that is useful for the modern data intensive classification and prediction applications.
Table 8

Performance evaluation of DPLR

(Class1, Class 2)

Misclassification error (MCE)

SIR (dB)

(Normal, Neptune)

(0.0081, 0.0397)

22.59660

(Normal, Portsweep)

(0.0017, 0.0988)

28.04944

(Neptune, Back)

(0.0024, 0.0000)

26.24273

(Neptune, Smurf)

(0.0050, 0.0245)

9.55109

(Back, Portseep)

(0.0714, 0.0034)

37.75119

(Back, Smurf)

(0.1582, 0.0132)

3.98875

(Teardrop, Satan)

(0.3936, 0.0449)

18.111127

(Smurf, Pod)

(0.0056, 0.0526)

1.59793

Table 9

Performance evaluation of ESP

(Class1, Class 2)

Misclassification error (MCE)

SIR (dB)

(Normal, Neptune)

(0.0009, 0.0027)

7.33663

(Normal, Portsweep)

(0.0008, 0.0102)

8.87933

(Neptune, Back)

(0.0000, 0.0000)

8.09299

(Neptune, Smurf)

(0.0000, 0.0000)

7.80478

(Back, Portseep)

(0.0000, 0.0017)

8.17580

(Back, Smurf)

(0.0000, 0.0000)

4.61716

(Teardrop, Satan)

(0.0000, 0.0014)

6.73531

(Smurf, Pod)

(0.0000, 0.0000)

7.99792

5.1 Comparative analysis: PCA vs. EPA

The results of PCA transformation—applied to the full-imbalanced NSL-KDD data—are presented in Table 7 and they can be compared with the results of the proposed EPA approach (applied to the same data) in the second column of Table 6. We adopted two criterion to extract number of PCs: eigenvalue greater than 1 criterion (i.e., Kaiser–Guttman criterion) as used in [14] and 80% cumulative variance rule as stated in [5]. The number of PCs selected by these criterion are 5 and 6, respectively. The random forest results (\(pd_\mathrm{t}\)) using the first 5 PCs and 6 PCs of these data are presented in the second and third columns of Table 7.

The results in the second columns of Tables 6 and 7 show that the average performance degradation caused by PCA with 5 PCs is higher (almost double) than the degradation caused by the proposed EPA approach. In contrast, the results in the third column suggests a smaller degradation is possible if 6 PCs are used. These results, with the use of higher number of PCs and PCA, can achieve better classification accuracy; thus, it also suggests the proposed approach can be competitive.

In another perspective, denial-of-service (DoS) attack is generally considered a major threat to network users and the servers in network security. Therefore, the classification of Normal traffic and DoS attacks are very important. The DoS attack includes the attacks such as Neptune, Back, Teardrop, Smurf and Pod [16] and they are included in NSL-KDD data set as well. Therefore, we calculated the performance degradation (\(pd_\mathrm{t}\)) for these attacks separately and obtained − 1.35, 1.97, and 0.67 for EPA, PCA with 5 PCs, and PCA with 6PCs, respectively. The negative value, as stated earlier, indicates an improvement in the performance; thus, it shows the proposed EPA is superior than PCA when the classification of DoS attacks are considered.
Table 10

OOB errors of three cases using IRIS plant data set

Class

OOB: RF

OOB: RF-PCA

OOB: RF-EPA

Setosa

0.00

0.00

0.00

Versicolor

0.08

0.12

0.12

Virginica

0.06

0.10

0.06

In terms of invertible characteristics, according to [13], it is possible to invert PCA with an estimate of the covariance matrix; hence, it is relatively weaker than the proposed EPA approach when the applications such as data privacy and security are considered. However, in terms of dimension reduction, PCA can be superior than the proposed method because it can reduce the dimension by more than 50%, whereas the proposed EPA approach has the fixed 50% dimension reduction.
Table 11

Performance evaluation of DPLR and EPA using IRIS data as binary classifiers

(Class 1, Class 2)

MCE: DPLR

SIR: DPLR

SIR: EPA

(Setosa, Versicolor)

(0.00, 0.00)

13.52917

9.49978

(Setosa, Virginica)

(0.00, 0.00)

10.48724

9.69263

(Versocolor, Virginica)

(0.04, 0.06)

18.74225

7.40357

5.2 Comparative analysis: DPLR vs. EPA

The results of DPLR transformation—applied to a subset of NSL-KDD data—are presented in Table 8. The DPLR is a binary classification approach [8]; therefore, we have divided the NSL-KDD data set into several subsets with two classes and then applied the DPLR approach. To facilitate a fair comparison, the proposed EPA approach is also applied to this new subset of NSL-KDD data and a set of new results are obtained. These results are presented in Table 9. In Table 8, the tuples of traffic types that are considered for the binary classifier DPLR are presented in this first column. We have tested all the pairs of traffic types and the results are similar to the candidate pairs presented in this Table. The second and third columns of Table 8 provide information about the misclassification errors and the SIR values of the pairs of traffic types designated by the first column of Table 8. The low misclassification errors indicate the DPLR performs very good in terms of classifying traffic types based on two class classification. At the same time, the SIR values above 20 dB or closer to 20 indicate the source signals (variables) can be recovered from the results of DPLR classifiers—hence the data privacy is not strong.

The comparison between the MCE values in Tables 8 and 9, we can clearly see that the misclassification errors of the proposed EPA approach is much lower than that of DPLR. At the same time, the SIR values below 12 dB of ESP approach indicate that the source signals cannot be recovered; thus, provides a very strong data privacy. Therefore, in over all we can decide the propose approach is much better than DPLR and PCA approaches. In the next section, we present and experimental results that support the same findings using IRIS data set.

5.3 Evaluation using IRIS plant data set

We also used the iris plant dataset to evaluate and compare EPA and PCA transformations, and the transformation in DPLR. This dataset is a simple, yet effective dataset, which has been used in machine learning extensively for the last several decades [7]. We obtained this data from the UCI Machine Learning Repository [21]. Random forest is applied to the original iris data first; hence, OOB errors are calculated and presented in the second column of Table 10. These data are then transformed into PCs using PCA. The random forest classification is applied using all the PCs and the OOB results are presented in the third column of Table 10. We also transformed the data set using the proposed EPA transformation and then applied random forest classification. The OOB results of the proposed approach is presented in the fourth column of the table. Note that the first column of the table shows the three classes of the iris plant. Comparing the results in Table 10, we can say that the proposed transformation provides the classification results closer to the results of random forest applied to the original data than the principal components.

The DPLR approach is then applied to the iris data set and the results are presented with the all the possible pairs of class types Table 11. Once again, we can see higher SIR for DPLR than that of EPA approach. However, the SIR values are below 20 dB; hence both DPLR and EPA provide data privacy; however, the SIR values below 12 dB of EPA indicate EPA provides much stronger data privacy than DPLR.

6 Conclusion

This study allowed us to understand the variations caused by the perturbation models between their input domain and transform domain characteristics or numerical patterns. This knowledge helped us construct a parametric perturbation model using an elliptical transformation along with an additive Gaussian noise degradation. The degradation performance analysis using random forest classifiers together with blind source separation attack and quantitative measures—signal interference ratio, OOB error, and misclassification error—showed that the parametric elliptical perturbation model performed very well in the classification of network intrusion and biological data, while protecting data privacy patterns of feature vectors of the data.

Compared with classical linear transformations such as PCA, the proposed method requires less statistical assumptions on the data and is highly suitable for the applications such as data privacy and security as a result of the difficulty of inverting the elliptical patterns from the transform domain to the input domain. In addition, we adopted a flexible block-wise dimension reduction step in the proposed method to accommodate the possible high-dimensional data (\(p \gg n\)) in modern applications, in which PCA is not directly applicable. The empirical performance results also confirmed the superior performance of the proposed EPA approach over the widely used PCA.

We also carried out a sensitivity analysis to evaluate the robustness of our model by changing the parameter values and the block size (from 2 to 4 and 8). The results (e.g., misclassification rate) seem quite stable under those changes. In a future work, it will be of interest to develop theoretical results for the optimal block size. Another future research direction is to consider developing perturbation models for online streaming data where data may come in a sequential order and their associated distribution may vary with the time. Then the proposed perturbation model can be modified by including an additional layer of latent structure to allow model parameters and block size to change with the time. It will then be of interest to evaluate and compare the proposed methodology with other popular approaches in this area, including active learning [28], data stream mining [12], and transfer learning [33].

Notes

Acknowledgements

This research of the first author was partially supported by the Department of Statistics, University of California at Irvine, and by the University of North Carolina at Greensboro. This material was based upon work partially supported by the National Science Foundation under Grant DMS-1638521 to the Statistical and Applied Mathematical Sciences Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. Shen’s research is partially supported by Simons Foundation Award 512620. The authors thank the Editor, the Associate Editor, and the referees for their valuable comments.

References

  1. 1.
    Aghion, P., Bloom, N., Blundell, R., Griffith, R., Howitt, P.: Competition and innovation: an inverted-u relationship. Q. J. Econ. 120(2), 701–728 (2005)Google Scholar
  2. 2.
    Boscolo, R., Pan, H., Roychowdhury, V.P.: Independent component analysis based on nonparametric density estimation. IEEE Trans. Neural Netw. 15(1), 55–65 (2004)CrossRefGoogle Scholar
  3. 3.
    Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)MATHGoogle Scholar
  4. 4.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefMATHGoogle Scholar
  5. 5.
    Bruce, P., Bruce, A.: Practical Statistics for Data Scientists: 50 Essential Concepts. O’Reilly Media, Inc., Sebastopol (2017)Google Scholar
  6. 6.
    Caiafa, C.F., Proto, A.N.: A non-gaussianity measure for blind source separation. In: Proceedings of SPARS05 (2005)Google Scholar
  7. 7.
    Chaudhary, A., Kolhe, S., Kamal, R.: A hybrid ensemble for classification in multiclass datasets: an application to oilseed disease dataset. Comput. Electron. Agric. 124, 65–72 (2016)CrossRefGoogle Scholar
  8. 8.
    Chaudhuri, K., Monteleoni, C., Sarwate, A.D.: Differentially private empirical risk minimization. J. Mach. Learn. Res. 12(Mar), 1069–1109 (2011)MathSciNetMATHGoogle Scholar
  9. 9.
    Du, K.L., Swamy, M.: Principal component analysis. In: Neural Networks and Statistical Learning, pp. 355–405. Springer, London (2014)Google Scholar
  10. 10.
    Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)MathSciNetCrossRefMATHGoogle Scholar
  11. 11.
    Fienberg, S.E., Steele, R.J.: Disclosure limitation using perturbation and related methods for categorical data. J. Off. Stat. 14(4), 485–502 (1998)Google Scholar
  12. 12.
    Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: Mining data streams: a review. SIGMOD Rec. 34(2), 18–26 (2005).  https://doi.org/10.1145/1083784.1083789 CrossRefMATHGoogle Scholar
  13. 13.
    Geiger, B.C.: Information loss in deterministic systems. Ph. D. Thesis, Graz University of Technology, Graz, Austria (2014)Google Scholar
  14. 14.
    Hung, C.C., Liu, H.C., Lin, C.C., Lee, B.O.: Development and validation of the simulation-based learning evaluation scale. Nurse Educ. Today 40, 72–77 (2016)Google Scholar
  15. 15.
    Jeyakumar, V., Li, G., Suthaharan, S.: Support vector machine classifiers with uncertain knowledge sets via robust optimization. Optimization 63(7), 1099–1116 (2014)Google Scholar
  16. 16.
    Jin, S., Yeung, D.S., Wang, X.: Network intrusion detection in covariance feature space. Pattern Recogn. 40(8), 2185–2197 (2007Google Scholar
  17. 17.
    Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philos. Trans. R. Soc. A 374(2065), 20150202 (2016)MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Jones, D.G., Beston, B.R., Murphy, K.M.: Novel application of principal component analysis to understanding visual cortical development. BMC Neurosci. 8(S2), P188 (2007)CrossRefGoogle Scholar
  19. 19.
    Lasko, T.A., Vinterbo, S.A.: Spectral anonymization of data. IEEE Trans. Knowl. Data Eng. 22(3), 437–446 (2010)CrossRefGoogle Scholar
  20. 20.
    Lee, S., Habeck, C., Razlighi, Q., Salthouse, T., Stern, Y.: Selective association between cortical thickness and reference abilities in normal aging. NeuroImage 142, 293–300 (2016)CrossRefGoogle Scholar
  21. 21.
    Lichman, M.: UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 1 Nov 2017
  22. 22.
    Little, R.J.: Statistical analysis of masked data. J. Off. Stat. 9(2), 407–426 (1993)Google Scholar
  23. 23.
    Liu, K., Giannella, C., Kargupta, H.: A survey of attack techniques on privacy-preserving data perturbation methods. In: Aggarwal, C.C., Yu, P.S. (eds.) Privacy-Preserving Data Mining, pp. 359–381. Springer, US (2008)Google Scholar
  24. 24.
    Muralidhar, K., Sarathy, R.: A theoretical basis for perturbation methods. Stat. Comput. 13(4), 329–335 (2003)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Murthy, S.K.: Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min. Knowl. Discov. 2(4), 345–389 (1998)CrossRefGoogle Scholar
  26. 26.
    Oliveira, S.R., Zaïane, O.R.: Achieving privacy preservation when sharing data for clustering. In: Jonker, W., Petković, M. (eds.) Workshop on Secure Data Management, pp. 67–82. Springer, Berlin Heidelberg (2004)Google Scholar
  27. 27.
    Qian, Y., Xie, H.: Drive more effective data-based innovations: enhancing the utility of secure databases. Manag. Sci. 61(3), 520–541 (2015)CrossRefGoogle Scholar
  28. 28.
    Rubens, N., Elahi, M., Sugiyama, M., Kaplan, D.: Recommender systems handbook. In: Ricci, F., Rokach, L., Shapira B. (eds.) Active Learning in Recommender Systems, pp. 809–846. Springer, Boston (2016)Google Scholar
  29. 29.
    Sørensen, M., De Lathauwer, L.: Blind signal separation via tensor decomposition with Vandermonde factor: canonical polyadic decomposition. IEEE Trans. Signal Process. 61(22), 5507–5519 (2013)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Suthaharan, S.: Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning, vol. 36. Springer, New York (2015)MATHGoogle Scholar
  31. 31.
    Suthaharan, S.: Support vector machine. In: Machine Learning Models and Algorithms for Big Data Classification, pp. 207–235. Springer, US (2016)Google Scholar
  32. 32.
    Suthaharan, S., Panchagnula, T.: Relevance feature selection with data cleaning for intrusion detection system. In: Southeastcon, 2012 Proceedings of IEEE, pp. 1–6. IEEE (2012)Google Scholar
  33. 33.
    Thrun, S., Pratt, L.: Learning to Learn. Springer, New York (2012)MATHGoogle Scholar
  34. 34.
    Whitworth, J., Suthaharan, S.: Security problems and challenges in a machine learning-based hybrid big data processing network systems. ACM SIGMETRICS Perform. Eval. Rev. 41(4), 82–85 (2014)CrossRefGoogle Scholar
  35. 35.
    Zarzoso, V., Nandi, A.: Blind source separation. In: Nandi, A. (ed.) Blind Estimation Using Higher-Order Statistics, pp. 167–252. Springer, US (1999)Google Scholar
  36. 36.
    Zumel, N., Mount, J., Porzak, J.: Practical data science with R, 1st edn. Manning, Shelter Island (2014)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Computer ScienceUniversity of North Carolina at GreensboroGreensboroUSA
  2. 2.Department of StatisticsUniversity of CaliforniaIrvineUSA

Personalised recommendations