Keywords

1 Introduction

Big Data applications have increased importantly in recent years [23]. The volume of information, the speed of data transference, and the variety of data are the main characteristics of Big Data paradigm [8, 10]. Concerning to the volume of information, a data set belongs to Big Data scale when it is difficult to process with traditional analytical systems [18].

In order to efficiently seize the large amount of information from Big Data applications, Deep Learning techniques have become an attractive alternative, because these algorithms generally allow to obtain better results than traditional machine learning methods [12, 20]. Multi-Layer Perceptron (MLP), the most common neural network topology, has been also translated to the Deep Learning context [14].

Deep Learning MLP (DL-MLP) incorporates two or more hidden layers in its architecture [11], which increases the computational cost of processing large size and high dimension data sets. However, this disadvantage can be overtaken by using modern efficient frameworks, such as Apache-Spark [24] or Tensor-Flow [1]. Thus, the high performance, robustness to overfitting, and high processing capability of this deep neural networks can be exploited.

Nevertheless, deep learning algorithms are strongly affected by the class imbalance problem [6]. The class imbalance problem refers to situations where the number of samples in one or more classes of the data set is fewer than in another class (or classes), producing an important deterioration of the classifier performance [5]. In literature, many investigations dealing with this problem have been documented, being Random Over Sampling (ROS), Random Under Sampling (RUS) and Synthetic Minority Over-sampling Techniques (SMOTE) the most popular methods [15]. Although the results are not conclusive for the specific application in the Big Data scale, they have motivated the development of other over-sampling methods [13, 17].

The KDD CUP 1999 intrusion detection data set (KDD99) was introduced at The Third International Knowledge Discovery and Data Mining Tools Competition [4]. It consists of more than 4 million instances (with 41 attributes); it is divided into twenty-three types of attacks clustered in four categories, therefore it is formally considered as Big Data [22]. Some attacks in KDD99 have less of ten instances; i.e., it is highly imbalanced and few represented, which implies a Big Data challenge with class imbalance problem [15, 18].

Previous works have been focused in the study of the KDD99 dataset to probe different machine learning techniques. Nevertheless, most of them have used only a subset of it [22]. For example, in [23] KDD99 was divided into four two-class data sets and the class imbalance problem has been addressed with parallel models of evolutionary under-sampling methods based in the Map Reduce paradigm. Seo et al. [22] used a KDD99 subset of five classes: four of them were the attack categories and the fifth class was the normal connections; then, a wrapper method was proposed to find the best SMOTE ratio by identifying the best level of sampling for the minority classes.

In this paper, the whole KDD99 data set was analyzed, by using all the twenty three attacks as classes with the aim of study the performance of the classical oversampling approaches, like ROS and SMOTE, in the Big Data class imbalance context, while the Deep Learning MLP was used as base classifier.

2 Theoretical Framework

2.1 Deep Learning Multilayer Perceptron

MLP constitutes the most conventional neural network architecture. It is commonly based on three layers: input, output, and one hidden layer [14]. Thus, the MLP can be translated into a deep neural network by incorporating two or more hidden layers within its architecture, becoming a Deep Learning MLP. This allows to reduce the number of nodes per layer and uses fewer parameters, but it leads to a more complex optimization problem [11]. However, due to the availability of more efficient frameworks, such as Apache-Spark or Tensorflow, this disadvantage is less restrictive than before.

Traditionally, MLP has been trained with the back-propagation algorithm (which is based in the stochastic gradient descent) and its weights randomly initialized. However, in the late versions of DL-MLPs, the hidden layers are pre-trained by an unsupervised algorithm and the weights are optimized by the back-propagation algorithm [14].

MLP uses sigmoid activation functions, such as the hyperbolic tangent or logistic function. In contrast, DL-MLP includes (commonly) the Rectified Linear Unit (ReLU) f(z) \(=\) max (0, z) because typically learns much faster in networks with many layers, allowing training of a DL-MLP without unsupervised pre-training.

There are three variants of the descending gradient that differ in how many data are used to process the gradient of the objective function [21]: (a) Batch Gradient Descendent calculates the gradient of the cost function to the parameters for the entire training data set, (b) Stochastic Gradient Descendent performs an update of parameters for each training example, and (c) Mini-batch Gradient Descendent takes the best of the previous two and performs the update for each mini-batch of a given number of training examples.

The most common algorithms of descending gradient optimization are: (a) Adagrad, which adapts the learning reason of the parameters, making bigger updates for less frequent parameters and smaller for the most frequent ones, (b) Adadelta is an extension of Adagrad that seeks to reduce aggressiveness, monotonously decreasing the learning rate instead of accumulating all the previous descending gradients, restricting accumulation to a fixed size, and (c) Adam, that calculates adaptations of the learning rate for each parameter and stores an exponentially decreasing average of past gradients. Other important algorithms are AdaMax, Nadam and RMSprop [21].

2.2 Classical Sampling Methods Used to Deal with the Class Imbalance Problem

The class imbalance problem has been a hot topic in machine learning and data mining, and more recently in deep learning and Big Data [7, 15]. Over-sampling (mainly ROS and SMOTE) are the most common techniques used to face with the class imbalance problem, mainly due to their independence of the underlying classifier [17]. ROS replicates samples in the minority class biasing the discrimination process to compensate the class imbalance, while SMOTE generates artificial samples from the minority class by interpolating existing instances that lie close together [9].

Table 1. A brief summary of main characteristics of the KDD99 data set.

The under-sampling methods also have shown effectiveness to deal with the class imbalance problem [13]: the RUS technique is one of the most successful under-sampling methods, which eliminates random samples from the original data set (usually from the majority class) to decrease the class imbalance. However, this method loses effectiveness when it removes significant samples [17]. To compensate this disadvantage, other important under-sampling methods include a heuristic mechanism [13].

Lately, Dynamic Sampling Methods have become an interesting alternative to sampling class imbalanced data sets because they automatically set the class imbalance sampling rate [2], and select the best samples to train the classifier [16]. The key of these methods is that they use the neural network output to identify those samples that are either close or in the decision regions of other classes; i.e., in the frontier decision or class overlap region.

3 Experimental Set-Up

KDD99 data set was used in the experimental stage, which is available from the University of California at Irvine (UCI) machine learning repository [4]. It contains about 4 million instances with 41 attributes each.

In order to deal with the Big Data multi-class imbalance problem, all the twenty-three attacks of KDD99 data set were defined as classes for this investigation. The hold–out method was used to randomly split the KDD99 data set in training (70%) and test (30%). Table 1 shows a brief summary of main characteristics of the KDD99 data set.

The main goal of this paper is to show the performance of classical over-sampling approaches (ROS and SMOTE) to deal with the Big Data class imbalance problem. SMOTE and ROS were selected because they have shown their success to deal with the multi-class imbalance problem and even SMOTE is considered the “de facto” standard in the framework of learning from imbalanced data [9]. Thus, the scikit-learn library was used to perform SMOTE and ROS algorithms. Scikit-learn is a free library software for machine learning for the Python programming language [19].

Two hidden layer were used in the DL-MLP with ReLU activation functions in its nodes, and softmax function on its output layer. The configuration of each hidden layer was 30 nodes. The number of hidden layers and nodes were obtained by a trial-error strategy. DL-MLP was performed in TensorFlow framework [1], and Adam algorithm [21] was used as the training method.

Table 2. Back-propagation classification performance. The results represent the averaged values between ten folds and the initialization of ten different weights of the neural network. The bold numbers represent the best average MAUC values.

The most widely used metrics on investigations to face the multi-class imbalanced problems has been the Multi-class Area Under the receiver operating characteristic Curve (MAUC) [2] and the Geometric Mean of Sensitivity and Precision (g-mean) [25]. However, these are global metrics and the evidence of the individual performance of ROS and SMOTE over the minority classes is more interesting for this paper; thus, the accuracy by-class was used instead.

Finally, in order to compute the general classification performance, the Ranks method was used. This assigns the rank 1 to the best algorithm, 2 to the second best, 3 to the third best, and so on up to the umpteenth best rank; if ties exist, then the average rank is calculated. The lesser the rank number, the better the algorithm performance.

4 Results and Discussion

Table 2 shows the accuracies by-class obtained by SMOTE and ROS in each individual class. It is organized in three parts: the first column represents the evaluated class, the second column are the number of samples classified correctly and the total of samples belonging to these class, and the third column is the average accuracy by-class. This is repeated for each sampling method: Standard (unsampled), ROS and SMOTE.

It is noticeable in Table 2 that some minority classes like back, teardrop and pod seem unaffected by the class imbalance problem. Another example is class imap, which is very poorly represented but the DL-MLP classifies correctly three of four of its samples. This confirmed the findings of others works, which affirmed that the class imbalance problem only increases the major disadvantage of the algorithms based in the back-propagation; i.e., the slow rate of convergence of the neural network and often it is the cause of the poor classification performance of the classifier, but not always [3].

It is observed also that the classifier accuracy by-class, in a few minority classes is not improved by the application of ROS or SMOTE methods. For example, the accuracy of the class multihop is not increased using ROS. The accuracies of the classes spy, loadmodule and rootkit are neither improved by ROS and SMOTE. Moreover, the classifier performance on the minority class ipsweep was reduced when ROS or SMOTE were applied. This could be originated by the increase of the noise or overlap in these minority classes when they are sampled.

Within the machine learning community, it is known that the class imbalance problem is severely stressed by other factors, such as class overlapping, small disjuncts, the lack of density and information, noisy data, the significance of the borderline samples and its relationship with noisy samples, and the data set shift problem [17].

All of these classes have a common feature: they are severely imbalanced, and the origin of this imbalance comes from different sources. Thus, an important question is how to deal with this problem. Maybe, the solution to this problem is not only the over-sampling of the minority classes, but heuristically sub-sampling the majority classes close to severely imbalanced minority classes, in a similar way to [3]. Then, an effective over-sampling method should be applied. However, another problem appears in the scene: how to identify the decision frontier of those minority classes. The use of the neuronal network output could be an interesting alternative [2].

Table 2 also exhibits that, in overall, the sampling methods improve the classifier performance in comparison to the unsampled data set. The average rank for both, SMOTE (1.87) and ROS (1.95), represent better results than standard rank (2.18).

In Big Data context, results from Table 2 confirm the conclusions of other investigations, which affirm that the class imbalance problem adversely affect the classifier performance, but in other situations it is not the main cause of effectiveness loss of classifier. In other words, the class imbalance problem in Big Data follows a similar behavior that the studied so far in machine learning community.

5 Conclusion

In this paper, the performance of two successful methods to deal with the multi-class imbalance problem, ROS and SMOTE, was analyzed. Results show that ROS and SMOTE are not always enough to improve the classifier performance in the minority classes, in the Big Data multi-class imbalance context. However, these oversampling methods increase the DL-MLP accuracy on most of the cases. It is considered necessary a cleaning stage before applying either SMOTE or ROS, and the neural network output could be a good alternative for this stage. Thus, further research is required to investigate the potential of recent dynamic sampling methods [2, 16], which use the neural network output to identify and delete samples from majority classes that are close or in the minority classes decision regions. Subsequently, the use of SMOTE or ROS would improve the classification performance on these minority classes.