Introduction

A generally established set of data-related properties is used to characterize and define big data, including volume, variety, velocity, variability, value, and complexity [1]. The volume of data is likely the best-known characteristic of big data, especially if datasets exceed 1 million instances. Variety is associated with big data because with such data analysis scenarios, one may have to collectively handle structured, unstructured, and semi-structured data. Velocity reflects the high speed with which data is being generated, collected, and refreshed in a big data scenario, in addition to issues of immediate data storage of fast incoming data. The variability aspect of big data reflects issues embedded in this data, such as anomalies and data dimension inconsistencies due to disparate data sources and types. The value characteristics of big data is often seen as the most important attribute because the results of big data analytics are expected to add/increase business value from the data. As briefly stated for the different V’s, the complexity aspect of big data is relatively obvious, and may include issues such as data transformation, relationship of data from different sources, veracity and possible volatility of the collected data. Traditional methods used for mining and analyzing data that do not consider big data are generally unable to handle the various complexities associated with big data.

The minimum number of instances in a dataset that defines big data has not been established in the literature. As an example, in [2] that number was set as 100,000 instances. However, in our current work, we set that number as 1,000,000 instances [3, 4]. Globally, the increasing dependence on big data applications necessitates the development of efficient techniques for performing knowledge extraction on this type of data. While it is true that class imbalance affects both big and non-big data, the adverse effects are usually more pronounced in the former. Extreme degrees of class imbalance may exist within big data [5] due to massive over-representation of the negative (larger) class within datasets.

Any dataset containing majority and minority classes, e.g., normal traffic and malicious traffic flowing through a computer network, can be viewed as class-imbalanced. There are various degrees of class imbalance, ranging from slightly imbalanced to rarity. Rarity in a dataset involves comparatively inconsequential numbers of positive instances [6], e.g., the occurrence of 40 fraudulent transactions within an insurance claims dataset of 1,000,000 normal transactions. Binary classification is frequently utilized to focus on class imbalance because many non-binary (i.e., multi-class) classification problems can be addressed by transforming the given data into multiple binary classification tasks. The minority (positive) class, which makes up a smaller portion of the dataset, is usually the class of interest in real-world problems [2]. This is in contrast to the majority (negative) class, which makes up the larger portion of the dataset.

Compared to traditional statistical techniques, Machine Learning (ML) algorithms produce better classification results [7,8,9]; however, if the dataset suffers from severe class imbalance or data rarity, the algorithm (classifier) cannot effectively discriminate between the minority and majority classes. This inability to properly discriminate between the two classes is analogous to searching for the proverbial needle in a haystack and could cause the classifier to label almost all instances as the majority (negative) class, thus resulting in an accuracy performance metric value that is deceptively high. In cases where the occurrence of false negatives is costlier than false positives, a classifier’s prediction bias in favor of the majority class may incur adverse consequences [10]. For example, among a large, random sample of elderly male patients with enlarged prostates, very few are expected to have prostate cancer (minority class), while most (majority class) are expected to not have prostate cancer. A false negative in this case indicates that a patient with prostate cancer has been misclassified as not having this type of cancer, which is a potentially life-threatening error.

One approach for addressing class imbalance involves the generation of new datasets which have different class distributions than that of the original dataset. To accomplish this, data scientists use data sampling, of which the main categories are undersampling and oversampling. Undersampling removes instances from the majority class, and if done randomly, the technique is known as Random Undersampling (RUS) [11]. Oversampling adds instances to the minority class, and if done randomly, the technique is known as Random Oversampling (ROS) [11]. Synthetic Minority Over-sampling TEchnique (SMOTE) [12] is a method of oversampling that creates artificially new instances between minority instances lying relatively close to each other. RUS has been recommended over ROS and SMOTE because this undersampling technique imposes a lower computational burden and results in a faster training time, which is beneficial to data analytics [13]. Another approach uses Feature Selection (FS) to address class imbalance. The technique selects the most influential features that can provide information on discrimination between classes, given the dataset suffers from class imbalance [14, 15]. This leads to a reduction of the adverse effects of class imbalance on classification performance [16,17,18].

In our paper, we present two case studies, each utilizing a combined approach of RUS and FS to investigate the effect of class imbalance on big data analytics. We used: (1) RUS to generate six different class distributions ranging from balanced to moderately imbalanced, and (2) Feature Importance (FI) as a method of FS. Classification performance was reported for the Random Forest (RF), Gradient-Boosted Trees (GBT), and Logistic Regression (LR) learners, as implemented within the Apache Spark framework. This framework, popular for its speed and scalability, performs in-memory data processing [2]. The first case study employs a training dataset and a test dataset from the Evolutionary Computation for Big Data and Big Learning (ECBDL’14) [19, 20] bioinformatics competition. The training and test datasets both have 631 features, and contain about 32 million instances and 2.9 million instances, respectively. For the first case study, GBT obtained the best results, with either a features-set of 60 or the full set, and a negative-to-positive class ratio of either 45:55 or 40:60. The second case study, unlike the first, includes training data from one source (POST dataset) [21] and test data from a separate source (Slowloris dataset) [22]. POST and Slowloris are two types of Denial of Service (DOS) attacks. The POST dataset contains 13 features and about 1.7 million instances, while the Slowloris dataset contains 11 features and about 0.2 million instances. For the second case study, LR obtained the best results, with a features-set of 5 and any of the following negative-to-positive ratios: 40:60, 45:55, 50:50, 65:35, and 75:25. We conclude that combining FS with RUS improves the classification performance of learners from different application domains.

To the best of our knowledge, our work is unique in investigating the combined approach of using RUS and FS to mitigate the effect that class-imbalanced big data has on data analytics. In addition, we demonstrate the effectiveness and versatility of our approach with case studies involving imbalanced big data from different application domains.

The remainder of this paper is organized as follows: “Related work” section provides an overview of literature related to the combined use of RUS and FS to address imbalanced big data; “Case study datasets” section presents the details of the ECBDL’14, POST, and Slowloris datasets; “Methodologies and empirical settings” section describes the different aspects of the methodology used to develop and implement our approach, including the big data processing framework, classification algorithms, FS, and performance metrics; “Approach for case studies experiments” section provides additional information on the case studies, including steps taken to remedy potential issues associated with the data; “Results and discussion” section presents and discusses our empirical results; and “Conclusion” section concludes our paper with a summary of the work presented and suggestions for related future work.

Related work

Our search for related literature extended as far back as 2010, with a focus on published works that use FS and RUS to address imbalanced big data. To the best of our knowledge, we were unable to find any studies that satisfied all our search parameters. There are several recent studies on the use of joint approaches of RUS and FS [23,24,25], but our focus on imbalanced big data with the combination of RUS and FS is unique, as explained in “Methodologies and empirical settings” section.

One related study used real-world data for Android malware app detection [26]. The default dataset contained about 425,000 benign and 8500 malicious apps (instances), a majority-to-minority ratio of 50:1. In our view, the study uses large data and not big data. The authors conducted six experiments, two of which are relevant to our research focus in this paper. Classification performance in their study was assessed with the k-nearest neighbor (k-NN) classifier.

The first of the two relevant experiments investigated the effect of feature extraction on classification performance. For feature extraction, the authors used two independent approaches: Drebin [27] and DroidSIFT [28]. The Drebin method reduced an emulated set of 1,000,000 features to 2246. With Drebin, the authors noted there was no change in True Positive Rate (\(\text {TP}_{\text {rate}}\)) (98.2%) and a decrease in False Positive Rate (\(\text {FP}_{\text {rate}}\)) from 1.5 to 0.1%. The DroidSIFT method reduced an emulated set of 1183 graph-based features to 192. With DroidSIFT, \(\text {TP}_{\text {rate}}\) and \(\text {FP}_{\text {rate}}\) increased from 90.6 and 18.8% to 95.6 and 22.1%, respectively. The second experiment examined the relationship between class imbalance and classifier performance. Starting with a 1:1 class ratio and 8500 instances for both the benign and malicious apps, the authors used undersampling to produce other majority-to-minority ratios, including 5:1, 10:1, 20:1, 50:1 and 100:1. The transition from a ratio of 1:1 to 100:1 resulted in a negligible increase in \(\text {TP}_{\text {rate}}\) and \(\text {FP}_{\text {rate}}\) from 96.1 and 5.7% to 96.2 and 5.8%, respectively. The authors concluded that: (1) FS is beneficial for the k-NN classifier, and (2) k-NN classification performance is affected by high-class imbalance.

A clear, noticeable shortcoming of [26] is the use of only one type of classifier. A broader selection of learners should be incorporated in future work. Another consideration for future work should be the incorporation of meaningful statistical analyses to evaluate the significance of comparative conclusions based on the models, such as ANalysis Of VAriance (ANOVA) and Tukey’s Honestly Significant Difference (HSD). In addition, replication of the relevant experiments may be difficult for several reasons: (1) No information is provided on the selected value(s) of k in the classifier; (2) The DroidSIFT tool is not publicly available; and (3) Clarification or further explanation is needed for several details, such as the basis for using a default dataset with 425,000 benign and 8500 malicious apps (a majority-to-minority ratio of 50:1). The fact that the study is based on large data is not a shortcoming per se, but a similar study on big data would be more valuable to contemporary, machine-learning research. Also, it is worth noting that the paper addresses a set of specific research questions, none of which directly involve proposed techniques for class-imbalance mitigation.

Another related study is based on a proposal to implement SMOTE within a distributed environment via Apache Spark [29]. With regards to small datasets, the researchers in [29] recognized that there is a standard implementation of SMOTE which supports Python, but distributed computing environments for large datasets cannot benefit from this implementation. To address this deficiency, the ECBDL’14 test dataset was split into a new training dataset and test dataset (80:20 split, respectively), with both sets having the same class imbalance ratio as the original test set. Performance metrics such as Area Under the Receiver Operating Characteristic Curve (AUC), Geometric Mean (GM), and Recall were then obtained for RF and Distributed RF classifiers.

We calculated True Positive Rate (\(\text {TP}_{\text {rate}}\)) \(\times\) True Negative Rate (\(\text {TN}_{\text {rate}}\)) scores for the no-sampling, SMOTE, and Distributed SMOTE approaches used in [29]. The proposed implementation, Distributed SMOTE, had the highest value of 0.169. Use of the ECBDL’14 test dataset as the only source of big data and a lack of comparison of Distributed SMOTE to other techniques, including RUS, are two limitations of this related work.

The dataset used in a third related study [30] contained about 17,000 positive instances and 9,514,000 negative instances, a majority-to-minority of 560:1. This dataset is arguably the largest collection of de-identified patient records of real-world dermatology visits in the US [31], where positive instances represent melanoma cases and negative ones represent non-melanoma cases. The authors applied RUS to the dataset to produce a 65:35 majority-to-minority class ratio. The dataset was then randomly split, 70:30, into a training and test set, respectively, with cross-validation, and sensitivity and specificity across various thresholds, used to select an optimal decision threshold. Results show that for the AUC metric, the RF classifier with a score of 0.790 outperforms the Decision Tree (DT) and LR classifiers in relation to predicting whether a dermatology patient has a high risk of developing melanoma.

One limitation of the work of [30] is the use of only one randomly undersampled ratio, the class distribution of 65:35. Our study uses RUS to produce six class distribution ratios. In addition, our study considers the joint application of RUS and FS, whereas this related work only uses RUS.

Finally, researchers in a fourth related study [32] proposed a cost-sensitive feature selection algorithm that incorporates a cost-sensitive fitness function and genetic algorithm to address class imbalance. Experiments were performed on network security data within the Weka framework. The class-imbalanced data, which was derived from KDDCUP’99 data [33], comprised 494,021 connections (instances) and 5 main categories of attacks (Normal, DOS, U2R, R2L, and PROBE). The selected classifiers for the work of [32] were k-NN and DT. Three metrics (AUC, F-measure, recall) were used to measure the performance of the proposed model, referred to as the CSFSG algorithm, against two other Feature Selection models. Results show that the CSFSG algorithm outperformed the other Feature Selection models, particularly with regards to the minority class.

We do not consider the work of [32] to be a genuine study of imbalanced big data because of the comparatively small number of instances involved (494,021 connections). Furthermore, results would be more meaningful if the CSFSG algorithm was compared against an established Feature Selection algorithm, such as the FI function of the RF classifier.

Case study datasets

The training and test datasets used in the first case study came from a different application domain than those used in the second case study. In both cases, the big data was preprocessed to construct the model and a comparatively smaller dataset was used to test the built model. The first case study utilized the ECBDL’14 dataset and a separate test dataset, both provided by the ECBDL’14 competition. In the second case study, the POST dataset was used for training and the Slowloris dataset for testing. The ECBDL’14 training and test datasets are considered high dimensional (631 features for both), while the POST and Slowloris datasets are not (13 and 11 features, respectively). Further details on the respective datasets are provided in the following subsections.

ECBDL’14

The ECBDL’14 dataset is a great addition for Protein Contact Map (CM) predictive analytics. This dataset was initially used to train a predictor for another competition called 9th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP9) [20]. Proteins are chains of amino acid (AA) residues with the ability to fold into distinct shapes and features. Two residues of a chain are said to be in contact if their distance is less than a certain threshold [34, 35]. Experimenting with proteins is rather time-consuming [36]. Hence, Protein Structure Prediction (PSP) aims to predict the 3-dimensional structure of a protein based on its primary sequence.

To generate the ECBDL’14 dataset, a diverse set of 2682 proteins with known structure were collected selectively [37] from the Protein Data Bank (PDB) public repository. The obtained set was apportioned into training and test sets, with a 90–10% split. Instances were generated for pairs of AAs at least 6 positions apart in the chain. This process resulted in the generation of a large training set and test set, with the former having 31,992,921 instances (31,305,192 negatives and 687,729 positives) and the latter having 2,897,917 instances. For the training dataset, about 2% of the instances are in the minority class. The data generation process collected three main types of information (attributes), for a total of 631 features. For additional details, we refer the reader to [37], which explains all features, in addition to their construction details.

POST and Slowloris

DOS attacks are executed via several techniques designed to deny network availability to legitimate users [38]. The Hypertext Transfer Protocol (HTTP) protocol, which contains various exploitable vulnerabilities, is often targeted for DOS attacks [39, 40]. During a Slow HTTP POST attack, legitimate HTTP headers are sent to a target server [21]. The message body of the exploit must be the correct size for communication between the attacker and the server to continue. Communication between the two hosts becomes a drawn-out process as the attacker sends messages that are comparatively very small, tying up server resources. This effect is exacerbated if several POST transmissions are done in parallel. Slowloris, another similar exploit, keeps numerous HTTP connections open for as long as possible [22]. During this attack, only partial requests are sent to a web server, and since these requests are never completed the available pool of connections for legitimate users shrinks to zero.

Data collection for POST and Slowloris was done within a real-world network environment. An ad hoc Apache web server, which was set up within the campus network of Indian River State College (IRSC), served as a viable target. The Switchblade 4 tool from Open Web Application Security Project (OWASP) and a Slowloris.py attack script [41] were used to generate attack packets for POST and Slowloris, respectively. Attacks were launched from a single host computer in hourly intervals. Attack configuration settings, such as connection intervals and number of parallel connections, were varied but the same PHP form element on the web server was targeted during the attack. The resulting POST dataset contains 1,697,377 instances (1,694,986 negatives and 2391 positives) and 13 features. About 0.1% of instances are in the minority class. The POST dataset is considered to be severely imbalanced. Severely imbalanced data, also referred to as highly imbalanced data or high-class imbalance, is typically bounded between majority-to-minority class ratios of 100:1 and 10,000:1 [2]. The resulting Slowloris dataset contains 201,430 instances (197,175 negatives and 4255 positives) and 11 features. About 2% of instances are in the minority class.

Methodologies and empirical settings

Big data processing framework

Nowadays, when big data is considered, the names of many tools and frameworks surface within the same context. Data engineers construct packages with well-designed algorithms that lessen the need for customization by the consumer of ML solutions in big data analytics. Such frameworks and tools play a significant role in providing speed, scalability, and reliability to ML with big data. Our work uses a state-of-the-art ML library, MLlib, provided by Apache Spark [42, 43], hereinafter referred to as Spark in this paper. Spark is one of the largest open source projects for big data processing [44]. When compared to traditional ML, it greatly increases the data processing speed while also being quite scalable.

In addition to Spark, we use the Apache Hadoop [45,46,47] environment, which consists of many big data tools and technologies, two of which are used in our work: Hadoop Distributed File System (HDFS) [48] and Yet Another Resource Negotiator (YARN) [49]. HDFS is highly fault-tolerant and can be deployed on low-cost hardware. YARN splits up functionality by deploying a global Resource Manager and per-application Application Master with master/slave nodes. The fundamental characteristic of Hadoop is the partitioning of data across many (thousands) hosts, and the parallel execution of application computation for the given data.

Our study used the Hadoop and Spark frameworks and did not focus on the number of partitions within the distributed system. The speed with which Spark processes the data in-memory gives it a big advantage when compared to Hadoop’s MapReduce [50] technology. We kept our data partitions invariant and memory use fixed during the experiments for performance stability. Thus, the number of distributed data partitions and the number of the cluster slave nodes were picked based on the available resources of our High Performance Computing (HPC) cluster.

Classifiers

To provide good coverage of more than one family of ML algorithms, our work utilized three learners: LR, RF, and GBT from the Spark MLlib implementation. Unless otherwise stated, we maintained the settings of the default parameters of these learners.

For both RF and GBT, which share several similar parameter settings, the number of trees generated in the training process was set to 100 [6, 51]. The Cache Node Ids was set to True and the maximum memory in megabytes (MB) was set to 1024 for speeding up the tree-building process. A very important factor to consider is the feature subset strategy, at each tree node split, which defines the number of features to be selected randomly at each split. The feature subset strategy was set to the square root of the features based on Breiman’s suggestion for RF [52]. The sub-sampling Rate Fraction of the training data used for learning each decision tree was set to 1.0 [6, 51]. Maximum Bins was set to the maximum number of categorical features for the datasets involved in this study, which was 2 for both case studies. We used the Gini index for the Information Gain (IG) calculation.

The LR max iteration parameter was set to 100. The Elastic Net Parameter, which is in the range of 0 and 1, corresponds to \(\alpha\) and was set to 0, indicating an L2 penalty [6, 51]. For more information about the supported penalties, please refer to [42]. Note that Spark’s LR has impeded standardization implementation which was set to True, determining whether to standardize the training features before fitting the model.

Performance metrics and model evaluation

Many studies with ML algorithms use a variety of performance criteria, with no fixed standard to evaluate the outcome of the trained classifiers. Some metrics, such as accuracy, may apply a naive 0.50 threshold to decide prediction between the two classes, and this is often impractical since in most real-world problems the target classes are imbalanced, usually with minority and majority classes.

Our work uses the confusion matrix, which is shown in Table 1 [10], for the binary classification problem, where the class of interest is usually the minority class and the opposite class is the majority class, i.e. Positive and Negative, respectively. A list of applicable performance metrics are explained as follows:

  • True Positive (TP) are positive instances correctly identified as positive.

  • True Negative (TN) are negative instances correctly identified as negative.

  • False Positive (FP), also known as Type I error, are negative instances incorrectly identified as positive.

  • False Negative (FN), also known as Type II error, are positive instances incorrectly identified as negative.

  • True Positive Rate (\(\text {TP}_{\text {rate}}\)), also known as Recall or Sensitivity, is equal to TP/(TP + FN).

  • \(\text {TN}_{\text {rate}}\), also known as Specificity, is equal to TN/(TN + FP).

  • The final performance metric score considered in our work is given by \({\text {TP}_{\text {rate}}} \times {\text {TN}_{\text {rate}}}\) [19].

Table 1 Confusion matrix

Determining whether Type I or Type II errors are more serious in ML depends on the nature of the problem. If the cost of a Type II error is much higher than the cost of a Type I error, then aiming toward a favorable Type II error rate is the general goal, but without sacrificing the Type I error rate.

Random undersampling

Four RUS ratios, using the negative:positive class ratio order, were initially obtained: 90:10, 75:25, 65:35, and 50:50. These proportions were carefully selected to provide a good range between perfectly balanced and moderately imbalanced. In order to increase the product of \(\text {TP}_{\text {rate}}\) and \(\text {TN}_{\text {rate}}\) as much as possible, while keeping these two metrics close to each other to relatively balance the Type I and Type II errors, we used RUS to create other ratios where the positive class has more instances than the negative class [19]. Thus, we added two ratios, 45:55 and 40:60.

It is worth noting that the randomization process in RUS promotes information loss from the given data corpus, which may create difficulties for ML classification performance. Because a sample of instances is usually a fraction of a larger body of data, the classification outcome could differ every time RUS is performed. Therefore, this randomness may result in splits that may be considered fair, favorable, or unlucky to the classifier. Unlucky splits, for instance, may retain noisy instances that degrade the classification performance. Favorable splits, on the other hand, may retain very good or clean instances that cause the classifier to perform well, but potentially overfit the model.

Additionally, some ML algorithms, such as the RF and GBT learners used in our study, have inherent randomness built within their implementation. Moreover, due to the random shuffling of instances that we performed prior to each training process, other algorithms such as LR may output different results when the features and/or order of instances is changed.

One way to reduce some of the potential negative effects of randomness is by using repetitive methods [53]. To embrace randomness during our sampling and model building, we performed several repetitions per built model.

Feature Selection

A key aspect of our study is to determine an optimal number of features that should be selected from an imbalanced dataset of big data, for cases where distribution ratios and ML domain datasets vary.

We selected subsets of features with FI, a function within the RF classifier that estimates the most relevant features. An analysis of FI can indicate which features are the most effective in training the RF model. This method is receiving increased attention in many ML-related fields, such as bioinformatics [54]. In our work, FI was used to engineer the available feature space by dropping features that may be considered irrelevant while building the classification model. Results obtained from the use of the new feature space were then compared with those where no Feature Selection was performed.

At each split in every tree in the RF model, the importance of the relevant feature is computed using IG, which is the difference between the parent node’s impurity and the weighted sum of the two child nodes’ impurities. Node impurities measure how well a tree splits data [55]. The accumulation of IG over all separate trees for a given feature is the FI for that specific feature. FI scores from all trees are normalized to values between 0 and 1. The total FI importance vector from all trees is normalized as well for each feature [56].

Another factor needing attention is categorical features. These do not fit into an ML regression model equation in their raw form. Additionally, if the categorical features were indexed, LR assumes that there is a logical order or a value of the indices. Such subsets of categorical features are known as ordinal features. A nominal feature, unlike an ordinal one, is a categorical feature whose instances can take a value that cannot be organized in a logical sequence [7]. In our work, all categorical features (nominal) were transformed into dummy variables using a one-hot encoding method [57], allowing conversion of nominal features into numerical values. A drawback of this method is that a new number of features equaling \(C-1\) is generated from each feature, where C is the number of categories belonging to the specific feature, and consequently, the total feature space increases in size.

Fig. 1
figure 1

(Feature Importance (FI) scores): This figure shows the FI scores (scaled between 0 and 1) generated for both original datasets. Horizontal dashed lines on the figure represent the cut-offs for Feature Selection. The figure also shows the proportions of nominal, continuous, and one-hot encoded features. a shows that the feature with the highest score has a continuous value. The features start to score below 0.04 and 0.004 after the top 60 and 120 features, respectively. b The feature with the highest score has a one-hot encoded value. Also, it is noticeable that all the features with continuous values (total of 8) are included in the top 20 selected features

Figure 1 visualizes the FI scores (scaled between 0 and 1) generated by our study for both original datasets. The horizontal dashed lines represent the cut-offs for FS. The figure also shows the proportions of nominal, continuous, and one-hot encoded features. Figure 1a shows that the feature with the highest score has a continuous value. The features start to score below 0.04 and 0.004 after the top 60 and 120 features, respectively. In Fig. 1b, the feature with the highest score has a one-hot encoded value. Also, it is noticeable that all the features with continuous values (total of 8) are included in the top 20 selected features. For the POST case, we recognized the significance of the features of the POST dataset (explanation provided in “Approach for case studies experiments” section) and used twice as many FS cut-offs as in the ECBDL’14 case.

Approach for case studies experiments

Through our combined approach of RUS and FS, we investigated two case studies from different big data domains, in each case using a big data training set for constructing the model and a comparatively smaller test set for assessing the built model. Generally speaking, the two case studies used the same methodology. Additional details are provided in the following subsections.

Case study 1: ECBDL’14 dataset

Our method for the first case study is sequentially outlined in six steps: (1) Select subsets of features using the FI function of the RF learner; (2) Implement one-hot encoding; (3) Create six different distribution ratios with RUS; (4) Distribute the datasets; (5) Train with the GBT, RF, and LR learners; and (6) Perform model prediction against a separate test set. We performed FS prior to one-hot encoding of the features, due to the fact that categorical features in this case study have, on average, four values that are well-spread throughout the relevant attributes. Following this strategy prevents the loss of any valuable information from the categories during the FS stage. We performed FS based on the scores generated by FI, obtaining 60 and 120 features. The full set (referred to as ALL) with no Feature Selection was also included. After one-hot encoding, the available feature space increased to 985.

Statistics for the sampled datasets that were generated from the ECBDL’14 dataset are shown in Table 2. The negative percentages in the table were calculated based on the negative (unsampled) total size of 31,305,192 instances.

Table 2 Generated sampled sizes with RUS

Case study 2: POST and Slowloris datasets

Our method for the second case study is also sequentially outlined in six steps: (1) Implement one-hot encoding; (2) Select subsets of features using the FI function of the RF learner; (3) Create six different distribution ratios with RUS; (4) Distribute the datasets; (5) Train with the GBT, RF, and LR learners; and (6) Perform model prediction against a separate test set (POST is used to build the model, Slowloris is used to test the model). Note that one-hot encoding was performed prior to FS in this second case study, and we provide an explanation for doing this in the following paragraph.

To build an ML model that performs well, it is usually advisable to train the model and test it against data that comes from the same target distribution. However, for the POST case study, we faced the challenge of the training and test data being sourced from different class distributions. Therefore, special attention was given to the original feature spaces of the training and test sets. For instance, the training set had two features (session flags and attribute) that were not recorded in the test set. To ensure compatibility, those two features were not included in the training process. Moreover, the values of the categorical features in the training set have protocols and flags that were not observed in the test data, and vice versa. This can be problematic for the one-hot encoding method as the dummy variable procedure extends values into new and different features. This issue can be solved in several ways, such as by ignoring the previously unobserved values when transforming the training and/or test sets, thus removing the instances that have these values. However, one reason for not following this solution is that the unobserved values represent a large number of instances from both the training and test sets. Therefore, to handle such values, we added them as dummy variables to both datasets and filled the entire dummy columns with zeros for each newly generated unobserved feature. After one-hot encoding, the available feature space increased to 72. Since FS occurs after one-hot encoding for this case study, the problem of unobserved features was effectively addressed. We performed FS based on the scores generated by FI, obtaining 5, 10, 20, and 36 features. As in the first case study, the full set of features (referred to as ALL) with no Feature Selection was also included.

Another problem was the processing of missing values. Both training and test datasets had null values within their categorical features. As Spark ML cannot handle this issue automatically, we explored several options, including: (1) Imputing missing values, (2) Removing the instances that have them, and (3) Taking no action, which means addressing the issue through one-hot encoding. With the third option, a categorical feature containing missing values would have zeroes assigned to all newly generated dummy variables for every instance with missing values. We chose to leave the missing values intact to maintain consistency with the methodology of the first case study. Our future studies will explore the challenge of big data analysis with missing values in datasets.

Statistics for the sampled datasets that were generated from the POST dataset are shown in Table 2. The negative percentages in the table were calculated based on the negative (unsampled) total size of 1,694,986 instances.

Table 3 Empirical results
Fig. 2
figure 2

(Distributions and Feature Selection scores): Average scores are shown on this figure for each dataset, sampling ratio distribution, and learner. In addition, the figure shows the number of selected features used to train the three models for each case study. ac depict the ECBDL’14 case study and df depict the POST case study

Table 4 Two-way ANOVA test results
Table 5 Tukey HSD test—ECBDL’14 dataset
Table 6 Tukey HSD test—POST dataset
Fig. 3
figure 3

(Performance scores per class ratio): This figure, which is based on tabulated results from ANOVA and Tukey’s HSD tests, shows the top three class ratio distributions for both original datasets. Box plots are used to visualize the median (middle quartile or the 50th percentile shown as a thick line), two hinges (25th and 75th percentiles), two whiskers (also known as error bars), and outlying points. The top three class distributions for the ECBDL’14 dataset with regards to performance were consistent for all three learners. These distributions are the 40:60, 45:55, and 50:50 ratios, as shown in ac. From a performance perspective, there was less consistency for the top three distributions of the POST dataset, as shown in df

Results and discussion

The results in Table 3 show the averages of 3 runs per built model in the ECBDL’14 case study and the averages of 10 runs per built model in the POST case study. Only 3 runs were necessary for the ECBDL’14 study because its training dataset is comparatively large and therefore has less statistical variance in scores obtained from the sampled class distributions. All classifier performance results shown were obtained from the product of \(\text {TP}_{\text {rate}}\) and \(\text {TN}_{\text {rate}}\). We chose these metrics because they are collectively robust enough for the classification of imbalanced big data, where the product of \(\text {TP}_{\text {rate}}\) and \(\text {TN}_{\text {rate}}\) should be as high as possible, while keeping them close to each other to relatively balance (as much as possible) the Type I and Type II errors [19]. It is worth noting that using the original (unsampled) training data to build the models yielded \({\text {TP}_{\text {rate}}} \times \text {TN}_{\text {rate}}\) scores of 0, as the models failed to correctly classify any instances from the positive class. Thus, on this account, we did not present the “unsampled” results in our work. For Table 3, the highest value within each column (features-set) of each sub-table is italics and the highest value within each row (class distribution ratio) of each sub-table is underlined. In case study 1, the lowest score of 0 was reported for the 90:10 ratio with RF (all 3 features-set sizes), and the highest score of 0.4947 was reported for the 40:60 ratio with GBT (features-set of ALL). In case study 2, the lowest score of 0 was reported for the 90:10 ratio with RF (features-set of ALL), and the highest score of 0.9061 was reported for the 90:10 ratio with RF (features-set of 10).

In order to visualize the results shown in Table 3 for the number of selected features used to train the three models for each case study, we included Fig. 2. Figure 2a–c depicts the ECBDL’14 case study and Fig. 2d–f depicts the POST case study. For the ECBDL’14 case, the 3 learners have a similar characteristic curve, with the graphs of GBT and LR closely resembling each other. For the POST case, each learner has a unique set of graphs. The similarity of the curves for the ECBDL’14 case may be because the respective training and test datasets were sourced not only from the same application domain, but also the same available feature space.

The use of average values for variations of repetitive model building statistically enhances the score results assigned to the models. In addition, to demonstrate statistical significance of the observed experimental results, a hypothesis test is performed using ANOVA [58], and with post hoc analysis via Tukey’s HSD test [59]. ANOVA is a statistical test determining whether the means of one or several independent variables (or factors) are significant. Six rows of two-way ANOVA analysis are shown in Table 4, i.e. three rows for the ECBDL’14 case study and three for the POST case study. The two factors in the table are the number of the selected features (referred to as F in the tables) generated by FI, and sampling class distribution ratios (referred to as R in the tables) generated by RUS. We investigated the intersection of both factors to learn how they affect the respective learner (GBT, RF, LR). If the p-value in the ANOVA table is less than or equal to a certain level (0.05), the associated factor is significant. A 95% (\(\alpha = 0.05\)) significance level for ANOVA and other statistical tests is the most commonly used value.

A low F-value indicates that the factor has low variability relative to the variability within each group. However, a high F-value is required to reject the null hypothesis, which means the variability of group means is large relative to the variability within each group. Our two-way ANOVA analysis determined that there was a statistically significant difference between groups with only one exception. When classification of the POST dataset was performed with LR, as shown in Table 4, the intersection of the Feature and Ratio factors had no significant effect on the \(\text {TP}_{\text {rate}} \times \text {TN}_{\text {rate}}\) score. However, this intersection proved significant for the GBT and RF learners with the POST dataset, and also for all three learners of the ECBDL’14 dataset.

An ANOVA test may show the results are significant overall; however, not enough information is given on exactly where those differences lie. Therefore, the Tukey’s HSD test compares all possible pairs and finds means of a factor that are significantly different from each other. Differences are grouped by assigning letters (referred to as ‘g’ in the table column headings), with pairs that share the same letter being similar (i.e., no statistically significant difference). Tables 5 and 6, respectively show the Tukey’s HSD tests for both cases in our experimental study. The levels of each factor were averaged and ordered accordingly from the highest to the lowest average. The tables are associated with maximum and minimum values, standard deviation, and quartiles, i.e. first quartile (25th percentile), second quartile (50th percentile or median), and third quartile (75th percentile).

The Tukey’s HSD test revealed that the selection of ALL features and the top 120 features-set from the ECBDL’14 dataset to build the GBT models were statistically insignificant (i.e., yielded similar performances). The test also showed that there is statistical significance between the ALL, 120, and 60 features-set cases for the RF and LR models. Moreover, the percentage differences between the class distributions of 40:60 and 45:55, and 45:55 and 50:50 were noticeably less than the percentage differences between the 90:10 and 75:25, 75:25 and 65:35, and 65:35 and 40:60 distributions. It should be noted that the 75:25 and 65:35 ratios are considered slightly imbalanced, whereas the 90:10 ratio is considered moderately imbalanced. The 90:10 ratio failed to classify any positive class instances with RF. For GBT and LR, this ratio yielded near-zero scores, barely classifying enough positive class instances with accuracy.

Table 6a–c presents the Tukey’s HSD tests with regards to the POST case. These results show similar, but not exact, behavior with those from the ECBDL’14 case. Interestingly, five distribution ratios for LR share the same group letter.

Our ANOVA and Tukey’s HSD test tables show quantitative information and focus on the occurrence of statistical significance, where factors are sufficiently important to be worthy of attention. To provide additional insight into the results, we included Fig. 3. This figure uses box plots to visualize the median (middle quartile or the 50th percentile shown as a thick line), two hinges (25th and 75th percentiles), two whiskers (also known as error bars), and outlying points. The length of error bars may help reveal the variation of the scores, e.g. a short error bar indicates less variation of scores. Outliers (represented by dots in the figure) are scores that are numerically distant from the rest of the data, i.e. scores located outside the whiskers.

The top three class distributions for the ECBDL’14 dataset with regards to performance were consistent for all three learners. These distributions are the 40:60, 45:55, and 50:50 ratios, as shown in Fig. 3a–c. From a performance perspective, there was less consistency for the top three distributions of the POST dataset, as shown in Fig. 3d–f. For instance, Fig. 3d maintained the same top three ranking for GBT when compared with all learners in the ECBDL’14 dataset. However, for Fig. 3e, which corresponds to RF in the POST dataset, the 90:10 and 75:25 ratios were unexpectedly among the top three distributions, and for Fig. 3f, which corresponds to LR in the POST dataset, the 65:35 ratio ranked third. It seems counterintuitive that moderately and slightly imbalanced ratios of 90:10 and 75:25, respectively, can have higher \({\text {TP}_{\text {rate}}} \times \text {TN}_{\text {rate}}\) values than the relevant 45:55 and 50:50 ratios.

For the ECBDL’14 dataset, the GBT selection of 40:60 and 45:55 ratios (group ‘a’) along with ALL and 120 features (group ‘a’) resulted in \({\text {TP}_{\text {rate}}} \times \text {TN}_{\text {rate}}\) values of 0.4908 (highest table value), 0.4887, 0.3449, and 0.3440, respectively, as shown in Table 5. Therefore, the optimal choice for this dataset involves a GBT learner with a combination of one of the ratios from group ‘a’ and one of the features-set sizes from group ‘a’.

With regards to the POST dataset, the LR selection of 40:60, 45:55, 50:50, 65:35, and 75:25 ratios (group ‘a’) along with 5 features (group ‘a’) resulted in \(\text {TP}_{\text {rate}} \times \text {TN}_{\text {rate}}\) values of 0.6218, 0.6420, 0.5740, 0.6122, 0.4818, and 0.8569 (highest table value), respectively, as shown in Table 6. Therefore, the optimal choice for this dataset involves an LR learner with a combination of one of the ratios from group ‘a’ and a features-set of 5.

Conclusion

This work uniquely investigates the combined approach of using RUS and Feature Selection to mitigate the effect that class imbalance (with varying degrees) has on big data analytics. Through our contribution, we demonstrate the effectiveness and versatility of our method with two case studies involving imbalanced big data from different application domains. Our results show that Feature Selection and class distribution ratios are important factors.

The best performer in the first case study was the Gradient-Boosted Trees classifier, with either a features-set of 60 or the full set (ALL) of 631 features, and a negative-to-positive ratio of either 45:55 or 40:60. The best performer in the second case study was the Logistic Regression classifier, with a features-set of 5 and any of the following negative-to-positive ratios: 40:60, 45:55, 50:50, 65:35, and 75:25. It is noteworthy that Logistic Regression, with the features-set of only 5, was a top performer despite the fact that Feature Selection was performed by the Feature Importance function within Random Forest, a different learner. Since the 45:55 and 40:60 class distributions appear in the top selections for both case studies, we conclude that using RUS to generate class distribution ratios in which the positive class has more instances than the negative class is more likely to be beneficial for big data analytics with imbalanced datasets. This conclusion is particularly applicable for class-imbalanced cases where the test dataset is small relative to the training dataset.

Future work with big data using our combined approach of RUS and Feature Selection will involve additional performance metrics, such as Area Under the Precision-Recall Curve (AUPRC) and AUC, data analysis with missing values in datasets, and the investigation of data from other application domains.