Reliable feature selection for adversarially robust cyber-attack detection

Vitorino, João; Silva, Miguel; Maia, Eva; Praça, Isabel

doi:10.1007/s12243-024-01047-z

Reliable feature selection for adversarially robust cyber-attack detection

Open access
Published: 07 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Annals of Telecommunications Aims and scope Submit manuscript

Reliable feature selection for adversarially robust cyber-attack detection

Download PDF

218 Accesses
1 Altmetric
Explore all metrics

Abstract

The growing cybersecurity threats make it essential to use high-quality data to train machine learning (ML) models for network traffic analysis, without noisy or missing data. By selecting the most relevant features for cyber-attack detection, it is possible to improve both the robustness and computational efficiency of the models used in a cybersecurity system. This work presents a feature selection and consensus process that combines multiple methods and applies them to several network datasets. Two different feature sets were selected and were used to train multiple ML models with regular and adversarial training. Finally, an adversarial evasion robustness benchmark was performed to analyze the reliability of the different feature sets and their impact on the susceptibility of the models to adversarial examples. By using an improved dataset with more data diversity, selecting the best time-related features and a more specific feature set, and performing adversarial training, the ML models were able to achieve a better adversarially robust generalization. The robustness of the models was significantly improved without their generalization to regular traffic flows being affected, without increases of false alarms, and without requiring too many computational resources, which enables a reliable detection of suspicious activity and perturbed traffic flows in enterprise computer networks.

Investigating the practicality of adversarial evasion attacks on network intrusion detection

Article 28 March 2022

A statistical analysis of intrinsic bias of network security datasets for training machine learning mechanisms

Article 12 February 2022

Detecting cybersecurity attacks across different network features and learners

Article Open access 23 February 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The growing quantity and complexity of cybersecurity threats is making it essential for organizations of all sizes to improve the protection of their digital assets [1]. Network intrusion detection systems monitor the traffic of an enterprise computer network and identify potentially harmful behavior that could threaten the integrity, confidentiality, or availability of computer resources [2]. These systems can use artificial intelligence (AI), and more specifically machine learning (ML) models, to automatically perform network traffic analysis and anomaly detection [3, 4].

However, training ML models requires high-quality data that correctly represents the network activity of an organization. Despite containing valuable network traffic flows, the publicly available datasets also often include redundant features with noisy and missing data, which make the models less robust and slower [5, 6]. Therefore, to improve the detection performance and computational efficiency of the models, it is necessary to carefully analyze multiple datasets and select only the most relevant features for a cyber-attack detection task [7, 8].

Furthermore, the trained ML models must also perform well against adversarial examples, which are malicious traffic flows that contain specialized perturbations to be misclassified as benign [9]. Adversarial attacks can use these perturbations to deceive ML models, so it is also important to select the features that provide the best defense against such attacks [10]. By benchmarking the robustness of different ML models with different feature sets of different datasets, it is possible to identify the most suitable approaches for the computer networks of distinct organizations [11, 12].

This work presents the selection of the most relevant features of multiple network intrusion detection datasets, to be used to simultaneously improve the robustness and computational efficiency of ML models in the cybersecurity domain. Five feature selection methods, information gain, chi-squared test, recursive feature elimination, mean absolute deviation, and dispersion ratio, were applied to the original CICIDS2017 dataset, a corrected version of it designated as NewCICIDS, the original HIKARI21 dataset, and an improved version of it designated as NewHIKARI.

For each dataset, two different feature sets were used to train ML models, one with only time-related characteristics and another with more specifically selected relevant features. Four types of ML models, random forest (RF), extreme gradient boosting (XGB), light gradient boosting machine (LGBM), and explainable boosting machine (EBM), were trained with regular training and adversarial training processes. Finally, the adversarial robustness benchmark carried out in [13] was extended to analyze the reliability of the different feature sets and their impact on the susceptibility of the models to adversarial examples of malicious network traffic flows.

The present paper is organized into multiple sections, meant to enable researchers to replicate this feature selection process for other datasets and perform trustworthy comparisons with the results of future studies. Section 2 provides a survey of previous work using feature selection methods for cybersecurity. Section 3 describes the datasets, the selection methods, the ML models, and the benchmark methodology. Section 4 presents the obtained feature sets and the most relevant features. Section 5 presents an analysis of the obtained results in the benchmark. Lastly, Section 6 addresses the main conclusions and future research topics.

2 Related work

To perform a reliable feature selection process and a trustworthy benchmark of ML models, it is important to understand the results and conclusions of previous work addressing network intrusion detection. Due to the complex tabular data structure of network traffic flows, where several features may be correlated, the presence of a given value at a given feature may restrict the values that other features can have [14]. Consequently, the search for optimal performance from ML models has been an ongoing quest, with challenges related to large quantities of noisy data, missing data, and lack of data diversity [15]. Over the last few years, several studies have used statistical methods to select the most relevant features and discard the redundant ones. The most relevant studies that use network traffic flows are described below.

A relatively recent study [16] demonstrated that a decision tree classification model can reach an accuracy of 80.60% and an F1-score of approximately 80% on the NSL-KDD dataset using only a reduced set of features carefully selected based on the information gain method and the chi-squared statistical test. In another study [17], an ensemble approach employing techniques such as information gain, gain ratio, chi-square, and symmetric uncertainty were used to obtain a simplified set of 9 predictive features, and an RF was trained and obtained a very high accuracy of 98.90% on an Internet-of-Things (IoT) dataset for cyber-attack detection.

Another approach featured the use of statistical tools such as the Spearman rank correlation coefficient and the chi-squared statistic, together with a decision tree classifier [18]. This strategy led to a substantially reduced number of features, producing a F1-score result of 99.87% for the binary classification and 99.49% for the multi-class classification. Analyzing further into the field of feature selection, a two-phase methodology was proposed by [19], combining information gain with recursive feature elimination. This process resulted in the identification of 16 features from an IoT dataset, achieving a detection accuracy of 99.80% with a deep neural network.

With the aim of improving the learning performance of RF classifiers, researchers have explored new approaches [20], by integrating information gain to calculate the value of each feature and the relief algorithm to calculate each feature weight, which resulted in improved ML model performance on the NSL-KDD dataset. The search for efficient intrusion detection systems has persisted, as demonstrated by [21]. Using recursive feature elimination together with an RF on the CICIDS2017 dataset, the study identified 4 crucial features out of the original 80, demonstrating the potential for simple and efficient intrusion detection systems. This approach resulted in an accuracy rate of 91% when applied to a deep neural network.

Using the recursive feature elimination method on the NSL-KDD dataset, the study [22] selected the 13 best features, which were then used in ML models such as RF, decision tree, K-nearest neighbors and Naïve Bayes. This approach achieved an average accuracy of between 98 and 99%. Efforts were also made to find a balance between feature reduction and prediction accuracy [23].Using recursive feature elimination and cross-validation, the researchers identified an optimal set of 15 features. This selection led to the development of an anomaly-based intrusion detection system, based on an RF, with an accuracy of 95.30%.

The search for computational efficiency remained a primary concern, and the CICIDS2017 dataset was used in a study [24] that applied the information gain method. The results of this approach were later applied to a wide range of ML models, resulting in a 99.86% accuracy achieved by an RF classifier using only the 22 most relevant features. In contrast, the J48 classifier achieved a slightly higher accuracy of 99.87%, although with the disadvantage of using the larger quantity of 52 features and having a longer execution time.

Overall, the recent studies have demonstrated the effectiveness of statistical feature selection methods to improve the performance of ML models for network intrusion detection. However, as new cyber-attacks and adversarial attacks are encountered, it is essential to analyze the features of more up-to-date datasets and how they impact the robustness of ML models [25]. To the best of our knowledge, no previous work has analyzed how the time-related characteristics and more specifically selected features of the considered datasets affect the robustness of RF, XGB, LGBM, and EBM against adversarially perturbed network traffic flows.

3 Methods

This section describes the datasets, the selection methods, the ML models, and the benchmark methodology. The work was carried out on a machine with 16 GB of RAM, a 6-core CPU, and a 4 GB GPU, which are reasonably common computation resources. The implementation relied on the Python programming language and the following libraries: numpy and pandas for general data manipulation, xgboost for the implementation of XGB, lightgbm for LGBM, interpret for EBM, and scikit-learn for the implementation of RF of the feature selection methods.

3.1 Datasets and selection methods

Due to their value for binary network traffic classification, four standard network traffic flow datasets were considered for the feature selection process: CICIDS2017, NewCICIDS, HIKARI21, and NewHIKARI. Their main characteristics are described below.

The CICIDS2017 [26] dataset is a very highly used dataset that contains common cyber-attacks performed in an enterprise computer network. It includes multiple captures of benign activity and several types of probing, brute-force, and DoS attacks, which were recorded in 2017 in an heterogeneous testbed environment with 12 interacting machines. The network traffic flows were converted to a tabular data format using the CICFlowMeter [27] tool, provided by the Canadian Institute for Cybersecurity. The combined Tuesday and Wednesday traffic captures resulted in a total of 872105 data samples of the benign class and 266507 of the malicious class.

A corrected version of this dataset has been created to provide more realistic network traffic flows, which is designated as NewCICIDS [28, 29] in this work. Even though CICIDS2017 continues to be used as a standard benchmark dataset to compare the performance of novel ML models with baseline models from previous studies, some discrepancies have been noticed on a portion of the attack vectors it contains. The corrected version addressed this issue by correcting most of samples, although it has a reduced size, with 638432 benign and 106538 malicious samples.

The more recent HIKARI21 [30] dataset is starting to be used in various studies because it includes cyber-attacks that have started to be performed in more recent years. It contains probing and brute-force attacks, as well as benign background traffic of the normal operation of an enterprise computer network that uses the HTTPS communication protocol to encrypt network traffic. The data was recorded in 2021 to tackle the lack of datasets containing application-layer attacks on encrypted traffic, using similar features to those utilized in CICIDS2017. The resulting network flows correspond to 517582 benign samples and 37696 malicious samples.

An improved version of HIKARI21 has been released by its authors, which is designated as NewHIKARI [31] in this work. It contains a slightly lower number of benign samples, 214904, and almost a third of the malicious samples, 13349. Despite the reduced number of network traffic flows recorded in this dataset, the data samples represent more cyber-attack variations to include more recent cybersecurity threats. NewHIKARI has a higher class imbalance than the previous three datasets, representing more realistic conditions for enterprise-scale network intrusion detection.

The four datasets required a data preprocessing stage. In addition to creating stratified training and holdout sets with 70% and 30% of each dataset, it was necessary to select relevant and unbiased features that correctly represented the network activity. To obtain the feature importance rankings of the more than 80 features of each of these datasets and identify the most impactful ones, five methods were considered: information gain, chi-squared test, recursive feature elimination, mean absolute deviation, and dispersion ratio. These selection methods are detailed below.

Information gain

The concept of information gain is widely used in information theory to quantify the improvement in predictive ability [32]. It evaluates the reduction in uncertainty when a feature is included, according to the difference in entropy before and after considering that feature [33]. The mutual information method was utilized with the number of neighbors set to 3, to prevent the introduction of bias. The information gain of a feature $X$, $IG\left(X\right)$, is mathematically defined as

$$IG\left(X\right)=H\left(Y\right)-H\left(Y|X\right)$$

where $H\left(Y\right)$ is the entropy of the target and $H\left(Y|X\right)$ is the conditional entropy of the target given X.

Chi-squared test

The chi-squared statistical test is commonly used to assess the degree of dependence between a term and a class. Then, it is then fitted to the chi2 distribution with one degree of freedom for analysis [34]. Considering a term t and a class c, it can be represented as

$${X}_{\left(t,c\right)}^{2}= \frac{S\times {(PN-MQ)}^{2}}{(P+M)\times (Q+N)\times (P+Q)\times (M+N)}$$

where $S$ is the total number of data samples, $P$ is the count of samples within class $c$ that contain $t$, $Q$ is the count of samples that contain the term $t$ but are not in class $c$, $M$ is the count of samples that belong to class $c$ but do not contain the term $t$, and $N$ is the count of samples from other classes without the term $t$.

Recursive feature elimination

The recursive approach to feature elimination [35] was originally a gene selection technique that required a classifier for the selection process. According to the weight assigned to each feature by the classifier, this method selects features by recursively considering smaller and smaller feature sets until one last feature remains. To obtain the ranking of each feature, the method was configured to only remove one feature per iteration of the recursive process.

Mean absolute deviation

The concept of mean absolute deviation is used as a scaling parameter within the Laplace distribution, providing a direct measure of the dispersion inherent in a feature. This method is commonly employed as an alternative for the standard deviation and can be reliably used to identify the most relevant features of a dataset [36]. It is mathematically defined as

$$MAD= \frac{1}{n} {\sum }_{i=i}^{n}|{X}_{i}-\overline{X }|$$

where n is the number of samples, ${X}_{i}$ is the value of sample $i$, and $\overline{X }$ represents the mean value of all the samples.

Dispersion ratio

The dispersion ratio of a feature is defined as the square root of the ratio of two components. The numerator represents the dispersion of the relative importance of a feature between the different classes, and the denominator represents the dispersion in the importance of that feature across the entire dataset [37]. This method obtains the relevance of a feature by calculating the ratio of the arithmetic mean to that of the geometric mean of a feature.

3.2 Models and benchmark methodology

The robustness analysis methodology introduced in [38] was followed to ensure an unbiased benchmark. It includes both a regular training process and an adversarial training process, which is a well-established adversarial defense strategy. In the former, the original training set of a certain dataset is used to train, fine-tune, and validate an ML model. In the latter, data augmentation is performed by creating simple perturbations in the original training set, resulting in an adversarial training set that contains both original data samples and slightly perturbed data samples.

Afterwards, the considered methodology establishes a performance evaluation in both normal conditions and during a direct attack to the models. In the former, the models perform predictions of the data samples in the regular holdout set of a certain dataset, and several standard evaluation metrics are computed. In the latter, a full adversarial evasion attack is performed against each model, with specialized perturbations to deceive that specific model. Since different models are susceptible to different perturbations, the attacks result in model-specific adversarial holdout sets. In the case of network intrusion detection, these attacks are targeted, attempting to cause misclassifications from the malicious class to the target benign class.

The adversarial examples were generated using the adaptative perturbation pattern method (A2PM) [39]. It relies on pattern sequences that learn the characteristics of each class and create constrained data perturbations, according to the provided information about the feature set, which corresponds to a gray-box setting. The patterns record the value intervals of different feature subsets, which are then used to ensure that the perturbations take the correlations of the features into account, generating realistic adversarial examples. Therefore, when applied to network intrusion detection, the patterns iteratively optimize the perturbations that are performed on each feature of a network traffic flow according to the constraints of a computer network.

For adversarial training, a simple function provided by A2PM was used to create a perturbation in each malicious sample of a training set, performing data augmentation. Hence, a model was able to learn not only from a sample, but also from a simple variation of it. Starting from a regular training set with 70% of a dataset, another set of the same size can be obtained, with a perturbation in each sample.

To perform adversarial evasion attacks specific to each model, the full adversarial attack created as many data perturbations as necessary in a holdout set until every malicious sample was misclassified as benign or a total of 15 attack iterations were performed. No more iterations were allowed because a high number of requests to a specific server would increase the risk of the anomalous behavior being noticed by the security practitioners overseeing the networking infrastructure of an enterprise network [40]. Starting from a regular holdout set with 30% of a dataset, several other sets of the same size can be obtained, with specialized data perturbations for a specific ML model.

Due to their well-established performance in network intrusion detection, four types of ML models were considered: RF, XGB, LGBM, and EBM. The optimal configuration for each model and each dataset were obtained via a grid search of well-established hyperparameter combinations, and the best ones were determined through a fivefold cross-validation. The F1-score, which consolidates precision and recall and is suitable for imbalanced data, was selected as the validation metric. After the fine-tuning process, each model was retrained with a complete training set to be ready for the benchmark with the regular and adversarial holdout sets. The models are described below.

Random forest

RF [41] is a supervised ensemble created through bagging and using the Gini impurity criterion to calculate the best node splits. Each individual tree performs a prediction according to a feature subset, and the most common vote is chosen. RF is based on the concept that the collective decisions of many trees will be better than the decisions of just one. Table 1 summarizes the fine-tuned configuration.

Table 1 Summary of RF configuration

Reliable feature selection for adversarially robust cyber-attack detection

Abstract

Similar content being viewed by others

Investigating the practicality of adversarial evasion attacks on network intrusion detection

A statistical analysis of intrinsic bias of network security datasets for training machine learning mechanisms

Detecting cybersecurity attacks across different network features and learners

1 Introduction

2 Related work

3 Methods

3.1 Datasets and selection methods

Information gain

Chi-squared test

Recursive feature elimination

Mean absolute deviation

Dispersion ratio

3.2 Models and benchmark methodology

Random forest

Extreme gradient boosting

Light gradient boosting machine

Explainable boosting machine

4 Feature selection

5 Results and discussion

5.1 CICIDS2017

5.2 NewCICIDS

5.3 HIKARI21

5.4 NewHIKARI

6 Conclusions

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation