1 Introduction

Network intrusion continues to exist as one of the unresolved areas, in addition to other challenges facing computer network systems. It has been getting scholars’ attention because of its growing, threatening remark for businesses about their sustainability [1]. Intrusion is whatsoever kind of action that can be prone to go against confidentiality, security, or network availability. It is also a security infringement that includes attacks from outside or within the company [2]. Intrusion Detection Systems (IDS) are used to identify duplication, alteration, and destruction of information systems. An unauthorized user or malicious software can generate these disruptions through the computer network. The two main classifications of an IDS are misuse and anomaly detection. Anomaly detection picks out patterns that differ from the usual pattern. However, misuse-detection is concerned with recognizing intrusions that already have renowned patterns. It differentiates the intrusions from the preceding signatures existing in the database [3].

However, the number of new attacks and network complexity is still increasing. It becomes very important to develop a reliable security mechanism that can monitor, analyze, and protect network systems’ operations [4]. Therefore, different solutions are needed to design and implement more efficient intrusion detection systems. As a result, statistical and machine learning techniques have been employed to predict the network’s intrusions [5].

Even though various methods have been proposed in line with machine learning, IDS still experiences the hardship of bigger False Positive Rate (FPR), False Negative Rate (FNR), and higher computation cost. These drawbacks are making the purpose of IDS not attainable, since it is anticipated to be faster and accurate. The main reason behind these problems may be a large number of features. Thus, there may be complexity during extraction and selection. Some features may be redundant, and some others may be irrelevant in the high-dimensional feature set [6]. These redundant and irrelevant features may make the time of computing processes slower. It may also degrade the accuracy and the general performance of Network Intrusion Detection Systems (NIDS) [7]. For these reasons, the feature selection methods have been employed as a pre-processing phase in acquiring the subset of relevant features to establish efficient NIDS [8].

Primarily, feature selection aims to choose the most essential and optimum set of features to improve the overall performance of NIDS [9]. It also helps in reducing the computational cost of the classifier for identifying hidden traffics and simplify the comprehension of data processing. In most cases, the accuracy of classification using a reduced feature subset is equal to or better than that of the complete feature subset. However, some features may be eliminated in certain feature selection cases, which are highly predictive of the instance space’s minimal areas. When this happens, feature selection may degrade the performance of machine learning [10].

While applying the single feature selection method, there may be a chance of eliminating useful candidate features. These features may be a good predictor for classification when another method is applied. Each feature selection methods have its evaluation criteria to select features. Thus, a feature selected by one algorithm may be irrelevant or inadequately relevant for the other method and may not be selected. Consequently, we are getting dissimilar features selected by different algorithms from the same dataset, resulting in low prediction accuracy. Therefore, appropriate measures are needed to tackle the above problems and support the current technological development. We believe that there should be a method that can give the useful candidate features a chance to be retained.

This study’s motivation is to propose a novel NIDS framework that can retain useful candidate features by introducing diversity during feature selection. The diversity of features may increase the regularity of the selection process while improving prediction performance. Hence, in this work, we propose a Heterogeneous Ensemble Feature Selection (HEFS) method. The HEFS method may help the NIDS framework in selecting relevant features while improving the attack detection performance. The proposed method may take advantage of the strengths of the individual selectors and overcome their weaknesses.

The logic of the ensemble approach is to develop a robust method by integrating multiple models and acquiring better performance instead of using only a single model. The control of the variance and the diversity of the techniques make this method successful in solving a specific problem. The approaches of ensemble feature selection methods are classified into homogeneous and heterogeneous. The homogeneous approach uses the same base feature selection method and different sample data. However, the same dataset and different base feature selection methods are used in the heterogeneous approach [11]. To be used as base selectors, feature selection algorithms can be classified into three categories based on their structure. These are wrappers, filters, or embedded algorithms. Wrappers measure the usefulness of features based on the classifier’s performance. They are very computationally intensive. Filters measure the relevance of the features based on univariate statistics instead of cross-validation. They require fewer computational resources than wrappers. Finally, embedded methods perform feature selection as part of the model construction process and can be viewed as intermediate positions between wrappers and filters. They are computationally more costly than filters but less than wrappers [12].

In this research, we use an ensemble of filters rather than the typical single method. The proposed HEFS method uses filter methods as base selectors rather than wrapper and embedded algorithms. Filters may help the HEFS method avoid the computational cost from the wrapper and embedded feature selection algorithms.

The proposed HEFS method fuses the output feature subsets of five filter feature selection methods. These methods include Probabilistic Significance Attribute Evaluator (SignifAtrEval), Symmetrical Uncertainty Attribute Evaluator (SymUnAtrEval), Gain Ratio Attribute Evaluator (GainRatioAtrEval), Classifier Attribute Evaluator (ClassfrAtrEval), and ReliefF Attribute Evaluator (ReliefAtrEval). The HEFS method uses a union combination method to obtain an ensemble features subset in NIDS. The proposed method further makes the merit-based evaluation for the obtained ensemble features subset to select the final optimal features.

Selecting the methods between ensemble feature selection and other feature fusion methods depends on the problem domain and the established assumptions of each technique. However, feature-level fusion has some disadvantages, including designing a new matcher and acquiring a larger number of training samples [13]. For example, Principal Component Analysis (PCA), a popular feature fusion method, lacks interpretability. The lack of interpretability includes that each principal component can be represented by a linear combination of primitive features, especially when a large number of features are involved [14].

In the domain of NIDS, even though several feature selection approaches exist, the possibilities of eliminating useful candidate features during a single feature selection and its impact on degrading prediction performance have not been sufficiently studied. Our work is novel in that we apply the union combination method to combine different filter methods through a heterogeneous approach. This heterogeneous ensemble helps our approach in retaining useful candidate features. To the best of our knowledge, no such approaches apply merit-based evaluation for the obtained ensemble feature subset to avoid internal redundancy and acquire the essential features. There are also no existing approaches in NIDS that measure the method’s robustness by testing the diversity and stability of the selected optimal features.

The contributions of this study in the domain of network intrusion detection system are:

  1. 1.

    The study proposes a heterogeneous ensemble feature selection method rather than an ensemble classifier to retain useful candidate features. The proposed approach combines relatively weaker feature selectors individually using the union combination method to make them stronger as a group.

  2. 2.

    To avoid internal redundancy of the obtained ensemble features subset, HEFS method makes further merit-based evaluation by using feature-feature and feature-class correlation to select the final optimal feature subset.

  3. 3.

    We conduct extensive experiments to assess the effectiveness of our proposed approach, including its robustness (diversity and stability test) and statistical significance test. The experimental results show that HEFS significantly improves prediction performance on a multi-class intrusion dataset compared to several other techniques.

The remaining parts of the paper are organized as follows. Section 2 summarizes related works about feature selection in the domain of network intrusion detection systems. In Sect. 3, the details of the proposed HEFS method are presented. In Sect. 4, experiments and its results are introduced and analyzed. In Sect. 5, we conclude our work and outline the future works.

2 Related Works

There are considerable studies that have been published on feature selection methods for NIDS. Some of them are explained to highlight their methods and contributions.

Thaseen et al. [15] designed an intrusion detection method by applying chi-square feature selection along with Multi-class Support Vector Machine (MSVM). The method has been proposed to minimize processing time with maximizing network attacks’ individual accuracy of classification. To optimize the parameter of a radial basis function kernel, the variance tuning technique has been adopted. Their model resulted in high detection rate and low false alarm rates in comparison to other approaches like Principal Component Analysis (PCA), Genetic Algorithm (GA), and MSVM and single-SVM.

Kumar et al. [16] proposed gain ratio feature selection technique with an updated version of Naive Bayes classifier to improve the accuracy of intrusion detection. In this study, the proposed method has been compared with the best first correlation feature selection and ranker information gain techniques using the existing classifiers, i.e., Naive Bayes, J48, and REPTree. The result of their experiment reported that the proposed approach achieved better classification performance.

Pham et al. [17] proposed an IDS framework by applying gain ratio feature selection technique with ensemble bagging model that used J48 as the base classifier. In this study, gain ratio feature selection helped J48 algorithm overcome the bias towards features that have a large number of values. Their approach has been compared with the ensemble of K-Nearest Neighbors (KNN) and correlation feature selection with Naive Bayes. Their experimental results showed that the proposed framework improved classification accuracy and decreased false alarm rate.

Shahbaz et al. [18] studied feature selection approach based on dependency measures of Correlation-based Feature Selection (CFS) and Symmetrical Uncertainty (SU) for intrusion detection system. In this study, CFS has been used to keep relevant features while avoiding redundancy. And then, SU has been applied to remove the features that are uncorrelated with other features. The proposed method has been tested using random forest, J48, PART, and C4.S classifiers.The reported experimental results exhibited that the proposed approach achieved better classification accuracy when compared with CFS, information gain, gain ratio, and Chi-square techniques.

Le et al. [19] proposed a hybrid Sequence Forward Selection (SFS) algorithm and Decision Tree (DT) model to improve accuracy performance as well as reduce false alarm rate (FAR). In their work, various Recurrent Neural Networks (RNN), which are traditional RNN, Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU), were built. Their experimental results, based on NSL-KDD and ISCX, reported that the proposed models achieved significantly improved prediction performance. Additionally, their approach could reduce the computation time by memory profilers’ measurement.

Taher et al. [20] applied a wrapper feature selection method with Artificial Neural Network (ANN) and support vector machine (SVM) to classify the network traffic. Their evaluation based on NSL-KDD reported that the ANN-based machine learning model outperformed the SVM model. Furthermore, their proposed model was efficient than the other existing models in terms of detection rate.

Yeshalem et al. [21] proposed a Bootstrap-based Homogeneous Ensemble Feature Selection (BHmEFS) method to select a subset of relevant and non-redundant features that improved classification accuracy. In their approach, three sample data have been generated from the original dataset during the bootstrapping process. They applied the Chi-square method to select the essential features subsets from each of the three acquired sample data. The intersection method was used as a combination method in a homogeneous approach that fused the three output subsets to obtain the ensemble features subset. In their experiments, they evaluated the performance of the proposed method by comparing it with the Chi-square method in each of the three bootstrap samples and the original dataset. The experimental results in a multi-class NSL-KDD dataset reported that the BHmEFS achieved better classification accuracy when compared with the Chi-square and other methods.

Bostani et al. [22] presented a hybrid feature selection method using a Binary Gravitational Search Algorithm (BGSA) and Mutual Information (MI) to improve the standard BGSA’s effectiveness as a feature selection algorithm. To perform a global search, BGSA was used as a wrapper-based feature selection method. Furthermore, BGSA was integrated with the MI approach to select the relevant features by computing the feature–feature and the feature–class mutual information. Their approach helped in choosing the essential features with the least redundancy to the target class. To control the search direction of the standard BGSA, they defined the two-objective function as a fitness function to maximize the detection rate and minimize the false-positive rate. Their experimental results on the NSL-KDD dataset showed that the proposed method achieved higher accuracy and detection rate than standard wrapper-based and filter-based feature selection methods.

Zhang et al. [23] proposed a two-level network intrusion detection model based on the combination of the ReliefF algorithm and borderline Synthetic Minority Oversampling Technique (SMOTE). In their proposed model, the ReliefF algorithm was first used to select features. And then, the borderline SMOTE was used to oversample the misclassified minority class samples. The three base classifiers, KNN, C4.5, and NB, were combined in pairs and tested by tenfold cross-validation. Their experimental results, using the NSL-KDD data set, showed that their approach could perform well on imbalanced network intrusion detection data set. It also significantly improved the detection accuracy of minority samples.

However, the above studies did not consider the impact of applying single feature selection in degrading prediction performance due to the possibility of eliminating useful candidate features. To the best of our knowledge, the topic of retaining useful candidate features has not been sufficiently addressed. Therefore, the HEFS approach is proposed to improve the prediction performance by holding onto good candidate features in NIDS.

3 Methodology

In this section, we explain the details of our proposed method. The type of ensemble approach we use, the particular base feature selection methods included, and the threshold method we use are explained. Furthermore, the combination method and the merit-based evaluation technique that we apply are explained. The scheme of our proposed method is as shown in Fig. 1 and its conceptual framework is shown in Fig. 2. The high-level description is also shown in Pseudo-code 1.

Fig. 1
figure 1

The scheme of the proposed framework

Fig. 2
figure 2

The framework of the proposed method

3.1 The Proposed Ensemble-Based Feature Selection Method

In this research, we propose a Heterogeneous Ensemble Feature Selection (HEFS) approach. A heterogeneous approach is used, because our base selectors are different and use the same dataset. When designing the ensemble feature selection method, there are issues to consider and techniques to be used. These include the number and types of individual feature selection methods, a threshold method, and a combination method. The threshold method is applied if the base feature selection methods are rankers [24].

In the proposed HEFS method, the output feature subsets of five different feature selection methods are combined using a union method to obtain an ensemble subset. HEFS method uses merit-based evaluation to avoid the internal redundancy of the obtained ensemble features subset and acquire the final optimal features. The merit of each feature is obtained by computing their correlation with themselves and the class.

3.2 Individual Feature Selection Methods

Since we use a heterogeneous ensemble feature selection approach, five different individual filter methods are combined to obtain an ensemble features’ subset. These filter methods are: probabilistic significance attribute evaluator,Footnote 1 Gain Ratio (GR),Footnote 2 reliefF,Footnote 3 Symmetrical Uncertainty (SU)Footnote 4 and classifier attribute evaluatorFootnote 5 which are included in our framework. Each filter method chooses its relevant features and rank depending on their score.

The methods employed in the ensemble should guarantee diversity while increasing the regularity of the feature selection process. Their variety may have the advantage of boosting performance. Using more than one feature selection method inevitably has a computational cost. When applied individually, filter methods have faster computation time and low prediction performance than wrappers and embedded algorithms. Therefore, filter methods are preferred for being included in an ensemble.

The five feature selection methods (i.e., base selectors) are rankers, which means that they do not select a subset of features, but sort all the features. These rankers are chosen, because: (1) their selection metrics are different and so ensure good diversity in the final ensemble, (2) they are widely used by feature selection researchers, (3) individually, they are not capable of detecting redundant features, and (4) they need to have a correctly balanced dataset (i.e., they do not consider the imbalance factor of the dataset) [30].

3.3 Threshold Method

When filter methods select features through ranking, they first sort all the features. To acquire relevant features’ subset through the specific selection methods, setting a threshold is needed. For this research, we use \([\textrm{log}_{2}k +1]\) to decide the threshold [31], where k is the number of features in the dataset. Thus, each ranker method selects the top \([\textrm{log}_{2}k +1]\) features individually.

3.4 Union Combination Method

As a combination method, we use the union of the feature subsets obtained from each base selector to acquire a single ensemble feature subset. We apply this combination method to retain the useful candidate features that can improve prediction performance. The union method effectively includes the features selected by at least one of the base selectors into the acquired ensemble features subset.

For feature subsets \(S_{1}\) and \(S_{2}\), the union of these two sets consists of all the features that are contained in either \(S_{1}\) or \(S_{2}\) or both. More formally, the union of the feature subsets is stated as

$$\begin{aligned} f \in (S_{1} \cup S_{2}) \Leftrightarrow f \in S_{1} \vee f \in S_{2}, \end{aligned}$$
(1)

where f is the feature in the dataset.

Thus, the general union of two feature subsets can be defined as

$$\begin{aligned} \cup \left\{ S_{1} , S_{2} \right\} = S_{1} \cup S_{2}. \end{aligned}$$
(2)

Based on Eq. 2, for N number of feature subsets

$$\begin{aligned}&\cup \left\{ S_{1} , S_{2} , \ldots , S_{N} \right\}r \\&\quad = S_{1} \cup S_{2} \cup \cdots \cup S_{N}. \end{aligned}$$
(3)

Therefore, we can combine the individual feature subsets \(S_{1}\), \(S_{2}\), ..., \(S_{N}\) and obtain the ensemble feature subset E as

$$\begin{aligned} E = S_{1} \cup S_{2} \cup \cdots \cup S_{N}. \end{aligned}$$
(4)

3.5 Merit-Based Evaluation of Ensemble Feature Subset

The main target of the union combination method is only to choose unique features from different subsets. However, this combination method does not consider the internal redundancy or irrelevancy of features in terms of prediction information.

The merit-based evaluation avoids redundancy and retains the relevant attributes to obtain a highly relevant subset of uncorrelated features. As a result, the dataset’s dimensionality can be drastically reduced, and the performance of learning algorithms can be improved. The merit-based evaluation uses a correlation-based heuristic to evaluate the worth of a subset of attributes by considering each feature’s predictive ability and the degree of redundancy between them. The heuristic considers the usefulness of individual features for predicting the class label and the inter-correlation level among them. Merit-based evaluation filters feature based on their correlation scores by rewarding them for containing attributes that are highly correlated with the dependent variable or class label and penalizing subsets for having attributes that are highly correlated with each other. The merit function considers the usability of individual features for predicting the class label (i.e., feature-class correlation) and the inter-correlation level (i.e., feature-feature correlation) among them. A higher merit score represents a better subset [32, 33].

During the merit-based evaluation of the ensemble feature subset, Evolutionary Search Algorithm (ESA) is used to select the relevant attributes and rank them according to their merits. ESA’s advantage over others is that it allows the best solution to emerge from the best of prior solutions. It improves the selection of features over time. ESA creates new and more fitted individuals by combining the generation of the different solutions and extracting the best genes (features) from each one [34].

The evaluation is performed by computing feature–feature and feature–class correlation. If the feature and the class have a higher correlation, then the evaluation selects this feature from the obtained ensemble subset. It further rapidly discards irrelevant, redundant, and noisy features. The merit or goodness of an ensemble feature subset E consisting of n number of features is given as [35]

$$\begin{aligned} \textrm{Merit}_ {En} = \frac{n \overline{r_\mathrm{{fc}}}}{\sqrt{n + n (n - 1 ) \overline{r_{ff}}}}, \end{aligned}$$
(5)

where \(\overline{r_\mathrm{{fc}}}\) is the mean value of all feature–class correlations (\(f \in E\), c is a target class), and \(\overline{r_\mathrm{{ff}}}\) is the mean value of all feature–feature correlations. The \(\textrm{Merit}_{En}\) is the heuristic function that controls irrelevant features because of their relatively weak prediction. Equation 5 is Pearson’s correlation, since all variables have been standardized. The numerator indicates how predictive a group of attributes is; the denominator indicates how much redundancy there is. Since the irrelevant features will be poor class predictors, they are handled by the heuristic. The highly correlated features are discriminated against, because they are redundant attributes. Therefore, the merit function will have larger values for subsets that have features with strong feature–class correlation and weak feature–feature correlation. However, even if a set of features has a strong feature–class correlation, the merit value will be degraded if there is a strong feature–feature correlation.

The criteria to select the subset of features with low redundancy are strongly predictive of the class from an ensemble subset E can be defined as follows:

$$\begin{aligned} F&= \max (\textrm{Merit}_ \mathrm{{En}}) \nonumber \\&\quad \left[ \frac{r_{f_ {1} c} + r_{f_ {2} c} + \cdots + r_{f_{n} c}}{\sqrt{n + 2 \left( r_{f_{1} f_{2}} + \cdots + r_{f_{i} f_{j}} + \cdots + r_{f_{n} f_{n-1}} \right) }} \right] , \end{aligned}$$
(6)

where F is the relevant features subset selected from an ensemble subset E, \(r_{{f_{i}c}}\) is the correlation value of the \(i{\textrm{th}}\) feature with class c, and \(r_{f_{i}f_{j}}\) is the correlation value of the ith and jth features. Finally, the relevant, non-redundant, and optimal features are obtained in which they are used for classification.

The merit-based evaluation method treats continuous and discrete features differently. For continuous class data, the obvious measure for estimating the correlation between features is as shown in Eq. 5, which is a standard linear (Pearson’s) correlation. When the two features involved are both continuous, it is straightforward, and their correlation is given as [35]

$$\begin{aligned} r_{XY}=\frac{\sum x y}{n \sigma _{x} \sigma _{y}}, \end{aligned}$$
(7)

where \(r_{XY}\) is correlation values, n is the number of features in an ensemble subset E, \(\sigma\) is the standard deviation value, X and Y are two continuous features expressed in terms of deviations, and x and y are the values of the features (variables) X and Y, respectively.

A weighted Pearson’s correlation is applied when one feature is continuous, and the other is discrete, as shown in Eq. 8. For discrete feature X and a continuous feature Y, if X has q values, then q binary features are correlated with Y. The binary features take the value 1 for each of the \(i = 1,\ldots ,q\) when the \(i{\textrm{th}}\) value of X occurs and 0 for all other values. Each of the calculated correlations \(i = 1,\ldots ,q\) is weighted by the prior probability that X takes value i

$$\begin{aligned} r_{XY}=\sum _{i=1}^{q} p\left( X=x_{i}\right) r_{X_{b i} Y}, \end{aligned}$$
(8)

where \(r_{XY}\) is correlation values and \(X_{bi}\) is a binary attribute that takes value 1 when X has value \(x_{i}\) and 0 otherwise.

When both features are discrete, binary features are created for both, and all weighted correlations are calculated for all combinations if Y has l values, as shown below

$$\begin{aligned} r_{X Y}=\sum _{i=1}^{q} \sum _{j=1}^{l} p\left( X=x_{i}, Y=y_{j}\right) r_{X_{b i} Y_{b j}}. \end{aligned}$$
(9)

During this approach to calculating correlations, merit-based evaluation replaces any missing values with the mean for continuous attributes and the most common value for discrete features.

figure a

3.6 Classification

We evaluate the HEFS method with random forest, J48, random tree, and REP tree classifiers. We explain the Random Forest (RF) classification technique here, since it achieves better prediction performance, as shown in Sect. 4. After selecting the final relevant features using HEFS, the ensemble features subset is used as input for the RF classifier. Then, the RF produces the classification result of the five target classes within the pre-processed dataset. The five target classes include the normal traffic data and the four attacks (i.e., DoS, probe, U2R, and R2L).

A combination of decision trees forms the RF classifier. When constructing a decision tree, RF improves the classification performance of a single tree classifier by combining the bootstrap aggregating method and randomization in selecting data nodes. A decision tree with M leaves divides the feature space into M regions, \(1\le {m}\le M\). The prediction function f(x), for each tree, is defined as [36]

$$\begin{aligned} f(x)=\sum _{m=1}^{M} c_{m} \Pi \left( x, R_{m}\right) , \end{aligned}$$
(10)

where M is the number of regions in the feature space, \(R_{m}\) is a region appropriate to m; \(c_{m}\) is a constant suitable to m

$$\begin{aligned} \Pi \left( x, R_{m}\right) = {\left\{ \begin{array}{ll} 1, &{}\quad {\text {if}} \,x \in R_{m} \\ 0, &{}\quad \text {otherwise}. \end{array}\right. } \end{aligned}$$
(11)

The majority vote of all trees concludes the last classification.

The framework of the proposed method which is shown in Fig. 2 is described as follows:

  1. 1.

    Start

  2. 2.

    Load the NSL-KDD dataset as an input

  3. 3.

    Apply different base feature selection methods (base selectors) separately for the input dataset

  4. 4.

    Generate the rankings of features for each base selectors according to their score

  5. 5.

    Find the threshold T for each ranking to decide the number of features to be selected

  6. 6.

    Select the top T features from the rankings of each base selectors

  7. 7.

    Obtain the subsets \(S_{1}\), \(S_{2}\),..., \(S_{N}\) that contain the selected features \(f_{1}\), \(f_{2}\),..., \(f_{T}\) from each rankings

  8. 8.

    Combine the obtained feature subsets in step (7) using the union combination method

  9. 9.

    Obtain the ensemble features subset E by including the features selected by at least one base selector

  10. 10.

    Compute the merit or goodness of the obtained ensemble features subset E with n number of features to evaluate the relevance and avoid the internal redundancy

  11. 11.

    if the merit of the feature is maximum in E, include the feature into the final optimal features subset F. Otherwise, discard the feature

  12. 12.

    Obtain the final optimal features subset F

  13. 13.

    Build the machine learning classification model using the obtained features subset F

  14. 14.

    Obtain the prediction performance of the classification model

  15. 15.

    End

Pseudo-code 1 shows our proposed approach’s high-level description (text-based algorithmic detail). In this Pseudo-code, we use the formula of feature–feature (inter-correlation of features) and feature–class correlation coefficient only when both features are continuous to avoid complexity in lines 31 and 32. To calculate the correlation coefficient for different types of features (i.e., discrete and continuous) and both discrete features, Eqs. 8 and 9 can be used, respectively.

3.7 Deployment of HEFS-Based NIDS

Even though Machine Learning (ML)-based network intrusion detection systems can take the network traffic (input) to predict a network behavior, these systems still cannot be directly implemented in real-time network environments. The ML-based models do not have packet sniffers that capture the network traffic in real time. A network sniffer (packet sniffer) is a traffic monitoring and analysis tool that can identify (sniff out) the data flowing over the network in real time. For achieving real-time detection, the developed ML-based network intrusion detection systems have to work with packet sniffers, such as Spark, Bro, and Snort [37].

When our proposed HEFS-based NIDS is deployed, the packet sniffer captures the network traffic and will be pre-processed by applying HEFS. The processed data will be used as input for the prediction decision of the RF classifier. Finally, the classifier predicts one of the five target classes, including the normal traffic data and one of the four attacks (i.e., DoS, probe, R2L, and U2R). The network administrators will decide the required network intrusion prevention mechanisms based on the report. The deployment of HEFS-based NIDS is as shown in Fig. 3.

Fig. 3
figure 3

The scheme of HEFS-based NIDS deployment

4 Experiments, Analysis of Results, Discussion, and Implications

4.1 Benchmark Dataset

We conduct our experiment using Network Security Laboratory-Knowledge Discovery and Data Mining (NSL-KDD) dataset to evaluate our proposed method. NSL-KDD dataset is an effective benchmark data set to help researchers compare different intrusion detection methods despite a lack of public data sets for network-based IDSs. The data set comprises 41 features with five classes labeled as attacks type (intrusions) and normal. The intrusions are categorized into four classes of attacks, which are: Denial of Services (DoS), probe, User-to-Root (U2R), and Remote-to-Local (R2L) [38, 39]. The list of features is shown in Table 1 with their corresponding type [40].

Table 1 List of features in the NSL-KDD dataset

4.2 Experimental Framework

In our experiment, to test our proposed method’s prediction performance, we make the partition of data by applying tenfold cross-validation. Then, we build models on four learners, viz., random forest, J48, random tree, and REP tree. We also carry out the statistical significance test to evaluate the significance of the HEFS method’s improved performance.

To test the robustness, we measure the diversity and stability of our proposed ensemble feature selection method. After conducting an extensive experiment, we also compare our proposed method’s performance with other methods previously proposed by other researchers.

RapidMiner Studio Professional 9.6 [41], WEKA version 3.8 [42], KEEL (Knowledge Extraction based on Evolutionary Learning) software tool [43], OriginPro 2018 [44], and IBM SPSS Statistic 23 [45] are adopted to create and examine the empirical results.

4.3 Evaluation Metrics

4.3.1 Prediction Performance Test

The evaluation of performance in our experiment is conducted by using the evaluation metrics: Detection Rate (DR), Accuracy (Acc), Precision (Pr), Recall (Rc), F-Measure (F1), and Receiver Operating Characteristic (ROC) curve. Since the classification is multi-class, for each classes i, the number of classes L, and the micro-average \(\mu\), the formulas to calculate the evaluation metrics are given as follows [46, 47]. The description of the notations which are used in the formulas is also shown in Table 2

$$\begin{aligned}&(\hbox {DR})_{\textrm{i}} = \frac{(\textrm{TP})_{i}}{(\textrm{TP})_{i} + (\textrm{FN})_{i}} \end{aligned}$$
(12)
$$\begin{aligned}&({\textrm{FPR}})_{\textrm{i}} = \frac{(\textrm{FP})_{i}}{(\textrm{TN})_{i} + (\textrm{FP})_{i}} \end{aligned}$$
(13)
$$\begin{aligned}&(\hbox {Pr})_{\textrm{i}} = \frac{(\textrm{TP})_{i}}{(\textrm{TP})_{i} + (\textrm{FP})_{i}} \end{aligned}$$
(14)
$$\begin{aligned}&(\hbox {Rc})_{\textrm{i}} = \frac{(\textrm{TP})_{i}}{(\textrm{TP})_{i} + (\textrm{FN})_{i}} \end{aligned}$$
(15)
$$\begin{aligned}&(\textrm{Acc})_{\textrm{avg}} = \frac{\sum _{i = 1} ^ {L} \frac{(\textrm{TP})_{i} + (\textrm{TN})_{i}}{(\textrm{TP})_{i} + (\textrm{TN})_{i} + (\textrm{FP})_{i} + (\textrm{TN})_{i}}}{L} \end{aligned}$$
(16)
$$\begin{aligned}&(\textrm{Pr})_{\mu }= \frac{\sum _{i = 1} ^ {L} (\textrm{TP})_{i}}{\sum _{i = 1} ^ {L} \left( (\textrm{TP})_{i} + (\textrm{FP})_{i} \right) } \end{aligned}$$
(17)
$$\begin{aligned}&(\textrm{DR})_{\mu }= \frac{\sum _{i = 1} ^ {L} (\textrm{TP})_{i}}{\sum _{i = 1} ^ {L} \left( (\textrm{TP})_{i} + (\textrm{FN})_{i} \right) } \end{aligned}$$
(18)
$$\begin{aligned}&(\textrm{Rc})_{\mu } = \frac{\sum _{i = 1} ^ {L} (\textrm{TP})_{i}}{\sum _{i = 1} ^ {L} \left( (\textrm{TP})_{i} + (\textrm{FN})_{i} \right) } \end{aligned}$$
(19)
$$\begin{aligned}&(F1)_{\mu } = \frac{2 * {(\textrm{Pr})}_{\mu } * {({\rm Rc})}_{\mu }}{{(\textrm{Pr})}_{\mu } + {({\rm Rc})}_{\mu }}. \end{aligned}$$
(20)
Table 2 The description of notations

4.3.2 Statistical Significance Test

To prove that the performance improvement of the HEFS method is statistically significant, we use the mostly used Wilcoxon statistical significance test [48, 49]. The test is evaluated against each of the eight feature selection techniques that include the five base selectors, the three additional methods for comparison, and the original data.

The Wilcoxon signed-rank test is a nonparametric statistical test that ranks the differences in performances of two feature selection methods using the dataset. It ignores the signs and compares the ranks for the positive and the negative differences. Let \(d_{i}\) be the difference between the performance scores of the feature selection methods on the \(i{\rm th}\) classification model; then, the differences are ranked according to their absolute values. The average ranks are assigned in case of ties. Let \(R^{+}\) be the sum of ranks on which the second algorithm outperformed the first, and \(R^{-}\) the sum of ranks for the opposite, then

$$\begin{aligned} R^{+}=\sum _{d_{i}>0} \textrm{rank}\left( d_{i}\right) +\frac{1}{2} \sum _{d_{i}=0} \textrm{rank}\left( d_{i}\right) \end{aligned}$$
(21)

and

$$\begin{aligned} R^{-}=\sum _{d_{i}<0} \textrm{rank}\left( d_{i}\right) +\frac{1}{2} \sum _{d_{i}=0} \textrm{rank}\left( d_{i}\right) . \end{aligned}$$
(22)

Ranks of \(d_{i} = 0\) are split evenly among the sums. If there is an odd number of them, one is ignored. Let the Wilcoxon test \(\tau\) be the smaller of the sums, then \(\tau = \textrm{min}(R^{+},R^{-})\) for confidence level of \(\alpha = 0.05\).

4.3.3 Diversity and Stability Test

In addition to prediction, accuracy, diversity, and stability are other relevant factors in ensemble feature selection methods. The most widely used metrics to evaluate the diversity and the stability of the obtained optimal feature subset are Spearman’s rank correlation coefficient and Pearson’s correlation coefficient [50,51,52]. Diversity measures the variation of results obtained from the specific feature selection methods. Ensembles contribute nothing to improve performance when all the individual methods produce the same result. On the other hand, stability can make us know the ensemble method is not sensitive to the changing of the training set. As a result, the selected features have not to be affected when the training set is changed [24, 53].

For the training examples described by a vector of features \(f = (f_{1}, f_{2}, \ldots , f_{k})\), then a feature selection algorithm produces either a weighting-scoring \(w = (w_{1},w_{2}, \ldots ,w_{k})\) or a ranking \(R = (R_{1}, R_{2}, \ldots , R_{k})\), where k is the number of features. To measure the robustness, we need a measure of association for each of the above representations.

To measure the association between two weightings w and \(w^{\prime }\) produced by a given feature selection algorithm, we use Pearson’s correlation coefficient and described as follows:

$$\begin{aligned} r\left( w, w^{\prime }\right) =\frac{\sum _{i=1}^{k}\left( w_{i}-\mu _{w}\right) \left( w_{i}^{\prime }-\mu _{w^{\prime }}\right) }{\sqrt{\sum _{i=1}^{k} \left( w_{i}-\mu _{w}\right) ^{2} \sum _{i=1}^{k}\left( w_{i}^{\prime } -\mu _{w^{\prime }}\right) ^{2}}}, \end{aligned}$$
(23)

where \(r(w, w^{\prime })\) is the association score on weightings w and \(w^{\prime }\), \(\mu _{w}\) is the mean of features’ weight w, and \(\mu _{w^{\prime }}\) is the mean of features’ weight \(w^{\prime }\). The values of Pearson’s correlation coefficient r (i.e., the association score) range [1, −1]. A r of 0 indicates that there is no association between the two weightings of features. A r of greater than 0 indicates a positive association, and a r of less than 0 indicates a negative association.

To measure the association between two rankings R and \(R^{\prime }\), we use Spearman’s correlation coefficient, and it can be expressed as follows:

$$\begin{aligned} r_{s}\left( R, R^{\prime }\right) =1-6 \sum _{i=1}^{k} \frac{\left( R_{i}-R_{i}^{\prime }\right) ^{2}}{k\left( k^{2}-1\right) }, \end{aligned}$$
(24)

where \(r_{s}(R, R^{\prime })\) is the association score on rankings R and \(R^{\prime }\). \(R_{i}\) and \(R_{i}^{\prime }\) are the ranks of feature i in rankings R and \(R^{\prime }\), respectively. The values of Spearman’s correlation coefficient \(r_{s}\) (i.e., the association score) takes the range [1, −1]. A \(r_{s}\) of 1 indicates a perfect association of features rankings, a \(r_{s}\) of zero indicates no association, and a \(r_{s}\) of −1 indicates a perfect negative association. The closer \(r_{s}\) is to zero, the weaker the association between the rankings of features.

Therefore, the values of r and \(r_{s}\) help in interpreting the robustness of ensemble feature selection methods. Accordingly, the higher the absolute value of r and \(r_{s}\), the higher the selected features’ stability. And the lower their absolute values, the higher the diversity of base selectors.

4.4 Result Analysis

4.4.1 Analysis of Selected Features

Table 3 reports the list of features selected using the five base selectors or rankers: ClassfrAtrEval, GainRatio, RelifAtrEval, SignfAtrEval, and SymUnAtrEval. The table also includes the list of features generated when the five base selectors are combined using the union combination method. Furthermore, Table 3 includes the list of the most relevant and optimum number of features selected using merit-based evaluation (i.e., the final result of the HEFS method).

Table 3 The list of selected features with the corresponding feature selection methods

As shown in Table 3, each of the five base selectors chooses the top six relevant features after they sort based on their ranking score. The six features selected in each selector are based on the threshold, which is explained in Sect. 3 (i.e., \([\textrm{log}_{2}k +1]\)). After the selected features from each base selector are combined using the union combination method, the union generates 20 candidate features. When the merit-based evaluation is applied for these 20 features to avoid internal redundancy, it generates an optimum number of features (i.e., the ten most relevant features) which are further used for classification purposes.

4.4.2 Analysis of Precision, Recall, and F-Measure

In Table 4, we report the detailed performance results of the proposed HEFS method compared to other feature selection methods in terms of Precision (Pr), Recall, F-measure (F1), and ROC curve. The feature selection methods include the five base feature selection methods involved into an ensemble subset (i.e., SignifAtrEval, ReliefAtrEval, SymUnAtrEval, GainRatioAtrEval, and ClassfrAtrEval). Three feature selection methods (i.e., Gini Index (GiniIndex),Footnote 6 Standard deviation (SDeviation),Footnote 7 and clustering value (CVAtrEval)Footnote 8) are included for comparison to showcase the standard of the proposed HEFS algorithm. This table describes each method’s evaluation results for predicting the five classes: DoS, U2R, R2L, probe, and normal. Random forest, J48, random tree, and REP tree classifiers are used for evaluating the performances.

Table 4 The detailed performance evaluation of HEFS method in terms of Pr, Recall, F1, and ROC for each classes with the four classifiers

As shown in Table 4, the HEFS method achieves better average performance compared with the five base feature selection methods (i.e., SignifAtrEval, ReliefAtrEval, SymUnAtr-Eval, GainRatioAtrEval, and ClassfrAtrEval) with the four indicated classifiers. The experiment results also show that the HEFS method and GiniIndex feature selection method achieve equal prediction performance with random forest and random tree classifiers in terms of Pr, Recall, F1, and ROC. They also achieve equal performance in terms of ROC with J48 and REP tree. Except for GiniIndex (which is not part of the base feature selection methods), the proposed HEFS method’s performance outperforms all the other feature selection methods with all classifiers and metrics.

Table 4 also shows that the proposed HEFS method detects all the five classes including the extreme minority classes U2R and R2L with better prediction performance. Among the base feature selection methods, SignifAtrEval predicts all classes. However, the other base feature selection methods (ReliefAtrEval, SymUnAtrEval, GainRatioAtrEval, and ClassfrAtrEval) have limitation in detecting the minority classes (U2R and R2L). The prediction performance of ReliefAtrEval, GainRatioAtrEval, and SymUnAtr-Eval in terms of Pr and F1 is 0.000 for the U2R class with most of the classifiers. For the class R2L, the prediction performance of ClassfrAtrEval in terms of Pr and F1 is also 0.000. Unlike the other base feature selection methods, the HEFS method achieves better prediction performance for U2R and R2L classes than most of the methods included in Table 4. Therefore, the experiment results in Table 4 show that the prediction performance of the proposed HEFS method outperforms most of the other methods with random forest, J48, random tree, and REP tree classifiers in the indicated metrics.

Figure 4 shows the result of performance evaluation for the proposed HEFS method and the five base feature selection methods in terms of precision. Random forest, J48, random tree, and REP tree are used to evaluate their performance, and the result is taken from Table 4. The figure also includes each method’s performance in predicting the five classes: DoS, U2R, R2L, probe, and normal. As shown in Fig. 4, the proposed HEFS method achieves better results than the other base feature selection methods using the indicated classifiers and metrics. The HEFS method helps the classifiers detect all attacks type (classes) and achieve better precision performance, while the base selectors partially detect the attack types. Figure 4 also shows the statistical significance of the HEFS method’s performance against the five base feature selection methods in terms of precision. The performance improvement of the proposed HEFS method against ReliefAtrEval (with the four classifiers), SymUnAtrEval (except J48 classifiers), and GainRatioAtrEval (with J48 and REP tree classifier) is statistically significant (i.e., \(P<0.05\)). However, its performance improvement against SignifAtrEval and Classfr-AtrEval is not statistically significant (i.e., \(P>0.05\)). Therefore, the figure shows that the proposed HEFS method’s performance achieves statistically significant improvement compared to most of the base feature selection methods in terms of precision.

Fig. 4
figure 4

The performance of the HEFS and base feature selection methods for detecting DoS, U2R, R2L, probe, and normal classes in terms of precision with their P values

4.4.3 Analysis of Accuracy and Detection Rate

Table 5 demonstrates the experiment results of performance evaluation for various feature selection methods in terms of DR and Acc with the four classifiers (i.e., random forest, J48, random tree, and REP tree). The feature selection methods in Table 5 includes the proposed HEFS method and the five base feature selection methods involved into an ensemble subset (SignifAtrEval, ReliefAtrEval, SymUnAtrEval, GainRatioAtrEval, and ClassfrAtrEval). Three feature selection methods (GiniIndex, SDeviation, and CVAtrEval) are also included for comparison to showcase the standard of the proposed HEFS algorithm. As shown in Table 5, the HEFS method’s performance outperforms all the base feature selection methods in terms of DR and Acc with the four classifiers. In this table, the proposed HEFS method achieves equal performance with the GiniIndex feature selection methods (which is not part of the base feature selection methods) in terms of DR with a random forest classifier. However, the HEFS method’s performance outperforms all the other methods in terms of DR and Acc with random forest, J48, random tree, and REP tree.

Table 5 Performance evaluation of HEFS method in terms of DR and Acc with the four classifiers

Table 5 also shows that the HEFS method selects ten features. In comparison, other methods select six features in which the proposed method helps retain good predictors and select the optimal number of features. Furthermore, Table 5 shows that the HEFS method achieves better performance with random forest than all the other classifiers in the corresponding feature selection methods.

Figure 5 shows the proposed HEFS method’s performance compared to the five base feature selection methods (i.e., SignifAtrEval, ReliefAtrEval, SymUnAtrEval, GainRatioAtrEval, and ClassfrAtr-Eval) in terms of accuracy with the four classifiers, including random forest, J48, random tree, and REP tree. The experiment results shown in Fig. 5 are taken from Table 5. As shown in this figure, the proposed HEFS method’s performance outperforms each of the base feature selection methods using the indicated classifiers. The experiment results in Fig. 5 also show that the combined (ensemble) performance is better than the individual base feature selection methods’ performances in terms of accuracy with all classifiers.

Fig. 5
figure 5

Performance comparison of the HEFS method with the base selectors in terms of accuracy (Acc)

4.4.4 Analysis of ROC Curve

Figure 6a–f shows the performance of the HEFS method and the five base feature selection methods (i.e., SignifAtrEval, ReliefAtrEval, SymUnAtrEval, GainRatioAtrEval, and ClassfrAtrEval) in terms of ROC curves. The ROC curves are constructed by plotting the True-Positive Rate (TPR) against the False-Positive Rate (FPR) of random forest classifier, which achieves the best performance, as shown in Tables 4 and 5. Figure 6a shows the ROC curves of all classes (DoS, U2R, R2L, probe, and normal) using the HEFS method. The classes’ ROC curves are aligned or very close to the upper left corner of the graph, which shows that the results are very close to perfect classification. As shown in Fig. 6b–f, the other methods’ ROC curves are relatively far from the graph’s upper left corner. These results (i.e., b–f) show that the base feature selection methods independently achieve less classification accuracy. Therefore, the proposed HEFS method helps the classifier achieve efficient and accurate prediction than the individual base selectors.

Fig. 6
figure 6

The ROC curves of random forest classifier with the HEFS method and the five base feature selection methods, including classes: DoS, U2R, R2L, probe, and normal

4.4.5 Analysis of Statistical Significance Test

The results of Wilcoxon signed statistical significance test are shown in Table 6 that compares the performance of the proposed HEFS method and the five base feature selection methods (i.e., SignifAtrEval, ReliefAtrEval, SymUnAtrEval, GainRatioAtrEval, and ClassfrAtrEval). The HEFS method’s performance is also compared with the three feature selection methods (i.e., GiniIndex, SDeviation, and CVAtrEval), which are not part of the base selectors. The statistical significance test is conducted to confirm that our proposed method’s performance difference is statistically significant in terms of detection rate. This test includes random forest, J48, random tree, and REP tree classifiers with the corresponding feature selection methods. In this table, the P-values are significant, i.e., \(P < 0.05\) with confidence interval level of \(\alpha = 0.05\) except GiniIndex feature selection method. The test statistic min(\(R^{+},R^{-}\)) for GiniIndex is larger than others and its P value is greater than the threshold \(\alpha = 0.05\). However, for all other methods, min(\(R^{+},R^{-}\)) is smaller and their P values are less than the threshold \(\alpha = 0.05\). This statistical significance test shows that our proposed HEFS method outperforms most of the other techniques, including the original data’s performance. Therefore, the performance improvement that is achieved by our proposed method is statistically significant.

Table 6 The result of Wilcoxon statistical significance test with a significance threshold of \(\alpha =0.05\), sum of signed ranks \(R^{+}\) and \(R^{-}\), and P-values

4.4.6 Analysis of Diversity Test

Table 7 demonstrates the diversity test results of the base feature selection methods involved in the ensemble. The diversity is measured using the base selectors’ mutual information-based weight in terms of Pearson’s and Spearman’s correlation coefficient. As shown in Table 7, the values of Pearson’s (r) and Spearman’s (\(r_{s}\)) correlation coefficient, which are in the range [−1, 1], represent the association among base selectors. The value 1 for r and \(r_{s}\) indicates that the ratings are equal. There is less association between the base selectors when the absolute values of r and \(r_{s}\) are far from 1. Less association between methods indicates that there is higher diversity between them. As explained in Sect. 4.3.3, the lower the values of correlation coefficients, the higher the diversity of base selectors, and the higher the correlation values, the lower their diversity.

Table 7 clearly shows that the results of Pearson’s correlation coefficient for base selectors are far from 1, and the average correlation coefficient is 0.371. Spearman’s correlation coefficients are higher than Pearson’s, and its average correlation coefficient is 0.536. Most of the P values (most values are near zero) of the correlation coefficients are less than the significance threshold of \(\alpha\) = 0.05. As a result, most of the correlation coefficients are statistically significant. Even though Spearman’s average correlation coefficient is higher, the diversity is not that much worse. The proposed HEFS method achieves better diversity in terms of Pearson’s correlation coefficient, since its score is less than the average. Therefore, the proposed HEFS method includes the base feature selection methods that fulfill the required diversity in their behavior.

Table 7 The result of the diversity test using Pearson’s (r) and Spearman’s (\(r_{s}\)) correlation coefficient with their corresponding P values

4.4.7 Analysis of Stability Test

The experimental results for the measurement of stability using Pearson’s (r) and Spearman’s (\(r_{s}\)) correlation coefficient is shown in Table 8. This table reports the inter-correlation (feature-feature) and feature-class correlation of the selected features. The values of Pearson’s (r) and Spearman’s (\(r_{s}\)) correlation coefficient, which are in the range [−1, 1], represent the association among selected features. The value 1 for r and \(r_{s}\) indicates that the ratings are equal. There is less association between the features when the absolute values of r and \(r_{s}\) are far from 1. Less association between features indicates that there is lower stability between them. As explained in Sect. 4.3.3, the lower the values of correlation coefficients, the lower the stability of features, and the higher the correlation values, the higher their stability.

Table 8 The result of the stability test using Pearson’s (r) and Spearman’s (\(r_{s}\)) correlation coefficient

As shown in Table 8, most of Spearman’s correlation coefficient results are not that much far from 1 for the selected features. The average Spearman’s correlation coefficient is 0.535. Most of Pearson’s correlation coefficients are less than Spearman’s, and its average correlation coefficient is 0.387. Based on these results, there is a better association among features in terms of Spearman’s correlation coefficient than Pearson’s. The value of the average Spearman’s correlation coefficient (i.e., 0.535) shows that the selected feature subset’s stability is in the good range. Therefore, the proposed HEFS method achieves good stability in terms of Spearman’s correlation coefficient.

4.4.8 Performance Comparison

Table 9 exhibits some of the lists of NIDS frameworks suggested by different authors in previous studies. This table also demonstrates the comparison of the proposed HEFS-based NIDS framework with other state-of-the-art frameworks in terms of accuracy and detection rate. The frameworks are chosen, since their problem domain is NIDS with machine learning topics. Furthermore, all the compared frameworks have been evaluated using the NSL-KDD dataset similar to our dataset.

Table 9 Performance comparison of the proposed framework with other NIDS frameworks in terms of accuracy and detection rate

As shown in Table 9, our proposed framework outperforms all other frameworks with the indicated performance evaluation metrics. The evaluation metrics: accuracy and detection rate are chosen, since they are used by most of the researchers. Therefore, the comparison results show that the proposed HEFS method helps the NIDS achieve improved performance compared with the other methods.

4.5 Discussion

4.5.1 HEFS Method in Significantly Improving the Prediction Performance of NIDS

The experimental results shown in Table 4, 5, and 9 as well as in Figs. 4, 5 and 6 affirm that the proposed HEFS method helps NIDS acquire statistically significant improvement of prediction performance. This performance improvement is due to the combined effort of heterogeneous feature selection methods, as discussed in [57]. On the other hand, applying single feature selection methods in machine learning can eliminate useful candidate features and degrades prediction performance as studied in [21].

4.5.2 HEFS Method in Avoiding Redundancy and Retaining Useful Candidate Features

Table 8, based on the selected relevant features in Table 3, shows that most of the values of feature–feature correlations are less than the feature–class correlations. These results confirm that the merit-based evaluation helps NIDS improve performance by avoiding redundancy from the obtained ensemble feature subset as explained in [58].

Therefore, the experiment results explained in Sect. 4.4 asserts that combining different (i.e., with the required diversity) feature selection methods contributes to retaining the useful, non-redundant, and optimum number of features in the obtained ensemble subset. The retaining of these useful candidate features helps the NIDS improve prediction performance.

4.5.3 HEFS Method in Detecting Minority Classes

The experimental results in Table 4 confirm that the ensemble feature selection method helps detect minority classes (i.e., U2R and R2L). However, single feature selection methods have limitations in detecting these classes that cause classification performance to be lower. These classes are low-frequency attacks (i.e., they rarely occur), and their number in the dataset is significantly less than the other classes. As explained in [10], low-frequency attacks are challenging to be detected and decrease classification performance. Therefore, the proposed HEFS method helps in increasing their detection rate and improves the general performance of NIDS.

4.5.4 HEFS Method in Improving Robustness

The results reported in Tables 7 and 8 show that the base selectors are diversified in their behavior, and the obtained ensemble feature subset is stable. During heterogeneous ensemble feature selection method, merely combining different methods is not sufficient. Instead, the base selectors’ diversity and the stability of the obtained ensemble features subset are crucial to making the algorithm more robust, as explained in [51, 59]. Furthermore, better stability can lead to better classification accuracy. Therefore, the results confirm that the heterogeneous ensemble method helps make the feature selection process more robust in NIDS.

4.6 Implications

The integration of different feature selection methods helps data scientists with data cleaning and pre-processing in NIDS. It contributes to retaining useful candidate features during their data analysis tasks. Consequently, the obtained HEFS method certainly helps them select the most appropriate and optimum number of features.

HEFS method likely helps machine learning engineers develop and implement robust NIDS to detect threats and predict attacks. Its robustness also contributes for them to detect low-frequency attacks which are minority classes effectively.

For machine learning-based NIDS model developers, applying the HEFS method certainly helps in building efficient NIDS. This NIDS contributes to network administrators effectively identifying, monitoring, and analyzing network traffic to protect a system from possible threats.

In general, NIDS based on the HEFS method certainly supports organizations in enhancing security and network performance. It also helps companies proactively identify any unwarranted networking behavior before the intrusion escalates into a full-force security attack, data leak, and service outages. Furthermore, the HEFS method likely improves organizations’ business models and services by transforming large data sets into knowledge and actionable intelligence.

5 Conclusion

This paper has demonstrated the positive influence of combining features, which are the output of different feature selection methods, to obtain a more predictable single ensemble features subset. The obtained ensemble features subset helps improve prediction performance in NIDS. The study has also indicated that applying merit-based evaluation of ensemble features subset helps avoid redundancy and select relevant features during intrusion detection.

In this work, we have compared the performance of the HEFS method with different methods including the base feature selection methods’ performance. HEFS method achieves better classification performance using random forest, J48, random tree, and REP tree. The results of stability and diversity tests for our proposed HEFS method confirm that it is robust. The statistical significance of the detection rate improvement achieved by the HEFS method is tested. It is proved that the difference of improvement in terms of detection rate is statistically significant. Our proposed NIDS framework is also compared with other researchers’ proposed state-of-the-art approaches, and it achieves better performance in terms of accuracy and detection rate.

The experiment results provide compelling evidence and suggest that the integration of heterogeneous feature selection methods helps retain useful candidate features during feature selection in NIDS. Consequently, the obtained HEFS method contributes to selecting the most relevant and optimum features that improve prediction performance. The robustness of ensemble feature selection methods also helps NIDS effectively detect low-frequency attacks which are minority classes. Therefore, the HEFS method is reliable for NIDSs, and it is a valuable method in achieving better detection performance of intrusions.

Further research is needed to improve the prediction performance of NIDS for detecting low-frequency attacks (i.e., minority classes). The minority classes are challenging to be detected, since the classification process favors the majority classes. Therefore, the problems of imbalanced multi-class datasets, for classifiers’ effectiveness, need to be considered in NIDS. Furthermore, deep learning approaches should be explored to build more effective and robust IDSs in computer network systems.