We here discuss our results on feature ranking, performance of selected subsets, benchmarking and processing time.
Feature weighting and ranking
Based on its inherent metrics and indices (Sect. 4.3), filters weight the contribution of every feature for the anomaly identification task. We normalize filters weights according to the maximum and minimum values and rearrange features in ascending order. Later on, nominal categories are visually stated according to pronounced changes in the weight slopes obtained by each filter, e.g WMR and SAM cases are shown in Fig. 3.
The feature rankings achieved with WMR, MRMR, SAM filters and LASSO, with and without stability selection are shown in Fig. 4. WMR and SAM filters are quick and straightforward, requiring less than 1 s to perform their rankings, whereas MRMR takes 18 s and LASSO 29 s to carry out their selection. Elapsed times for stability selection are obviously longer: \(1'17''\) for WMR, \(27''\) for SAM, \(9'45''\) for MRMR and \(14'56''\) for LASSO.
An initial statistical analysis of the data revealed obvious irrelevant features. For instance, feature f20 (number of outbound commands) equals \(0\) for the whole data set. Thus all filters agreed ranking f20 as the most irrelevant variable (rank 41 in Fig. 4). f20 is usually scorned also in related works, except for Stein et al. (2005), which utilizes the KDD Cup’99 (where also f20 is \(0\) for all observations).
Beyond this aspect, according to Fig. 4 the estimations of the diverse feature selection methods showed strong similarities, yet some disagreements too. All methods agreed about finding traffic features mostly relevant and content features mostly irrelevant, whereas they disagreed mainly when assessing basic features. In this respect, SAM and LASSO emphasized the contribution of basic features and WMR and MRMR completely ignored them. Table. 2 displays classification performances with the subsets obtained by WMR, MRMR, SAM and LASSO after stability selection. Classification results disclose that LASSO and SAM selections slightly outperformed WMR and MRMR, yet it is worth remarking the notable performance of the kNN classifier only with 11 features in the MRMR case.
Refinement with brute force search
We used a subset based on a conservative estimation of filters and LASSO rankings for strong contribution (i.e., f23, f25, f26, f29, f32, f33, f34, f38 and f39) and the 41 feature set as baselines for FS and BE stepwise regression respectively.
The final feature categorization after wrapper refinement is shown in Fig. 5. The meaning of the final three-color categories is as follows: The imperative feature subset (dark gray) led to an acceptable classification performance. From here on, the addition of new features (light gray) implied low improvements until a performance peak was finally reached. Then, the addition of new features (white) worsened the performance. Note that we have joined relevant and highly relevant features in the final strong contribution category.
Stepwise regression suggested that filters were able to correctly discover imperative features, but for some incongruities or disagreements due to the highly correlated nature of some features. In other words, there are features and combinations of features that provide very similar information and therefore exclude each other. This phenomenon is also the origin of the disagreement between BE and FS with regard to f8; BE labels it as irrelevant, whereas FS considers it as relevant. Specific cross-validation tests confirmed the inclusion of f8 into the strong contribution group before other possible feature combinations. The same reason caused that, except for f20, filters arbitrarily ranked negligible and low contribution features. In this respect, it is important to remark that, beyond the contribution of a few set of features (the conservative subset), the inclusion of new features has a relative value, i.e. it is always possible to replace such new features with others that contain analogous information and lead to quite similar performances.
As for the required time, brute force refinement is a high time demanding process compared to filters (for instance, BE stepwise regression with normal validation using the kNN classifier requires 5h and \(45'\) to remove only one feature of the 16-feature subset; this time is \(\times 5\) for the fivefold cross validation case). In the conducted tests, fivefold cross validation checks around turning points were run after visual inspection of normal validation results of the three classifiers together. Considering that such decisions are made in an automated manner, and taking into account that classifiers tests are serialized, the estimated time for the complete brute force refinement accounts for about 10 days of processing.
Comparison of the final subsets
Table 3 shows the validation of the final subset selections. Some interesting reasoning can be inferred by comparing performance outcomes:
-
As a general rule, the high scores obtained by the selected classifiers regardless of the feature subset confirmed their suitability for feature selection validation (Bayes classifier remains under discussion).
-
Starting from the 41-feature set, the removal of negligible features improved the classification up to taking into account 30 features. The 41- and 30-feature performances can be considered equivalent in terms of detection power.
-
The performance degradation from 30 to 16 features is significantly low, hence the contribution for anomaly detection of the rejected features is also low.
Obviating the less reliable Bayes classifier, feature subsets of Tables 2 and 3 can be compared based on the provided indices. Such comparison is displayed in Table 4. AUC and accuracy are global evaluators, whereas a good recall performance is usually considered more important for anomaly detection than precision (i.e. in terms of security false positives are less harmful or preferable before false negatives).
Table 4 Subset competency based on DTC and kNN performance indices (1st indicates the best performance)
In short, obtained outcomes suggest that, with less data to analyze, classifiers work in a simpler, less noisy space, improve criteria and reduce randomness in classification; on the other hand, there is a little information loss that otherwise could be useful to detect specific attacks.
Evaluation with classifiers
Results show that the original data set with 41 features contains redundant and irrelevant features. As a consequence, researchers that use the 41 features to detect anomalies should get similar or even better results when using the 30-feature subset, and minor degradation (or even non-existing) when using the 16-feature subset. We performed some tests to validate this statement, i.e., assessing classifiers working with the selected 41, 30 or 16 features and comparing results (Table 5). For this we used the originally proposed train/test NSL-KDD division.
Table 5 Evaluation of significant subsets with the original train/test division
Test outcomes confirmed the proposed feature selection classification introduced in Sects. 7.1, 7.3 and Table 3, as classifiers working with the 30-feature subset improve the performance if compared to the 41-feature case. Accuracy indices are not so excellent compared to some previous studies. Reasons for this have been already introduced: a) The more demanding NSL-KDD data set compared to DARPA’98 and KDD Cup’99, b) the original train/test partitioning, designed to challenge data mining studies c) our simple, straightforward classification models (since optimizing classifiers is not the focus of this work).
Noteworthy is the evolution of the Bayes classifier, which improved the performance by exchanging precision and recall rates. With 41 features, it was prone to consider almost any new vector as an attack, generating many false positives. With 16 features, false positives were reduced, but losing power to identify some anomalies. It does not imply that missing threats were distinguished with 41 features, performance indices suggest that the Naïve Bayes classifier was superficially dividing the input space, unable to dive in a deeper level to classify the data.
SVM are optimal for binary problems that involve high dimensional input spaces (i.e., many features) and myriads of samples (Joachims 1998). SVM showed specially sensitive to changes in the parameterization, demanding a new adjustment for every different subset selection. In Table 5, results correspond to the set-up that obtained the best performance with the minimum number of features, after running parameter optimization with an evolutionary computation approach.
For the ANN and SVM tests nominal features were transformed to numerical features, significantly increasing the number of dimensions and the sparsity in the feature space. The inclusion or removal of nominal features implies a determinant effect on the input space overall shape. Irrespective of the relevance of the contained information, it can be either beneficial or detrimental when establishing partition boundaries in the SVM case (depending on which features are already selected). Considering feature by feature, density estimations revealed that both attack and normal clusters show strong deviation as a general rule, introducing an environment with non-determinant fluctuations in density and where areas of existence of both normal and attack classes frequently superimpose. Moreover, normal traffic accounts for very diverse types of operations, and could be even considered as a background noise wrapping attack profiles. On the other hand, the attack group is obviously not homogeneous, and is arranged in little sub-clusters that can also show high dispersion in several features. Taking into account how SVM establish classification boundaries, such characteristics suggest a very demanding scenario. In any case, a specific parameterization for every subset revealed a tendency to improve the performance as features were reduced.
Processing time
We provide the time required by filter feature selection validation (Table 2) as well as for the classification benchmarks (Table 5). As expected, as a consequence of utilizing less features by the involved algorithms computational costs were highly reduced without impairing classification performances, which kept acceptable rates, showed minor degradation or even improved.
The reasons behind the long-time periods required by ANN have to do with the database size growth (it is more than double in storage) when transforming nominal data to numerical data for enabling LASSO, ANN and SVM processing (explained in Sect. 4.3). On the other hand, input dimensionality does not have a high impact in overall complexity for the SVM, hence performance times depended more directly on sample size and convergence criteria inside training algorithms.