An Effective Intrusion Detection Model Based on Pls-Logistic Regression with Feature Augmentation

Gu, Jie

doi:10.1007/978-981-33-4922-3_10

Jie Gu^15,16

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1299))

Included in the following conference series:

China Cyber Security Annual Conference

4607 Accesses
1 Citations

Abstract

Computer network is playing a significantly important role in our society, including commerce, communication, consumption and entertainment. Therefore, network security has become increasingly important. Intrusion detection systems have received considerable attention, which not only can detect known attacks or intrusions, but also can detect unknown attacks. Among the various methods applied to intrusion detection, logistic regression is the most widely used, which can achieve good performances and have good interpretability at the same time. However, intrusion detection systems usually confront with data of large scale and high dimension. How to reduce the dimension and improve the data quality is significant to improve the detection performances. Therefore, in this paper, we propose an effective intrusion detection model based on pls-logistic regression with feature augmentation. More specifically, the feature augmentation technique is implemented on the original features with goal of obtaining high-qualified training data; and then, pls-logistic regression is applied on the newly transformed data to perform dimension reduction and detection model building. The NSL-KDD dataset is used to evaluate the proposed method, and the empirical results show that our proposed method can achieve good performances in terms of accuracy, detection rate and false alarm rate.

You have full access to this open access chapter, Download conference paper PDF

Detecting Network Intrusions Using Multi-class Logistic Regression and Correlation-Based Feature Selection

A Multi-level Correlation-Based Feature Selection for Intrusion Detection

Article 30 March 2022

An Intelligent Intrusion Detection System Using a Novel Combination of PCA and MLP

Keywords

1 Introduction

With the rapid development of internet, networks are becoming more and more important in our daily life. Organizations rely heavily on networks to do on-line transactions, and also, individuals are dependent on networks to work, study and entertain. In a word, networks are an essentially indispensable part in modern society. However, this over-dependence on networks might have potential risk, because considerable information that relates to organization operation and individual activities is accumulated and stored. It would cause huge losses, when the networks are been invaded or attacked.

Intrusion detection systems are the most widely used tool to protect information from being compromised. Intrusion detection has been long considered as a classification problem [1, 2]. Various statistic-based and machine-learning-based methods have been applied to improve the performances of intrusion detection systems [3, 4]. However, machine learning-based methods for intrusion detection suffer criticisms [5]. Though many machine- learning-based detection methods, such as support vector (SVM) machine and artificial neural network (ANN), could achieve better detection performances, the detailed procedures of the detection process remain unknown. It is called the black-box which is not favorable for practical applications. Moreover, machine-learning-based detection methods are common time-consuming. For example, the training complexity of SVM cannot be tolerable when confront with large-scale and high dimension dataset. However, the statistic-based detection methods could cover these shortages to a large extent in terms of the model interpretation and training speed. Therefore, it can be inferred that when compared to machine-learning-based intrusion detection approaches, statistic-based intrusion detection method have some advantages, that is, good interpretability and fast training speed.

Among these statistic-based detection methods, logistic regression is the most widely used classification approach, which could achieve good detection performances [6,7,8]. It is worthy to noting that logistic regression could model the correlations among feature and take into account of the joint effects between features to produce a decision boundary to separate different classes effectively. Therefore, logistic regression can be considered as an effective detection method. However, we should also realize that to achieve further improvement in detection performance, it may not be sufficient to use logistic regression alone. Review of related work in intrusion detection indicates that data quality data quality has been considered as a critical determinant [9].

Therefore, in our study, we propose an effective intrusion detection framework based on pls-logistic regression with feature augmentation. Specifically, the feature augmentation technique is used to improve the data quality, and pls-logistic regression is chosen to reduce the dimension and build the intrusion detection model using the transformed data. The reminder of this paper is organized as follows. In Sect. 2, we give a brief overview of feature augmentation and pls-logistic regression. Section 3 describes the details of the proposed intrusion detection model. Section 4 presents the experiment settings, results and discussions. Finally, Sect. 5 comes to conclusion.

2 Methodology

To better illustrate the proposed detection model, firstly, we briefly review the main principles of the feature augmentation [10] in Sect. 2.1, as well as the pls-logistic regression classification model [11] in Sect. 2.2.

2.1 Feature Augmentation

Following Fan et al. (2016), suppose we have a pair of random variables $ \left( {{\mathbf{X}},Y} \right) $ with $ n $ observations, where $ {\mathbf{X}} \in {\mathbb{R}}^{p} $ denotes the original features and $ Y \in \left\{ {0,1} \right\} $ denotes the corresponding binary response. The logarithm marginal density ratio transformation is used as the feature augmentation technique to transform the original features. Specifically, for $ X_{j} ,j = 1,2, \ldots ,p $ in $ {\mathbf{X}} $, denote by $ f_{j} ,g_{j} $ the class conditional densities, respectively, for class 1 and class 0, that is, $ (X_{j} |Y = 1) \sim f_{j} $ and $ (X_{j} |Y = 0) \sim g_{j} $. Denote by $ {}^{1}X_{j} = \{ X_{ij} |Y_{i} = 1,i = 1,2, \ldots ,n\} $ and $ {}^{0}X_{j} = \{ X_{ij} |Y_{i} = 0,i = 1,2, \ldots ,n\} $. Then, $ f_{j} ,g_{j} $ is obtained by kernel density estimation on $ {}^{1}X_{j} $ and $ {}^{0}X_{j} $, and denote the estimates by $ \hat{f}_{j} $ and $ \hat{g}_{j} $, respectively. Thus, the feature augmentation for $ X_{j} $ using logarithm marginal density ratio transformation is shown as follows:

$$ X_{j}^{'} = \log \hat{f}_{j} (X_{j} ) - \log \hat{g}_{j} (X_{j} ) , $$

(1)

where $ X_{j}^{'} $ denotes the transformed feature for the $ j $ th feature $ X_{j} $.

2.2 Pls-Logistic Regression Classification Model

Suppose we have a pair of random variables $ (\varvec{X},Y) $, where $ \varvec{X} \in {\mathbb{R}}^{p} $ denotes the original features and $ Y \in \left\{ {0,1} \right\} $ denotes the corresponding binary response. The procedures of pls-logistic regression is depicted as follows:

Step 1. Perform univariate logistic regression on each feature to obtain $ p $ coefficients denoted by $ \omega^{1} = \left( {\omega_{1} ,\omega_{2} , \cdots ,\omega_{p} } \right) $. Denote the normalized $ \omega^{1} $ by $ \bar{\omega }^{1} $.
Step 2. Extract the first pls component $ t_{1} $ by $ t_{1} = {\mathbf{X}} \cdot \bar{\omega }^{1} $.
Step 3. Perform OLS regression of $ \varvec{X} $ against $ t_{1} $. Denote the residual of $ \varvec{X} $ by $ {\mathbf{X}}^{ * } $.
Step 4. Perform logistic regression on each feature of $ {\mathbf{X}}^{ * } $ against $ t_{1} $ to obtain the $ p $ coefficients of features in $ {\mathbf{X}}^{ * } $, denoted by $ \omega^{2} $, and then normalize $ \omega^{2} $ to $ \bar{\omega }^{2} $.
Step 5. Extract the second pls component $ t_{2} $ by $ t_{2} = {\mathbf{X}}^{ * } \cdot \bar{\omega }^{2} $.
Step 6. Repeat Step 3, Step 4 and Step 5 until the stopping criteria are satisfied.
Step 7. Denote by $ t_{1} ,t_{2} , \cdots ,t_{h} $ the final extracted pls components. Perform the logistic regression on these pls components to build the classification model.

3 Proposed Intrusion Detection Model: Fa-Plslogistic

In this section, we present the main procedures of our proposed intrusion detection model based on pls-logistic with feature augmentation. By embedding the data quality improvement technique into pls-logistic, we can obtain an effective intrusion detection with good performances and less complexity. First, we perform feature transformations on the original features to obtain high-quality training data that can significantly improve the detection performances. Then, the pls-logistic regression is perform on the newly transformed data to conduct dimension reduction and build the intrusion detection model. For clarity, the detailed procedures are summarized as follows:

Step 1. Data transformation

Perform feature transformations on the original data to obtain high-qualified training data.
Step 2. Detection model building

Use the newly obtained data from Step 1 to train pls-logistic-based classifier and build the intrusion detection model.
Step 3. Intrusion detection

For a new testing sample, it is first transformed by the logarithm marginal density ratio transformation illustrated in Sect. 2.1; then, the transformed data is fed into the built intrusion detection model to classify it as either an intrusion or a normal.

4 Experimental Setting

4.1 Dataset Description

In our study, the NSL-KDD dataset is used to evaluate the performance of the proposed intrusion detection model. The NSL-KDD dataset is a modified version of KDD 99 dataset which is considered as the benchmark dataset in intrusion detection domain. However, the KDD 99 dataset suffers from some drawbacks [12, 13]. For example, there are redundant and duplicate records which cause the classifier would be biased towards these more frequent records. The NSL-KDD dataset was proposed by [14] by removing all the redundant samples and reconstituting the dataset, making it more reasonable not only in data size, but also in data structure. The NSL-KDD dataset contains TCP connections that consist of 41 features and one labeling feature.

4.2 Experimental Results and Discussion

In order to prevent the dominance of features with large ranges, we normalize the data into a range of $ [0,1] $ before conducting the experiments. To evaluate our proposed detection model, the 10-fold cross validation has been adopted and the performance is evaluated by the following measurements according to the confusion matrix presented in Table 1.

Table 1. Confusion matrix

Full size table

Accuracy = $ \frac{TP + TN}{TP + TN + FP + FN} $, Detection rate (DR) = $ \frac{TP}{TP + FN} $, False alarm rate (FAR) = $ \frac{FP}{TN + FP} $

To verify the effectiveness of our proposed intrusion detection model, we first compare the detection performance of Fa-plslogistic with that of the naïve-plslogistic detection model (pls-logistic regression on original data without feature transformation). The 10-fold cross validation results of these two detection models on NSL-KDD dataset with regard to accuracy, DR, FAR and training time are summarized in Table 2.

Table 2. Performances of proposed methods

Full size table

As the results shown in Table 2, our proposed intrusion detection model takes clear advantages over the naïve-plslogistic detection model, indicating that the data quality improvement technique can greatly boost the detection performance. More specifically, the accuracy and detection rate of our proposed model both exceed 96%, while naïve-plslogistic only achieves 91.29% and 88. 59%, respectively. Besides, in terms of false alarm rate, our proposed method is below 2.3%, while naïve-plslogistic is over 6%. Moreover, the performances of our proposed is also more robust than that of naïve-plslogistic.

To further demonstrate the advantages of our proposed method, the training time required by Fa-plslogistic and naïve-plslogistic is also compared in Table 2. As shown, the training time of our proposed method is superior to that of naïve-plslogistic. Specifically, naïve-plslogistic demands about 1.39 time as much training time as Fa-plslogistic does. Thus, it can be inferred that our proposed method is much more concise than naïve-plslogistic, which can reduce the training time.

Therefore, according to the comparison results, it can be concluded that our proposed intrusion detection model is more effective than naïve-plslogistic and can achieve better detection performances.

Standard errors are in the parentheses in percentage form.

In addition, we examine which features are influential on the intrusion detection. Here, for simplicity, the feature whose coefficient is greater than 1 after standardization is considered to be important. Thus, the influential features recognized during the 10-fold cross-validation are shown in Table 3.

Table 3. Influential features for intrusion detection

Full size table

According to the results in Table 3, the important features for intrusion detection are listed in descending order by frequency: land, su_attempted, num_failed_logins, src_bytes, urgent, hot, num_root, num_compromised, root_shell, is_guest_login and dst_bytes. These features are helpful in practice to efficiently detect network intrusion and attacks.

Furthermore, in order to better interpret the effectiveness of our proposed method in intrusion detection, performance comparisons between our proposed model and other existing methods in intrusion detection using NSL-KDD dataset are conduct. The comparison results are summarized in Table 4.

Table 4. Performance comparisons of proposed method and other detection methods

Full size table

From the comparison results shown in Table 4, our proposed method outperforms other intrusion detection methods with regard to detection accuracy. However, it should be noted that Table 4 just provides a snapshot of performance comparison between our proposed method and other detection methods. Thus, it can be claimed that our proposed method always performs better when compared to any other methods. Nevertheless, from the results above, we can make a conclusion that our proposed method still possesses advantages in intrusion detection and can provide inspirations for the following researches.

5 Conclusion

Intrusion detection system is critical to network security. In this paper, we proposed an effective intrusion detection model based on pls-logistic with feature augmentation. Though the pls-logistic classifier might achieve a good performance, the detection capacity is much more dependent on the quality of the training data. Therefore, in order to increase the detection capacity, we use the logarithm marginal density ratio transformation on the original data to obtain high-quality training data for pls-logistic before building the intrusion detection model. Empirical results on NSL-KDD dataset show that our proposed intrusion detection model is effective and can achieve good and robust detection performances.

References

Kumar, G., Thakur, K., Ayyagari, M.R.: MLEsIDSs: machine learning-based ensembles for intrusion detection systems—a review. J. Supercomput. 76(11), 8938–8971 (2020). https://doi.org/10.1007/s11227-020-03196-z
Article Google Scholar
Bamakan, S.M.H., Wang, H., Yingjie, T., Shi, Y.: An effective intrusion detection framework based on MCLP/SVM optimized by time-varying chaos particle swarm optimization. Neurocomputing 199, 90–102 (2016)
Article Google Scholar
Moustafa, N., Hu, J., Slay, J.: A holistic review of network anomaly detection systems: a comprehensive survey. J. Netw. Comput. Appl. 128, 33–55 (2019)
Article Google Scholar
Tsai, C.F., Hsu, Y.F., Lin, C.Y., Lin, W.Y.: Intrusion detection by machine learning: a review. Expert Syst. Appl. 36(10), 11994–12000 (2009)
Article Google Scholar
Sommer, R., Paxson, V.: Outside the closed world: on using machine learning for network intrusion detection. In: 2010 IEEE Symposium on Security and Privacy, pp. 305–316 (2010)
Google Scholar
Wang, Y.: A multinomial logistic regression modeling approach for anomaly intrusion detection. Comput. Secur. 24(8), 662–674 (2005)
Article Google Scholar
Mok, M.S., Sohn, S.Y., Ju, Y.H.: Random effects logistic regression model for anomaly detection. Expert Syst. Appl. 37(10), 7162–7166 (2005)
Article Google Scholar
Ji, S.Y., Choi, S., Jeong, D.H.: Designing an internet traffic predictive model by applying a signal processing method. J. Netw. Syst. Manag. 23(4), 998–1015 (2015)
Article Google Scholar
Aburomman, A.A., Reaz, M.B.I.: A survey of intrusion detection systems based on ensemble and hybrid classifiers. Comput. Secur. 65, 135–152 (2017)
Article Google Scholar
Fan, J., Feng, Y., Jiang, J., Tong, X.: Feature augmentation via nonparametrics and selection (FANS) in high-dimensional classification. J. Am. Stat. Assoc. 111(513), 275–287 (2016)
Article MathSciNet Google Scholar
Bastien, P., Vinzi, V.E., Tenenhaus, M.: Pls generalised linear regression. Comput. Stat. Data Anal. 48(1), 17–46 (2005)
Article MathSciNet Google Scholar
Mahoney, M.V., Chan, P.K.: An analysis of the 1999 DARPA/lincoln laboratory evaluation data for network anomaly detection. In: Vigna, G., Kruegel, C., Jonsson, E. (eds.) RAID 2003. LNCS, vol. 2820, pp. 220–237. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45248-5_13
Chapter Google Scholar
Bamakan, S.M.H., Wang, H., Yingjie, T., Shi, Y.: An effective intrusion detection framework based on MCLP/SVM optimized by time-varying chaos particle swarm optimization. Neurocomputing 199, 90–102 (2016)
Article Google Scholar
Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD CUP 99 data set. In: Proceedings of the 2009 IEEE Symposium on Computational Intelligence in Security and Defense Applications, pp. 1–6. IEEE (2009)
Google Scholar
Yu, Z., Tsai, J.J., Weigert, T.: An adaptive automatically tuning intrusion detection system. ACM Trans. Auton. Adapt. Syst. 3(3), 10 (2008)
Article Google Scholar
Ippoliti, D., Zhou, X.: A-GHSOM: an adaptive growing hierarchical self-organizing map for network anomaly detection. J. Parallel Distrib. Comput. 72(12), 1576–1590 (2012)
Article Google Scholar
Panda, M., Abraham, A., Patra, M.R.: Discriminative multinomial naive bayes for network intrusion detection. In: Proceedings of 2010 Sixth International Conference on Information Assurance and Security, pp. 5–10. IEEE (2010)
Google Scholar

Download references

Acknowledgments

This research was financially supported by National Natural Science Foundation of China (Grant No. 72001222, 61832001, 61702016).

Author information

Authors and Affiliations

Postdoctoral Research Station, Agricultural Bank of China, Beijing, 100005, China
Jie Gu
School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China
Jie Gu

Authors

Jie Gu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Gu .

Editor information

Editors and Affiliations

CNCERT/CC, Beijing, China
Wei Lu
Beijing University of Posts and Telecommunications, Beijing, China
Qiaoyan Wen
University of Chinese Academy of Sciences, Beijing, China
Yuqing Zhang
Beihang University, Beijing, China
Bo Lang
Peking University, Beijing, China
Weiping Wen
CNCERT/CC, Beijing, China
Hanbing Yan
CNCERT/CC, Beijing, China
Chao Li
CNCERT/CC, Beijing, China
Li Ding
CNCERT/CC, Beijing, China
Ruiguang Li
CNCERT/CC, Beijing, China
Yu Zhou

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gu, J. (2020). An Effective Intrusion Detection Model Based on Pls-Logistic Regression with Feature Augmentation. In: Lu, W., et al. Cyber Security. CNCERT 2020. Communications in Computer and Information Science, vol 1299. Springer, Singapore. https://doi.org/10.1007/978-981-33-4922-3_10

Download citation

DOI: https://doi.org/10.1007/978-981-33-4922-3_10
Published: 19 January 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-4921-6
Online ISBN: 978-981-33-4922-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Effective Intrusion Detection Model Based on Pls-Logistic Regression with Feature Augmentation

Abstract

Similar content being viewed by others

Detecting Network Intrusions Using Multi-class Logistic Regression and Correlation-Based Feature Selection

A Multi-level Correlation-Based Feature Selection for Intrusion Detection

An Intelligent Intrusion Detection System Using a Novel Combination of PCA and MLP

Keywords

1 Introduction

2 Methodology

2.1 Feature Augmentation

2.2 Pls-Logistic Regression Classification Model

3 Proposed Intrusion Detection Model: Fa-Plslogistic

4 Experimental Setting

4.1 Dataset Description

4.2 Experimental Results and Discussion

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

An Effective Intrusion Detection Model Based on Pls-Logistic Regression with Feature Augmentation

Abstract

Similar content being viewed by others

Detecting Network Intrusions Using Multi-class Logistic Regression and Correlation-Based Feature Selection

A Multi-level Correlation-Based Feature Selection for Intrusion Detection

An Intelligent Intrusion Detection System Using a Novel Combination of PCA and MLP

Keywords

1 Introduction

2 Methodology

2.1 Feature Augmentation

2.2 Pls-Logistic Regression Classification Model

3 Proposed Intrusion Detection Model: Fa-Plslogistic

4 Experimental Setting

4.1 Dataset Description

4.2 Experimental Results and Discussion

5 Conclusion

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation