Evaluating the Impact of Data Preprocessing Techniques on the Performance of Intrusion Detection Systems

Santos, Kelson Carvalho; Miani, Rodrigo Sanches; de Oliveira Silva, Flávio

doi:10.1007/s10922-024-09813-z

Evaluating the Impact of Data Preprocessing Techniques on the Performance of Intrusion Detection Systems

Published: 22 March 2024

Volume 32, article number 36, (2024)
Cite this article

Journal of Network and Systems Management Aims and scope Submit manuscript

Kelson Carvalho Santos^1,2,
Rodrigo Sanches Miani²^na1 &
Flávio de Oliveira Silva^2,3^na1

244 Accesses
Explore all metrics

Abstract

The development of Intrusion Detection Systems using Machine Learning techniques (ML-based IDS) has emerged as an important research topic in the cybersecurity field. However, there is a noticeable absence of systematic studies to comprehend the usability of such systems in real-world applications. This paper analyzes the impact of data preprocessing techniques on the performance of ML-based IDS using two public datasets, UNSW-NB15 and CIC-IDS2017. Specifically, we evaluated the effects of data cleaning, encoding, and normalization techniques on the performance of binary and multiclass intrusion detection models. This work investigates the impact of data preprocessing techniques on the performance of ML-based IDS and how the performance of different ML-based IDS is affected by data preprocessing techniques. To this end, we implemented a machine learning pipeline to apply the data preprocessing techniques in different scenarios to answer such questions. The findings analyzed using the Friedman statistical test and Nemenyi post-hoc test revealed significant differences in groups of data preprocessing techniques and ML-based IDS, according to the evaluation metrics. However, these differences were not observed in multiclass scenarios for data preprocessing techniques. Additionally, ML-based IDS exhibited varying performances in binary and multiclass classifications. Therefore, our investigation presents insights into the efficacy of different data preprocessing techniques for building robust and accurate intrusion detection models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of UNSW-NB15 Datasets Using Machine Learning Algorithms

Diverse Analysis of Data Mining and Machine Learning Algorithms to Secure Computer Network

Article 02 December 2021

Building an Intrusion Detection System Using Supervised Machine Learning Classifiers with Feature Selection

Data Availability

All materials used in this manuscript are public, and no permission is required. The results and data in this manuscript have not been published elsewhere.

Code Availability

All materials used in this manuscript are public, and no permission is required. Additional materials for this article are available at the following link: https://github.com/kelsonc/evaluation-preprocessing-methods

Notes

Scikit-learn. <https://scikit-learn.org/> Accessed on 16 May 2023.
Pandas. <https://pandas.pydata.org/> Accessed on 16 May 2023.
ACSC, Australian Cyber Security Centre. <https://www.cyber.gov.au/> Accessed on 16 May 2023.
Bro. <https://zeek.org/> Accessed on 16 May 2023.
Argus. <https://openargus.org/> Accessed on 16 May 2023.
TCPDump. <https://www.tcpdump.org/> Accessed on 16 May 2023.
CIC, Canadian Institute for Cybersecurity. <https://www.unb.ca/cic/> Accessed on 16 May 2023.
CICFlowMeter. <https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter> Accessed on 16 May 2023.
Overfitting is a prevalent issue in machine learning, wherein the model effectively learns from the training data but also captures the inherent noise. As a result, the overfitted model tends to perform poorly on new data. This is because the model has memorized the training set instead of discerning general patterns applicable to novel instances [44].
Underfitting occurs when the model needs more complexity to comprehend the nuances and intricacies of the dataset. As a result, the model fails to adequately adapt to the training data, leading to heightened rates of false positives and false negatives. This limitation compromises the model’s ability to detect and classify network intrusions accurately [44].
Google LLC, Kaggle. <https://www.kaggle.com/> Accessed on 16 May 2023.

References

International Telecommunication Union: Global Cybersecurity Index 2020: Measuring Commitment to Cybersecurity, 1st edn. ITUPublications, Geneva (2021)
Google Scholar
Sarker, I.H., Kayes, A., Badsha, S., Alqahtani, H., Watters, P., Ng, A.: Cybersecurity data science: an overview from machine learning perspective. J. Big Data 7(1), 1–29 (2020). https://doi.org/10.1186/s40537-020-00318-5
Article Google Scholar
Szczypiorski, K.: Cybersecurity and data science. Electronics 11(15), 1–4 (2022). https://doi.org/10.3390/electronics11152309
Article Google Scholar
Hajj, S., El Sibai, R., Bou Abdo, J., Demerjian, J., Makhoul, A., Guyeux, C.: Anomaly-based intrusion detection systems: the requirements, methods, measurements, and datasets. Trans. Emerging Telecommun. Technol. 32(4), 1–36 (2021). https://doi.org/10.1002/ett.4240
Article Google Scholar
Putra, W., Huang, J.J.: A survey of intrusion detection system. Int. J. Informatics Comput. 1(1), 1–19 (2019). https://doi.org/10.35842/ijicom.v1i1.7
Article Google Scholar
Hubballi, N., Suryanarayanan, V.: False alarm minimization techniques in signature-based intrusion detection systems: a survey. Comput. Commun. 49, 1–17 (2014). https://doi.org/10.1016/j.comcom.2014.04.012
Article Google Scholar
Skopik, F., Wurzenberger, M., Landauer, M.: The seven golden principles of effective anomaly-based intrusion detection. IEEE Secur. Privacy 19(5), 36–45 (2021). https://doi.org/10.1109/MSEC.2021.3090444
Article Google Scholar
Kunal, Dua, M.: Machine Learning Approach to IDS: A Comprehensive Review. Paper presented at the 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 12-14 June 2019 (2019). https://doi.org/10.1109/ICECA.2019.8822120
Thakkar, A., Lohiya, R.: A survey on intrusion detection system: feature selection, model, performance measures, application perspective, challenges, and future research directions. Artif. Intell. Rev. 55, 453–563 (2022). https://doi.org/10.1007/s10462-021-10037-9
Article Google Scholar
Ahmad, T., Aziz, M.N.: Data preprocessing and feature selection for machine learning intrusion detection systems. ICIC Express Lett 13(2), 93–101 (2019). https://doi.org/10.24507/icicel.13.02.93
Article Google Scholar
Obaid, H.S., Dheyab, S.A., Sabry, S.S.: The Impact of Data Pre-Processing Techniques and Dimensionality Reduction on the Accuracy of Machine Learning. Paper presented at the 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON), Jaipur, India, 13–15 March 2019 (2019). https://doi.org/10.1109/IEMECONX.2019.8877011
Li, C.: Preprocessing methods and pipelines of data mining: An overview. arXiv preprint, 1–7 (2019) https://doi.org/10.48550/arXiv.1906.08510 arXiv:1906.08510 [[s.n.]]
Paulauskas, N., Auskalnis, J.: Analysis of data pre-processing influence on intrusion detection using NSL-KDD dataset (2017). https://doi.org/10.1109/eStream.2017.7950325
Davis, J.J., Clark, A.J.: Data preprocessing for anomaly based network intrusion detection: a review. Comput. Secur. 30(6), 353–375 (2011). https://doi.org/10.1016/j.cose.2011.05.008
Article Google Scholar
Magán-Carrión, R., Urda, D., Diaz-Cano, I., Dorronsoro, B.: Towards a reliable comparison and evaluation of network intrusion detection systems based on machine learning approaches. Appl. Sci. 10(5), 1–21 (2020). https://doi.org/10.3390/app10051775
Article Google Scholar
Magán-Carrión, R., Urda, D., Diaz-Cano, I., Dorronsoro, B.: Improving the reliability of network intrusion detection systems through dataset integration. IEEE Trans. Emerging Topics Comput. 10(4), 1717–1732 (2022). https://doi.org/10.1109/TETC.2022.3178283
Article Google Scholar
Singh, D., Singh, B.: Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 97, 1–23 (2020). https://doi.org/10.1016/j.asoc.2019.105524
Article Google Scholar
Molina-Coronado, B., Mori, U., Mendiburu, A., Miguel-Alonso, J.: Survey of network intrusion detection methods from the perspective of the knowledge discovery in databases process. IEEE Trans. Network Serv. Manag. 17(4), 2451–2479 (2020). https://doi.org/10.1109/TNSM.2020.3016246
Article Google Scholar
Zebari, R., Abdulazeez, A., Zeebaree, D., Zebari, D., Saeed, J.: A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appli. Sci. Technol. Trends 1(2), 56–70 (2020). https://doi.org/10.38094/jastt1224
Article Google Scholar
Chou, D., Jiang, M.: A survey on data-driven network intrusion detection. ACM Comput. Surv. (CSUR) 54(9), 1–36 (2021). https://doi.org/10.1145/3472753
Article Google Scholar
Al-Utaibi, K.A., El-Alfy, E.M.: Intrusion detection taxonomy and data preprocessing mechanisms. J. Intell. Fuzzy Syst. 34(3), 1369–1383 (2018). https://doi.org/10.3233/JIFS-16943
Article Google Scholar
Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33, 1–39 (2010). https://doi.org/10.1007/s10462-009-9124-7
Article Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet Google Scholar
Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philos. Trans. Royal Soc. A: Math. Phys. Eng. Sci. 374(2065), 1–16 (2016). https://doi.org/10.1098/rsta.2015.0202
Article MathSciNet Google Scholar
Izenman, A.J.: Linear discriminant analysis. In: Modern Multivariate Statistical Techniques, pp. 237–280. Springer, New York (2013). https://doi.org/10.1007/978-0-387-78189-1_8
Chapter Google Scholar
Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD CUP 99 data set. Paper presented at the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, ON, Canada, 08-10 July 2009 (2009). https://doi.org/10.1109/CISDA.2009.5356528
Moustafa, N., Slay, J.: UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). Paper presented at the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, ACT, Australia, 10-12 November 2015 (2015). https://doi.org/10.1109/MilCIS.2015.7348942
Song, J., Takakura, H., Okabe, Y., Eto, M., Inoue, D., Nakao, K.: Statistical analysis of honeypot data and building of Kyoto 2006+ dataset for NIDS evaluation. Paper presented at the first Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS), Salzburg Austria, 10–10 April 2011 (2011). https://doi.org/10.1145/1978672.1978676
Kennedy, J., Eberhart, R.: Particle swarm optimization. Paper presented at the International Conference on Neural Networks (ICNN’95), Perth, WA, Australia, 27 November – 01 December 1995 (1995). https://doi.org/10.1109/ICNN.1995.488968
Hall, M.A.: Correlation-based feature selection for machine learning. PhD thesis, University of Waikato, Department of Computer Science, Hamilton, New Zealand (1999). Thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy
García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sci. 180(10), 2044–2064 (2010). https://doi.org/10.1016/j.ins.2009.12.010
Article Google Scholar
Güney, H.: Preprocessing impact analysis for machine learning-based network intrusion detection. Sakarya Univ. J. Comput. Information Sci. 6(1), 67–79 (2023). https://doi.org/10.35377/saucis...1223054
Article Google Scholar
Liu, Q., Chen, C., Zhang, Y., Hu, Z.: Feature selection for support vector machines with rbf kernel. Artif. Intell. Rev. 36(2), 99–115 (2011). https://doi.org/10.1007/s10462-011-9205-2
Article Google Scholar
Ketepalli, G., Bulla, P.: Data Preparation and Pre-processing of Intrusion Detection Datasets using Machine Learning. Paper presented at the 2023 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 26-28 April 2023 (2023). https://doi.org/10.1109/ICICT57646.2023.10134025
Symeonidis, S., Effrosynidis, D., Arampatzis, A.: A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Exp. Syst. Appl. 110, 298–310 (2018). https://doi.org/10.1016/j.eswa.2018.06.022
Article Google Scholar
Chowdhary, K.: Natural language processing. In: Fundamentals of Artificial Intelligence, pp. 603–649. Springer, New York (2020). https://doi.org/10.1007/978-81-322-3972-7_19
Chapter Google Scholar
Frye, M., Mohren, J., Schmitt, R.H.: Benchmarking of data preprocessing methods for machine learning-applications in production. Procedia CIRP 104, 50–55 (2021). https://doi.org/10.1016/j.procir.2021.11.009
Article Google Scholar
Maseer, Z.K., Yusof, R., Bahaman, N., Mostafa, S.A., Foozy, C.F.M.: Benchmarking of machine learning for anomaly based intrusion detection systems in the cicids2017 dataset. IEEE Access 9, 22351–22370 (2021). https://doi.org/10.1109/ACESSO.2021.3056614
Article Google Scholar
Ring, M., Wunderlich, S., Scheuring, D., Landes, D., Hotho, A.: A survey of network-based intrusion detection data sets. Comput. Secur. 86, 147–167 (2019). https://doi.org/10.1016/j.cose.2019.06.005
Article Google Scholar
Claise, B., Trammell, B., Aitken, P.: Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information (RFC7011). Retrieved from https://www.rfc-editor.org/info/rfc7011. Accessed on 16 May 2023 (2013). https://doi.org/10.17487/rfc7011
Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. Paper presented at the 4th International Conference on Information Systems Security and Privacy (ICISSp), Funchal, Madeira, Portugal, 22–24 January 2018 (2018). https://doi.org/10.5220/0006639801080116
Hindy, H., Brosset, D., Bayne, E., Seeam, A.K., Tachtatzis, C., Atkinson, R., Bellekens, X.: A taxonomy of network threats and the effect of current datasets on intrusion detection systems. IEEE Access 8, 104650–104675 (2020). https://doi.org/10.1109/ACCESS.2020.3000179
Article Google Scholar
Gharib, A., Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: An Evaluation Framework for Intrusion Detection Dataset. Paper presented at the International Conference on Information Science and Security (ICISS 2016), Pattaya, Thailand, 19-22 December 2016 (2016). https://doi.org/10.1109/ICISSEC.2016.7885840
Koehrsen, W.: Overfitting vs. underfitting: a complete example. Towards Data Sci. 405, 1–12 (2018)
Google Scholar
Kramer, O.: Dimensionality Reduction with Unsupervised Nearest Neighbors, 1st edn. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38652-7
Book Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324
Article Google Scholar
Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. Paper presented at the 22nd acm sigkdd international conference on knowledge discovery and data mining, San Francisco, California, USA, 13–17 August 2013 (2016). https://doi.org/10.1145/2939672.2939785
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.: Lightgbm: A highly efficient gradient boosting decision tree. Paper presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017 (2017). [s.n.]
Resende, P.A.A., Drummond, A.C.: A survey of random forest based methods for intrusion detection systems. ACM Comput. Surv. (CSUR) 51(3), 1–36 (2018). https://doi.org/10.1145/3178582
Article Google Scholar
Martínez Torres, J., Iglesias Comesaña, C., García-Nieto, P.J.: Review: machine learning techniques applied to cybersecurity. Int. J. Mach. Learn. Cybern. 10, 2823–2836 (2019). https://doi.org/10.1007/s13042-018-00906-1
Article Google Scholar
Nemenyi, P.B.: Distribution-free multiple comparisons. PhD thesis, University of Princeton, Department of Mathematics, Princeton, New Jersey, US (1963). Thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy

Download references

Funding

The authors thank the State of Minas Gerais Research Support Foundation-FAPEMIG (Grant APQ-02196-18) for financial support. The authors also acknowledge the financial support of the Brazilian National Council for Scientific and Technological Development (CNPq), Grant 421944/2021-8.

Author information

Rodrigo Sanches Miani and Flávio de Oliveira Silva have contributed equally to this work.

Authors and Affiliations

Computing of Department, Federal Institute of Piaui (IFPI), Av Pedro Freitas, 1020, Sao Pedro, Teresina, 64018-000, PI, Brazil
Kelson Carvalho Santos
Faculty of Computing (FACOM), Federal University of Uberlandia (UFU), Av João Naves de Ávila, 2121 - Campus Santa Mônica, Uberlândia, 38400-902, MG, Brazil
Kelson Carvalho Santos, Rodrigo Sanches Miani & Flávio de Oliveira Silva
Department of Informatics (DI), School of Engineering, University of Minho, R. da Universidade, Braga, 4710-057, Portugal
Flávio de Oliveira Silva

Authors

Kelson Carvalho Santos
View author publications
You can also search for this author in PubMed Google Scholar
Rodrigo Sanches Miani
View author publications
You can also search for this author in PubMed Google Scholar
Flávio de Oliveira Silva
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The authors contributed equally to this work.

Corresponding author

Correspondence to Kelson Carvalho Santos.

Ethics declarations

Conflict of interest

The authors have no competing interests or other interests that might be perceived to influence the results or discussion reported in this paper.

Ethical Approval

This manuscript adheres to the principles and policies of authorship ethics.

Consent to Participate and Publication

All authors read and approved the final manuscript for publication via the subscription publishing route.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Santos, K.C., Miani, R.S. & de Oliveira Silva, F. Evaluating the Impact of Data Preprocessing Techniques on the Performance of Intrusion Detection Systems. J Netw Syst Manage 32, 36 (2024). https://doi.org/10.1007/s10922-024-09813-z

Download citation

Received: 17 May 2023
Revised: 09 December 2023
Accepted: 20 February 2024
Published: 22 March 2024
DOI: https://doi.org/10.1007/s10922-024-09813-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating the Impact of Data Preprocessing Techniques on the Performance of Intrusion Detection Systems

Abstract

Access this article

Similar content being viewed by others

Analysis of UNSW-NB15 Datasets Using Machine Learning Algorithms

Diverse Analysis of Data Mining and Machine Learning Algorithms to Secure Computer Network

Building an Intrusion Detection System Using Supervised Machine Learning Classifiers with Feature Selection

Data Availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent to Participate and Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evaluating the Impact of Data Preprocessing Techniques on the Performance of Intrusion Detection Systems

Abstract

Access this article

Similar content being viewed by others

Analysis of UNSW-NB15 Datasets Using Machine Learning Algorithms

Diverse Analysis of Data Mining and Machine Learning Algorithms to Secure Computer Network

Building an Intrusion Detection System Using Supervised Machine Learning Classifiers with Feature Selection

Data Availability

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval

Consent to Participate and Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation