Skip to main content
Log in

Evaluating the Impact of Data Preprocessing Techniques on the Performance of Intrusion Detection Systems

  • Published:
Journal of Network and Systems Management Aims and scope Submit manuscript

Abstract

The development of Intrusion Detection Systems using Machine Learning techniques (ML-based IDS) has emerged as an important research topic in the cybersecurity field. However, there is a noticeable absence of systematic studies to comprehend the usability of such systems in real-world applications. This paper analyzes the impact of data preprocessing techniques on the performance of ML-based IDS using two public datasets, UNSW-NB15 and CIC-IDS2017. Specifically, we evaluated the effects of data cleaning, encoding, and normalization techniques on the performance of binary and multiclass intrusion detection models. This work investigates the impact of data preprocessing techniques on the performance of ML-based IDS and how the performance of different ML-based IDS is affected by data preprocessing techniques. To this end, we implemented a machine learning pipeline to apply the data preprocessing techniques in different scenarios to answer such questions. The findings analyzed using the Friedman statistical test and Nemenyi post-hoc test revealed significant differences in groups of data preprocessing techniques and ML-based IDS, according to the evaluation metrics. However, these differences were not observed in multiclass scenarios for data preprocessing techniques. Additionally, ML-based IDS exhibited varying performances in binary and multiclass classifications. Therefore, our investigation presents insights into the efficacy of different data preprocessing techniques for building robust and accurate intrusion detection models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24

Similar content being viewed by others

Data Availability

All materials used in this manuscript are public, and no permission is required. The results and data in this manuscript have not been published elsewhere.

Code Availability

All materials used in this manuscript are public, and no permission is required. Additional materials for this article are available at the following link: https://github.com/kelsonc/evaluation-preprocessing-methods

Notes

  1. Scikit-learn. <https://scikit-learn.org/> Accessed on 16 May 2023.

  2. Pandas. <https://pandas.pydata.org/> Accessed on 16 May 2023.

  3. ACSC, Australian Cyber Security Centre. <https://www.cyber.gov.au/> Accessed on 16 May 2023.

  4. Bro. <https://zeek.org/> Accessed on 16 May 2023.

  5. Argus. <https://openargus.org/> Accessed on 16 May 2023.

  6. TCPDump. <https://www.tcpdump.org/> Accessed on 16 May 2023.

  7. CIC, Canadian Institute for Cybersecurity. <https://www.unb.ca/cic/> Accessed on 16 May 2023.

  8. CICFlowMeter. <https://github.com/CanadianInstituteForCybersecurity/CICFlowMeter> Accessed on 16 May 2023.

  9. Overfitting is a prevalent issue in machine learning, wherein the model effectively learns from the training data but also captures the inherent noise. As a result, the overfitted model tends to perform poorly on new data. This is because the model has memorized the training set instead of discerning general patterns applicable to novel instances [44].

  10. Underfitting occurs when the model needs more complexity to comprehend the nuances and intricacies of the dataset. As a result, the model fails to adequately adapt to the training data, leading to heightened rates of false positives and false negatives. This limitation compromises the model’s ability to detect and classify network intrusions accurately [44].

  11. Google LLC, Kaggle. <https://www.kaggle.com/> Accessed on 16 May 2023.

References

  1. International Telecommunication Union: Global Cybersecurity Index 2020: Measuring Commitment to Cybersecurity, 1st edn. ITUPublications, Geneva (2021)

    Google Scholar 

  2. Sarker, I.H., Kayes, A., Badsha, S., Alqahtani, H., Watters, P., Ng, A.: Cybersecurity data science: an overview from machine learning perspective. J. Big Data 7(1), 1–29 (2020). https://doi.org/10.1186/s40537-020-00318-5

    Article  Google Scholar 

  3. Szczypiorski, K.: Cybersecurity and data science. Electronics 11(15), 1–4 (2022). https://doi.org/10.3390/electronics11152309

    Article  Google Scholar 

  4. Hajj, S., El Sibai, R., Bou Abdo, J., Demerjian, J., Makhoul, A., Guyeux, C.: Anomaly-based intrusion detection systems: the requirements, methods, measurements, and datasets. Trans. Emerging Telecommun. Technol. 32(4), 1–36 (2021). https://doi.org/10.1002/ett.4240

    Article  Google Scholar 

  5. Putra, W., Huang, J.J.: A survey of intrusion detection system. Int. J. Informatics Comput. 1(1), 1–19 (2019). https://doi.org/10.35842/ijicom.v1i1.7

    Article  Google Scholar 

  6. Hubballi, N., Suryanarayanan, V.: False alarm minimization techniques in signature-based intrusion detection systems: a survey. Comput. Commun. 49, 1–17 (2014). https://doi.org/10.1016/j.comcom.2014.04.012

    Article  Google Scholar 

  7. Skopik, F., Wurzenberger, M., Landauer, M.: The seven golden principles of effective anomaly-based intrusion detection. IEEE Secur. Privacy 19(5), 36–45 (2021). https://doi.org/10.1109/MSEC.2021.3090444

    Article  Google Scholar 

  8. Kunal, Dua, M.: Machine Learning Approach to IDS: A Comprehensive Review. Paper presented at the 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 12-14 June 2019 (2019). https://doi.org/10.1109/ICECA.2019.8822120

  9. Thakkar, A., Lohiya, R.: A survey on intrusion detection system: feature selection, model, performance measures, application perspective, challenges, and future research directions. Artif. Intell. Rev. 55, 453–563 (2022). https://doi.org/10.1007/s10462-021-10037-9

    Article  Google Scholar 

  10. Ahmad, T., Aziz, M.N.: Data preprocessing and feature selection for machine learning intrusion detection systems. ICIC Express Lett 13(2), 93–101 (2019). https://doi.org/10.24507/icicel.13.02.93

    Article  Google Scholar 

  11. Obaid, H.S., Dheyab, S.A., Sabry, S.S.: The Impact of Data Pre-Processing Techniques and Dimensionality Reduction on the Accuracy of Machine Learning. Paper presented at the 9th Annual Information Technology, Electromechanical Engineering and Microelectronics Conference (IEMECON), Jaipur, India, 13–15 March 2019 (2019). https://doi.org/10.1109/IEMECONX.2019.8877011

  12. Li, C.: Preprocessing methods and pipelines of data mining: An overview. arXiv preprint, 1–7 (2019) https://doi.org/10.48550/arXiv.1906.08510arXiv:1906.08510 [[s.n.]]

  13. Paulauskas, N., Auskalnis, J.: Analysis of data pre-processing influence on intrusion detection using NSL-KDD dataset (2017). https://doi.org/10.1109/eStream.2017.7950325

  14. Davis, J.J., Clark, A.J.: Data preprocessing for anomaly based network intrusion detection: a review. Comput. Secur. 30(6), 353–375 (2011). https://doi.org/10.1016/j.cose.2011.05.008

    Article  Google Scholar 

  15. Magán-Carrión, R., Urda, D., Diaz-Cano, I., Dorronsoro, B.: Towards a reliable comparison and evaluation of network intrusion detection systems based on machine learning approaches. Appl. Sci. 10(5), 1–21 (2020). https://doi.org/10.3390/app10051775

    Article  Google Scholar 

  16. Magán-Carrión, R., Urda, D., Diaz-Cano, I., Dorronsoro, B.: Improving the reliability of network intrusion detection systems through dataset integration. IEEE Trans. Emerging Topics Comput. 10(4), 1717–1732 (2022). https://doi.org/10.1109/TETC.2022.3178283

    Article  Google Scholar 

  17. Singh, D., Singh, B.: Investigating the impact of data normalization on classification performance. Appl. Soft Comput. 97, 1–23 (2020). https://doi.org/10.1016/j.asoc.2019.105524

    Article  Google Scholar 

  18. Molina-Coronado, B., Mori, U., Mendiburu, A., Miguel-Alonso, J.: Survey of network intrusion detection methods from the perspective of the knowledge discovery in databases process. IEEE Trans. Network Serv. Manag. 17(4), 2451–2479 (2020). https://doi.org/10.1109/TNSM.2020.3016246

    Article  Google Scholar 

  19. Zebari, R., Abdulazeez, A., Zeebaree, D., Zebari, D., Saeed, J.: A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. J. Appli. Sci. Technol. Trends 1(2), 56–70 (2020). https://doi.org/10.38094/jastt1224

    Article  Google Scholar 

  20. Chou, D., Jiang, M.: A survey on data-driven network intrusion detection. ACM Comput. Surv. (CSUR) 54(9), 1–36 (2021). https://doi.org/10.1145/3472753

    Article  Google Scholar 

  21. Al-Utaibi, K.A., El-Alfy, E.M.: Intrusion detection taxonomy and data preprocessing mechanisms. J. Intell. Fuzzy Syst. 34(3), 1369–1383 (2018). https://doi.org/10.3233/JIFS-16943

    Article  Google Scholar 

  22. Rokach, L.: Ensemble-based classifiers. Artif. Intell. Rev. 33, 1–39 (2010). https://doi.org/10.1007/s10462-009-9124-7

    Article  Google Scholar 

  23. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  Google Scholar 

  24. Jolliffe, I.T., Cadima, J.: Principal component analysis: a review and recent developments. Philos. Trans. Royal Soc. A: Math. Phys. Eng. Sci. 374(2065), 1–16 (2016). https://doi.org/10.1098/rsta.2015.0202

    Article  MathSciNet  Google Scholar 

  25. Izenman, A.J.: Linear discriminant analysis. In: Modern Multivariate Statistical Techniques, pp. 237–280. Springer, New York (2013). https://doi.org/10.1007/978-0-387-78189-1_8

    Chapter  Google Scholar 

  26. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD CUP 99 data set. Paper presented at the 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, ON, Canada, 08-10 July 2009 (2009). https://doi.org/10.1109/CISDA.2009.5356528

  27. Moustafa, N., Slay, J.: UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). Paper presented at the 2015 Military Communications and Information Systems Conference (MilCIS), Canberra, ACT, Australia, 10-12 November 2015 (2015). https://doi.org/10.1109/MilCIS.2015.7348942

  28. Song, J., Takakura, H., Okabe, Y., Eto, M., Inoue, D., Nakao, K.: Statistical analysis of honeypot data and building of Kyoto 2006+ dataset for NIDS evaluation. Paper presented at the first Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS), Salzburg Austria, 10–10 April 2011 (2011). https://doi.org/10.1145/1978672.1978676

  29. Kennedy, J., Eberhart, R.: Particle swarm optimization. Paper presented at the International Conference on Neural Networks (ICNN’95), Perth, WA, Australia, 27 November – 01 December 1995 (1995). https://doi.org/10.1109/ICNN.1995.488968

  30. Hall, M.A.: Correlation-based feature selection for machine learning. PhD thesis, University of Waikato, Department of Computer Science, Hamilton, New Zealand (1999). Thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy

  31. García, S., Fernández, A., Luengo, J., Herrera, F.: Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sci. 180(10), 2044–2064 (2010). https://doi.org/10.1016/j.ins.2009.12.010

    Article  Google Scholar 

  32. Güney, H.: Preprocessing impact analysis for machine learning-based network intrusion detection. Sakarya Univ. J. Comput. Information Sci. 6(1), 67–79 (2023). https://doi.org/10.35377/saucis...1223054

    Article  Google Scholar 

  33. Liu, Q., Chen, C., Zhang, Y., Hu, Z.: Feature selection for support vector machines with rbf kernel. Artif. Intell. Rev. 36(2), 99–115 (2011). https://doi.org/10.1007/s10462-011-9205-2

    Article  Google Scholar 

  34. Ketepalli, G., Bulla, P.: Data Preparation and Pre-processing of Intrusion Detection Datasets using Machine Learning. Paper presented at the 2023 International Conference on Inventive Computation Technologies (ICICT), Lalitpur, Nepal, 26-28 April 2023 (2023). https://doi.org/10.1109/ICICT57646.2023.10134025

  35. Symeonidis, S., Effrosynidis, D., Arampatzis, A.: A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Exp. Syst. Appl. 110, 298–310 (2018). https://doi.org/10.1016/j.eswa.2018.06.022

    Article  Google Scholar 

  36. Chowdhary, K.: Natural language processing. In: Fundamentals of Artificial Intelligence, pp. 603–649. Springer, New York (2020). https://doi.org/10.1007/978-81-322-3972-7_19

    Chapter  Google Scholar 

  37. Frye, M., Mohren, J., Schmitt, R.H.: Benchmarking of data preprocessing methods for machine learning-applications in production. Procedia CIRP 104, 50–55 (2021). https://doi.org/10.1016/j.procir.2021.11.009

    Article  Google Scholar 

  38. Maseer, Z.K., Yusof, R., Bahaman, N., Mostafa, S.A., Foozy, C.F.M.: Benchmarking of machine learning for anomaly based intrusion detection systems in the cicids2017 dataset. IEEE Access 9, 22351–22370 (2021). https://doi.org/10.1109/ACESSO.2021.3056614

    Article  Google Scholar 

  39. Ring, M., Wunderlich, S., Scheuring, D., Landes, D., Hotho, A.: A survey of network-based intrusion detection data sets. Comput. Secur. 86, 147–167 (2019). https://doi.org/10.1016/j.cose.2019.06.005

    Article  Google Scholar 

  40. Claise, B., Trammell, B., Aitken, P.: Specification of the IP Flow Information Export (IPFIX) Protocol for the Exchange of Flow Information (RFC7011). Retrieved from https://www.rfc-editor.org/info/rfc7011. Accessed on 16 May 2023 (2013). https://doi.org/10.17487/rfc7011

  41. Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: Toward generating a new intrusion detection dataset and intrusion traffic characterization. Paper presented at the 4th International Conference on Information Systems Security and Privacy (ICISSp), Funchal, Madeira, Portugal, 22–24 January 2018 (2018). https://doi.org/10.5220/0006639801080116

  42. Hindy, H., Brosset, D., Bayne, E., Seeam, A.K., Tachtatzis, C., Atkinson, R., Bellekens, X.: A taxonomy of network threats and the effect of current datasets on intrusion detection systems. IEEE Access 8, 104650–104675 (2020). https://doi.org/10.1109/ACCESS.2020.3000179

    Article  Google Scholar 

  43. Gharib, A., Sharafaldin, I., Lashkari, A.H., Ghorbani, A.A.: An Evaluation Framework for Intrusion Detection Dataset. Paper presented at the International Conference on Information Science and Security (ICISS 2016), Pattaya, Thailand, 19-22 December 2016 (2016). https://doi.org/10.1109/ICISSEC.2016.7885840

  44. Koehrsen, W.: Overfitting vs. underfitting: a complete example. Towards Data Sci. 405, 1–12 (2018)

    Google Scholar 

  45. Kramer, O.: Dimensionality Reduction with Unsupervised Nearest Neighbors, 1st edn. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38652-7

    Book  Google Scholar 

  46. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001). https://doi.org/10.1023/A:1010933404324

    Article  Google Scholar 

  47. Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. Paper presented at the 22nd acm sigkdd international conference on knowledge discovery and data mining, San Francisco, California, USA, 13–17 August 2013 (2016). https://doi.org/10.1145/2939672.2939785

  48. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.: Lightgbm: A highly efficient gradient boosting decision tree. Paper presented at the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017 (2017). [s.n.]

  49. Resende, P.A.A., Drummond, A.C.: A survey of random forest based methods for intrusion detection systems. ACM Comput. Surv. (CSUR) 51(3), 1–36 (2018). https://doi.org/10.1145/3178582

    Article  Google Scholar 

  50. Martínez Torres, J., Iglesias Comesaña, C., García-Nieto, P.J.: Review: machine learning techniques applied to cybersecurity. Int. J. Mach. Learn. Cybern. 10, 2823–2836 (2019). https://doi.org/10.1007/s13042-018-00906-1

    Article  Google Scholar 

  51. Nemenyi, P.B.: Distribution-free multiple comparisons. PhD thesis, University of Princeton, Department of Mathematics, Princeton, New Jersey, US (1963). Thesis submitted in partial fulfilment of the requirements for the degree of Doctor of Philosophy

Download references

Funding

The authors thank the State of Minas Gerais Research Support Foundation-FAPEMIG (Grant APQ-02196-18) for financial support. The authors also acknowledge the financial support of the Brazilian National Council for Scientific and Technological Development (CNPq), Grant 421944/2021-8.

Author information

Authors and Affiliations

Authors

Contributions

The authors contributed equally to this work.

Corresponding author

Correspondence to Kelson Carvalho Santos.

Ethics declarations

Conflict of interest

The authors have no competing interests or other interests that might be perceived to influence the results or discussion reported in this paper.

Ethical Approval

This manuscript adheres to the principles and policies of authorship ethics.

Consent to Participate and Publication

All authors read and approved the final manuscript for publication via the subscription publishing route.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Santos, K.C., Miani, R.S. & de Oliveira Silva, F. Evaluating the Impact of Data Preprocessing Techniques on the Performance of Intrusion Detection Systems. J Netw Syst Manage 32, 36 (2024). https://doi.org/10.1007/s10922-024-09813-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10922-024-09813-z

Keywords

Navigation