Skip to main content
Log in

A comprehensive comparison study of ML models for multistage APT detection: focus on data preprocessing and resampling

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Advanced persistent threats (APTs) present a significant cybersecurity challenge, necessitating innovative detection methods. This study stands out by integrating advanced data preparation with strategies for handling data imbalances, tailored for the SCVIC-APT-2021 dataset. We employ a mix of resampling, cost-sensitive learning, and ensemble methods, alongside machine learning and deep learning models like XGBoost, LightGBM, and ANNs, to enhance APT detection. Our strategy, which draws from the MITRE ATT&CK framework, concentrates on each stage of APT attacks, which significantly increases detection accuracy. Notably, we achieved a Macro F1-score of 95.20% with XGBoost and 96.67% with LightGBM, and significant enhancements in the area under the precision–recall curve for both. Our study’s exploration of the SCVIC-APT-2021 dataset marks a progressive step in APT detection research, with vital implications for future cybersecurity developments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Availability of data and materials

Data generated or analyzed during this study are included in this published article.

Code availability

The custom code developed for the experiments in this study is available upon request from the corresponding author.

References

  1. Chen P, Desmet L, Huygens C (2014) A study on advanced persistent threats. In: Communications and Multimedia Security: 15th IFIP TC 6/TC 11 International Conference, CMS 2014, Aveiro, Springer, Berlin Heidelberg, pp 63–72

  2. Alshamrani A, Myneni S, Chowdhary A, Huang D (2019) A survey on advanced persistent threats: techniques, solutions, challenges, and research opportunities. IEEE Commun Surv Tutor 21(2):1851–1877

    Article  Google Scholar 

  3. Werner de Vargas V, Schneider Aranda JA, dos Santos Costa R, da Silva Pereira PR, Victória Barbosa JL (2023) Imbalanced data preprocessing techniques for machine learning: a systematic mapping study. Knowl Inf Syst 65(1):31–57

    Article  PubMed  Google Scholar 

  4. Seo JH (2022) Evolutionary data preprocessing to alleviate class imbalance. Secur Commun Netw 2022

  5. Sharma A, Gupta BB, Singh AK, Saraswat VK (2023) Advanced persistent threats (APT): evolution, anatomy, attribution and countermeasures. J Ambient Intell Humaniz Comput 1–27

  6. Neuschmied H, Winter M, Stojanović B, Hofer-Schmitz K, Božić J, Kleb U (2022) Apt-attack detection based on multi-stage autoencoders. Appl Sci 12(13):6816

    Article  CAS  Google Scholar 

  7. Bodström T, Hämäläinen T (2019) A novel deep learning stack for APT detection. Appl Sci 9(6):1055

    Article  Google Scholar 

  8. Shi Y, Li W, Zhang Y, Deng X, Yin D, Deng S (2021) Survey on APT attack detection in industrial cyber-physical system. In: 2021 International Conference on Electronic Information Technology and Smart Agriculture (ICEITSA). IEEE, pp 296–301

  9. Do Xuan C, Dao MH (2021) A novel approach for APT attack detection based on combined deep learning model. Neural Comput Appl 33:13251–13264

    Article  Google Scholar 

  10. Myneni S, Chowdhary A, Sabur A, Sengupta S, Agrawal G, Huang D, Kang M (2020) DAPT 2020-constructing a benchmark dataset for advanced persistent threats. In: Deployable Machine Learning for Security Defense: First International Workshop, MLHat 2020, San Diego. Springer, pp 138–163

  11. Liu J, Shen Y, Simsek M, Kantarci B, Mouftah HT, Bagheri M, Djukic P (2022) A new realistic benchmark for advanced persistent threats in network traffic. IEEE Netw Lett 4(3):162–166

    Article  Google Scholar 

  12. Friedberg I, Skopik F, Settanni G, Fiedler R (2015) Combating advanced persistent threats: from network event correlation to incident detection. Comput Secur 48:35–57

    Article  Google Scholar 

  13. Siddiqui S, Khan MS, Ferens K, Kinsner W (2016) Detecting advanced persistent threats using fractal dimension based machine learning classification. In: Proceedings of the 2016 ACM on International Workshop on Security and Privacy Analytics, pp 64–69

  14. Ghafir I, Hammoudeh M, Prenosil V, Han L, Hegarty R, Rabie K, Aparicio-Navarro FJ (2018) Detection of advanced persistent threat using machine-learning correlation analysis. Future Gener Comput Syst 89:349–359

    Article  Google Scholar 

  15. Laurenza G, Lazzeretti R, Mazzotti L (2020) Malware triage for early identification of advanced persistent threat activities. Digit Threats Res Pract 1(3):1–17

    Article  Google Scholar 

  16. Hasan MM, Islam MU, Uddin J (2023) Advanced persistent threat identification with boosting and explainable AI. SN Comput Sci 4(3):271

    Article  Google Scholar 

  17. Brownlee J (2020). Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python. Machine Learning Mastery

  18. Brownlee J (2020). Imbalanced classification with Python: better metrics, balance skewed classes, cost-sensitive learning. Machine Learning Mastery

  19. Kim M, Hwang KB (2022) An empirical evaluation of sampling methods for the classification of imbalanced data. PLoS ONE 17(7):e0271260

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Janiesch C, Zschech P, Heinrich K (2021) Machine learning and deep learning. Electron Mark 31(3):685–695

    Article  Google Scholar 

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Contributions

The primary contributor to the research was Dinh-Dong Dau, who created the test code and edited the research paper’s main content. Hanseok Kim assisted in reviewing language and grammar errors in the research paper. Professor Soojin Lee supervised the research and provided oversight for the entire content of the research paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Soojin Lee.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable, as this study did not involve human participants or animals.

Consent to participate

Not applicable, as this study did not involve human participants.

Consent for publication

All authors have consented to the publication of this research.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dau, DD., Lee, S. & Kim, H. A comprehensive comparison study of ML models for multistage APT detection: focus on data preprocessing and resampling. J Supercomput (2024). https://doi.org/10.1007/s11227-024-06010-2

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11227-024-06010-2

Keywords

Navigation