A comprehensive comparison study of ML models for multistage APT detection: focus on data preprocessing and resampling

Dau, Dinh-Dong; Lee, Soojin; Kim, Hanseok

doi:10.1007/s11227-024-06010-2

A comprehensive comparison study of ML models for multistage APT detection: focus on data preprocessing and resampling

Published: 16 March 2024

(2024)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

80 Accesses
Explore all metrics

Abstract

Advanced persistent threats (APTs) present a significant cybersecurity challenge, necessitating innovative detection methods. This study stands out by integrating advanced data preparation with strategies for handling data imbalances, tailored for the SCVIC-APT-2021 dataset. We employ a mix of resampling, cost-sensitive learning, and ensemble methods, alongside machine learning and deep learning models like XGBoost, LightGBM, and ANNs, to enhance APT detection. Our strategy, which draws from the MITRE ATT&CK framework, concentrates on each stage of APT attacks, which significantly increases detection accuracy. Notably, we achieved a Macro F1-score of 95.20% with XGBoost and 96.67% with LightGBM, and significant enhancements in the area under the precision–recall curve for both. Our study’s exploration of the SCVIC-APT-2021 dataset marks a progressive step in APT detection research, with vital implications for future cybersecurity developments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 8

AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions

Article 26 March 2021

Cybersecurity data science: an overview from machine learning perspective

Article Open access 01 July 2020

A comprehensive survey of AI-enabled phishing attacks detection techniques

Article 23 October 2020

Availability of data and materials

Data generated or analyzed during this study are included in this published article.

Code availability

The custom code developed for the experiments in this study is available upon request from the corresponding author.

References

Chen P, Desmet L, Huygens C (2014) A study on advanced persistent threats. In: Communications and Multimedia Security: 15th IFIP TC 6/TC 11 International Conference, CMS 2014, Aveiro, Springer, Berlin Heidelberg, pp 63–72
Alshamrani A, Myneni S, Chowdhary A, Huang D (2019) A survey on advanced persistent threats: techniques, solutions, challenges, and research opportunities. IEEE Commun Surv Tutor 21(2):1851–1877
Article Google Scholar
Werner de Vargas V, Schneider Aranda JA, dos Santos Costa R, da Silva Pereira PR, Victória Barbosa JL (2023) Imbalanced data preprocessing techniques for machine learning: a systematic mapping study. Knowl Inf Syst 65(1):31–57
Article PubMed Google Scholar
Seo JH (2022) Evolutionary data preprocessing to alleviate class imbalance. Secur Commun Netw 2022
Sharma A, Gupta BB, Singh AK, Saraswat VK (2023) Advanced persistent threats (APT): evolution, anatomy, attribution and countermeasures. J Ambient Intell Humaniz Comput 1–27
Neuschmied H, Winter M, Stojanović B, Hofer-Schmitz K, Božić J, Kleb U (2022) Apt-attack detection based on multi-stage autoencoders. Appl Sci 12(13):6816
Article CAS Google Scholar
Bodström T, Hämäläinen T (2019) A novel deep learning stack for APT detection. Appl Sci 9(6):1055
Article Google Scholar
Shi Y, Li W, Zhang Y, Deng X, Yin D, Deng S (2021) Survey on APT attack detection in industrial cyber-physical system. In: 2021 International Conference on Electronic Information Technology and Smart Agriculture (ICEITSA). IEEE, pp 296–301
Do Xuan C, Dao MH (2021) A novel approach for APT attack detection based on combined deep learning model. Neural Comput Appl 33:13251–13264
Article Google Scholar
Myneni S, Chowdhary A, Sabur A, Sengupta S, Agrawal G, Huang D, Kang M (2020) DAPT 2020-constructing a benchmark dataset for advanced persistent threats. In: Deployable Machine Learning for Security Defense: First International Workshop, MLHat 2020, San Diego. Springer, pp 138–163
Liu J, Shen Y, Simsek M, Kantarci B, Mouftah HT, Bagheri M, Djukic P (2022) A new realistic benchmark for advanced persistent threats in network traffic. IEEE Netw Lett 4(3):162–166
Article Google Scholar
Friedberg I, Skopik F, Settanni G, Fiedler R (2015) Combating advanced persistent threats: from network event correlation to incident detection. Comput Secur 48:35–57
Article Google Scholar
Siddiqui S, Khan MS, Ferens K, Kinsner W (2016) Detecting advanced persistent threats using fractal dimension based machine learning classification. In: Proceedings of the 2016 ACM on International Workshop on Security and Privacy Analytics, pp 64–69
Ghafir I, Hammoudeh M, Prenosil V, Han L, Hegarty R, Rabie K, Aparicio-Navarro FJ (2018) Detection of advanced persistent threat using machine-learning correlation analysis. Future Gener Comput Syst 89:349–359
Article Google Scholar
Laurenza G, Lazzeretti R, Mazzotti L (2020) Malware triage for early identification of advanced persistent threat activities. Digit Threats Res Pract 1(3):1–17
Article Google Scholar
Hasan MM, Islam MU, Uddin J (2023) Advanced persistent threat identification with boosting and explainable AI. SN Comput Sci 4(3):271
Article Google Scholar
Brownlee J (2020). Data preparation for machine learning: data cleaning, feature selection, and data transforms in Python. Machine Learning Mastery
Brownlee J (2020). Imbalanced classification with Python: better metrics, balance skewed classes, cost-sensitive learning. Machine Learning Mastery
Kim M, Hwang KB (2022) An empirical evaluation of sampling methods for the classification of imbalanced data. PLoS ONE 17(7):e0271260
Article CAS PubMed PubMed Central Google Scholar
Janiesch C, Zschech P, Heinrich K (2021) Machine learning and deep learning. Electron Mark 31(3):685–695
Article Google Scholar

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Korea National Defense University, Nonsan-si, Chungcheongnam-Do, South Korea
Dinh-Dong Dau, Soojin Lee & Hanseok Kim

Authors

Dinh-Dong Dau
View author publications
You can also search for this author in PubMed Google Scholar
Soojin Lee
View author publications
You can also search for this author in PubMed Google Scholar
Hanseok Kim
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The primary contributor to the research was Dinh-Dong Dau, who created the test code and edited the research paper’s main content. Hanseok Kim assisted in reviewing language and grammar errors in the research paper. Professor Soojin Lee supervised the research and provided oversight for the entire content of the research paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Soojin Lee.

Ethics declarations

Conflict of interests

The authors declare that they have no conflict of interest.

Ethics approval

Not applicable, as this study did not involve human participants or animals.

Consent to participate

Not applicable, as this study did not involve human participants.

Consent for publication

All authors have consented to the publication of this research.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (ZIP 253 KB)

Supplementary file2 (IPYNB 653 KB)

Supplementary file3 (IPYNB 562 KB)

Supplementary file4 (IPYNB 640 KB)

Supplementary file5 (IPYNB 640 KB)

Supplementary file6 (IPYNB 841 KB)

Supplementary file7 (IPYNB 1478 KB)

Supplementary file8 (IPYNB 853 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Dau, DD., Lee, S. & Kim, H. A comprehensive comparison study of ML models for multistage APT detection: focus on data preprocessing and resampling. J Supercomput (2024). https://doi.org/10.1007/s11227-024-06010-2

Download citation

Accepted: 19 February 2024
Published: 16 March 2024
DOI: https://doi.org/10.1007/s11227-024-06010-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive comparison study of ML models for multistage APT detection: focus on data preprocessing and resampling

Abstract

Access this article

Similar content being viewed by others

AI-Driven Cybersecurity: An Overview, Security Intelligence Modeling and Research Directions

Cybersecurity data science: an overview from machine learning perspective

A comprehensive survey of AI-enabled phishing attacks detection techniques

Availability of data and materials

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interests

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (ZIP 253 KB)

Supplementary file2 (IPYNB 653 KB)

Supplementary file3 (IPYNB 562 KB)

Supplementary file4 (IPYNB 640 KB)

Supplementary file5 (IPYNB 640 KB)

Supplementary file6 (IPYNB 841 KB)

Supplementary file7 (IPYNB 1478 KB)

Supplementary file8 (IPYNB 853 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A comprehensive comparison study of ML models for multistage APT detection: focus on data preprocessing and resampling

Abstract

Access this article

Similar content being viewed by others

Availability of data and materials

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interests

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation