Short- versus long-term performance of detection models for obfuscated MSOffice-embedded malware

Viţel, Silviu; Lupaşcu, Marilena; Gavriluţ, Dragoş Teodor; Luchian, Henri

doi:10.1007/s10207-023-00736-5

Short- versus long-term performance of detection models for obfuscated MSOffice-embedded malware

Regular Contribution
Published: 14 August 2023

Volume 23, pages 271–297, (2024)
Cite this article

International Journal of Information Security Aims and scope Submit manuscript

Silviu Viţel^1,2,
Marilena Lupaşcu^1,2,
Dragoş Teodor Gavriluţ^1,2 &
…
Henri Luchian¹

162 Accesses
Explore all metrics

Abstract

This paper analyzes the efficiency of various machine learning models (artificial neural networks, random forest, decision tree, AdaBoost and XGBoost) against the evolution of VBA-based (Visual Basic for Applications) malware over a large period of time (1995–2021). The file set used in our research is comprehensive—approximately 1.9 million files (out of which 944,595 are malicious and the rest are benign)—which allowed to gain insights on the resilience of various machine learning models against the diversity and the evolution of file features that reflect obfuscation techniques in VBA-based malware. In studying detection of VBA-based malware, we focus on characteristics of both the classifiers—proactivity (short-term detection efficiency against future malware), endurance (long-term detection robustness)—and of the detection-wise relevant file features—feature perishability (dynamics of feature relevance). We also describe in some detail—as a prerequisite of the study—various obfuscation techniques used by the malware under investigation during the last decade.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 6

Fig. 19

Fig. 20

Fig. 23

Fig. 24

Machine Learning for Intelligent Data Analysis and Automation in Cybersecurity: Current and Future Prospects

Article Open access 19 September 2022

Applying NLP techniques to malware detection in a practical environment

Article Open access 06 June 2021

Review: machine learning techniques applied to cybersecurity

Article 04 January 2019

Data availability

The data that support the findings of this study are obtained from Bitdefender and are propriety of Bitdefender. Restrictions apply to the availability of these data, which were used under specific license for the current study, and so are not publicly available.

Notes

https://expandedramblings.com/index.php/microsoft-office-statistics-facts/.
This information is not available anymore on the Microsoft website; it can still be found at https://web.archive.org/web/20170412001248/news.microsoft.com/bythenumbers/planet-office.
https://home.sophos.com/en-us/security-news/2019/macro-viruses.aspx.
https://www.tripwire.com/state-of-security/featured/macro-malware/.
To what degree is a human reader confused.
\(\hbox {Precision}= \hbox {TP} / (\hbox {TP}+\hbox {FP})\); \(\hbox {recall}= \hbox {TP} / (\hbox {TP}+\hbox {FN})\), where: TP—true positive; FP—false positive; FN—false negative.
Term Frequency.
Term Frequency − Inverse Document Frequency.
Bag of Words.
Latent Semantic Indexing.
Sparse composite document vectors.
Office document files have the Open XML file format. It represents a ZIP archive containing data structured in separate XML files.
https://docs.microsoft.com/en-us/office/vba/language/reference/user-interface-help/shell-function.
https://github.com/decalage2/oletools/wiki/olevba.
https://www.antlr.org/.
https://scikit-learn.org.
https://github.com/VitelSilviuConstantin/VBA-Dataset.
https://www.ncsc.gov.uk/guidance/macro-security-for-microsoft-office.
https://github.com/malicialab/avclass.
Despite the fact that Microsoft introduced security measures aimed at preventing the execution of malicious macros, attackers often managed to convince unsuspecting users to open infected documents, by disguising their origin or describing the enabling of macros as a necessary step to access a document’s data.
https://nakedsecurity.sophos.com/2014/09/17/vba-injectors/.
https://threatpost.com/microsoft%2Dextends%2Dmalicious%2Dmacro-protection%2Dto%2Doffice%2D2013/121618/.
https://isssource.com/macro-malware-on-way-back/.
https://www.securityweek.com/locky-variant-osiris-distributed-excel-documents.
https://news.cision.com/f-secure/r/covid%2D19%2Dspam%2D%2Dphishing%2Demails%2D%2Dplagued%2Dusers%2Din%2Dfirst%2Dhalf%2Dof%2D2020,c3195746.
In the whole database D, ignoring the time stamps.

References

Viţel, S., Lupaşcu, M., Gavriluţ, D.T., Luchian, H.: Detection of msoffice-embedded malware: Feature mining and short- vs. long-term performance. In: Su, C., Gritzalis, D., Piuri, V. (eds.) Information Security Practice and Experience, pp. 287–305. Springer, Cham (2022)
Chapter Google Scholar
Viţel, SC., Lupaşcu, M., Gavriluţ, DT., Luchian, H.: Evolution of macro vba obfuscation techniques. In: 2022 15th International Conference on Security of Information and Networks (SIN), pp. 1–8 (2022). https://doi.org/10.1109/SIN56466.2022.9970550
You, I., Yim, K.: Malware obfuscation techniques: a brief survey. In: 2010 International conference on broadband, wireless computing, communication and applications, pp. 297–300. IEEE (2010)
Collberg, C., Thomborson, C., Low, D.: A taxonomy of obfuscating transformations. Tech. Rep. 148, Department of Computer Sciences, The University of Auckland (1997). http://www.cs.auckland.ac.nz/~/Research/Publications/CollbergThomborsonLow97a/index.html
Ertaul, L., Venkatesh, S.: Jhide—a tool kit for code obfuscation. In: IASTED Conference on Software Engineering and Applications, pp. 133–138 (2004)
Ertaul, L., Venkatesh, S.: Novel obfuscation algorithms for software security. In: Proceedings of the 2005 International Conference on Software Engineering Research and Practice, SERP, Citeseer, vol. 5 (2005)
Xu, W., Zhang, F., Zhu, S.: The power of obfuscation techniques in malicious javascript code: a measurement study. In: 2012 7th International Conference on Malicious and Unwanted Software, pp. 9–16 (2012). https://doi.org/10.1109/MALWARE.2012.6461002
Kolisar: Whitespace: A different approach to javascript obfuscation (2008). https://defcon.org/images/defcon-16/dc16-presentations/defcon-16-kolisar.pdf
Chellapilla, K., Maykov, A.: A taxonomy of javascript redirection spam. In: AIRWeb ’07 (2007)
AL-Taharwa, I.A., Lee, H.M., Jeng, A.B., Wu, K.P., Ho, C.S., Chen, S.M.: Jsod: Javascript obfuscation detector. Secur. Commun. Netw. 8(6), 1092–1107 (2015)
Article Google Scholar
Xu, W., Zhang, F., Zhu, S.: Jstill: mostly static detection of obfuscated malicious javascript code. In: Proceedings of the third ACM conference on Data and application security and privacy, pp. 117–128 (2013)
Choi, Y., Kim, T., Choi, S., Lee, C.: Automatic detection for javascript obfuscation attacks in web pages through string pattern analysis. In: Ślezak, D., Lee, Y., Kim, T., Fang, W. (eds.) Future Generation Information Technology, pp. 160–172. Springer, Berlin (2009)
Chapter Google Scholar
Liu, C., Xia, B., Yu, M., Liu, Y.: Psdem: a feasible de-obfuscation method for malicious powershell detection. In: 2018 IEEE Symposium on Computers and Communications (ISCC), pp 825–831. IEEE (2018)
Ugarte, D., Maiorca, D., Cara, F., Giacinto, G.: Powerdrive: accurate de-obfuscation and analysis of powershell malware. In: International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pp 240–259. Springer (2019)
Hendler, D., Kels, S., Rubin, A.: Detecting malicious powershell commands using deep neural networks. In: Proceedings of the 2018 on Asia conference on computer and communications security, pp. 187–197 (2018)
Aboud, E., O’Brien, D.: Detection of malicious VBA macros using machine learning methods (2018)
Kim, S., Hong, S., Oh, J., Lee, H.: Obfuscated VBA macro detection using machine learning. In: DSN, IEEE Computer Society, pp. 490–501 (2018)
De los Santos, S., Torres, J.: Macro malware detection using machine learning techniques—a new approach. In: ICISSP, pp. 295–302 (2017)
Bearden, R., Lo, DCT: Automated microsoft office macro malware detection using machine learning. In: 2017 IEEE International Conference on Big Data (2017)
Huneault-Leblanc, S., Talhi, C.: P-code based classification to detect malicious vba macro. In: 2020 International Symposium on Networks. Computers and Communications (ISNCC), pp. 1–6. IEEE (2020)
Mimura, M., Miura, H.: Detecting unseen malicious VBA macros with NLP techniques. J. Inf. Process. 27, 555–563 (2019)
Google Scholar
Mimura, M.: An improved method of detecting macro malware on an imbalanced dataset. IEEE Access 8, 204709–204717 (2020)
Article Google Scholar
Mimura, M.: Using sparse composite document vectors to classify VBA macros, pp. 714–720. (2019)https://doi.org/10.1007/978-3-030-36938-5_46
Mimura, M.: Using fake text vectors to improve the sensitivity of minority class for macro malware detection. J. Inf. Secur. Appl. 54, 102600 (2020)
Google Scholar
Ravi, V., Gururaj, S., Vedamurthy, H., Nirmala, M.: Analysing corpus of office documents for macro-based attacks using machine learning. Glob. Trans. Proc. 3, 20–24 (2022)
Article Google Scholar
Nissim, N., Cohen, A., Elovici, Y.: Aldocx: detection of unknown malicious microsoft office documents using designated active learning methods based on new structural feature extraction methodology. EEE Trans. Inf. Forensic Secur. 12, 631–646 (2016)
Article Google Scholar
Cohen, A., Nissim, N., Rokach, L., Elovici, Y.: Sfem: structural feature extraction methodology for the detection of malicious office documents using machine learning methods. Expert Syst. Appl. 63, 324–343 (2016)
Article Google Scholar
Casino, F., Totosis, N., Apostolopoulos, T., Lykousas, N., Patsakis, C.: Analysis and correlation of visual evidence in campaigns of malicious office documents. Association for Computing Machinery, New York, NY, USA (2022) https://doi.org/10.1145/3513025
Rudd, EM., Harang, RE., Saxe, J.: MEADE: towards a malicious email attachment detection engine (2018) CoRR abs/1804.08162, arXiv:1804.08162
Yang, S., Chen, W., Li, S., Xu, Q.: Approach using transforming structural data into image for detection of malicious ms-doc files based on deep learning models. In: 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 28–32 (2019)
Lu, X., Wang, F., Shu, Z.: Malicious word document detection based on multi-view features learning pp. 1–6 (2019) https://doi.org/10.1109/ICCCN.2019.8846940
Li, Wj., Stolfo, S., Stavrou, A., Androulaki, E., Keromytis, A.: A study of malcode-bearing documents (2007)
Koutsokostas, V., Lykousas, N., Apostolopoulos, T., Orazi, G., Ghosal, A., Casino, F., Conti, M., Patsakis, C.: Invoice# 31415 attached: Automated analysis of malicious microsoft office documents. Comput. Secur. 114(102), 582 (2022)
Google Scholar
Tzermias, Z., Sykiotakis, G., Polychronakis, M., Markatos, E.: Combining static and dynamic analysis for the detection of malicious documents (2011)
Yu, M., Jiang, J., Li, G., Li, J., Lou, C., Liu, C., Huang, W., Wang, Y.: A unified malicious documents detection model based on two layers of abstraction (2019)
Iwamoto, K., Wasaki, K.: A method for shellcode extraction from malicious document files using entropy and emulation. Int. J. Eng. Technol. 8, 101–106 (2015)
Article Google Scholar
Schreck, T., Berger, S., Göbel, J.: Bissam: automatic vulnerability identification of office documents (2012)
Smutz, C., Stavrou, A.: Preventing exploits in microsoft office documents through content randomization (2015)
Otsubo, Y.: O-checker : Detection of malicious documents through deviation from file format specifications (2016)
Moubarak, J., Feghali, T.: Comparing machine learning techniques for malware detection. In: ICISSP (2020)
Azeez, N.A., Odufuwa, O.E., Misra, S., Oluranti, J., Damaševičius, R.: Windows pe malware detection using ensemble learning. Informatics 8(1), 10 (2021)
Szandała, T.: Review and comparison of commonly used activation functions for deep neural networks. In: Bio-inspired Neurocomputing, pp. 203–224 (2021)
Gabor, S.: Vba is not dead! Virus Bulletin (2014). https://www.virusbulletin.com/virusbulletin/2014/07/vba-not-dead

Download references

Acknowledgements

This article represents an extension of our previous works [1, 2]. Consequently, some graphics, tables and algorithms are included in this extended version. The following elements were first published in Lecture Notes in Computer Science, volume 13620, Information Security Practice and Experience, pp 287–305, 2022 by Springer Nature: Figs. 42, 43, 44, 45, 46, 49, 50, Tables 5, 8 and Algorithms 1, 2. Figures 40 and 41 were first published in 15th International Conference on Security of Information and Networks.

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Faculty of Computer Science, “Al.I. Cuza” University, Iaşi, Romania
Silviu Viţel, Marilena Lupaşcu, Dragoş Teodor Gavriluţ & Henri Luchian
Bitdefender Labs, Iaşi, Romania
Silviu Viţel, Marilena Lupaşcu & Dragoş Teodor Gavriluţ

Authors

Silviu Viţel
View author publications
You can also search for this author in PubMed Google Scholar
Marilena Lupaşcu
View author publications
You can also search for this author in PubMed Google Scholar
Dragoş Teodor Gavriluţ
View author publications
You can also search for this author in PubMed Google Scholar
Henri Luchian
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. All authors wrote, read and approved the final manuscript.

Corresponding author

Correspondence to Marilena Lupaşcu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflicts of interest or competing interests regarding the publication of this study.

Human participants and animals

This article does not contain any studies involving human participants or animals, performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Viţel, S., Lupaşcu, M., Gavriluţ, D.T. et al. Short- versus long-term performance of detection models for obfuscated MSOffice-embedded malware. Int. J. Inf. Secur. 23, 271–297 (2024). https://doi.org/10.1007/s10207-023-00736-5

Download citation

Accepted: 18 July 2023
Published: 14 August 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s10207-023-00736-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Short- versus long-term performance of detection models for obfuscated MSOffice-embedded malware

Abstract

Access this article

Similar content being viewed by others

Machine Learning for Intelligent Data Analysis and Automation in Cybersecurity: Current and Future Prospects

Applying NLP techniques to malware detection in a practical environment

Review: machine learning techniques applied to cybersecurity

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Human participants and animals

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Short- versus long-term performance of detection models for obfuscated MSOffice-embedded malware

Abstract

Access this article

Similar content being viewed by others

Machine Learning for Intelligent Data Analysis and Automation in Cybersecurity: Current and Future Prospects

Applying NLP techniques to malware detection in a practical environment

Review: machine learning techniques applied to cybersecurity

Data availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Human participants and animals

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation