Abstract
Thousands of new malware codes are developed every day. Signature-based methods, which are employed by common malware detectors, are susceptible to code obfuscation and novel malware. In this paper, we present an alternative method for malware detection, which makes use of assembly opcode sequences obtained during runtime. First, for sequential opcode data, we utilize natural language processing and deep learning techniques to facilitate the extraction of deeper behavioral features. Due to these features, this method can be impervious to code obfuscation and effective against novel malware. Finally, these features are fed to various machine learning algorithms for classification. The experiments on a more class balanced dataset of 26869 samples demonstrated that MCC (Matthew’s correlation coefficient) score as high as 0.95 is achievable with this approach. The MCC score results for the experiments conducted on imbalanced and artificially balanced datasets are 0.81 and 0.83, respectively.
Similar content being viewed by others
Availability of data and material
All the data used in this study can be made available upon request after contacting with any of the authors.
Code availability
The relevant code base is stored in the private repository. It can be made available upon request after contacting with any of the authors.
References
Arora S, Liang Y, Ma T (2016) A simple but tough-to-beat baseline for sentence embeddings (2016)
Bazrafshan Z, Hashemi H, Fard SMH Hamzeh A (2013) A survey on heuristic malware detection techniques. In: The 5th conference on information and knowledge technology, IEEE, pp 113–120
Christodorescu M, Jha S (2006) Static analysis of executables to detect malicious patterns. WISCONSIN UNIV-MADISON DEPT OF COMPUTER SCIENCES, Tech. Rep
Beltagy I, Peters ME, Cohan, A (2020) Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150
Cakir B, Dogdu E (2018) Malware classification using deep learning methods. In: Proceedings of the ACMSE 2018 Conference, ACM, p 10
Carlin D, Cowan A, O’Kane P, Sezer S (2017) The effects of traditional anti-virus labels on malware detection using dynamic runtime opcodes. IEEE Access 5:17742–17752
Carlin D, O’Kane P, and Sezer S (2017) Dynamic analysis of malware using run-time opcodes. In: Data analytics and decision support for cybersecurity. Springer, pp 99–125
Maaten LVD, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605
Chen T, Mao Q, Yang Y, Lv M, Zhu J (2018) Tinydroid: a lightweight and efficient model for android malware detection and classification. Mobile information systems 2018
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
fnl (https://stats.stackexchange.com/users/44585/fnl): How to set the dictionary for text analysis using neural networks. Cross Validated. https://stats.stackexchange.com/q/163032. URL: https://stats.stackexchange.com/q/163032 (version: 2017-06-26)
Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882
Kitaev N, Kaiser Ł, Levskaya A (2020) Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020)
Kolosnjaji B, Eraisha G, Webster GD, Zarras A, Eckert C (2017) Empowering convolutional networks for malware classification and analysis. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp 3838–3845
Kolosnjaji B, Zarras A, Webster GD, Eckert C (2016) Deep learning for classification of malware system call sequences. In: Australasian joint conference on artificial intelligence, pp 137–149
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105
Lab K (2017) Kaspersky Security Bulletin. https://www.kaspersky.com/about/press-releases/2017_kaspersky-lab-detects-360000-new-malicious-files-daily
Martinez E (2015) A first shot at false positives. [Online]. Available: https://blog.virustotal.com/2015/02/a-first-shot-at-false-positives.html
Lapiello E (2018) Shuffling paragraphs: Using data augmentation in nlp to increase accuracy.https://medium.com/bcggamma/shuffling-paragraphs-using-data-augmentation-in-nlp-to-increase-accuracy-477388746bd9
McLaughlin N, del Rincón JM, Kang B, Yerima SY, Miller PC, Sezer S, Safaei Y, Trickel E, Zhao Z, Doupé A, Ahn GJ (2017) Deep android malware detection. In: Proceedings of the seventh ACM on conference on data and application security and privacy, pp 301–308
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Neil D, Pfeiffer M, Liu SC (2016) Phased lstm: Accelerating recurrent network training for long or event-based sequences. In: Advances in neural information processing systems, pp 3882–3890
Osborn, M.: Malware detection techniques. Int J Comput (IJC) 18(1) (2015)
Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621
Sung Y, Jang S, Jeong Y-S, Hyuk J et al (2020) Malware classification algorithm using advanced word2vec-based bi-lstm for ground control stations. Comput Commun 153:342–348
Jeon S, Moon J (2020) Malware-detection method with a convolutional recurrent neural network using opcode sequences. Inform Sci 535:1–15
Popov I (2017) Malware detection using machine learning based on word2vec embeddings of machine code instructions. In: 2017 Siberian Symposium on Data Science and Engineering (SSDSE), IEEE, pp 1–4
Sihwail R, Omar K, Ariffin KAZ (2018) A survey on malware analysis techniques: static, dynamic, hybrid and memory analysis. Int J Adv Sci Eng Inf Technol 8(4–2):1662
Idika N, Mathur AP (2007) A survey of malware detection techniques. Purdue University
Řehůřek R, Sojka P (2010) Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, pp 45–50. http://is.muni.cz/publication/884893/en
Rong X (2014) word2vec parameter learning explained. arXiv preprint arXiv:1411.2738
Shijo P, Salim A (2015) Integrated static and dynamic analysis for malware detection. Procedia Comput Sci 46:804–811
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Kaspersky (2018) Types of malware. [Online]. Available: https://www.kaspersky.com/resource-center/threats/malware-classifications
McInnes L, Healy J, Melville J (2018) Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426
Becht E, McInnes L, Healy J, Dutertre C-A, Kwok IW, Ng LG, Ginhoux F, Newell EW (2019) Dimensionality reduction for visualizing single-cell data using umap. Nat Biotechnol 37(1):38–44
Kokhlikyan N, Miglani V, Martin M, Wang E, Alsallakh B, Reynolds J, Melnikov A, Kliushkina N, Araya C, Yan S et al. (2020) Captum: A unified and generic model interpretability library for pytorch,” arXiv preprint arXiv:2009.07896
Vemparala S, Di Troia F, Corrado VA, Austin TH, Stamo M (2016) Malware detection using dynamic birthmarks. In: Proceedings of the 2016 ACM on international workshop on security and privacy analytics, ACM, pp 41–46
Yan J, Qi Y, Rao Q (2018) Detecting malware with an ensemble method based on deep neural network. Secur Commun Networks 2018:1–16
Yan J, Qi Y, Rao Q (2018) Lstm-based hierarchical denoising network for android malware detection. Secur Commun Netw 2018:1–18
Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 2: Short Papers), vol 2. pp 207–212
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
All authors contributed equally.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Parildi, E.S., Hatzinakos, D. & Lawryshyn, Y. Deep learning-aided runtime opcode-based Windows malware detection. Neural Comput & Applic 33, 11963–11983 (2021). https://doi.org/10.1007/s00521-021-05861-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-05861-7