Skip to main content
Log in

Deep learning-aided runtime opcode-based Windows malware detection

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Thousands of new malware codes are developed every day. Signature-based methods, which are employed by common malware detectors, are susceptible to code obfuscation and novel malware. In this paper, we present an alternative method for malware detection, which makes use of assembly opcode sequences obtained during runtime. First, for sequential opcode data, we utilize natural language processing and deep learning techniques to facilitate the extraction of deeper behavioral features. Due to these features, this method can be impervious to code obfuscation and effective against novel malware. Finally, these features are fed to various machine learning algorithms for classification. The experiments on a more class balanced dataset of 26869 samples demonstrated that MCC (Matthew’s correlation coefficient) score as high as 0.95 is achievable with this approach. The MCC score results for the experiments conducted on imbalanced and artificially balanced datasets are 0.81 and 0.83, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Availability of data and material

All the data used in this study can be made available upon request after contacting with any of the authors.

Code availability

The relevant code base is stored in the private repository. It can be made available upon request after contacting with any of the authors.

References

  1. Arora S, Liang Y, Ma T (2016) A simple but tough-to-beat baseline for sentence embeddings (2016)

  2. Bazrafshan Z, Hashemi H, Fard SMH Hamzeh A (2013) A survey on heuristic malware detection techniques. In: The 5th conference on information and knowledge technology, IEEE, pp 113–120

  3. Christodorescu M, Jha S (2006) Static analysis of executables to detect malicious patterns. WISCONSIN UNIV-MADISON DEPT OF COMPUTER SCIENCES, Tech. Rep

  4. Beltagy I, Peters ME, Cohan, A (2020) Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150

  5. Cakir B, Dogdu E (2018) Malware classification using deep learning methods. In: Proceedings of the ACMSE 2018 Conference, ACM, p 10

  6. Carlin D, Cowan A, O’Kane P, Sezer S (2017) The effects of traditional anti-virus labels on malware detection using dynamic runtime opcodes. IEEE Access 5:17742–17752

    Article  Google Scholar 

  7. Carlin D, O’Kane P, and Sezer S (2017) Dynamic analysis of malware using run-time opcodes. In: Data analytics and decision support for cybersecurity. Springer, pp 99–125

  8. Maaten LVD, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

  9. Chen T, Mao Q, Yang Y, Lv M, Zhu J (2018) Tinydroid: a lightweight and efficient model for android malware detection and classification. Mobile information systems 2018

  10. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  11. fnl (https://stats.stackexchange.com/users/44585/fnl): How to set the dictionary for text analysis using neural networks. Cross Validated. https://stats.stackexchange.com/q/163032. URL: https://stats.stackexchange.com/q/163032 (version: 2017-06-26)

  12. Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882

  13. Kitaev N, Kaiser Ł, Levskaya A (2020) Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451 (2020)

  14. Kolosnjaji B, Eraisha G, Webster GD, Zarras A, Eckert C (2017) Empowering convolutional networks for malware classification and analysis. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp 3838–3845

  15. Kolosnjaji B, Zarras A, Webster GD, Eckert C (2016) Deep learning for classification of malware system call sequences. In: Australasian joint conference on artificial intelligence, pp 137–149

  16. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  17. Lab K (2017) Kaspersky Security Bulletin. https://www.kaspersky.com/about/press-releases/2017_kaspersky-lab-detects-360000-new-malicious-files-daily

  18. Martinez E (2015) A first shot at false positives. [Online]. Available: https://blog.virustotal.com/2015/02/a-first-shot-at-false-positives.html

  19. Lapiello E (2018) Shuffling paragraphs: Using data augmentation in nlp to increase accuracy.https://medium.com/bcggamma/shuffling-paragraphs-using-data-augmentation-in-nlp-to-increase-accuracy-477388746bd9

  20. McLaughlin N, del Rincón JM, Kang B, Yerima SY, Miller PC, Sezer S, Safaei Y, Trickel E, Zhao Z, Doupé A, Ahn GJ (2017) Deep android malware detection. In: Proceedings of the seventh ACM on conference on data and application security and privacy, pp 301–308

  21. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781

  22. Neil D, Pfeiffer M, Liu SC (2016) Phased lstm: Accelerating recurrent network training for long or event-based sequences. In: Advances in neural information processing systems, pp 3882–3890

  23. Osborn, M.: Malware detection techniques. Int J Comput (IJC) 18(1) (2015)

  24. Perez L, Wang J (2017) The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621

  25. Sung Y, Jang S, Jeong Y-S, Hyuk J et al (2020) Malware classification algorithm using advanced word2vec-based bi-lstm for ground control stations. Comput Commun 153:342–348

    Article  Google Scholar 

  26. Jeon S, Moon J (2020) Malware-detection method with a convolutional recurrent neural network using opcode sequences. Inform Sci 535:1–15

    Article  MathSciNet  Google Scholar 

  27. Popov I (2017) Malware detection using machine learning based on word2vec embeddings of machine code instructions. In: 2017 Siberian Symposium on Data Science and Engineering (SSDSE), IEEE, pp 1–4

  28. Sihwail R, Omar K, Ariffin KAZ (2018) A survey on malware analysis techniques: static, dynamic, hybrid and memory analysis. Int J Adv Sci Eng Inf Technol 8(4–2):1662

    Article  Google Scholar 

  29. Idika N, Mathur AP (2007) A survey of malware detection techniques. Purdue University

  30. Řehůřek R, Sojka P (2010) Software Framework for Topic Modelling with Large Corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, pp 45–50. http://is.muni.cz/publication/884893/en

  31. Rong X (2014) word2vec parameter learning explained. arXiv preprint arXiv:1411.2738

  32. Shijo P, Salim A (2015) Integrated static and dynamic analysis for malware detection. Procedia Comput Sci 46:804–811

    Article  Google Scholar 

  33. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  34. Kaspersky (2018) Types of malware. [Online]. Available: https://www.kaspersky.com/resource-center/threats/malware-classifications

  35. McInnes L, Healy J, Melville J (2018) Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426

  36. Becht E, McInnes L, Healy J, Dutertre C-A, Kwok IW, Ng LG, Ginhoux F, Newell EW (2019) Dimensionality reduction for visualizing single-cell data using umap. Nat Biotechnol 37(1):38–44

    Article  Google Scholar 

  37. Kokhlikyan N, Miglani V, Martin M, Wang E, Alsallakh B, Reynolds J, Melnikov A, Kliushkina N, Araya C, Yan S et al. (2020) Captum: A unified and generic model interpretability library for pytorch,” arXiv preprint arXiv:2009.07896

  38. Vemparala S, Di Troia F, Corrado VA, Austin TH, Stamo M (2016) Malware detection using dynamic birthmarks. In: Proceedings of the 2016 ACM on international workshop on security and privacy analytics, ACM, pp 41–46

  39. Yan J, Qi Y, Rao Q (2018) Detecting malware with an ensemble method based on deep neural network. Secur Commun Networks 2018:1–16

    Google Scholar 

  40. Yan J, Qi Y, Rao Q (2018) Lstm-based hierarchical denoising network for android malware detection. Secur Commun Netw 2018:1–18

    Google Scholar 

  41. Zhou P, Shi W, Tian J, Qi Z, Li B, Hao H, Xu B (2016) Attention-based bidirectional long short-term memory networks for relation classification. In: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 2: Short Papers), vol 2. pp 207–212

Download references

Funding

No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed equally.

Corresponding author

Correspondence to Enes Sinan Parildi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Parildi, E.S., Hatzinakos, D. & Lawryshyn, Y. Deep learning-aided runtime opcode-based Windows malware detection. Neural Comput & Applic 33, 11963–11983 (2021). https://doi.org/10.1007/s00521-021-05861-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-05861-7

Keywords

Navigation