Skip to main content

Machine Learning Aided Static Malware Analysis: A Survey and Tutorial

Part of the Advances in Information Security book series (ADIS,volume 70)

Abstract

Malware analysis and detection techniques have been evolving during the last decade as a reflection to development of different malware techniques to evade network-based and host-based security protections. The fast growth in variety and number of malware species made it very difficult for forensics investigators to provide an on time response. Therefore, Machine Learning (ML) aided malware analysis became a necessity to automate different aspects of static and dynamic malware investigation. We believe that machine learning aided static analysis can be used as a methodological approach in technical Cyber Threats Intelligence (CTI) rather than resource-consuming dynamic malware analysis that has been thoroughly studied before. In this paper, we address this research gap by conducting an in-depth survey of different machine learning methods for classification of static characteristics of 32-bit malicious Portable Executable (PE32) Windows files and develop taxonomy for better understanding of these techniques. Afterwards, we offer a tutorial on how different machine learning techniques can be utilized in extraction and analysis of a variety of static characteristic of PE binaries and evaluate accuracy and practical generalization of these techniques. Finally, the results of experimental study of all the method using common data was given to demonstrate the accuracy and complexity. This paper may serve as a stepping stone for future researchers in cross-disciplinary field of machine learning aided malware forensics.

Keywords

  • Machine learning
  • Malware
  • Static analysis
  • Artificial intelligence

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-73951-9_2
  • Chapter length: 39 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   119.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-73951-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   159.99
Price excludes VAT (USA)
Hardcover Book
USD   159.99
Price excludes VAT (USA)
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

References

  1. Virusshare.com. http://virusshare.com/. accessed: 15.10.2015.

  2. Vx heaven. http://vxheaven.org/. accessed: 25.10.2015.

  3. Weka 3: Data mining software in java. http://www.cs.waikato.ac.nz/ml/weka/. accessed: 10.09.2015.

  4. Gianni Amato. Peframe. https://github.com/guelfoweb/peframe. accessed: 20.10.2015.

  5. M. Baig, P. Zavarsky, R. Ruhl, and D. Lindskog. The study of evasion of packed pe from static detection. In Internet Security (WorldCIS), 2012 World Congress on, pages 99–104, June 2012.

    Google Scholar 

  6. Simen Rune Bragen. Malware detection through opcode sequence analysis using machine learning. Master’s thesis, Gjøvik University College, 2015.

    Google Scholar 

  7. C. Cepeda, D. L. C. Tien, and P. Ordóñez. Feature selection and improving classification performance for malware detection. In 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom), pages 560–566, Oct 2016.

    Google Scholar 

  8. Mohsen Damshenas, Ali Dehghantanha, and Ramlan Mahmoud. A survey on malware propagation, analysis, and detection. International Journal of Cyber-Security and Digital Forensics (IJCSDF), 2(4):10–29, 2013.

    Google Scholar 

  9. F. Daryabar, A. Dehghantanha, and N. I. Udzir. Investigation of bypassing malware defences and malware detections. In 2011 7th International Conference on Information Assurance and Security (IAS), pages 173–178, Dec 2011.

    Google Scholar 

  10. Farid Daryabar, Ali Dehghantanha, and Hoorang Ghasem Broujerdi. Investigation of malware defence and detection techniques. International Journal of Digital Information and Wireless Communications (IJDIWC), 1(3):645–650, 2011.

    Google Scholar 

  11. Farid Daryabar, Ali Dehghantanha, Brett Eterovic-Soric, and Kim-Kwang Raymond Choo. Forensic investigation of onedrive, box, googledrive and dropbox applications on android and ios devices. Australian Journal of Forensic Sciences, 48(6):615–642, 2016.

    CrossRef  Google Scholar 

  12. Farid Daryabar, Ali Dehghantanha, Nur Izura Udzir, Solahuddin bin Shamsuddin, et al. Towards secure model for scada systems. In Cyber Security, Cyber Warfare and Digital Forensic (CyberSec), 2012 International Conference on, pages 60–64. IEEE, 2012.

    Google Scholar 

  13. Farid Daryabar, Ali Dehghantanha, Nur Izura Udzir, et al. A review on impacts of cloud computing on digital forensics. International Journal of Cyber-Security and Digital Forensics (IJCSDF), 2(2):77–94, 2013.

    Google Scholar 

  14. Ali Dehghantanha and Katrin Franke. Privacy-respecting digital investigation. In Privacy, Security and Trust (PST), 2014 Twelfth Annual International Conference on, pages 129–138. IEEE, 2014.

    Google Scholar 

  15. Dhruwajita Devi and Sukumar Nandi. Detection of packed malware. In Proceedings of the First International Conference on Security of Internet of Things, SecurIT ’12, pages 22–26, New York, NY, USA, 2012. ACM.

    Google Scholar 

  16. Dennis Distler and Charles Hornat. Malware analysis: An introduction. Sans Reading Room, 2007.

    Google Scholar 

  17. T. Dube, R. Raines, G. Peterson, K. Bauer, M. Grimaila, and S. Rogers. Malware type recognition and cyber situational awareness. In Social Computing (SocialCom), 2010 IEEE Second International Conference on, pages 938–943, Aug 2010.

    Google Scholar 

  18. Tim Ebringer, Li Sun, and Serdar Boztas. A fast randomness test that preserves local detail. Virus Bulletin, 2008, 2008.

    Google Scholar 

  19. Parvez Faruki, Vijay Laxmi, M. S. Gaur, and P. Vinod. Mining control flow graph as api call-grams to detect portable executable malware. In Proceedings of the Fifth International Conference on Security of Information and Networks, SIN ’12, pages 130–137, New York, NY, USA, 2012. ACM.

    Google Scholar 

  20. Anders Flaglien, Katrin Franke, and Andre Arnes. Identifying malware using cross-evidence correlation. In IFIP International Conference on Digital Forensics, pages 169–182. Springer Berlin Heidelberg, 2011.

    Google Scholar 

  21. Tristan Fletcher. Support vector machines explained. [Online]. http://sutikno.blog.undip.ac.id/files/2011/11/SVM-Explained.pdf.[Accessed 06 06 2013], 2009.

  22. Katrin Franke, Erik Hjelmås, and Stephen D Wolthusen. Advancing digital forensics. In IFIP World Conference on Information Security Education, pages 288–295. Springer Berlin Heidelberg, 2009.

    Google Scholar 

  23. Katrin Franke and Sargur N Srihari. Computational forensics: Towards hybrid-intelligent crime investigation. In Information Assurance and Security, 2007. IAS 2007. Third International Symposium on, pages 383–386. IEEE, 2007.

    Google Scholar 

  24. Mark A Hall and Lloyd A Smith. Practical feature subset selection for machine learning. Proceedings of the 21st Australasian Computer Science Conference ACSC’98, 1998.

    Google Scholar 

  25. Chris Hoffman. How to keep your pc secure when microsoft ends windows xp support. http://www.pcworld.com/article/2102606/how-to-keep-your-pc-secure-when-microsoft-ends-windows-xp-support.html. accessed: 18.04.2016.

  26. Anil K Jain, Robert PW Duin, and Jianchang Mao. Statistical pattern recognition: A review. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(1):4–37, 2000.

    Google Scholar 

  27. Sachin Jain and Yogesh Kumar Meena. Byte level n–gram analysis for malware detection. In Computer Networks and Intelligent Computing, pages 51–59. Springer, 2011.

    Google Scholar 

  28. Kris Kendall and Chad McMillan. Practical malware analysis. In Black Hat Conference, USA, 2007.

    Google Scholar 

  29. Z. Khorsand and A. Hamzeh. A novel compression-based approach for malware detection using pe header. In Information and Knowledge Technology (IKT), 2013 5th Conference on, pages 127–133, May 2013.

    Google Scholar 

  30. Teuvo Kohonen and Timo Honkela. Kohonen network. Scholarpedia, 2(1):1568, 2007.

    CrossRef  Google Scholar 

  31. Jeremy Z. Kolter and Marcus A. Maloof. Learning to detect malicious executables in the wild. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pages 470–478, New York, NY, USA, 2004. ACM.

    Google Scholar 

  32. Igor Kononenko and Matjaž Kukar. Machine learning and data mining: introduction to principles and algorithms. Horwood Publishing, 2007.

    Google Scholar 

  33. S. Kumar, M. Azad, O. Gomez, and R. Valdez. Can microsoft’s service pack2 (sp2) security software prevent smurf attacks? In Advanced Int’l Conference on Telecommunications and Int’l Conference on Internet and Web Applications and Services (AICT-ICIW’06), pages 89–89, Feb 2006.

    Google Scholar 

  34. Lastline. The threat of evasive malware. white paper, Lastline Labs, https://www.lastline.com/papers/evasive_threats.pdf, February 2013. accessed: 29.10.2015.

  35. N. A. Le-Khac and A. Linke. Control flow change in assembly as a classifier in malware analysis. In 2016 4th International Symposium on Digital Forensic and Security (ISDFS), pages 38–43, April 2016.

    Google Scholar 

  36. Woody Leonhard. Atms will still run windows xp – but a bigger shift in security looms. http://www.infoworld.com/article/2610392/microsoft-windows/atms-will-still-run-windows-xp----but-a-bigger-shift-in-security-looms.html, March 2014. accessed: 09.11.2015.

  37. R. J. Mangialardo and J. C. Duarte. Integrating static and dynamic malware analysis using machine learning. IEEE Latin America Transactions, 13(9):3080–3087, Sept 2015.

    CrossRef  Google Scholar 

  38. Z. Markel and M. Bilzor. Building a machine learning classifier for malware detection. In Anti-malware Testing Research (WATeR), 2014 Second Workshop on, pages 1–4, Oct 2014.

    Google Scholar 

  39. M.M. Masud, L. Khan, and B. Thuraisingham. A hybrid model to detect malicious executables. In Communications, 2007. ICC ’07. IEEE International Conference on, pages 1443–1448, June 2007.

    Google Scholar 

  40. Microsoft. Microsoft security essentials. http://windows.microsoft.com/en-us/windows/security-essentials-download. accessed: 18.04.2016.

  41. Microsoft. Set application-specific access permissions. https://technet.microsoft.com/en-us/library/cc731858%28v=ws.11%29.aspx. accessed: 30.05.2016.

  42. C. Miles, A. Lakhotia, C. LeDoux, A. Newsom, and V. Notani. Virusbattle: State-of-the-art malware analysis for better cyber threat intelligence. In 2014 7th International Symposium on Resilient Control Systems (ISRCS), pages 1–6, Aug 2014.

    Google Scholar 

  43. Nikola Milosevic, Ali Dehghantanha, and Kim-Kwang Raymond Choo. Machine learning aided android malware classification. Computers & Electrical Engineering, 2017.

    Google Scholar 

  44. S. Naval, V. Laxmi, M. Rajarajan, M. S. Gaur, and M. Conti. Employing program semantics for malware detection. IEEE Transactions on Information Forensics and Security, 10(12):2591–2604, Dec 2015.

    CrossRef  Google Scholar 

  45. Farhood Norouzizadeh Dezfouli, Ali Dehghantanha, Brett Eterovic-Soric, and Kim-Kwang Raymond Choo. Investigating social networking applications on smartphones detecting facebook, twitter, linkedin and google+ artefacts on android and ios platforms. Australian journal of forensic sciences, 48(4):469–488, 2016.

    Google Scholar 

  46. Opeyemi Osanaiye, Haibin Cai, Kim-Kwang Raymond Choo, Ali Dehghantanha, Zheng Xu, and Mqhele Dlodlo. Ensemble-based multi-filter feature selection method for ddos detection in cloud computing. EURASIP Journal on Wireless Communications and Networking, 2016(1):130, 2016.

    Google Scholar 

  47. Hamed Haddad Pajouh, Reza Javidan, Raouf Khayami, Dehghantanha Ali, and Kim-Kwang Raymond Choo. A two-layer dimension reduction and two-tier classification model for anomaly-based intrusion detection in iot backbone networks. IEEE Transactions on Emerging Topics in Computing, 2016.

    Google Scholar 

  48. Shuhui Qi, Ming Xu, and Ning Zheng. A malware variant detection method based on byte randomness test. Journal of Computers, 8(10):2469–2477, 2013.

    Google Scholar 

  49. J. Ross Quinlan. Improved use of continuous attributes in c4. 5. Journal of artificial intelligence research, pages 77–90, 1996.

    Google Scholar 

  50. RC Quinlan. 4.5: Programs for machine learning morgan kaufmann publishers inc. San Francisco, USA, 1993.

    Google Scholar 

  51. D Krishna Sandeep Reddy and Arun K Pujari. N-gram analysis for computer virus detection. Journal in Computer Virology, 2(3):231–239, 2006.

    CrossRef  Google Scholar 

  52. Seth Rosenblatt. Malwarebytes: With anti-exploit, we’ll stop the worst attacks on pcs. http://www.cnet.com/news/malwarebytes-finally-unveils-freeware-exploit-killer/. accessed: 30.05.2016.

  53. Neil J. Rubenking. The best antivirus utilities for 2016. http://uk.pcmag.com/antivirus-reviews/8141/guide/the-best-antivirus-utilities-for-2016. accessed: 30.05.2016.

  54. Paul Rubens. 10 ways to keep windows xp machines secure. http://www.cio.com/article/2376575/windows-xp/10-ways-to-keep-windows-xp-machines-secure.html. accessed: 18.04.2016.

  55. Ashkan Sami, Babak Yadegari, Hossein Rahimi, Naser Peiravian, Sattar Hashemi, and Ali Hamze. Malware detection based on mining api calls. In Proceedings of the 2010 ACM Symposium on Applied Computing, SAC ’10, pages 1020–1025, New York, NY, USA, 2010. ACM.

    Google Scholar 

  56. S. Samtani, K. Chinn, C. Larson, and H. Chen. Azsecure hacker assets portal: Cyber threat intelligence and malware analysis. In 2016 IEEE Conference on Intelligence and Security Informatics (ISI), pages 19–24, Sept 2016.

    Google Scholar 

  57. SANS. Who’s using cyberthreat intelligence and how? https://www.sans.org/reading-room/whitepapers/analyst/cyberthreat-intelligence-how-35767. accessed: 01.03.2017.

  58. Igor Santos, Felix Brezo, Xabier Ugarte-Pedrero, and Pablo G Bringas. Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Sciences, 231:64–82, 2013.

    CrossRef  MathSciNet  Google Scholar 

  59. Igor Santos, Xabier Ugarte-Pedrero, Borja Sanz, Carlos Laorden, and Pablo G. Bringas. Collective classification for packed executable identification. In Proceedings of the 8th Annual Collaboration, Electronic Messaging, Anti-Abuse and Spam Conference, CEAS ’11, pages 23–30, New York, NY, USA, 2011. ACM.

    Google Scholar 

  60. Asaf Shabtai, Yuval Fledel, and Yuval Elovici. Automated static code analysis for classifying android applications using machine learning. In Computational Intelligence and Security (CIS), 2010 International Conference on, pages 329–333. IEEE, 2010.

    Google Scholar 

  61. Kaveh Shaerpour, Ali Dehghantanha, and Ramlan Mahmod. Trends in android malware detection. The Journal of Digital Forensics, Security and Law: JDFSL, 8(3):21, 2013.

    Google Scholar 

  62. R.K. Shahzad, N. Lavesson, and H. Johnson. Accurate adware detection using opcode sequence extraction. In Availability, Reliability and Security (ARES), 2011 Sixth International Conference on, pages 189–195, Aug 2011.

    Google Scholar 

  63. Andrii Shalaginov and Katrin Franke. Automated generation of fuzzy rules from large-scale network traffic analysis in digital forensics investigations. In 7th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2015). IEEE, 2015.

    Google Scholar 

  64. Andrii Shalaginov and Katrin Franke. A new method for an optimal som size determination in neuro-fuzzy for the digital forensics applications. In Advances in Computational Intelligence, pages 549–563. Springer International Publishing, 2015.

    Google Scholar 

  65. Andrii Shalaginov and Katrin Franke. A new method of fuzzy patches construction in neuro-fuzzy for malware detection. In IFSA-EUSFLAT. Atlantis Press, 2015.

    Google Scholar 

  66. Andrii Shalaginov and Katrin Franke. Automated intelligent multinomial classification of malware species using dynamic behavioural analysis. In IEEE Privacy, Security and Trust 2016, 2016.

    Google Scholar 

  67. Andrii Shalaginov and Katrin Franke. Big data analytics by automated generation of fuzzy rules for network forensics readiness. Applied Soft Computing, 2016.

    Google Scholar 

  68. Andrii Shalaginov and Katrin Franke. Towards Improvement of Multinomial Classification Accuracy of Neuro-Fuzzy for Digital Forensics Applications, pages 199–210. Springer International Publishing, Cham, 2016.

    Google Scholar 

  69. Andrii Shalaginov, Katrin Franke, and Xiongwei Huang. Malware beaconing detection by mining large-scale dns logs for targeted attack identification. In 18th International Conference on Computational Intelligence in Security Information Systems. WASET, 2016.

    Google Scholar 

  70. Andrii Shalaginov, Lars Strande Grini, and Katrin Franke. Understanding neuro-fuzzy on a class of multinomial malware detection problems. In IEEE International Joint Conference on Neural Networks (IJCNN 2016), Jul 2016.

    Google Scholar 

  71. M. Shankarapani, K. Kancherla, S. Ramammoorthy, R. Movva, and S. Mukkamala. Kernel machines for malware classification and similarity analysis. In Neural Networks (IJCNN), The 2010 International Joint Conference on, pages 1–6, July 2010.

    Google Scholar 

  72. Muazzam Ahmed Siddiqui. Data mining methods for malware detection. ProQuest, 2008.

    Google Scholar 

  73. Holly Stewart. Infection rates and end of support for windows xp. https://blogs.technet.microsoft.com/mmpc/2013/10/29/infection-rates-and-end-of-support-for-windows-xp/. accessed: 01.04.2016.

  74. Li Sun, Steven Versteeg, Serdar Boztaş, and Trevor Yann. Pattern recognition techniques for the classification of malware packers. In Information security and privacy, pages 370–390. Springer, 2010.

    Google Scholar 

  75. S Momina Tabish, M Zubair Shafiq, and Muddassar Farooq. Malware detection using statistical analysis of byte-level file content. In Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics, pages 23–31. ACM, 2009.

    Google Scholar 

  76. Shugang Tang. The detection of trojan horse based on the data mining. In Fuzzy Systems and Knowledge Discovery, 2009. FSKD ’09. Sixth International Conference on, volume 1, pages 311–314, Aug 2009.

    Google Scholar 

  77. X. Ugarte-Pedrero, I. Santos, P.G. Bringas, M. Gastesi, and J.M. Esparza. Semi-supervised learning for packed executable detection. In Network and System Security (NSS), 2011 5th International Conference on, pages 342–346, Sept 2011.

    Google Scholar 

  78. R Veeramani and Nitin Rai. Windows api based malware detection and framework analysis. In International conference on networks and cyber security, volume 25, 2012.

    Google Scholar 

  79. C. Wang, Z. Qin, J. Zhang, and H. Yin. A malware variants detection methodology with an opcode based feature method and a fast density based clustering algorithm. In 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), pages 481–487, Aug 2016.

    Google Scholar 

  80. Tzu-Yen Wang, Chin-Hsiung Wu, and Chu-Cheng Hsieh. Detecting unknown malicious executables using portable executable headers. In INC, IMS and IDC, 2009. NCM ’09. Fifth International Joint Conference on, pages 278–284, Aug 2009.

    Google Scholar 

  81. Steve Watson and Ali Dehghantanha. Digital forensics: the missing piece of the internet of things promise. Computer Fraud & Security, 2016(6):5–8, 2016.

    CrossRef  Google Scholar 

  82. Yanfang Ye, Dingding Wang, Tao Li, Dongyi Ye, and Qingshan Jiang. An intelligent pe-malware detection system based on association mining. Journal in computer virology, 4(4):323–334, 2008.

    CrossRef  Google Scholar 

  83. M.N.A. Zabidi, M.A. Maarof, and A. Zainal. Malware analysis with multiple features. In Computer Modelling and Simulation (UKSim), 2012 UKSim 14th International Conference on, pages 231–235, March 2012.

    Google Scholar 

  84. Zongqu Zhao. A virus detection scheme based on features of control flow graph. In Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC), 2011 2nd International Conference on, pages 943–947, Aug 2011.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Dehghantanha .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

Shalaginov, A., Banin, S., Dehghantanha, A., Franke, K. (2018). Machine Learning Aided Static Malware Analysis: A Survey and Tutorial. In: Dehghantanha, A., Conti, M., Dargahi, T. (eds) Cyber Threat Intelligence. Advances in Information Security, vol 70. Springer, Cham. https://doi.org/10.1007/978-3-319-73951-9_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73951-9_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73950-2

  • Online ISBN: 978-3-319-73951-9

  • eBook Packages: Computer ScienceComputer Science (R0)