Skip to main content

Malware and Machine Learning

  • Chapter
  • First Online:

Part of the book series: Studies in Computational Intelligence ((SCI,volume 563))

Abstract

Malware analysts use Machine Learning to aid in the fight against the unstemmed tide of new malware encountered on a daily, even hourly, basis. The marriage of these two fields (malware and machine learning) is a match made in heaven: malware contains inherent patterns and similarities due to code and code pattern reuse by malware authors; machine learning operates by discovering inherent patterns and similarities. In this chapter, we seek to provide an overhead, guiding view of machine learning and how it is being applied in malware analysis. We do not attempt to provide a tutorial or comprehensive introduction to either malware or machine learning, but rather the major issues and intuitions of both fields along with an elucidation of the malware analysis problems machine learning is best equipped to solve.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Binaries are not one big blob, but separated into sections of logically related code and data. At a minimum, there will be two sections: one designated for data and the other for code.

  2. 2.

    Objects don’t need to be compared with themselves and similarity functions are (typically) symmetric, so the actual number of comparisons required is \((n^2-n)/2)\).

References

  1. Neumann, J.V. : Theory of Self-reproducing Automata. IEEE Trans. Neural Networks. 5(1), 3–14 (1994)

    Google Scholar 

  2. Cohen, F.: Computer viruses. PhD thesis, University of Southern California (1985)

    Google Scholar 

  3. Measuring and optimizing malware analysis: An open model. L.L.C, Technical report, Securosis (2012)

    Google Scholar 

  4. Schon, B., Dmitry, G., Joel, S.: Automated sample processing, Technical Report, Mcafee AVERT, Auckland, New Zealand (2006)

    Google Scholar 

  5. Nielson, F., Nielson, H.R., Hankin, C.: Principles of Program Analysis. Springer, Berlin (1999). ISBN 9783540654100

    Google Scholar 

  6. Schwarz, B., Debray, S., Andrews, G.: Disassembly of executable code revisited. In: Proceedings of Ninth Working Conference on Reverse Engineering, IEEE, 2002, pp. 45–54

    Google Scholar 

  7. Collberg, C., Nagra, J.: Surreptitious Software: Obfuscation, Watermarking, and Tamperproofing for Software Protection. Pearson Education (2010). ISBN 9780321549259

    Google Scholar 

  8. Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997). ISBN 0070428077 9780070428072 0071154671 9780071154673

    MATH  Google Scholar 

  9. Shabtai, A., Moskovitch, R., Elovici, Y., Glezer, C.: Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inf. Sec. Tech. Rep. 14(1), 1629 (2009)

    Google Scholar 

  10. Egele, M., Scholte, T., Kirda, E., Kruegel, C.: A survey on automated dynamic malware analysis techniques and tools. ACM Comput. Surv. 44(2), 6:1–6:42 (2008). ISSN 0360–0300. doi:10.1145/2089125.2089126

  11. Arnold, W. Tesauro, G.: Automatically generated WIN32 heuristic virus detection. In: 2000 Virus Bulletin International Conference, pp. 51–60. The Pentagon, Abingdon, Oxfordshire, OX14 3YP, England, Virus Bulletin Ltd (2000)

    Google Scholar 

  12. Kephart, J.O., Arnold, B.: Automatic extraction of computer virus signatures. In: Ford, R. (ed.) 4th Virus Bulletin International Conference, pp. 178–184, Abingdon, England, Virus Bulletin Ltd (1994)

    Google Scholar 

  13. Kephart, J.O., Arnold, B.: A biologically inspired immune system for computers. In: Fourth International Workshop on the Synthesis and Simulation of Living Systems, pp.130–139 (1994)

    Google Scholar 

  14. Kephart, J.O., Sorkin, G.B., Arnold, W.C., Chess, D.M., Tesauro, G.J., White, S.R.: Biologically inspired defenses against computer viruses. In: IJCAI 95, pp. 985–996 (1995)

    Google Scholar 

  15. Karim, M.E., Walenstein, A., Lakhotia, A., Parida, L.: Malware phylogeny generation using permutations of code. J. Comput. Virol. 1(1), 13–23 (2005)

    Article  Google Scholar 

  16. Wang, T.-Y., Wu, C.-H., Hsieh, C.-C.: Detecting unknown malicious executables using portable executable headers. In: Fifth International Joint Conference on INC, IMS and IDC, NCM 09, pp. 278–284 (2009). doi:10.1109/ncm.2009.385

  17. Walenstein, A., Hefner, D.J., Wichers, J.: Header information in malware families and impact on automated classifiers. In: 2010 5th International Conference on Malicious and Unwanted Software (MALWARE), p. 1522 (2010). doi:10.1109/malware.2010.5665799

  18. Schultz, M.G., Eskin, E., Zadok, F., Stolfo, S.J.: Data mining methods for detection of new malicious executables. In: Proceedings of 2001 IEEE Symposium on Security and Privacy, S P 2001, pp. 38–49 (2001). doi:10.1109/secpri.2001.924286

  19. Ye, Y., Chen, L., Wang, D., Li, T., Jiang, Q., Zhao, M.: SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging. J. Comput. Virol. 5(4), 283–293 (2008). ISSN 1772–9890, 1772–9904. doi:10.1007/s11416-008-0108-y

  20. Kruegel, C., Robertson, W., Valeur, F., Vigna, G.: Static disassembly of obfuscated binaries. In: Proceedings of the 13th USENIX Security Symposium, pp. 255–270. Usenix (2004)

    Google Scholar 

  21. Linn, C., Debray, S.: Obfuscation of executable code to improve resistance to static disassembly. In: Proceedings of the 10th ACM Conference on Computer and Communications Security, pp. 290–299, ACM Press, New York, NY, USA (2003)

    Google Scholar 

  22. Christodorescu, M., Jha, S., Kruegel, C.: Mining specifications of malicious behavior. In: Proceedings of the 1st India Software Engineering Conference, ISEC ’08, p. 514, New York, NY, USA (2008). ACM. ISBN 978-1-59593-917-3. doi:10.1145/1342211.1342215

  23. Debray, S. Patel, J.: Reverse engineering self-modifying code: Unpacker extraction. In: 2010 17th Working Conference on Reverse Engineering (WCRE), pp. 131–140 (2010). doi:10.1109/WCRE.2010.22

  24. Sharif, M., Lanzi, A., Giffin, J., Lee, W.: Automatic reverse engineering of malware emulators. In: 2009 30th IEEE Symposium on Security and Privacy, pp. 94–109 (2009). doi:10.1109/SP.2009.27

  25. Alazab, M., Kadiri, M.A., Venkatraman, S., Al-Nemrat, A.: Malicious code detection using penalized splines on OPcode frequency. In: Cybercrime and Trustworthy Computing Workshop (CTC), 2012 Third, pp. 38–47 (2012). doi:10.1109/CTC.2012.15

  26. Bilar, D.: Opcode as predictors for malware. Int. J. Electron. Sec. Digit. Forensics 1(2), 156–168 (2007)

    Google Scholar 

  27. Hu, X., Bhatkar, S., Griffin, K., Shin, K.G.: MutantX-S: scalable malware clustering based on static features. In: USENIX Annual Technical Conference (USENIX ATC 13), pp. 187–198 (2013)

    Google Scholar 

  28. Moskovitch, R., Feher, C., Tzachar, N., Berger, E., Gitelman, M., Dolev, S., Elovici, Y.: Unknown malcode detection using opcode representation. Intell. Secur. Inform. 48, 204–215 (2008)

    Google Scholar 

  29. Runwal, N., Low, R.M., Stamp, M.: Opcode graph similarity and metamorphic detection. J. Comput. Virol. 8(1–2), 37–52 (2012). ISSN 1772–9890, 1772–9904, doi:10.1007/s11416-012-0160-5

  30. Chouchane, M.R., Lakhotia, A.: Using engine signature to detect metamorphic malware. In: Proceedings of the 4th ACM Workshop on Recurring Malcode, WORM ’06, pp. 73–78, New York, NY, USA (2006). ACM. ISBN 1-59593-551-7. doi:10.1145/1179542.1179558

  31. Hu, X., Chiueh, T.-C., Shin, K.G.: Large-scale malware indexing using function-call graphs. In: Proceedings of the 16th ACM Conference on Computer and Communications security, pp. 611–620 (2009)

    Google Scholar 

  32. Carrera, E., Erdelyi, G.: Digital genome mapping: advanced binary malware analysis. In: Proceedings of the 2004 Virus Bulletin Conference, pp. 187–197 (2004)

    Google Scholar 

  33. Briones, I., Gomez, A.: Graphs, entropy and grid computing: automatic comparison of malware. Virus Bulletin, 1–12 (2008). http://pandalabs.pandasecurity.com/blogs/images/PandaLabs/2008/10/07/IsmaelBriones-VB2008.p

  34. Kinable, J., Kostakis, O.: Malware classification based on call graph clustering. J. Comput. Virol. 7(4), 233–245 (2011). ISSN 1772–9890, 1772–9904, doi:10.1007/s11416-011-0151-y

  35. Kruegel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.: Polymorphic worm detection using structural information of executables. In: Valdes, A., Zamboni, D. (eds.) Recent Advances in Intrusion Detection, no. 3858. Lecture Notes in Computer Science, pp. 207–226. Springer, Berlin (2006). ISBN 978-3-540-31778-4, 978–3-540-31779-1

    Google Scholar 

  36. Chaki, S., Cohen, C., Gurfinkel, A.: Supervised learning for provenance-similarity of binaries. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, p. 1523, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0813-7. doi:10.1145/2020408.2020419

  37. Jin, W., Chaki, S., Cohen, C., Gurfinkel, A., Havrilla, J., Hines, C., Narasimhan, P.: Binary function clustering using semantic hashes. In: Proceedings of the 11th International Conference on Machine Learning and Applications (ICMLA), vol. 1, pp. 386–391 (2012). doi:10.1109/ICMLA.2012.70

  38. Lakhotia, A., Preda, M.D., Giacobazzi, R.: Fast location of similar code fragments using semantic ‘juice’. In: Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop, PPREW ’13, p. 5:15:6, New York, NY, USA (2013). ACM. ISBN 978-1-4503-1857-0. doi:10.1145/2430553.2430558

  39. Pfeffer, A., Call, C., Chamberlain, J., Kellogg, L., Ouellette, J., Patten, T., Zacharias, G., Lakhotia, A., Golconda, S., Bay, J., Hall, R., Scofield, D.: Malware analysis and attribution using genetic information. In: Proceedings of the 7th IEEE International Conference on Malicious and Unwanted Software (MALWARE 2012), pp. 39–45, IEEE Computer Society Press, Fajardo, Puerto Rico, Oct. (2012)

    Google Scholar 

  40. Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: RAID07: Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection, pp. 178–197, Berlin, Heidelberg, Springer-Verlag (2007)

    Google Scholar 

  41. Trinius, P., Willems, C., Holz, T., Rieck, K.: A malware instruction set for behavior-based analysis, Technical Report, University of Mannheim (2009). http://citeseerx.ist.psu.edu/viewdoc/download

  42. Masud, M.M., Khan, L., Thuraisingham, B.: A hybrid model to detect malicious executables. In: IEEE International Conference on Communications, ICC 07, pp. 1443–1448 (2007). doi:10.1109/icc.2007.242

  43. Lu, Y.B., Din, S.C., Zheng, C.F., Gao, B.J.: Using multi-feature and classifier ensembles to improve malware detection. J. CCIT 39(2), 57–72 (2010)

    Google Scholar 

  44. Islam, R., Tian, R., Batten, L., Versteeg, S.: Classification of malware based on string and function feature selection. In: Cybercrime and Trustworthy Computing, Workshop, p. 917 (2010)

    Google Scholar 

  45. LeDoux, C., Walenstein, A., Lakhotia, A.: Improved malware classification through sensor fusion using disjoint union. In: Information Systems, Technology and Management, pp. 360–371, Grenoble, France. Springer, Berlin Heidelberg (2012). ISBN 978-3-642-29166-1. doi:10.1007/978-3-642-29166-1_32

    Google Scholar 

  46. Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7, 2721–2744 (2006)

    MATH  MathSciNet  Google Scholar 

  47. Walenstein, A., Venable, M., Hayes, M., Thompson, C., Lakhotia, A.: Exploiting similarity between variants to defeat malware. In: Proceedings of BlackHat Briefings DC 2007 (2007)

    Google Scholar 

  48. Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering (2009). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.148.7690&rep=rep1&type=pdf

  49. Gurrutxaga, I., Arbelaitz, O., Ma Perez, J., Muguerza, J., Martin, J.I., Perona, I.: Evaluation of malware clustering based on its dynamic behaviour. In: Roddick, J.F., Li, J., Christen, P., Kennedy, P.J. (eds.) Seventh Australasian Data Mining Conference (AusDM 2008), Crpit, vol. 87, pp. 163–170, Glenelg, South Australia, Acs (2008)

    Google Scholar 

  50. Wang, Y., Ye, Y., Chen, H., Jiang, Q.: An improved clustering validity index for determining the number of malware clusters. In: 3rd International Conference on Anti-counterfeiting, Security, and Identification in Communication, 2009, ASID 2009, pp. 544–547. doi:10.1109/ICASID.2009.5277000

  51. Wicherski, G.: peHash: a novel approach to fast malware clustering. In: Proceedings of LEET09: 2nd USENIX Workshop on Large-Scale Exploits and Emergent Threats (2009)

    Google Scholar 

  52. Cesare, S., Xiang, Y.: Software Similarity and Classification. Springer, Heidelberg (2012)

    Google Scholar 

  53. Legany, C., Juhsz, S., Babos, A.: Cluster validity measurement techniques. In: Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, AIKED’06, pp. 388–393, Stevens Point, Wisconsin, USA (2006). World Scientific and Engineering Academy and Society (WSEAS). ISBN 111-2222-33-9

    Google Scholar 

  54. Jang, J., Brumley, D., Venkataraman, S.: BitShred: feature hashing malware for scalable triage and semantic analysis. In: Proceedings of the 18th ACM Conference on Computer and Communications Security, CCS ’11, pp. 309–320, ACM, New York, NY, USA (2011). ISBN 978-1-4503-0948-6. doi:10.1145/2046707.2046742

  55. LeDoux, C., Lakhotia, A., Miles, C., Notani, V., Pfeffer, A.: FuncTracker: discovering shared code to aid malware forensics extended abstract (2013)

    Google Scholar 

  56. Cohen, C., Havrilla, J.S.: Function hashing for malicious code analysis. In: CERT Research Annual Report 2009, pp. 26–29. Software Engineering Institute, Carnegie Mellon University (2009)

    Google Scholar 

  57. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Google Scholar 

  58. Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012). ISBN 9781139505345

    Google Scholar 

  59. Zhu, X.: Semi-supervised learning literature survey, Technical Report, Computer Sciences, University of Wisconsin-Madison (2005). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.9681&rep=rep1&type=pdf. Accessed 14 Mar 2013

  60. Santos, I., Nieves, J., Bringas, P.: Semi-supervised learning for unknown malware detection. In: International Symposium on Distributed Computing and Artificial Intelligence, pp. 415–422 (2011)

    Google Scholar 

  61. Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schlkopf, B.: Learning with local and global consistency. Adv. Neural Inf. Process. Syst. 16, 321–328 (2004)

    Google Scholar 

  62. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley-Interscience, Hoboken (2004). ISBN 0471210781

    Google Scholar 

  63. Dahl, G., Stokes, J.W., Deng, L., Yu, D.: Large-scale malware classification using random projections and neural networks. In: Proceedings IEEE Conference on Acoustics, Speech, and Signal Processing, pp. 3422–3426 (2013)

    Google Scholar 

  64. Shahzad, R., Lavesson, N.: Veto-based malware detection. In: 2012 Seventh International Conference on Availability, Reliability and Security (ARES), pp. 47–54 (2012). doi:10.1109/ARES.2012.85

  65. Shahzad, R.K., Lavesson, N.: Comparative analysis of voting schemes for ensemble-based malware detection. Wireless Mob. Netw. Ubiquitous Comput. Dependable Appl. 4, 76–97 (2013)

    Google Scholar 

  66. Ye, Y., Li, T., Chen, Y., Jiang, Q.: Automatic malware categorization using cluster ensemble. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 95–104 (2010)

    Google Scholar 

  67. Zhuang, W., Ye, Y., Chen, Y., Li, T.: Ensemble clustering for internet security applications. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(6), 1784–1796 (2012). ISSN 1094-6977. doi:10.1109/TSMCC.2012.2222025

  68. Strehl, A., Ghosh, J.: Cluster ensembles a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2003). ISSN 1532–4435. doi:10.1162/153244303321897735

  69. Topchy, A., Jain, A.K., Punch, W.: Clustering ensembles: models of consensus and weak partitions. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1866–1881 (2005). ISSN 0162–8828. doi:10.1109/TPAMI.2005.237

  70. Barr, S.J., Cardman, S.J., Martin, D.M.Jr.: A boosting ensemble for the recognition of code sharing in malware. J. Comput. Virol. 4(4), 335–345 (2008). ISSN 1772–9890, 1772–9904, doi:10.1007/s11416-008-0087-z

  71. Menahem, E., Shabtai, A., Rokach, L., Elovici, Y.: Improving malware detection by applying multi-inducer ensemble. Comput. Stat. Data Anal. 53(4), 1483–1494 (2009). ISSN 0167–9473. doi:10.1016/j.csda.2008.10.015

  72. Zabidi, M., Maarof, M., Zainal, A.: Ensemble based categorization and adaptive model for malware detection. In: 2011 7th International Conference on Information Assurance and Security (IAS), pp. 80–85 (2011). doi:10.1109/ISIAS.2011.6122799

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charles LeDoux .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

LeDoux, C., Lakhotia, A. (2015). Malware and Machine Learning. In: Yager, R., Reformat, M., Alajlan, N. (eds) Intelligent Methods for Cyber Warfare. Studies in Computational Intelligence, vol 563. Springer, Cham. https://doi.org/10.1007/978-3-319-08624-8_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08624-8_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08623-1

  • Online ISBN: 978-3-319-08624-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics