An investigation of byte n-gram features for malware classification

  • Edward RaffEmail author
  • Richard Zak
  • Russell Cox
  • Jared Sylvester
  • Paul Yacci
  • Rebecca Ward
  • Anna Tracy
  • Mark McLean
  • Charles Nicholas
Original Paper


Malware classification using machine learning algorithms is a difficult task, in part due to the absence of strong natural features in raw executable binary files. Byte n-grams previously have been used as features, but little work has been done to explain their performance or to understand what concepts are actually being learned. In contrast to other work using n-gram features, in this work we use orders of magnitude more data, and we perform feature selection during model building using Elastic-Net regularized Logistic Regression. We compute a regularization path and analyze novel multi-byte identifiers. Through this process, we discover significant previously unreported issues with byte n-gram features that cause their benefits and practicality to be overestimated. Three primary issues emerged from our work. First, we discovered a flaw in how previous corpora were created that leads to an over-estimation of classification accuracy. Second, we discovered that most of the information contained in n-grams stem from string features that could be obtained in simpler ways. Finally, we demonstrate that n-gram features promote overfitting, even with linear models and extreme regularization.


Malware classification Byte n-grams Multi-byte identifier Elastic-Net 


  1. 1.
    Abou-Assaleh, T., Cercone, N., Keselj, V., Sweidan, R.: N-gram-based detection of new malicious code. In: Proceedings of 28th annual int’l computer software & applications conference, vol. 2, pp. 41–42. IEEE (2004)Google Scholar
  2. 2.
    Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the Surprising Behavior of Distance Metrics in High Dimensional Spaces. In: van den Bussche, J., Vianu, V. (eds.) Proceedings of 8th international conference on database theory, pp. 420–434. Springer-Verlag (2001)Google Scholar
  3. 3.
    Banko, M., Brill, E.: Scaling to Very Very Large Corpora for Natural Language Disambiguation. In: Proceedings of the 39th annual meeting on association for computational linguistics, pp. 26–33 (2001)Google Scholar
  4. 4.
    Bellman, R.: Dynamic Programming. Princeton University Press, Princeton (1957)zbMATHGoogle Scholar
  5. 5.
    Corelan Team. Exploit writing tutorial, part 11: heap spraying demystified (2011). (visited on 05/25/2016)
  6. 6.
    Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012). (issn: 0001-0782)CrossRefGoogle Scholar
  7. 7.
    Elovici, Y., Shabtai, A., Moskovitch, R., Tahan, G., Glezer, C.: Applying Machine Learning Techniques for Detection of Malicious Code in Network Traffic. In: Proceedings of the 30th annual German conference on advances in artificial intelligence. In: KI ’07, pp. 44–50. Springer-Verlag, Berlin, Heidelberg. isbn: 978- 3-540-74564-8 (2007)Google Scholar
  8. 8.
    Freund, Y., Schapire, R.: Experiments with a new boosting algorithm. In: Saitta, L. (ed.) Proceedings of the thirteenth international conference on machine learning (ICML 1996), pp. 148–156. Morgan Kaufmann (1996)Google Scholar
  9. 9.
    Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat Softw. 33(1), 1–22 (2010)CrossRefGoogle Scholar
  10. 10.
    Gong, P., Ye, J.: A modified orthant-wise limited memory quasi-Newton method with convergence analysis. In: Proceedings of 32nd international conference on machine learning, vol. 37, pp. 276–284 (2015)Google Scholar
  11. 11.
    Griffin, K., Schneider, S., Hu, X., Chiueh, T.-C.: Automatic generation of string signatures for malware detection. In: Lippmann, R., Clark, A. (eds.) Recent Advances in Intrusion Detection, RAID ’09 Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection, pp. 101–120 (2009)Google Scholar
  12. 12.
    Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. Intell. Syst. IEEE 24(2), 8–12 (2009)CrossRefGoogle Scholar
  13. 13.
    Henchiri, O., Japkowicz, N.: A Feature Selection and Evaluation Scheme for Computer Virus Detection. In: Proceedings of the 6th international conference on data mining. IEEE Computer Society, pp. 891–895. isbn: 0-7695-2701-9 (2006)Google Scholar
  14. 14.
    Ibrahim, A.H., Abdelhalim, M.B., Hussein, H., Fahmy, A.: Analysis of x86 instruction set usage for Windows 7 applications. In: 2nd international conference on computer technology & development, pp. 511–516 (2010)Google Scholar
  15. 15.
    Jain, S., Meena, Y.K.: Byte level n-gram analysis for malware detection. In: Venugopal, K.R., Patnaik, L.M. (eds.) Computer Networks and Intelligent Computing, pp. 51–59. Springer, Berlin Heidelberg (2011)Google Scholar
  16. 16.
    Kephart, J.O., Sorkin, G.B., Arnold, W.C., Chess, D.M., Tesauro, G.J., White, S.R.: Biologically Inspired Defenses Against Computer Viruses. In: Proceedings of the 14th international joint conference on artificial intelligence, vol. 1, pp. 985–996. Morgan Kaufmann (1995). (isbn: 1-55860-363-8)Google Scholar
  17. 17.
    Kolter, J.Z., Maloof, M.A.: Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res. 7, 2721–2744 (2006)MathSciNetzbMATHGoogle Scholar
  18. 18.
    Kolter, J.Z., Maloof, M.A.: Learning to detect malicious executablesin the wild. In: Proceedings of the 2004 ACM SIGKDD international conference on knowledge discovery and data mining, pp. 470–478. ACM Press (2004)Google Scholar
  19. 19.
    Lo, R.W., Levitt, K.N., Olsson, R.A.: Refereed paper: MCF: a malicious code filter. Comput. Secur. 14(6), 541–566 (1995). issn: 0167-4048CrossRefGoogle Scholar
  20. 20.
    Lyda, R., Hamrock, J.: Using entropy analysis to find encrypted and packed malware. IEEE Secur. Priv. Mag. 5(2), 40–45 (2007)CrossRefGoogle Scholar
  21. 21.
    Masud, M.M., Khan, L., Thuraisingham, B.: A scalable multi-level feature extraction technique to detect malicious executables. Inf. Syst. Front. 10(1), 33–45 (2008)CrossRefGoogle Scholar
  22. 22.
    Masud, M.M., Al-Khateeb, T.M., Hamlen, K.W., Gao, J., Khan, L., Han, J., Thuraisingham, B.: Cloudbased malware detection for evolving data streams. ACM Trans. Manag. Inf. Syst. 2(3), 1–27 (2011)CrossRefGoogle Scholar
  23. 23.
    Menahem, E., Shabtai, A., Rokach, L., Elovici, Y.: Improving malware detection by applying multi-inducer ensemble. Comput. Stat. Fata Anal. 53(4), 1483–1494 (2009). issn: 0167-9473MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Microsoft Portable Executable and Common Object File Format Specification Version 8.3. Tech. rep. Microsoft, p. 98 (2013)Google Scholar
  25. 25.
    Moskovitch, R., Stopel, D., Feher, C., Nissim, N., Japkowicz, N., Elovici, Y.: Unknown malcode detection and the imbalance problem. J. Comput. Virol. 5(4), 295–308 (2009)CrossRefGoogle Scholar
  26. 26.
    Ng, A.Y.: Feature selection, \(L_{1}\) vs. \(L_{2}\) regularization, and rotational invariance. In: Proceedings of 21st international conference on machine learning, pp. 78–86 (2004)Google Scholar
  27. 27.
    Perdisci, R., Lanzi, A., Lee, W.: McBoost: boosting scalability in malware collection and analysis using statistical classification of executables. In: Annual computer security applications conference (ACSAC), pp. 301–310. IEEE (2008)Google Scholar
  28. 28.
    Quinlan, J.R.: C4.5: programs for machine learning. Vol. 1(3) of Morgan Kaufmann series in Machine Learning. Morgan Kaufmann (1993). isbn: 1558602380Google Scholar
  29. 29.
    Quist, D.: Open malware. (visited on 05/25/2016)
  30. 30.
    Reddy, D.K.S., Pujari, A.K.: N-gram analysis for computer virus detection. J. Comput. Virol. 2(3), 231–239 (2006)CrossRefGoogle Scholar
  31. 31.
    Roberts, J.-M.: Virus share. (visited on 05/25/2016)
  32. 32.
    Santos, I., Penya, Y.K., Devesa, J., Bringas, P.G.: N-grams-based file signatures for malware detection. In: Proceedings of 11th international conference on enterprise information systems, pp. 317–320 (2009)Google Scholar
  33. 33.
    Schultz, M., Eskin, E., Zadok, F., Stolfo, S.: Data mining methods for detection of new malicious executables. In: Proceedings of IEEE symposium on security and privacy, pp. 38–49 (2001)Google Scholar
  34. 34.
    Shabtai, A., Moskovitch, R., Elovici, Y., Glezer, C.: Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inf. Secur. Tech. Rep. 14(1), 16–29 (2009). issn: 1363-4127CrossRefGoogle Scholar
  35. 35.
    Shafiq, M.Z., Tabish, S.M., Mirza, F., Farooq, M.: PE-Miner: mining structural information to detect malicious executables in realtime. In: Lippmann, R., Clark, A. (eds.) Recent Advances in Intrusion Detection, Springer, Berlin Heidelberg, pp. 121–141 (2009)Google Scholar
  36. 36.
    Stolfo, S.J., Wang, K., Li, W.-J.: Towards stealthy malware detection. In: Christodorescu, M., Jha, S., Maughan, D., Song, D., Wang, C. (eds.) Malware Detection, pp. 231–249. Springer, Berlin Heidelberg (2007). isbn: 978-0-387-44599-1Google Scholar
  37. 37.
    Tahan, G., Rokach, L., Shahar, Y.: Mal-ID: automatic malware detection using common segment analysis and meta-features. J. Mach. Learn. Res. 13, 949–979 (2012). issn: 1532-4435MathSciNetzbMATHGoogle Scholar
  38. 38.
    Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58(1), 267–288 (1994)MathSciNetzbMATHGoogle Scholar
  39. 39.
    Verleysen, M., François, D.: The Curse of Dimensionality in Data Mining and Time Series Prediction. In: Cabestany, J., Prieto, A., Sandoval, F. (eds.) Proceedings of 8th international conference on artificial neural networks: computational intelligence and bioinspired systems, pp. 758–770 (2005)Google Scholar
  40. 40.
    Yuan, G.-X., Chang, K.-W., Hsieh, C.-J., Lin, C.-J.: A comparison of optimization methods and software for large-scale \(L_{1}\)-regularized linear classification. J. Mach. Learn. Res. 11, 3183–3234 (2010)MathSciNetzbMATHGoogle Scholar
  41. 41.
    Yuan, G.-X., Ho, C.-H., Lin, C.-J.: An improved GLMNET for \(L_{1}\)-regularized logistic regression. J. Mach. Learn. Res. 13, 1999–2030 (2012)MathSciNetzbMATHGoogle Scholar
  42. 42.
    Zhang, B., Yin, J., Hao, J., Zhang, D., Wang, S.: Malicious codes detection based on ensemble learning. In: Proceedings of the 4th international conference on autonomic and trusted computing, pp. 468–477. Springer-Verlag (2007). isbn: 3-540-73546-1Google Scholar
  43. 43.
    Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67(2), 301–320 (2005)MathSciNetCrossRefzbMATHGoogle Scholar

Copyright information

© Springer-Verlag France (outside the USA) 2016

Authors and Affiliations

  1. 1.Computer Science and Electrical EngineeringUniversity of Maryland, Baltimore CountyBaltimoreUSA
  2. 2.Laboratory for Physical SciencesCollege ParkUSA

Personalised recommendations