An investigation of byte n-gram features for malware classification

  • Edward Raff
  • Richard Zak
  • Russell Cox
  • Jared Sylvester
  • Paul Yacci
  • Rebecca Ward
  • Anna Tracy
  • Mark McLean
  • Charles Nicholas
Original Paper

DOI: 10.1007/s11416-016-0283-1

Cite this article as:
Raff, E., Zak, R., Cox, R. et al. J Comput Virol Hack Tech (2016). doi:10.1007/s11416-016-0283-1


Malware classification using machine learning algorithms is a difficult task, in part due to the absence of strong natural features in raw executable binary files. Byte n-grams previously have been used as features, but little work has been done to explain their performance or to understand what concepts are actually being learned. In contrast to other work using n-gram features, in this work we use orders of magnitude more data, and we perform feature selection during model building using Elastic-Net regularized Logistic Regression. We compute a regularization path and analyze novel multi-byte identifiers. Through this process, we discover significant previously unreported issues with byte n-gram features that cause their benefits and practicality to be overestimated. Three primary issues emerged from our work. First, we discovered a flaw in how previous corpora were created that leads to an over-estimation of classification accuracy. Second, we discovered that most of the information contained in n-grams stem from string features that could be obtained in simpler ways. Finally, we demonstrate that n-gram features promote overfitting, even with linear models and extreme regularization.


Malware classification Byte n-grams Multi-byte identifier Elastic-Net 

Copyright information

© Springer-Verlag France (outside the USA) 2016

Authors and Affiliations

  1. 1.Computer Science and Electrical EngineeringUniversity of Maryland, Baltimore CountyBaltimoreUSA
  2. 2.Laboratory for Physical SciencesCollege ParkUSA

Personalised recommendations