Comparing files using structural entropy

  • Ivan SorokinEmail author
Original paper


One of the main trends in the modern anti-virus industry is the development of algorithms that help estimate the similarity of files. Since malware writers tend to use increasingly complex techniques to protect their code such as obfuscation and polymorphism, anti-virus software vendors face problems of the increasing difficulty of file scanning, the considerable growth of anti-virus databases, and file storages overgrowth. For solving such problems, a static analysis of files appears to be of some interest. Its use helps determine those file characteristics that are necessary for their comparison without executing malware samples within a protected environment. The solution provided in this article is based on the assumption that different samples of the same malicious program have a similar order of code and data areas. Each such file area may be characterized not only by its length, but also by its homogeneity. In other words, the file may be characterized by the complexity of its data order. Our approach consists of using wavelet analysis for the segmentation of files into segments of different entropy levels and using edit distance between sequence segments to determine the similarity of the files. The proposed solution has a number of advantages that help detect malicious programs efficiently on personal computers. First, this comparison does not take into account the functionality of analysed files and is based solely on determining the similarity in code and data area positions which makes the algorithm effective against many ways of protecting executable code. On the other hand, such a comparison may result in false alarms. Therefore, our solution is useful as a preliminary test that triggers the running of additional checks. Second, the method is relatively easy to implement and does not require code disassembly or emulation. And, third, the method makes the malicious file record compact which is significant when compiling anti-virus databases.


Discrete Wavelet Transform Wavelet Analysis Mother Wavelet Edit Distance Number Entropy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Breitenbacher, Z.: Entropy based detection of polymorphic malware. In: Proceedings of the 19th Annual EICAR Conference “ICT Security: Quo Vadis?”, pp. 117–128. Presses Techniques de l’ESIEA, Paris (2010)Google Scholar
  2. 2.
    Daubechies, I.: Desjat’ lektsij po vejvletam. [Ten lectures on wavelets]. Izhevsk: NIC Regular and Chaotic Dynamics (2001)Google Scholar
  3. 3.
    Ebringer, R., Sun, L., Boztas, S.: A fast randomness test that preserves local detail. In: Proceedings of the Virus Bulletin (VB) Conference, pp. 34–42. Virus Bulletin, Abingdon (2008)Google Scholar
  4. 4.
    Fabjanski K., Kruk T.: Network traffic classification by common subsequence finding. In: Bubak, M., Albada, G., Sloot, P. (eds) Computational Science—ICCS 2008, vol. 5101, pp. 499–508. Springer, Berlin (2008)CrossRefGoogle Scholar
  5. 5.
    Gheorghescu, M.: An automated virus classification system. In: Proceedings of the Virus Bulletin (VB) Conference, pp. 294–300. Virus Bulletin, Abingdon (2005)Google Scholar
  6. 6.
    Kreibich, C., Crowcroft, J.: Efficient sequence alignment of network traffic. In: Proceedings of Internet Measurement Conference, pp. 307–312. IMC, Melbourne (2006)Google Scholar
  7. 7.
    Li, J., Xu, J., Xu, M., Zhao, H., Zheng, N.: Malware obfuscation measuring via evolutionary similarity. In: Proceedings of the International Conference on Future Information Networks, pp. 197–200. IEEE Computer Society, Los Alamitos (2009)Google Scholar
  8. 8.
    Lyda R., Hamrock J.: Using entropy analysis to find encrypted and packed malware. IEEE Security Priv. 5(2), 40–45 (2007)CrossRefGoogle Scholar
  9. 9.
    Newsome, J., Karp, B., Song, D.: Polygraph: Automatically generating signatures for polymorphic worms. In: Proceedings of the 2005 IEEE Symposium on Security and Privacy, pp. 226–241. IEEE Computer Society, Los Alamitos (2005)Google Scholar
  10. 10.
    Perdisci R., Lanzi A., Lee W.: Classification of packed executables for accurate computer virus detection. Pattern Recognit. Lett. 29(14), 1941–1946 (2008)CrossRefGoogle Scholar
  11. 11.
    Sun, L., Versteeg, S., Boztas, S., Yann, T.: Pattern recognition techniques for the classification of malware packers. In: Proceedings of the 15th Australian Conference on Information Security and Privacy (pp. 370–390). Springer, Berlin (2010)Google Scholar
  12. 12.
    Sung, A.H., Xu, J., Chavez, P., Mukkamala, S.: Static analyzer of vicious executables (SAVE). In: Proceedings of the 20th Annual Computer Security Applications Conference, pp. 326–334. IEEE Computer Society, Washington (2004)Google Scholar
  13. 13.
    Wagener G., State R., Dulaunoy A.: Malware behaviour analysis, extended version. J. Comput. Virol. 4(4), 279–287 (2007)CrossRefGoogle Scholar
  14. 14.
    Christodorescu, M., Jha, S.: Testing malware detectors. In: Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 34–44. ACM, New York (2004)Google Scholar
  15. 15.
    Jacob, G., Neugschwandtner, M., Comparetti, P.M., Kruegel, C., Vigna, G.: A static, packer-agnostic filter to detect similar malware samples. Department of Computer Science University of California Santa Barbara Technical Report, 2010–26. Retrieved 29 November 2010 from (2010)
  16. 16.
    Wagner R.A., Fischer M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974)MathSciNetzbMATHCrossRefGoogle Scholar
  17. 17.
    Prangišvili, I.V.: Èntropijnye i drugie sistemnye zakonomernosti. Voprosy upravlenija složnymi sistemami (Entropy and other system laws. Issues of managing complex systems). p. 432. Nauka, Moscow (2003)Google Scholar

Copyright information

© Springer-Verlag France 2011

Authors and Affiliations

  1. 1.Doctor Web’s Virus Lab, Ltd.Saint-PetersburgRussia

Personalised recommendations