Advertisement

Using String Information for Malware Family Identification

  • Prasha Shrestha
  • Suraj Maharjan
  • Gabriela Ramírez de la Rosa
  • Alan Sprague
  • Thamar Solorio
  • Gary Warner
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8864)

Abstract

Classifying malware into correct families is an important task for anti-virus vendors. Currently, only some of them will recognize a particular malware. Even when they do, they either classify them into different families or use a generic family name, which does not provide much information. Our method for malware family identification is based on the observation that closely related malware have heavy overlap of strings. We first created two kinds of prototypes from printable strings in the malware: one using term frequency–inverse document frequency (tf-idf) and the other using the prominent strings extracted from the vocabulary. We then used these prototypes for classification. We achieved an accuracy of 91.02 % by considering the entire vocabulary and an accuracy of 80.52 % by considering 20 prominent strings for each malware family. Our accuracy is high enough for our system to be used to classify even those malware that can confuse the anti-virus vendors.

Keywords

Malware Prototype based classification Prominent strings Tf-idf Cosine similarity 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Park, Y., Reeves, D., Mulukutla, V., Sundaravel, B.: Fast malware classification by automated behavioral graph matching. In: Proceedings of the Sixth Annual Workshop on Cyber Security and Information Intelligence Research, CSIIRW 2010, pp. 45:1–45:4. ACM, New York (2010)Google Scholar
  2. 2.
    Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Kruegel, C., Lippmann, R., Clark, A. (eds.) RAID 2007. LNCS, vol. 4637, pp. 178–197. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  3. 3.
    Tian, R., Batten, L., Islam, M., Versteeg, S.: An automated classification system based on the strings of trojan and virus families. In: 2009 4th International Conference on Malicious and Unwanted Software (MALWARE), pp. 23–30 (2009)Google Scholar
  4. 4.
    Shabtai, A., Moskovitch, R., Elovici, Y., Glezer, C.: Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey. Information Security Technical Report 14, 16–29 (2009)CrossRefGoogle Scholar
  5. 5.
    Han, E.-H.S., Karypis, G.: Centroid-based document classification: Analysis and experimental results. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 424–431. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  6. 6.
    Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Proceedings of the 2003 ACM Symposium on Applied Computing, SAC 2003, pp. 784–788. ACM, New York (2003)Google Scholar
  7. 7.
    Wei, C., Sprague, A., Warner, G.: Clustering malware-generated spam emails with a novel fuzzy string matching algorithm. In: Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 889–890. ACM (2009)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Prasha Shrestha
    • 1
  • Suraj Maharjan
    • 1
  • Gabriela Ramírez de la Rosa
    • 2
  • Alan Sprague
    • 1
  • Thamar Solorio
    • 1
  • Gary Warner
    • 1
  1. 1.University of Alabama at BirminghamBirminghamUSA
  2. 2.Universidad Autónoma Metropolitana, Unidad CuajimalpaMexicoMexico

Personalised recommendations