Skip to main content
Log in

Extremely scalable storage and clustering of malware metadata

  • Original Paper
  • Published:
Journal of Computer Virology and Hacking Techniques Aims and scope Submit manuscript

Abstract

The sheer volume of new malware samples presents some big data challenges for antivirus vendors. Not only does the metadata for tens (or even hundreds) of millions of samples need to be stored, but all this data also needs to be clustered - mined to find groups of related samples. Existing techniques cannot easily scale to the magnitudes of samples already arriving today, yet alone those that we expect to receive in the future. This paper proposes the use of a data structure called an aggregation overlay graph to simplify these problems. By exploiting the similarities shared between most malware variants, we can reduce the total volume of metadata by more than an entire magnitude without any loss of information. Furthermore, by including a wide variety of features from each sample, this process of reduction also creates groups of similar samples, a clustering technique that is capable of handling extremely high volumes. The versatility of this approach is demonstrated by applying it not only to large corpuses of Windows PE metadata, but also for Android APK files.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Bailey, M., Oberheide, J., Andersen, J., Morley Mao, Z., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Kruegel, C., Lippmann, R., Clark, A. (eds.) Recent Advances in Intrusion Detection. Lecture Notes in Computer Science, vol. 4637, pp. 178–197. Springer, Berlin (2007)

    Chapter  Google Scholar 

  2. Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: Proceedings of the Network and Distributed System Security Symposium (2009)

  3. Buehrer, G., Chellapilla, K.: A scalable pattern mining approach to web graph compression with communities. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM ’08, New York, NY, USA, pp. 95–106. ACM, New York (2008)

  4. Griffin, K., Schneider, S., Hu, X., Chiueh, T.-C.: Automatic generation of string signatures for malware detection. In: Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection, RAID ’09, Berlin, Heidelberg, pp. 101–120. Springer, Berlin (2009)

  5. Hu, X., Shin, K.G.: Duet: integration of dynamic and static analyses for malware clustering with cluster ensembles. In: Proceedings of the 29th Annual Computer Security Applications Conference, ACSAC ’13, New York, NY, USA, pp. 79–88. ACM, New York (2013)

  6. Hu, X., Shin, K.G., Bhatkar, S., Griffin, K.: Mutantx-s: scalable malware clustering based on static features. In: Presented as Part of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), San Jose, CA, pp. 187–198. USENIX, San Jose (2013)

  7. Karim, Md.E., Walenstein, A., Lakhotia, A., Parida, L.: Malware phylogeny generation using permutations of code. J. Comput. Virol. 1, 13–23 (2005)

  8. Komashinskiy, D., Kotenko, I.: Malware detection by data mining techniques based on positionally dependent features. In: 2010 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 617–623 (2010)

  9. Li, P., Liu, L., Gao, D., Reiter, M.K.: On challenges in evaluating malware clustering. In: Jha, S., Sommer, R., Kreibich, C. (eds.) Recent Advances in IntrusionDetection. Lecture Notes in Computer Science, vol. 6307, pp. 238–255. Springer, Berlin (2010)

    Chapter  Google Scholar 

  10. Mondal, J., Deshpande, A.: Eagr: supporting continuous ego-centric aggregate queries over large dynamic graphs. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, New York, NY, USA, pp. 1335–1346. ACM, New York (2014)

  11. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web, Technical report, Stanford Digital Library Technologies Project (1998)

  12. Pistelli, D.: Microsoft’s rich signature undocumented, November 2010. http://www.ntcore.com/Files/richsign.htm

  13. Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19(4), 639–668 (2011)

    Article  Google Scholar 

  14. Wicherski, G.: pehash: A novel approach to fast malware clustering. In: Proceedings of the 2nd USENIX Conference on Large-Scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, LEET’09, Berkeley, CA, USA, pp. 1–1. USENIX Association, Berkeley (2009)

  15. Ye, Y., Li, T., Chen, Y., Jiang, Q.: Automatic malware categorization using cluster ensemble. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, New York, NY, USA, pp. 95–104, ACM, New York (2010)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthew Asquith.

Appendix: A family of variants

Appendix: A family of variants

See Table 3.

Table 3 A subset of the samples clustered around the same virtual node (vnode_16028 in Fig. 6)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Asquith, M. Extremely scalable storage and clustering of malware metadata. J Comput Virol Hack Tech 12, 49–58 (2016). https://doi.org/10.1007/s11416-015-0241-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11416-015-0241-3

Keywords

Navigation