Abstract
The sheer volume of new malware samples presents some big data challenges for antivirus vendors. Not only does the metadata for tens (or even hundreds) of millions of samples need to be stored, but all this data also needs to be clustered - mined to find groups of related samples. Existing techniques cannot easily scale to the magnitudes of samples already arriving today, yet alone those that we expect to receive in the future. This paper proposes the use of a data structure called an aggregation overlay graph to simplify these problems. By exploiting the similarities shared between most malware variants, we can reduce the total volume of metadata by more than an entire magnitude without any loss of information. Furthermore, by including a wide variety of features from each sample, this process of reduction also creates groups of similar samples, a clustering technique that is capable of handling extremely high volumes. The versatility of this approach is demonstrated by applying it not only to large corpuses of Windows PE metadata, but also for Android APK files.
Similar content being viewed by others
References
Bailey, M., Oberheide, J., Andersen, J., Morley Mao, Z., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: Kruegel, C., Lippmann, R., Clark, A. (eds.) Recent Advances in Intrusion Detection. Lecture Notes in Computer Science, vol. 4637, pp. 178–197. Springer, Berlin (2007)
Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: Proceedings of the Network and Distributed System Security Symposium (2009)
Buehrer, G., Chellapilla, K.: A scalable pattern mining approach to web graph compression with communities. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, WSDM ’08, New York, NY, USA, pp. 95–106. ACM, New York (2008)
Griffin, K., Schneider, S., Hu, X., Chiueh, T.-C.: Automatic generation of string signatures for malware detection. In: Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection, RAID ’09, Berlin, Heidelberg, pp. 101–120. Springer, Berlin (2009)
Hu, X., Shin, K.G.: Duet: integration of dynamic and static analyses for malware clustering with cluster ensembles. In: Proceedings of the 29th Annual Computer Security Applications Conference, ACSAC ’13, New York, NY, USA, pp. 79–88. ACM, New York (2013)
Hu, X., Shin, K.G., Bhatkar, S., Griffin, K.: Mutantx-s: scalable malware clustering based on static features. In: Presented as Part of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), San Jose, CA, pp. 187–198. USENIX, San Jose (2013)
Karim, Md.E., Walenstein, A., Lakhotia, A., Parida, L.: Malware phylogeny generation using permutations of code. J. Comput. Virol. 1, 13–23 (2005)
Komashinskiy, D., Kotenko, I.: Malware detection by data mining techniques based on positionally dependent features. In: 2010 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 617–623 (2010)
Li, P., Liu, L., Gao, D., Reiter, M.K.: On challenges in evaluating malware clustering. In: Jha, S., Sommer, R., Kreibich, C. (eds.) Recent Advances in IntrusionDetection. Lecture Notes in Computer Science, vol. 6307, pp. 238–255. Springer, Berlin (2010)
Mondal, J., Deshpande, A.: Eagr: supporting continuous ego-centric aggregate queries over large dynamic graphs. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD ’14, New York, NY, USA, pp. 1335–1346. ACM, New York (2014)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web, Technical report, Stanford Digital Library Technologies Project (1998)
Pistelli, D.: Microsoft’s rich signature undocumented, November 2010. http://www.ntcore.com/Files/richsign.htm
Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19(4), 639–668 (2011)
Wicherski, G.: pehash: A novel approach to fast malware clustering. In: Proceedings of the 2nd USENIX Conference on Large-Scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, LEET’09, Berkeley, CA, USA, pp. 1–1. USENIX Association, Berkeley (2009)
Ye, Y., Li, T., Chen, Y., Jiang, Q.: Automatic malware categorization using cluster ensemble. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, New York, NY, USA, pp. 95–104, ACM, New York (2010)
Author information
Authors and Affiliations
Corresponding author
Appendix: A family of variants
Appendix: A family of variants
See Table 3.
Rights and permissions
About this article
Cite this article
Asquith, M. Extremely scalable storage and clustering of malware metadata. J Comput Virol Hack Tech 12, 49–58 (2016). https://doi.org/10.1007/s11416-015-0241-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11416-015-0241-3