Skip to main content

AVclass: A Tool for Massive Malware Labeling

  • Conference paper
  • First Online:
Book cover Research in Attacks, Intrusions, and Defenses (RAID 2016)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9854))

Abstract

Labeling a malicious executable as a variant of a known family is important for security applications such as triage, lineage, and for building reference datasets in turn used for evaluating malware clustering and training malware classification approaches. Oftentimes, such labeling is based on labels output by antivirus engines. While AV labels are well-known to be inconsistent, there is often no other information available for labeling, thus security analysts keep relying on them. However, current approaches for extracting family information from AV labels are manual and inaccurate. In this work, we describe AVclass, an automatic labeling tool that given the AV labels for a, potentially massive, number of samples outputs the most likely family names for each sample. AVclass implements novel automatic techniques to address 3 key challenges: normalization, removal of generic tokens, and alias detection. We have evaluated AVclass on 10 datasets comprising 8.9 M samples, larger than any dataset used by malware clustering and classification works. AVclass leverages labels from any AV engine, e.g., all 99 AV engines seen in VirusTotal, the largest engine set in the literature. AVclass’s clustering achieves F1 measures up to 93.9 on labeled datasets and clusters are labeled with fine-grained family names commonly used by the AV vendors. We release AVclass to the community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/malicialab/avclass.

  2. 2.

    We check the sample’s MD5, SHA1, and SHA256 hashes.

References

  1. Arp, D., Spreitzenbarth, M., Huebner, M., Gascon, H., Rieck, K.: Drebin: efficient and explainable detection of android malware in your pocket. In: Network and Distributed System Security (2014)

    Google Scholar 

  2. Bailey, M., Oberheide, J., Andersen, J., Mao, Z.M., Jahanian, F., Nazario, J.: Automated classification and analysis of internet malware. In: International Symposium on Recent Advances in Intrusion Detection (2007)

    Google Scholar 

  3. Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda, E.: Scalable, behavior-based malware clustering. In: Network and Distributed System Security (2009)

    Google Scholar 

  4. Beck, D., Connolly, J.: The common malware enumeration initiative. In: Virus Bulletin Conference (2006)

    Google Scholar 

  5. Bureau, P.-M., Harley, D.: A dose by any other name. In: Virus Bulletin Conference (2008)

    Google Scholar 

  6. Canto, J., Dacier, M., Kirda, E., Leita, C.: Large scale malware collection: lessons learned. In: IEEE SRDS Workshop on Sharing Field Data and Experiment Measurements on Resilience of Distributed Computing Systems (2008)

    Google Scholar 

  7. CARO Virus Naming Convention. http://www.caro.org/articles/naming.html

  8. Dahl, G.E., Stokes, J.W., Deng, L., Yu, D.: Large-scale malware classification using random projections and neural networks. In: IEEE International Conference on Acoustics, Speech and Signal Processing (2013)

    Google Scholar 

  9. Gashi, I., Sobesto, B., Mason, S., Stankovic, V., Cukier, M.: A study of the relationship between antivirus regressions and label changes. In: International Symposium on Software Reliability Engineering (2013)

    Google Scholar 

  10. Harley, D.: The game of the name: malware naming, shape shifters and sympathetic magic. In: International Conference on Cybercrime Forensics Education & Training (2009)

    Google Scholar 

  11. Huang, W., Stokes, J.W.: MtNet: a multi-task neural network for dynamic malware classification. In: Detection of Intrusions and Malware, and Vulnerability Assessment (2016)

    Google Scholar 

  12. Hurier, M., Allix, K., Bissyandé, T., Klein, J., Traon, Y.L.: On the lack of consensus in anti-virus decisions: metrics and insights on building ground truths of android malware. In: Detection of Intrusions and Malware, and Vulnerability Assessment (2016)

    Google Scholar 

  13. Jang, J., Brumley, D., Venkataraman, S.: BitShred: feature hashing malware for scalable triage and semantic analysis. In: ACM Conference on Computer and Communications Security (2011)

    Google Scholar 

  14. Kantchelian, A., Tschantz, M.C., Afroz, S., Miller, B., Shankar, V., Bachwani, R., Joseph, A.D., Tygar, J.: Better malware ground truth: techniques for weighting anti-virus vendor labels. In: ACM Workshop on Artificial Intelligence and Security (2015)

    Google Scholar 

  15. Kotzias, P., Matic, S., Rivera, R., Caballero, J.: Certified PUP: abuse in authenticode code signing. In: ACM Conference on Computer and Communication Security (2015)

    Google Scholar 

  16. Li, P., Liu, L., Gao, D., Reiter, M.K.: On challenges in evaluating malware clustering. In: Jha, S., Sommer, R., Kreibich, C. (eds.) RAID 2010. LNCS, vol. 6307, pp. 238–255. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  17. Lindorfer, M., Neugschwandtner, M., Weichselbaum, L., Fratantonio, Y., van der Veen, V., Platzer, C.: ANDRUBIS-1,000,000 apps later: a view on current android malware behaviors. In: International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (2014)

    Google Scholar 

  18. Maggi, F., Bellini, A., Salvaneschi, G., Zanero, S.: Finding non-trivial malware naming inconsistencies. In: International Conference on Information Systems Security (2011)

    Google Scholar 

  19. Miller, B., Kantchelian, A., Tschantz, M.C., Afroz, S., Bachwani, R., Faizullabhoy, R., Huang, L., Shankar, V., Wu, T., Yiu, G., Joseph, A.D., Tygar, J.D.: Reviewer integration and performance measurement for malware detection. In: Caballero, J., Zurutuza, U., Rodríguez, R.J. (eds.) DIMVA 2016. LNCS, vol. 9721, pp. 122–141. Springer, Heidelberg (2016). doi:10.1007/978-3-319-40667-1_7

    Chapter  Google Scholar 

  20. Mohaisen, A., Alrawi, O.: AV-Meter: an evaluation of antivirus scans and labels. In: Detection of Intrusions and Malware, and Vulnerability Assessment (2014)

    Google Scholar 

  21. Nappa, A., Rafique, M.Z., Caballero, J.: The MALICIA dataset: identification and analysis of drive-by download operations. Int. J. Inf. Secur. 14(1), 15–33 (2015)

    Article  Google Scholar 

  22. Oberheide, J., Cooke, E., Jahanian, F.: CloudAV: N-version antivirus in the network cloud. In: USENIX Security Symposium (2008)

    Google Scholar 

  23. Perdisci, R., Lanzi, A., Lee, W.: McBoost: boosting scalability in malware collection and analysis using statistical classification of executables. In: Annual Computer Security Applications Conference (2008)

    Google Scholar 

  24. Perdisci, R., Lee, W., Feamster, N.: Behavioral clustering of HTTP-based malware and signature generation using malicious network traces. In: USENIX Symposium on Networked Systems Design and Implementation (2010)

    Google Scholar 

  25. Perdisci, R., ManChon, U.: VAMO: towards a fully automated malware clustering validity analysis. In: Annual Computer Security Applications Conference (2012)

    Google Scholar 

  26. Rafique, M.Z., Caballero, J.: FIRMA: malware clustering and network signature generation with mixed network behaviors. In: Stolfo, S.J., Stavrou, A., Wright, C.V. (eds.) RAID 2013. LNCS, vol. 8145, pp. 144–163. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  27. Rajab, M.A., Ballard, L., Lutz, N., Mavrommatis, P., Provos, N., CAMP: content-agnostic malware protection. In: Network and Distributed System Security (2013)

    Google Scholar 

  28. Rieck, K., Holz, T., Willems, C., Düssel, P., Laskov, P.: Learning and classification of malware behavior. In: Detection of Intrusions and Malware, and Vulnerability Assessment (2008)

    Google Scholar 

  29. Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19(4), 639–668 (2011)

    Article  Google Scholar 

  30. Virusshare. http://virusshare.com/

  31. Virustotal. https://virustotal.com/

  32. Yang, C., Xu, Z., Gu, G., Yegneswaran, V., Porras, P.: DroidMiner: automated mining and characterization of fine-grained malicious behaviors in android applications. In: European Symposium on Research in Computer Security (2014)

    Google Scholar 

  33. Zhou, Y., Jiang, X.: Dissecting android malware: characterization and evolution. In: IEEE Symposium on Security and Privacy (2012)

    Google Scholar 

Download references

Acknowledgments

We specially thank Manos Antonakakis and Martina Lindorfer for providing us with the University and Andrubis datasets, respectively. We also thank the authors of the Drebin, MalGenome, Malheur, Malicia, and the Malicious Content Detection Platform datasets for making them publicly available. We are grateful to Srdjan Matic for his assistance with the plots, Davide Balzarotti and Chaz Lever for useful discussions, VirusTotal for their support, and Pavel Laskov for his help to improve this manuscript.

This research was partially supported by the Regional Government of Madrid through the N-GREENS Software-CM project S2013/ICE-2731 and by the Spanish Government through the Dedetis Grant TIN2015-7013-R. All opinions, findings and conclusions, or recommendations expressed herein are those of the authors and do not necessarily reflect the views of the sponsors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richard Rivera .

Editor information

Editors and Affiliations

A Additional Results

A Additional Results

Table 10. Accuracy numbers reported by prior clustering works.
Table 11. AV engines found in our datasets and their lifetime in days.

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Sebastián, M., Rivera, R., Kotzias, P., Caballero, J. (2016). AVclass: A Tool for Massive Malware Labeling. In: Monrose, F., Dacier, M., Blanc, G., Garcia-Alfaro, J. (eds) Research in Attacks, Intrusions, and Defenses. RAID 2016. Lecture Notes in Computer Science(), vol 9854. Springer, Cham. https://doi.org/10.1007/978-3-319-45719-2_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45719-2_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45718-5

  • Online ISBN: 978-3-319-45719-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics