Advertisement

Data Mining and Knowledge Discovery

, Volume 33, Issue 1, pp 168–203 | Cite as

Learning edge weights in file co-occurrence graphs for malware detection

  • Weixuan Mao
  • Zhongmin CaiEmail author
  • Bo Zeng
  • Xiaohong Guan
Article
  • 60 Downloads

Abstract

The cloud based security service generates a new type of security data, which indicates the occurrence of executable files in end hosts. With the basis of the security data, semi-supervised learning on file co-occurrence graph provides a novel perspective for malware detection. The edge weight, which quantifies the correlation of the labels (either benign or malicious) of co-occurred files, plays a significant role in such techniques. While previous work employed heuristic methods of defining edge weights in the file co-occurrence graph, this paper develops a novel framework for learning the edge weights via minimizing the error under the harmonic property which is implied from the graph. Our method is proven to achieve the unique global optimal edge weights in the graph of training instances. Furthermore, taking advantage of the learned edge weights between co-occurred files, we develop a graph based semi-supervised learning method for malware detection. Experimental results on a real-world dataset, which consists of 12,469 benign and 11,327 malicious executable files from 11,713,031 end hosts, demonstrate the efficacy of our method. Our malware detection approach with the learned edge weights significantly outperforms existing approaches with commonly used heuristic edge weights.

Keywords

Malware detection Edge-weight learning Graph based semi-supervised learning Bipartite graph 

Notes

Acknowledgements

The research presented in this paper is supported in part by the National Natural Science Foundation (61772415, U1736205, 61221063, 61403301), National Key R&D Project (2016YFB0901900), 111 International Collaboration Program of China, National Key Research and Development Program of China (2016YFB0801501), 863 High Tech Development Plan (2012AA011003), Research Fund for Doctoral Program of Higher Education of China (20090201120032), International Research Collaboration Project of Shaanxi Province (2013KW11), Fundamental Research Funds for Central Universities (2012jdhz08).

References

  1. Alabdulmohsin I, Han Y, Shen Y, Zhang X (2016) Content-agnostic malware detection in heterogeneous maliciousdistribution graph. In: Proceedings of the 25th ACM international on conference on information and knowledge management (CIKM). ACM, pp 2395–2400Google Scholar
  2. Baluja S, Seth R, Sivakumar D, Jing Y, Yagnik J, Kumar S, Deepak R, Mohamed A (2008) Video suggestion and discovery for youtube: taking random walks through the view graph. In: Proceedings of the 17th international conference on World Wide Web (WWW). ACM, pp 895–904Google Scholar
  3. Bayer U, Comparetti PM, Hlauschek C, Kruegel C, Kirda E (2009) Scalable, behavior-based malware clustering. In: Network and distributed system security symposium (NDSS)Google Scholar
  4. Bilgic M, Mihalkova L, Getoor L (2010) Active learning for networked data. In: Proceedings of the 27th international conference on machine learning (ICML 2010), pp 79–86Google Scholar
  5. Calado P, Cristo M, Moura E, Ziviani N, Ribeiro-Neto B, Gonçalves MA (2003) Combining link-based and content-based methods for web document classification. In: Proceedings of the twelfth international conference on information and knowledge management, CIKM ’03, New York, NY, USA. ACM, pp 394–401. ISBN 1-58113-723-0Google Scholar
  6. Chau DH, Nachenberg C, Wilhelm J, Wright A, Faloutsos C (2011) Polonium: tera-scale graph mining and inference for malware detection. In: SIAM international conference on data mining (SDM), pp 131–142Google Scholar
  7. Chen L, Hardy W, Ye Y, Li T (2015) Analyzing file-to-file relation network in malware detection. In: Web information systems engineering (WISE). Springer, Berlin, pp 415–430Google Scholar
  8. Fredrikson M, Jha S, Christodorescu M, Sailer R, Yan X (2010) Synthesizing near-optimal malware specifications from suspicious behaviors. In: IEEE symposium on security and privacy (S&P), pp 45–60Google Scholar
  9. Gao M, Chen L, Li B, Li Y, Liu W, cheng Xu Y (2017) Projection-based link prediction in a bipartite network. Inf Sci 376:158–171CrossRefGoogle Scholar
  10. Gelman A, Hill J (2007) Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, CambridgeGoogle Scholar
  11. He L, Lu C-T, Ma J, Cao J, Shen L, Yu PS (2016) Joint community and structural hole spanner detection via harmonic modularity. In: Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16, New York, NY, USA. ACM, pp 875–884Google Scholar
  12. Ji M, Sun Y, Danilevsky M, Han J, Gao J (2010) Graph regularized transductive classification on heterogeneous information networks. In: Machine learning and knowledge discovery in databases. Springer, Berlin, pp 570–586Google Scholar
  13. Ji M, Han J, Danilevsky M (2011) Ranking-based classification of heterogeneous information networks. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). ACM, pp 1298–1306Google Scholar
  14. Karampatziakis N, Stokes JW, Thomas A, Marinescu M (2012) Using file relationships in malware classification. In: Detection of intrusions and malware, and vulnerability assessment (DIMVA). Springer, Berlin, pp 1–20Google Scholar
  15. Lanzi A, Balzarotti D, Kruegel C, Christodorescu M, Kirda E (2010) Accessminer: using system-centric models for malware protection. In: Proceedings of the 17th ACM conference on computer and communications security (CCS). ACM, pp 399–412Google Scholar
  16. Liben-Nowell D, Kleinberg J (2007) The link-prediction problem for social networks. J Am Soc Inf Sci Technol 58(7):1019–1031CrossRefGoogle Scholar
  17. Liu H, Zheng H, Mian A, Tian H, Zhu X (2014) A new user similarity model to improve the accuracy of collaborative filtering. Knowl Based Syst 56:156–166CrossRefGoogle Scholar
  18. Macskassy SA, Provost F (2003) A simple relational classifier. Technical report, DTIC DocumentGoogle Scholar
  19. Macskassy SA, Provost F (2007) Classification in networked data: a toolkit and a univariate case study. J Mach Learn Res 8(1):935–983Google Scholar
  20. Masud MM, Al-Khateeb TM, Hamlen KW, Gao J, Khan L, Han J, Thuraisingham B (2011) Cloud-based malware detection for evolving data streams. ACM Trans Manag Inf Syst (TMIS) 2(3):16Google Scholar
  21. Ni M, Li Q, Zhang H, Li T, Hou J (2015) File relation graph based malware detection using label propagation. In: Web information systems engineering (WISE). Springer, Berlin, pp 164–176Google Scholar
  22. Ni M, Li T, Li Q, Zhang H, Ye Y (2016) Findmal: a file-to-file social network based malware detection framework. Knowl Based Syst 112:142–151CrossRefGoogle Scholar
  23. Nigam A, Chawla NV (2016) Link prediction in a semi-bipartite network for recommendation. In: Asian conference on intelligent information and database systems (ACIIDS). Springer, Berlin, pp 127–135Google Scholar
  24. Rahbarinia B, Balduzzi M, Perdisci R (2016) Real-time detection of malware downloads via large-scale \({URL}\rightarrow {File}\rightarrow {Machine}\) graph mining. In: 11th ACM Asia conference on computer and communications security (ASIACCS)Google Scholar
  25. Rajab MA, Ballard L, Lutz N, Mavrommatis P, Provos N (2013) CAMP: content-agnostic malware protection. In: Network and distributed system security symposium (NDSS)Google Scholar
  26. Ravi S, Diao Q (2016) Large scale distributed semi-supervised learning using streaming approximation. In: Proceedings of the 19th international conference on artificial intelligence and statistics (AISTATS)Google Scholar
  27. Sun W, Sekar R, Poothia G, Karandikar T (2008) Practical proactive integrity preservation: a basis for malware defense. IEEE symposium on security and privacy (S&P), pp 248–262Google Scholar
  28. Symantec (2016) Internet security threat report. https://www.symantec.com/content/dam/symantec/docs/reports/istr-21-2016-en.pdf. Accessed Jan 2017
  29. Tamersoy A, Roundy K, Chau DH (2014) Guilt by association: large scale malware detection by mining file-relation graphs. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD). ACM, pp 1524–1533Google Scholar
  30. Tsironis S, Sozio M, Vazirgiannis M, Poltechnique LE (2013) Accurate spectral clustering for community detection in mapreduce. In: Advances in neural information processing systems (NIPS) workshops. CiteseerGoogle Scholar
  31. Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416MathSciNetCrossRefGoogle Scholar
  32. Wang F, Zhang C (2008) Label propagation through linear neighborhoods. IEEE Trans Knowl Data Eng 20(1):55–67CrossRefGoogle Scholar
  33. Wu X-M, Li Z, Chang S-F (2013) Analyzing the harmonic structure in graph-based learning. In: Advances in neural information processing systems (NIPS), pp 3129–3137Google Scholar
  34. Ye Y, Li T, Zhu S, Zhuang W, Tas E, Gupta U, Abdulhayoglu M (2011) Combining file content and file relations for cloud based malware detection. In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’11, New York, NY, USA. ACM, pp 222–230Google Scholar
  35. Yin H, Song D, Manuel E, Kruegel C, Kirda E (2007) Panorama: capturing system-wide information flow for malware detection and analysis. In: Proceedings of the 14th ACM conferences on computer and communication security (CCS)Google Scholar
  36. Zhou D, Bousquet O, Lal TN, Weston J, Schölkopf B (2004) Learning with local and global consistency. Adv Neural Inf Proces Syst (NIPS) 16(16):321–328Google Scholar
  37. Zhu X (2005) Semi-supervised learning literature survey. Technical report 1530, Computer Sciences, University of Wisconsin-MadisonGoogle Scholar
  38. Zhu X, Ghahramani Z, Lafferty J et al (2003) Semi-supervised learning using gaussian fields and harmonic functions. In: Proceedings of the 20th international conference on machine learning (ICML 2003), vol 3, pp 912–919Google Scholar
  39. Zhu X, Lafferty J, Rosenfeld R (2005) Semi-supervised learning with graphs. Ph.D. thesis, Carnegie Mellon University, Language Technologies Institute, School of Computer ScienceGoogle Scholar

Copyright information

© The Author(s) 2018

Authors and Affiliations

  • Weixuan Mao
    • 1
    • 2
  • Zhongmin Cai
    • 1
    Email author
  • Bo Zeng
    • 3
  • Xiaohong Guan
    • 1
  1. 1.MOE KLINNS LabXi’an Jiaotong UniversityXi’anChina
  2. 2.National Computer Network Emergency Response Technical Team/Coordination Center of ChinaBeijingChina
  3. 3.Department of Industrial EngineeringUniversity of PittsburghPittsburghUSA

Personalised recommendations