Skip to main content

Topology-Aware Hashing for Effective Control Flow Graph Similarity Analysis

  • Conference paper
  • First Online:
Security and Privacy in Communication Networks (SecureComm 2019)

Abstract

Control Flow Graph (CFG) similarity analysis is an essential technique for a variety of security analysis tasks, including malware detection and malware clustering. Even though various algorithms have been developed, existing CFG similarity analysis methods still suffer from limited efficiency, accuracy, and usability. In this paper, we propose a novel fuzzy hashing scheme called topology-aware hashing (TAH) for effective and efficient CFG similarity analysis. Given the CFGs constructed from program binaries, we extract blended n-gram graphical features of the CFGs, encode the graphical features into numeric vectors (called graph signatures), and then measure the graph similarity by comparing the graph signatures. We further employ a fuzzy hashing technique to convert the numeric graph signatures into smaller fixed-size fuzzy hash signatures for efficient similarity calculation. Our comprehensive evaluation demonstrates that TAH is more effective and efficient compared to existing CFG comparison techniques. To demonstrate the applicability of TAH to real-world security analysis tasks, we develop a binary similarity analysis tool based on TAH, and show that it outperforms existing similarity analysis tools while conducting malware clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    16 of 1-gram features, \(16^2\) of 2-gram features, \(16^3\) of 3-gram features, \(16^4\) of 4-gram features, and \(16^5\) of 5-gram features.

  2. 2.

    The parameter selection process is not included in the paper due to space limitation.

  3. 3.

    A graph node can be deleted only if it is isolated.

  4. 4.

    E.g., adding 1 bit to record whether the node contains string constants during basic block type abstraction as described in Sect. 3.1.

  5. 5.

    For simplicity, we also refer our binary similarity analysis tool as TAH.

References

  1. Ahmed, A., Shervashidze, N., Narayanamurthy, S., Josifovski, V., Smola, A.J.: Distributed large-scale natural graph factorization. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 37–48. ACM (2013)

    Google Scholar 

  2. Alazab, M., Venkataraman, S., Watters, P.: Towards understanding malware behaviour by the extraction of API calls. In: 2010 Second Cybercrime and Trustworthy Computing Workshop (CTC), pp. 52–59. IEEE (2010)

    Google Scholar 

  3. Bourquin, M., King, A., Robbins, E.: BinSlayer: accurate comparison of binary executables. In: Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop, Rome, Italy, p. 4. ACM (2013)

    Google Scholar 

  4. Canfora, G., De Lorenzo, A., Medvet, E., Mercaldo, F., Visaggio, C.A.: Effectiveness of Opcode ngrams for detection of multi family android malware. In: 2015 10th International Conference on Availability, Reliability and Security (ARES), Toulouse, France, pp. 333–340. IEEE, IEEE Computer Society (2015)

    Google Scholar 

  5. Cao, S., Lu, W., Xu, Q.: Deep neural networks for learning graph representations. In: AAAI, pp. 1145–1152 (2016)

    Google Scholar 

  6. Cesare, S., Xiang, Y.: Malware variant detection using similarity search over sets of control flow graphs. In: 2011 IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications, pp. 181–189. IEEE, IEEE Computer Society (2011)

    Google Scholar 

  7. Chan, P.P., Collberg, C.: A method to evaluate CFG comparison algorithms. In: 2014 14th International Conference on Quality Software (QSIC), Washington, DC, USA, pp. 95–104. IEEE, IEEE Computer Society (2014)

    Google Scholar 

  8. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, pp. 380–388. ACM (2002)

    Google Scholar 

  9. Virustotal (2018). https://www.virustotal.com

  10. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp. 253–262. ACM (2004)

    Google Scholar 

  11. Dullien, T., Rolles, R.: Graph-based comparison of executable objects (English version). SSTIC 5, 1–3 (2005)

    Google Scholar 

  12. Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovRE: efficient cross-architecture identification of bugs in binary code. In: Proceedings of the 2016 Network and Distributed System Security (NDSS) Symposium, San Diego, CA, USA. The Internet Society (2016)

    Google Scholar 

  13. Feng, Q., Zhou, R., Xu, C., Cheng, Y., Testa, B., Yin, H.: Scalable graph-based bug search for firmware images. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, pp. 480–491. ACM (2016)

    Google Scholar 

  14. Galvez, J., Garcia, R., Salabert, M., Soler, R.: Charge indexes. New topological descriptors. J. Chem. Inf. Comput. Sci. 34(3), 520–525 (1994)

    Article  Google Scholar 

  15. Gao, D., Reiter, M.K., Song, D.: BinHunt: automatically finding semantic differences in binary programs. In: Chen, L., Ryan, M.D., Wang, G. (eds.) ICICS 2008. LNCS, vol. 5308, pp. 238–255. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88625-9_16

    Chapter  Google Scholar 

  16. Gärtner, T.: A survey of kernels for structured data. ACM SIGKDD Explor. Newsl. 5(1), 49–58 (2003)

    Article  Google Scholar 

  17. Gionis, A., Gunopulos, D., Koudas, N.: Efficient and tumble similar set retrieval. ACM SIGMOD Rec. 30(2), 247–258 (2001)

    Article  Google Scholar 

  18. Goyal, P., Ferrara, E.: Graph embedding techniques, applications, and performance: a survey. Knowl. Based Syst. 151, 78–94 (2018)

    Article  Google Scholar 

  19. Guo, F., Ferrie, P., Chiueh, T.: A study of the packer problem and its solutions. In: Lippmann, R., Kirda, E., Trachtenberg, A. (eds.) RAID 2008. LNCS, vol. 5230, pp. 98–115. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87403-4_6

    Chapter  Google Scholar 

  20. Hert, J., et al.: Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. Org. Biomol. Chem. 2(22), 3256–3266 (2004)

    Article  Google Scholar 

  21. Hu, X., Chiueh, T.-C., Shin, K.G.: Large-scale malware indexing using function-call graphs. In: Proceedings of the 16th ACM Conference on Computer and Communications Security, Chicago, Illinois, USA, pp. 611–620. ACM (2009)

    Google Scholar 

  22. Hu, X., Shin, K.G.: DUET: integration of dynamic and static analyses for malware clustering with cluster ensembles. In: Proceedings of the 29th Annual Computer Security Applications Conference, New Orleans, Louisiana, USA, pp. 79–88. ACM (2013)

    Google Scholar 

  23. Jang, J., Brumley, D., Venkataraman, S.: BitShred: feature hashing malware for scalable triage and semantic analysis. In: Proceedings of the 18th ACM Conference on Computer and Communications Security, Chicago, Illinois, USA, pp. 309–320. ACM (2011)

    Google Scholar 

  24. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)

  25. Kondor, R., Pan, H.: The multiscale Laplacian graph kernel. In: Advances in Neural Information Processing Systems, pp. 2990–2998 (2016)

    Google Scholar 

  26. Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digit. Investig. 3, 91–97 (2006)

    Article  Google Scholar 

  27. Kruegel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.: Polymorphic worm detection using structural information of executables. In: Valdes, A., Zamboni, D. (eds.) RAID 2005. LNCS, vol. 3858, pp. 207–226. Springer, Heidelberg (2006). https://doi.org/10.1007/11663812_11

    Chapter  Google Scholar 

  28. Kuhn, H.W.: The Hungarian method for the assignment problem. Nav. Res. Logist. (NRL) 2(1–2), 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  29. Li, Y., et al.: Experimental study of fuzzy hashing in malware clustering analysis. In: 8th Workshop on Cyber Security Experimentation and Test, CSET 2015, vol. 5, p. 52. USENIX Association (2015)

    Google Scholar 

  30. McGregor, J.J.: Backtrack search algorithms and the maximal common subgraph problem. Softw. Pract. Exp. 12(1), 23–34 (1982)

    Article  Google Scholar 

  31. Ming, J., Pan, M., Gao, D.: iBinHunt: binary hunting with inter-procedural control flow. In: Kwon, T., Lee, M.-K., Kwon, D. (eds.) ICISC 2012. LNCS, vol. 7839, pp. 92–109. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37682-5_8

    Chapter  Google Scholar 

  32. Moskovitch, R., et al.: Unknown malcode detection using OPCODE representation. In: Ortiz-Arroyo, D., Larsen, H.L., Zeng, D.D., Hicks, D., Wagner, G. (eds.) EuroIsI 2008. LNCS, vol. 5376, pp. 204–215. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89900-6_21

    Chapter  Google Scholar 

  33. Myles, G., Collberg, C.: K-gram based software birthmarks. In: Proceedings of the 2005 ACM Symposium on Applied Computing, Santa Fe, New Mexico, USA, pp. 314–318. ACM (2005)

    Google Scholar 

  34. Ou, M., Cui, P., Pei, J., Zhang, Z., Zhu, W.: Asymmetric transitivity preserving graph embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1105–1114. ACM (2016)

    Google Scholar 

  35. Rafique, M.Z., Caballero, J.: FIRMA: malware clustering and network signature generation with mixed network behaviors. In: Stolfo, S.J., Stavrou, A., Wright, C.V. (eds.) RAID 2013. LNCS, vol. 8145, pp. 144–163. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41284-4_8

    Chapter  Google Scholar 

  36. Rieck, K., Laskov, P.: Linear-time computation of similarity measures for sequential data. J. Mach. Learn. Res. 9, 23–48 (2008)

    MATH  Google Scholar 

  37. Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19(4), 639–668 (2011)

    Article  Google Scholar 

  38. Roberts, J.: VirusShare.com (2015). http://virusshare.com/

  39. Shabtai, A., Moskovitch, R., Elovici, Y., Glezer, C.: Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inf. Secur. Tech. Rep. 14(1), 16–29 (2009)

    Article  Google Scholar 

  40. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, New York (2004)

    Book  Google Scholar 

  41. Shervashidze, N., Schweitzer, P., van Leeuwen, E.J., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-Lehman graph kernels. J. Mach. Learn. Res. 12, 2539–2561 (2011)

    MathSciNet  MATH  Google Scholar 

  42. Sokolsky, O., Kannan, S., Lee, I.: Simulation-based graph similarity. In: Hermanns, H., Palsberg, J. (eds.) TACAS 2006. LNCS, vol. 3920, pp. 426–440. Springer, Heidelberg (2006). https://doi.org/10.1007/11691372_28

    Chapter  Google Scholar 

  43. Vishwanathan, S.V.N., Schraudolph, N.N., Kondor, R., Borgwardt, K.M.: Graph kernels. J. Mach. Learn. Res. 11, 1201–1242 (2010)

    MathSciNet  MATH  Google Scholar 

  44. Vujošević-Janičić, M., Nikolić, M., Tošić, D., Kuncak, V.: Software verification and graph similarity for automated evaluation of students’ assignments. Inf. Softw. Technol. 55(6), 1004–1016 (2013)

    Article  Google Scholar 

  45. Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1225–1234. ACM (2016)

    Google Scholar 

  46. Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., Song, D.: Neural network-based graph embedding for cross-platform binary code similarity detection. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 363–376. ACM (2017)

    Google Scholar 

Download references

Acknowledgment

This research was partially supported by the U.S. National Science Foundation under grant No. 1622402 and 1717862. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuping Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, Y., Jang, J., Ou, X. (2019). Topology-Aware Hashing for Effective Control Flow Graph Similarity Analysis. In: Chen, S., Choo, KK., Fu, X., Lou, W., Mohaisen, A. (eds) Security and Privacy in Communication Networks. SecureComm 2019. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 304. Springer, Cham. https://doi.org/10.1007/978-3-030-37228-6_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-37228-6_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-37227-9

  • Online ISBN: 978-3-030-37228-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics