Topology-Aware Hashing for Effective Control Flow Graph Similarity Analysis

Li, Yuping; Jang, Jiyong; Ou, Xinming

doi:10.1007/978-3-030-37228-6_14

Yuping Li²⁰,
Jiyong Jang²¹ &
Xinming Ou²²

Part of the book series: Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering ((LNICST,volume 304))

Included in the following conference series:

International Conference on Security and Privacy in Communication Systems

1356 Accesses
3 Citations

Abstract

Control Flow Graph (CFG) similarity analysis is an essential technique for a variety of security analysis tasks, including malware detection and malware clustering. Even though various algorithms have been developed, existing CFG similarity analysis methods still suffer from limited efficiency, accuracy, and usability. In this paper, we propose a novel fuzzy hashing scheme called topology-aware hashing (TAH) for effective and efficient CFG similarity analysis. Given the CFGs constructed from program binaries, we extract blended n-gram graphical features of the CFGs, encode the graphical features into numeric vectors (called graph signatures), and then measure the graph similarity by comparing the graph signatures. We further employ a fuzzy hashing technique to convert the numeric graph signatures into smaller fixed-size fuzzy hash signatures for efficient similarity calculation. Our comprehensive evaluation demonstrates that TAH is more effective and efficient compared to existing CFG comparison techniques. To demonstrate the applicability of TAH to real-world security analysis tasks, we develop a binary similarity analysis tool based on TAH, and show that it outperforms existing similarity analysis tools while conducting malware clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
16 of 1-gram features, \(16^2\) of 2-gram features, \(16^3\) of 3-gram features, \(16^4\) of 4-gram features, and \(16^5\) of 5-gram features.
2.
The parameter selection process is not included in the paper due to space limitation.
3.
A graph node can be deleted only if it is isolated.
4.
E.g., adding 1 bit to record whether the node contains string constants during basic block type abstraction as described in Sect. 3.1.
5.
For simplicity, we also refer our binary similarity analysis tool as TAH.

References

Ahmed, A., Shervashidze, N., Narayanamurthy, S., Josifovski, V., Smola, A.J.: Distributed large-scale natural graph factorization. In: Proceedings of the 22nd International Conference on World Wide Web, pp. 37–48. ACM (2013)
Google Scholar
Alazab, M., Venkataraman, S., Watters, P.: Towards understanding malware behaviour by the extraction of API calls. In: 2010 Second Cybercrime and Trustworthy Computing Workshop (CTC), pp. 52–59. IEEE (2010)
Google Scholar
Bourquin, M., King, A., Robbins, E.: BinSlayer: accurate comparison of binary executables. In: Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop, Rome, Italy, p. 4. ACM (2013)
Google Scholar
Canfora, G., De Lorenzo, A., Medvet, E., Mercaldo, F., Visaggio, C.A.: Effectiveness of Opcode ngrams for detection of multi family android malware. In: 2015 10th International Conference on Availability, Reliability and Security (ARES), Toulouse, France, pp. 333–340. IEEE, IEEE Computer Society (2015)
Google Scholar
Cao, S., Lu, W., Xu, Q.: Deep neural networks for learning graph representations. In: AAAI, pp. 1145–1152 (2016)
Google Scholar
Cesare, S., Xiang, Y.: Malware variant detection using similarity search over sets of control flow graphs. In: 2011 IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications, pp. 181–189. IEEE, IEEE Computer Society (2011)
Google Scholar
Chan, P.P., Collberg, C.: A method to evaluate CFG comparison algorithms. In: 2014 14th International Conference on Quality Software (QSIC), Washington, DC, USA, pp. 95–104. IEEE, IEEE Computer Society (2014)
Google Scholar
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, pp. 380–388. ACM (2002)
Google Scholar
Virustotal (2018). https://www.virustotal.com
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry, pp. 253–262. ACM (2004)
Google Scholar
Dullien, T., Rolles, R.: Graph-based comparison of executable objects (English version). SSTIC 5, 1–3 (2005)
Google Scholar
Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovRE: efficient cross-architecture identification of bugs in binary code. In: Proceedings of the 2016 Network and Distributed System Security (NDSS) Symposium, San Diego, CA, USA. The Internet Society (2016)
Google Scholar
Feng, Q., Zhou, R., Xu, C., Cheng, Y., Testa, B., Yin, H.: Scalable graph-based bug search for firmware images. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Vienna, Austria, pp. 480–491. ACM (2016)
Google Scholar
Galvez, J., Garcia, R., Salabert, M., Soler, R.: Charge indexes. New topological descriptors. J. Chem. Inf. Comput. Sci. 34(3), 520–525 (1994)
Article Google Scholar
Gao, D., Reiter, M.K., Song, D.: BinHunt: automatically finding semantic differences in binary programs. In: Chen, L., Ryan, M.D., Wang, G. (eds.) ICICS 2008. LNCS, vol. 5308, pp. 238–255. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88625-9_16
Chapter Google Scholar
Gärtner, T.: A survey of kernels for structured data. ACM SIGKDD Explor. Newsl. 5(1), 49–58 (2003)
Article Google Scholar
Gionis, A., Gunopulos, D., Koudas, N.: Efficient and tumble similar set retrieval. ACM SIGMOD Rec. 30(2), 247–258 (2001)
Article Google Scholar
Goyal, P., Ferrara, E.: Graph embedding techniques, applications, and performance: a survey. Knowl. Based Syst. 151, 78–94 (2018)
Article Google Scholar
Guo, F., Ferrie, P., Chiueh, T.: A study of the packer problem and its solutions. In: Lippmann, R., Kirda, E., Trachtenberg, A. (eds.) RAID 2008. LNCS, vol. 5230, pp. 98–115. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87403-4_6
Chapter Google Scholar
Hert, J., et al.: Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. Org. Biomol. Chem. 2(22), 3256–3266 (2004)
Article Google Scholar
Hu, X., Chiueh, T.-C., Shin, K.G.: Large-scale malware indexing using function-call graphs. In: Proceedings of the 16th ACM Conference on Computer and Communications Security, Chicago, Illinois, USA, pp. 611–620. ACM (2009)
Google Scholar
Hu, X., Shin, K.G.: DUET: integration of dynamic and static analyses for malware clustering with cluster ensembles. In: Proceedings of the 29th Annual Computer Security Applications Conference, New Orleans, Louisiana, USA, pp. 79–88. ACM (2013)
Google Scholar
Jang, J., Brumley, D., Venkataraman, S.: BitShred: feature hashing malware for scalable triage and semantic analysis. In: Proceedings of the 18th ACM Conference on Computer and Communications Security, Chicago, Illinois, USA, pp. 309–320. ACM (2011)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
Kondor, R., Pan, H.: The multiscale Laplacian graph kernel. In: Advances in Neural Information Processing Systems, pp. 2990–2998 (2016)
Google Scholar
Kornblum, J.: Identifying almost identical files using context triggered piecewise hashing. Digit. Investig. 3, 91–97 (2006)
Article Google Scholar
Kruegel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.: Polymorphic worm detection using structural information of executables. In: Valdes, A., Zamboni, D. (eds.) RAID 2005. LNCS, vol. 3858, pp. 207–226. Springer, Heidelberg (2006). https://doi.org/10.1007/11663812_11
Chapter Google Scholar
Kuhn, H.W.: The Hungarian method for the assignment problem. Nav. Res. Logist. (NRL) 2(1–2), 83–97 (1955)
Article MathSciNet Google Scholar
Li, Y., et al.: Experimental study of fuzzy hashing in malware clustering analysis. In: 8th Workshop on Cyber Security Experimentation and Test, CSET 2015, vol. 5, p. 52. USENIX Association (2015)
Google Scholar
McGregor, J.J.: Backtrack search algorithms and the maximal common subgraph problem. Softw. Pract. Exp. 12(1), 23–34 (1982)
Article Google Scholar
Ming, J., Pan, M., Gao, D.: iBinHunt: binary hunting with inter-procedural control flow. In: Kwon, T., Lee, M.-K., Kwon, D. (eds.) ICISC 2012. LNCS, vol. 7839, pp. 92–109. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37682-5_8
Chapter Google Scholar
Moskovitch, R., et al.: Unknown malcode detection using OPCODE representation. In: Ortiz-Arroyo, D., Larsen, H.L., Zeng, D.D., Hicks, D., Wagner, G. (eds.) EuroIsI 2008. LNCS, vol. 5376, pp. 204–215. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-89900-6_21
Chapter Google Scholar
Myles, G., Collberg, C.: K-gram based software birthmarks. In: Proceedings of the 2005 ACM Symposium on Applied Computing, Santa Fe, New Mexico, USA, pp. 314–318. ACM (2005)
Google Scholar
Ou, M., Cui, P., Pei, J., Zhang, Z., Zhu, W.: Asymmetric transitivity preserving graph embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1105–1114. ACM (2016)
Google Scholar
Rafique, M.Z., Caballero, J.: FIRMA: malware clustering and network signature generation with mixed network behaviors. In: Stolfo, S.J., Stavrou, A., Wright, C.V. (eds.) RAID 2013. LNCS, vol. 8145, pp. 144–163. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41284-4_8
Chapter Google Scholar
Rieck, K., Laskov, P.: Linear-time computation of similarity measures for sequential data. J. Mach. Learn. Res. 9, 23–48 (2008)
MATH Google Scholar
Rieck, K., Trinius, P., Willems, C., Holz, T.: Automatic analysis of malware behavior using machine learning. J. Comput. Secur. 19(4), 639–668 (2011)
Article Google Scholar
Roberts, J.: VirusShare.com (2015). http://virusshare.com/
Shabtai, A., Moskovitch, R., Elovici, Y., Glezer, C.: Detection of malicious code by applying machine learning classifiers on static features: a state-of-the-art survey. Inf. Secur. Tech. Rep. 14(1), 16–29 (2009)
Article Google Scholar
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, New York (2004)
Book Google Scholar
Shervashidze, N., Schweitzer, P., van Leeuwen, E.J., Mehlhorn, K., Borgwardt, K.M.: Weisfeiler-Lehman graph kernels. J. Mach. Learn. Res. 12, 2539–2561 (2011)
MathSciNet MATH Google Scholar
Sokolsky, O., Kannan, S., Lee, I.: Simulation-based graph similarity. In: Hermanns, H., Palsberg, J. (eds.) TACAS 2006. LNCS, vol. 3920, pp. 426–440. Springer, Heidelberg (2006). https://doi.org/10.1007/11691372_28
Chapter Google Scholar
Vishwanathan, S.V.N., Schraudolph, N.N., Kondor, R., Borgwardt, K.M.: Graph kernels. J. Mach. Learn. Res. 11, 1201–1242 (2010)
MathSciNet MATH Google Scholar
Vujošević-Janičić, M., Nikolić, M., Tošić, D., Kuncak, V.: Software verification and graph similarity for automated evaluation of students’ assignments. Inf. Softw. Technol. 55(6), 1004–1016 (2013)
Article Google Scholar
Wang, D., Cui, P., Zhu, W.: Structural deep network embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1225–1234. ACM (2016)
Google Scholar
Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., Song, D.: Neural network-based graph embedding for cross-platform binary code similarity detection. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 363–376. ACM (2017)
Google Scholar

Download references

Acknowledgment

This research was partially supported by the U.S. National Science Foundation under grant No. 1622402 and 1717862. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Pinterest, San Francisco, USA
Yuping Li
IBM Research, Yorktown Heights, USA
Jiyong Jang
University of South Florida, Tampa, USA
Xinming Ou

Authors

Yuping Li
View author publications
You can also search for this author in PubMed Google Scholar
Jiyong Jang
View author publications
You can also search for this author in PubMed Google Scholar
Xinming Ou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuping Li .

Editor information

Editors and Affiliations

George Mason University, Fairfax, VA, USA
Songqing Chen
The University of Texas at San Antonio, San Antonio, TX, USA
Kim-Kwang Raymond Choo
Boston University, Lowell, MA, USA
Xinwen Fu
Virginia Tech, Blacksburg, VA, USA
Wenjing Lou
University of Central Florida, Orlando, FL, USA
Aziz Mohaisen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Jang, J., Ou, X. (2019). Topology-Aware Hashing for Effective Control Flow Graph Similarity Analysis. In: Chen, S., Choo, KK., Fu, X., Lou, W., Mohaisen, A. (eds) Security and Privacy in Communication Networks. SecureComm 2019. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 304. Springer, Cham. https://doi.org/10.1007/978-3-030-37228-6_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-37228-6_14
Published: 13 December 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-37227-9
Online ISBN: 978-3-030-37228-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics