Similarity of Binaries Across Optimization Levels and Obfuscation

Jiang, Jianguo; Li, Gengwang; Yu, Min; Li, Gang; Liu, Chao; Lv, Zhiqiang; Lv, Bin; Huang, Weiqing

doi:10.1007/978-3-030-58951-6_15

Jianguo Jiang^12,13,
Gengwang Li^12,13,
Min Yu^12,13,
Gang Li¹⁴,
Chao Liu¹²,
Zhiqiang Lv¹²,
Bin Lv¹² &
…
Weiqing Huang¹²

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12308))

Included in the following conference series:

European Symposium on Research in Computer Security

4062 Accesses
10 Citations

Abstract

Binary code similarity evaluation has been widely applied in security. Unfortunately, the compiler optimization and obfuscation techniques exert challenges that have not been well addressed by existing approaches. In this paper, we propose a prototype, ImOpt, for re-optimizing code to boost similarity evaluation. The key contribution is an immediate SSA (static single-assignment) transforming algorithm to provide a very fast pointer analysis for re-optimizing more thoroughly. The algorithm transforms variables and even pointers into SSA form on the fly, so that the information on def-use and reachability can be maintained promptly. By utilizing the immediate SSA transforming algorithm, ImOpt canonicalizes and eliminates junk code to alleviate the perturbation from optimization and obfuscation.

We illustrate that ImOpt can improve the accuracy of a state-of-the-art approach on similarity evaluation by 22.7%. Our experiment results demonstrate that the bottleneck part of our SSA transforming algorithm runs 15.7x faster than one of the best similar methods. Furthermore, we show that ImOpt is robust to many obfuscation techniques that based on data dependency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bourquin, M., King, A., Robbins, E.: BinSlayer: accurate comparison of binary executables. In: Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop (2013)
Google Scholar
Brumley, D., Jager, I., Avgerinos, T., Schwartz, E.J.: BAP: a binary analysis platform. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 463–469. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22110-1_37
Chapter Google Scholar
Chandramohan, M., Xue, Y., Xu, Z., Liu, Y., Cho, C.Y., Tan, H.B.K.: Bingo: cross-architecture cross-OS binary search. In: International Symposium on Foundations of Software Engineering, pp. 678–689. ACM (2016)
Google Scholar
Chase, D.R., Wegman, M., Zadeck, F.K.: Analysis of pointers and structures. ACM SIGPLAN Not. 25(6), 296–310 (1990)
Article Google Scholar
Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13(4), 451–490 (1991). https://doi.org/10.1145/115372.115320
Article Google Scholar
David, Y., Partush, N., Yahav, E.: Statistical similarity of binaries. ACM SIGPLAN Not. 51, 266–280 (2016)
Article Google Scholar
David, Y., Partush, N., Yahav, E.: Similarity of binaries through re-optimization. ACM SIGPLAN Not. 52, 79–94 (2017). https://doi.org/10.1145/3140587.3062387
Article Google Scholar
David, Y., Yahav, E.: Tracelet-based code search in executables. ACM SIGPLAN Not. 49, 349–360 (2014)
Article Google Scholar
Ding, S.H., Fung, B.C., Charland, P.: Asm2Vec: boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In: 2019 IEEE Symposium on Security and Privacy (SP). IEEE (2019)
Google Scholar
Egele, M., Woo, M., Chapman, P., Brumley, D.: Blanket execution: dynamic similarity testing for program binaries and components. In: USENIX Security Symposium, pp. 303–317 (2014)
Google Scholar
Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovRE: efficient cross-architecture identification of bugs in binary code. In: NDSS (2016)
Google Scholar
Feng, Q., Zhou, R., Xu, C., Cheng, Y., Testa, B., Yin, H.: Scalable graph-based bug search for firmware images. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 480–491 (2016)
Google Scholar
Gao, D., Reiter, M.K., Song, D.: BinHunt: automatically finding semantic differences in binary programs. In: Chen, L., Ryan, M.D., Wang, G. (eds.) ICICS 2008. LNCS, vol. 5308, pp. 238–255. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88625-9_16
Chapter Google Scholar
Hardekopf, B., Lin, C.: Flow-sensitive pointer analysis for millions of lines of code. In: International Symposium on Code Generation and Optimization, pp. 289–298. IEEE (2011)
Google Scholar
Hasti, R., Horwitz, S.: Using static single assignment form to improve flow-insensitive pointer analysis. ACM SIGPLAN Not. 33(5), 97–105 (1998)
Article Google Scholar
Hu, Y., Wang, H., Zhang, Y., Li, B., Gu, D.: A semantics-based hybrid approach on binary code similarity comparison. IEEE Trans. Softw. Eng. (2019)
Google Scholar
Hu, Y., Zhang, Y., Li, J., Gu, D.: Binary code clone detection across architectures and compiling configurations. In: Proceedings of the 25th International Conference on Program Comprehension, pp. 88–98. IEEE Press (2017)
Google Scholar
Jang, J., Woo, M., Brumley, D.: Towards automatic software lineage inference. In: Proceedings of the 22nd USENIX Conference on Security (2013)
Google Scholar
Lindorfer, M., Di Federico, A., Maggi, F., Comparetti, P.M., Zanero, S.: Lines of malicious code: insights into the malicious software industry. In: Proceedings of the 28th Annual Computer Security Applications Conference, pp. 349–358 (2012)
Google Scholar
Madsen, M., Møller, A.: Sparse dataflow analysis with pointers and reachability. In: Müller-Olm, M., Seidl, H. (eds.) SAS 2014. LNCS, vol. 8723, pp. 201–218. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10936-7_13
Chapter Google Scholar
Ming, J., Xu, D., Jiang, Y., Wu, D.: BinSim: trace-based semantic binary diffing via system call sliced segment equivalence checking. In: USENIX Security Symposium, pp. 253–270 (2017)
Google Scholar
Ming, J., Xu, D., Wu, D.: Memoized semantics-based binary diffing with application to malware lineage inference. In: Federrath, H., Gollmann, D. (eds.) SEC 2015. IAICT, vol. 455, pp. 416–430. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18467-8_28
Chapter Google Scholar
Oh, H., Heo, K., Lee, W., Lee, W., Yi, K.: Design and implementation of sparse global analyses for C-like languages. 47(6), 229–238 (2012)
Google Scholar
Pewny, J., Schuster, F., Bernhard, L., Holz, T., Rossow, C.: Leveraging semantic signatures for bug search in binary programs. In: Proceedings of the 30th Annual Computer Security Applications Conference, pp. 406–415 (2014)
Google Scholar
Shirani, P., Wang, L., Debbabi, M.: BinShape: scalable and robust binary library function identification using function shape. In: Polychronakis, M., Meier, M. (eds.) DIMVA 2017. LNCS, vol. 10327, pp. 301–324. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60876-1_14
Chapter Google Scholar
Tok, T.B., Guyer, S.Z., Lin, C.: Efficient flow-sensitive interprocedural data-flow analysis in the presence of pointers. In: Mycroft, A., Zeller, A. (eds.) CC 2006. LNCS, vol. 3923, pp. 17–31. Springer, Heidelberg (2006). https://doi.org/10.1007/11688839_3
Chapter Google Scholar
Walenstein, A., Lakhotia, A.: The software similarity problem in malware analysis. In: Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2007)
Google Scholar
Wei, M.K., Mycroft, A., Anderson, R.: Rendezvous: a search engine for binary code. In: 2013 10th IEEE Working Conference on Mining Software Repositories (MSR) (2013)
Google Scholar

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (No. 61572469).

Author information

Authors and Affiliations

Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Jianguo Jiang, Gengwang Li, Min Yu, Chao Liu, Zhiqiang Lv, Bin Lv & Weiqing Huang
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
Jianguo Jiang, Gengwang Li & Min Yu
School of Information Technology, Deakin University, Melbourne, VIC, Australia
Gang Li

Authors

Jianguo Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Gengwang Li
View author publications
You can also search for this author in PubMed Google Scholar
Min Yu
View author publications
You can also search for this author in PubMed Google Scholar
Gang Li
View author publications
You can also search for this author in PubMed Google Scholar
Chao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiqiang Lv
View author publications
You can also search for this author in PubMed Google Scholar
Bin Lv
View author publications
You can also search for this author in PubMed Google Scholar
Weiqing Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Min Yu .

Editor information

Editors and Affiliations

University of Surrey, Guildford, UK
Liqun Chen
Purdue University, West Lafayette, IN, USA
Ninghui Li
Delft University of Technology, Delft, The Netherlands
Kaitai Liang
University of Surrey, Guildford, UK
Steve Schneider

Appendices

A Appendix 1: Sparse Analysis for Backward Block

The ImSsaSparse procedure is quite similar to the ImSsa procedure, except it sparsely recomputes the visited but invalid block (line 4) to avoid unnecessary re-computation:

The SparseRecompute procedure is similar to the ProcessStatement procedure, except the former only process invalid statements. Another difference is SparseRecompute needs to propagate invalidation through the def-use chain.

B Appendix 2: Supplementary Proofs

1.1 B.1 Appended Proofs for Section 2.3

Proof

(Proof of Lemma 2). Let us assume that there is a backward definition \(d \in b'\) reaching the forward block \(b_i\), and all the blocks except \(b_i\) on any path \(p : b' \rightarrow b_i\) are in the SSA form. Then, there should be a back edge \(e: b_j \rightarrow b_k\) on the path \(p: b' \rightarrow b_i\). Note the block \(b_k\) can not be the block \(b_i\) since a forward block has no back edge pointing to itself. So, the block \(b_k\) should be in SSA form. However, in SSA-form code, a back edge will introduce \(\phi \)-functions into the block \(b_k\) that is at the target end of the edge for every definition that passes through the block \(b_k\). These \(\phi \)-functions will prevent definition d reaching to the successor block \(b_i\), which contradicts the assumption. Hence, a forward block has only forward definitions reaching itself in SSA-form code.

Proof

(Proof of Lemma 3). According to Condition 1, the first \(i-1\) blocks have already been in SSA form. Then following Lemma 2, only forward definitions can reach block \(b_i\). Recall that if a forward definition d can reach a block \(b_i\), then there is no back edge on the reaching path \(p : d\ \overrightarrow{rc}\ b_i\), which means that the block \(b'\) which defines d is ahead of block \(b_i\) in reverse post-order.

Proof

(Proof of Lemma 4). According to the definition of the BRG, it is trivial that \(b' \in G_{BR}(b_i)\). If there is a path \(p: b' \rightarrow b_i\), then \(b_i\) is either an IDF of block \(b'\) or dominated by an IDF. Besides, there should be a back edge on path p since \(d\ \overleftarrow{rc}\ b_i\). If the back edge is in the middle of path p, then the \(\phi \)-function introduced by it will intercept definition d. That contradicts with the fact that definition d can reach block b. Hence, the back edge can only be the last edge on path p. It means the backward block b cannot be dominated by any block on path p. Therefore, the backward block b is an IDF of block \(b'\).

1.2 B.2 Appended Proofs for Section 3.6

To prove Lemma 1, we first show that given Condition 1 the property is satisfied for the i-th block, no matter the block is a forward or backward one.

Lemma 5

Given Condition 1, if the i-th block \(b_i\) is a forward block, then the ImSsa procedure also satisfies the invariants in Property 1 for block \(b_i\).

Proof

(Proof Sketch of Lemma 5). Based on Lemma 3, all the reachable definitions can be visited before visiting the current block. Global invariant: the variables with dominating definition are linked by \(E_{DOM(i)}\) directly, whose validation is confirmed by Condition 1; the other variables, which have multiple reachable definitions, have the \(\phi \)-functions being inserted in the front of the block when processing the blocks that define these definitions; Hence, the global invariant is held. Local invariant: E and C are well-defined at the beginning of local analysis since the global invariant is held; moreover, their invariants are preserved during the running of ProcessForwardBlock procedure. In conclusion, the local invariant is held.

Lemma 6

If the CFG contains only forward block, the ImSsa procedure satisfies the invariants in Property 1 for all the blocks in that graph.

Proof

This lemma simply follows Lemma 5.

Lemma 7

Given Condition 1, if the i-th block \(b_i\) is a backward block, the ImSsa procedure also satisfies the invariants in Property 1. Furthermore, for each block in the BRG, the invariants in Property 1 are also satisfied.

Proof

(Proof Sketch of Lemma 7). Assume that all blocks except \(b_i\) in the BRG are forward blocks. According to Lemma 4 and Lemma 6, the ImSsa procedure can propagate all the backward definitions back into block \(b_i\) along with the IDF chain by processing \(G_{BR}(b_i)\). Additionally, the following SparseRecompute procedure will reveal any new definitions caused by the re-computation and propagate them back, too. Hence, the global invariant is held. Recall that the invariants of E and C are also maintained during the sparse re-computations, therefore, the local invariant is also held. Furthermore, since the BRG \(G_{BR}(b_i)\) contains only forward blocks, according to Lemma 6, these invariants are also satisfied for each block in the BRG. As for the case where the BRG contains backward blocks, it can be deductively reasoned from the previous case.

It follows that the ImSsa procedure satisfies Property 1:

Proof

(Proof Sketch of Lemma 1). It is reasonable to assume a forward entry block since we can always insert an empty forward block at the beginning of the CFG. Therefore, the first block satisfies the global invariant because of no reachable definition. Meanwhile, the local invariant is also satisfied according to Lemma 5. That is, the ImSsa procedure satisfies the invariants for the first block. Then the conclusion can be deductively reasoned from Lemma 5 and Lemma 7.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, J. et al. (2020). Similarity of Binaries Across Optimization Levels and Obfuscation. In: Chen, L., Li, N., Liang, K., Schneider, S. (eds) Computer Security – ESORICS 2020. ESORICS 2020. Lecture Notes in Computer Science(), vol 12308. Springer, Cham. https://doi.org/10.1007/978-3-030-58951-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-58951-6_15
Published: 12 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58950-9
Online ISBN: 978-3-030-58951-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Similarity of Binaries Across Optimization Levels and Obfuscation

Abstract

Access this chapter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Appendix 1: Sparse Analysis for Backward Block

B Appendix 2: Supplementary Proofs

1.1 B.1 Appended Proofs for Section 2.3

Proof

Proof

Proof

1.2 B.2 Appended Proofs for Section 3.6

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

Proof

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation