Skip to main content

Similarity of Binaries Across Optimization Levels and Obfuscation

  • Conference paper
  • First Online:
Computer Security – ESORICS 2020 (ESORICS 2020)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 12308))

Included in the following conference series:

Abstract

Binary code similarity evaluation has been widely applied in security. Unfortunately, the compiler optimization and obfuscation techniques exert challenges that have not been well addressed by existing approaches. In this paper, we propose a prototype, ImOpt, for re-optimizing code to boost similarity evaluation. The key contribution is an immediate SSA (static single-assignment) transforming algorithm to provide a very fast pointer analysis for re-optimizing more thoroughly. The algorithm transforms variables and even pointers into SSA form on the fly, so that the information on def-use and reachability can be maintained promptly. By utilizing the immediate SSA transforming algorithm, ImOpt canonicalizes and eliminates junk code to alleviate the perturbation from optimization and obfuscation.

We illustrate that ImOpt can improve the accuracy of a state-of-the-art approach on similarity evaluation by 22.7%. Our experiment results demonstrate that the bottleneck part of our SSA transforming algorithm runs 15.7x faster than one of the best similar methods. Furthermore, we show that ImOpt is robust to many obfuscation techniques that based on data dependency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.trailofbits.com/research-and-development/mcsema/.

  2. 2.

    https://github.com/obfuscator-llvm/obfuscator/tree/llvm-4.0.

References

  1. Bourquin, M., King, A., Robbins, E.: BinSlayer: accurate comparison of binary executables. In: Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop (2013)

    Google Scholar 

  2. Brumley, D., Jager, I., Avgerinos, T., Schwartz, E.J.: BAP: a binary analysis platform. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 463–469. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22110-1_37

    Chapter  Google Scholar 

  3. Chandramohan, M., Xue, Y., Xu, Z., Liu, Y., Cho, C.Y., Tan, H.B.K.: Bingo: cross-architecture cross-OS binary search. In: International Symposium on Foundations of Software Engineering, pp. 678–689. ACM (2016)

    Google Scholar 

  4. Chase, D.R., Wegman, M., Zadeck, F.K.: Analysis of pointers and structures. ACM SIGPLAN Not. 25(6), 296–310 (1990)

    Article  Google Scholar 

  5. Cytron, R., Ferrante, J., Rosen, B.K., Wegman, M.N., Zadeck, F.K.: Efficiently computing static single assignment form and the control dependence graph. ACM Trans. Program. Lang. Syst. 13(4), 451–490 (1991). https://doi.org/10.1145/115372.115320

    Article  Google Scholar 

  6. David, Y., Partush, N., Yahav, E.: Statistical similarity of binaries. ACM SIGPLAN Not. 51, 266–280 (2016)

    Article  Google Scholar 

  7. David, Y., Partush, N., Yahav, E.: Similarity of binaries through re-optimization. ACM SIGPLAN Not. 52, 79–94 (2017). https://doi.org/10.1145/3140587.3062387

    Article  Google Scholar 

  8. David, Y., Yahav, E.: Tracelet-based code search in executables. ACM SIGPLAN Not. 49, 349–360 (2014)

    Article  Google Scholar 

  9. Ding, S.H., Fung, B.C., Charland, P.: Asm2Vec: boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In: 2019 IEEE Symposium on Security and Privacy (SP). IEEE (2019)

    Google Scholar 

  10. Egele, M., Woo, M., Chapman, P., Brumley, D.: Blanket execution: dynamic similarity testing for program binaries and components. In: USENIX Security Symposium, pp. 303–317 (2014)

    Google Scholar 

  11. Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovRE: efficient cross-architecture identification of bugs in binary code. In: NDSS (2016)

    Google Scholar 

  12. Feng, Q., Zhou, R., Xu, C., Cheng, Y., Testa, B., Yin, H.: Scalable graph-based bug search for firmware images. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 480–491 (2016)

    Google Scholar 

  13. Gao, D., Reiter, M.K., Song, D.: BinHunt: automatically finding semantic differences in binary programs. In: Chen, L., Ryan, M.D., Wang, G. (eds.) ICICS 2008. LNCS, vol. 5308, pp. 238–255. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88625-9_16

    Chapter  Google Scholar 

  14. Hardekopf, B., Lin, C.: Flow-sensitive pointer analysis for millions of lines of code. In: International Symposium on Code Generation and Optimization, pp. 289–298. IEEE (2011)

    Google Scholar 

  15. Hasti, R., Horwitz, S.: Using static single assignment form to improve flow-insensitive pointer analysis. ACM SIGPLAN Not. 33(5), 97–105 (1998)

    Article  Google Scholar 

  16. Hu, Y., Wang, H., Zhang, Y., Li, B., Gu, D.: A semantics-based hybrid approach on binary code similarity comparison. IEEE Trans. Softw. Eng. (2019)

    Google Scholar 

  17. Hu, Y., Zhang, Y., Li, J., Gu, D.: Binary code clone detection across architectures and compiling configurations. In: Proceedings of the 25th International Conference on Program Comprehension, pp. 88–98. IEEE Press (2017)

    Google Scholar 

  18. Jang, J., Woo, M., Brumley, D.: Towards automatic software lineage inference. In: Proceedings of the 22nd USENIX Conference on Security (2013)

    Google Scholar 

  19. Lindorfer, M., Di Federico, A., Maggi, F., Comparetti, P.M., Zanero, S.: Lines of malicious code: insights into the malicious software industry. In: Proceedings of the 28th Annual Computer Security Applications Conference, pp. 349–358 (2012)

    Google Scholar 

  20. Madsen, M., Møller, A.: Sparse dataflow analysis with pointers and reachability. In: Müller-Olm, M., Seidl, H. (eds.) SAS 2014. LNCS, vol. 8723, pp. 201–218. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10936-7_13

    Chapter  Google Scholar 

  21. Ming, J., Xu, D., Jiang, Y., Wu, D.: BinSim: trace-based semantic binary diffing via system call sliced segment equivalence checking. In: USENIX Security Symposium, pp. 253–270 (2017)

    Google Scholar 

  22. Ming, J., Xu, D., Wu, D.: Memoized semantics-based binary diffing with application to malware lineage inference. In: Federrath, H., Gollmann, D. (eds.) SEC 2015. IAICT, vol. 455, pp. 416–430. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18467-8_28

    Chapter  Google Scholar 

  23. Oh, H., Heo, K., Lee, W., Lee, W., Yi, K.: Design and implementation of sparse global analyses for C-like languages. 47(6), 229–238 (2012)

    Google Scholar 

  24. Pewny, J., Schuster, F., Bernhard, L., Holz, T., Rossow, C.: Leveraging semantic signatures for bug search in binary programs. In: Proceedings of the 30th Annual Computer Security Applications Conference, pp. 406–415 (2014)

    Google Scholar 

  25. Shirani, P., Wang, L., Debbabi, M.: BinShape: scalable and robust binary library function identification using function shape. In: Polychronakis, M., Meier, M. (eds.) DIMVA 2017. LNCS, vol. 10327, pp. 301–324. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60876-1_14

    Chapter  Google Scholar 

  26. Tok, T.B., Guyer, S.Z., Lin, C.: Efficient flow-sensitive interprocedural data-flow analysis in the presence of pointers. In: Mycroft, A., Zeller, A. (eds.) CC 2006. LNCS, vol. 3923, pp. 17–31. Springer, Heidelberg (2006). https://doi.org/10.1007/11688839_3

    Chapter  Google Scholar 

  27. Walenstein, A., Lakhotia, A.: The software similarity problem in malware analysis. In: Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2007)

    Google Scholar 

  28. Wei, M.K., Mycroft, A., Anderson, R.: Rendezvous: a search engine for binary code. In: 2013 10th IEEE Working Conference on Mining Software Repositories (MSR) (2013)

    Google Scholar 

Download references

Acknowledgments

This work is supported by National Natural Science Foundation of China (No. 61572469).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Min Yu .

Editor information

Editors and Affiliations

Appendices

A Appendix 1: Sparse Analysis for Backward Block

The ImSsaSparse procedure is quite similar to the ImSsa procedure, except it sparsely recomputes the visited but invalid block (line 4) to avoid unnecessary re-computation:

figure j

The SparseRecompute procedure is similar to the ProcessStatement procedure, except the former only process invalid statements. Another difference is SparseRecompute needs to propagate invalidation through the def-use chain.

B Appendix 2: Supplementary Proofs

1.1 B.1 Appended Proofs for Section 2.3

Proof

(Proof of Lemma 2). Let us assume that there is a backward definition \(d \in b'\) reaching the forward block \(b_i\), and all the blocks except \(b_i\) on any path \(p : b' \rightarrow b_i\) are in the SSA form. Then, there should be a back edge \(e: b_j \rightarrow b_k\) on the path \(p: b' \rightarrow b_i\). Note the block \(b_k\) can not be the block \(b_i\) since a forward block has no back edge pointing to itself. So, the block \(b_k\) should be in SSA form. However, in SSA-form code, a back edge will introduce \(\phi \)-functions into the block \(b_k\) that is at the target end of the edge for every definition that passes through the block \(b_k\). These \(\phi \)-functions will prevent definition d reaching to the successor block \(b_i\), which contradicts the assumption. Hence, a forward block has only forward definitions reaching itself in SSA-form code.

Proof

(Proof of Lemma 3). According to Condition 1, the first \(i-1\) blocks have already been in SSA form. Then following Lemma 2, only forward definitions can reach block \(b_i\). Recall that if a forward definition d can reach a block \(b_i\), then there is no back edge on the reaching path \(p : d\ \overrightarrow{rc}\ b_i\), which means that the block \(b'\) which defines d is ahead of block \(b_i\) in reverse post-order.

Proof

(Proof of Lemma 4). According to the definition of the BRG, it is trivial that \(b' \in G_{BR}(b_i)\). If there is a path \(p: b' \rightarrow b_i\), then \(b_i\) is either an IDF of block \(b'\) or dominated by an IDF. Besides, there should be a back edge on path p since \(d\ \overleftarrow{rc}\ b_i\). If the back edge is in the middle of path p, then the \(\phi \)-function introduced by it will intercept definition d. That contradicts with the fact that definition d can reach block b. Hence, the back edge can only be the last edge on path p. It means the backward block b cannot be dominated by any block on path p. Therefore, the backward block b is an IDF of block \(b'\).

1.2 B.2 Appended Proofs for Section 3.6

To prove Lemma 1, we first show that given Condition 1 the property is satisfied for the i-th block, no matter the block is a forward or backward one.

Lemma 5

Given Condition 1, if the i-th block \(b_i\) is a forward block, then the ImSsa procedure also satisfies the invariants in Property 1 for block \(b_i\).

Proof

(Proof Sketch of Lemma 5). Based on Lemma 3, all the reachable definitions can be visited before visiting the current block. Global invariant: the variables with dominating definition are linked by \(E_{DOM(i)}\) directly, whose validation is confirmed by Condition 1; the other variables, which have multiple reachable definitions, have the \(\phi \)-functions being inserted in the front of the block when processing the blocks that define these definitions; Hence, the global invariant is held. Local invariant: E and C are well-defined at the beginning of local analysis since the global invariant is held; moreover, their invariants are preserved during the running of ProcessForwardBlock procedure. In conclusion, the local invariant is held.

Lemma 6

If the CFG contains only forward block, the ImSsa procedure satisfies the invariants in Property 1 for all the blocks in that graph.

Proof

This lemma simply follows Lemma 5.

Lemma 7

Given Condition 1, if the i-th block \(b_i\) is a backward block, the ImSsa procedure also satisfies the invariants in Property 1. Furthermore, for each block in the BRG, the invariants in Property 1 are also satisfied.

Proof

(Proof Sketch of Lemma 7). Assume that all blocks except \(b_i\) in the BRG are forward blocks. According to Lemma 4 and Lemma 6, the ImSsa procedure can propagate all the backward definitions back into block \(b_i\) along with the IDF chain by processing \(G_{BR}(b_i)\). Additionally, the following SparseRecompute procedure will reveal any new definitions caused by the re-computation and propagate them back, too. Hence, the global invariant is held. Recall that the invariants of E and C are also maintained during the sparse re-computations, therefore, the local invariant is also held. Furthermore, since the BRG \(G_{BR}(b_i)\) contains only forward blocks, according to Lemma 6, these invariants are also satisfied for each block in the BRG. As for the case where the BRG contains backward blocks, it can be deductively reasoned from the previous case.

It follows that the ImSsa procedure satisfies Property 1:

Proof

(Proof Sketch of Lemma 1). It is reasonable to assume a forward entry block since we can always insert an empty forward block at the beginning of the CFG. Therefore, the first block satisfies the global invariant because of no reachable definition. Meanwhile, the local invariant is also satisfied according to Lemma 5. That is, the ImSsa procedure satisfies the invariants for the first block. Then the conclusion can be deductively reasoned from Lemma 5 and Lemma 7.

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jiang, J. et al. (2020). Similarity of Binaries Across Optimization Levels and Obfuscation. In: Chen, L., Li, N., Liang, K., Schneider, S. (eds) Computer Security – ESORICS 2020. ESORICS 2020. Lecture Notes in Computer Science(), vol 12308. Springer, Cham. https://doi.org/10.1007/978-3-030-58951-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58951-6_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58950-9

  • Online ISBN: 978-3-030-58951-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics