1 Introduction

Binary code similarity detection (BCSD) is used for comparing two or even more pieces of binary code (includes whole programs, functions or basic blocks) to determine their similarities or differences. It is essential to compare binary code in situations where the source code of the program is not available, including malware, legacy programs and commercial-of-the-shelf (COTS) programs [1]. It has been increasingly popular in research area and has broad applications in software engineering and security, such as code clone detection [2,3,4,5], vulnerability discovery [6, 7], malware detection [8,9,10,11,12], patch analysis [13,14,15] and porting information across program versions [16].

The similarity of the binary code not only comes from the source code update but also depends on the compilation process of the binary code. Binary code is generated by the compilation process, as illustrated in Fig. 1, and can be run directly. The usual compilation process takes the source code as input, through the selected compiler, optimization option or platform to generate an object file, after which the object file will generate a binary program.

Fig. 1
figure 1

Illustration of compilation process

2 Related Work

The core of BCSD problem is to design an approach to detect whether the two pieces of binary code are similar. A method that can solve this problem needs to achieve the following design goals: First, the application scenario of this method is binary-oriented. In many cases, we often cannot access the source code of a binary function. Effective similarity detection and code search technology must directly use binary code as the research object. Second, since the query function and the functions in the target corpus may come from different hardware architectures and software platforms, an effective BCSD technology which capture the inherent characteristics of these binary functions must be compatible with grammatical changes. In addition, an excellent BCSD method should satisfy high efficiency and adaptability. It can effectively calculate the similarity function for the task such as library function identification, vulnerability search, so as to expand to a larger target function library. At the same time, when domain experts can provide similar or dissimilar examples, the method should be able to quickly adapt to these examples for specific domain applications.

2.1 Semantic-Aware Similarity

Semantic-aware similarity records if the code compared has similar effects from instructions. FOSSIL [17] utilizes the idea of instruction classification, which is used to calculate semantic similarity, but is unable to determine whether two pieces of binary code are equivalent. The other method based on semantic similarity is the symbolic formula, which is a task declaration in which the left side is an output variable, as well as the right side is a logical expression of input variables and literals. BINHUNT [18], iBINHUNT [19] and COP use symbolic formulas. These approaches require to attempt all pair-wise comparisons and inspect whether a permutation of variables such that all matched variables contain the same value exists there. Recently, the majority of the current suitable strategies at the semantic-level have combined the ideas of natural language processing (NLP). For example, Zuo [20] was inspired by machine translation, treated each instruction in the binary code as a word while treating each block as a sentence, and used LSTM to encode the semantic vector of each sentence. Massarelli [21] used the word2vec model to train token embedding, and then used the attention mechanism to obtain function embedding.

2.2 Structural-Aware Similarity

Structural-aware similarity usually computes similarity depending on graph representations of binary code. It is different from semantic similarity because graph can capture multiple grammatical representations of the same code. The structure similarity can be calculated on different graphs. Traditional BCSD methods calculate the similarity scores between graphs using graph matching algorithm, such as SIGMA [22], DiscovRE [23], BinGo [24]. However, these methods have the disadvantages of low time efficiency and difficult migration applications. Later, a method of graph embedding was proposed. First, the binary function was represented by a graph, such as control flow graph (CFG). The nodes in the graph contained the relevant features of the binary function Basic Block. By learning the model, the binary function was expressed as a vector for direct comparison. The similarity distance of two vectors represents the similarity of two binary functions. The first to apply the embedded ideas to BCSD is the method Genius, which is a vulnerability detection engine for IoT devices that supports multiple architectures at the same time. Its method mainly includes 4 parts: feature extraction, codebook generation, feature encoding and online search. Recently, Xu [25] proposed a graph embedding method called Gemini, which can get better performance. Gemini combines the neural network to make the training and retraining time reduced. The time-dependent nature of graph neural networks makes BCSD more suitable for practical applications to improve the quality and efficiency of similarity detection.

2.3 Deep Embedding

Deep embedding is an efficient method to map a low-dimensional vector representation, with the goal of preserving the original data in the embedding space. The idea of embedding are widely used in many scenes. Deep embedding can extract high-level semantics from the sample and generate a vector that can represent the sample. For example, Chen [26] described pose variation by similarity embedding learning as spatial constraints for person re-identification. Gao [27] designed an effective similarity neural network, which focus on a similarity learning task in image retrieval. The embedded binary detection method uses the same principle, learns the high-level semantics of the graph through the neural network, and represents a binary function as a vector. Finally, the similarity of two vectors is directly compared to obtain the similarity score of two binary functions. Ou [28] proposed the idea of preserving asymmetric transitivity by approximating high-order proximity to improve graph embedding efficiency. Heimann [29] proposed a framework named REGAL that automatically learned node representations to match nodes. The literature [30] proposed Asm2Vec, a function embedding solution, which is based on the PV-DM model for natural language processing. Asm2Vec operatively computes the CFG of a function, and then it executes a series of random walks on top of it.

3 Approach

This section presents interpretations of the DeepDual-SD in detail. First, we introduce the code similarity task description in the Sect. 3.1 and provide an approach overview in the Sect. 3.2. Then, the dual-attribute embedding modules of our method are presented (semantic attribute embedding in the Sect. 3.3 and structural attribute embedding in the Sect. 3.4). Finally, a gated attention mechanism is provided to fuse the two attributes in the Sect. 3.5.

3.1 Code Similarity Task Description

In automatic BCSD, given two pieces of binary function, the machine needs to read and understand the binary codes or functions, and then compares the similarity between them. We call the binary function of interest the query function, and the corpus of binary functions the target corpus.

In this paper, all binaries are compiled from source code, not generated by the manual assembly. A binary B consists of a set of functions \(f_1, f_2, \ldots , f_u\). Given two binaries \(B^{q}=\left\{ f_{1}^{q},f_{2}^{q}, \ldots , f_{m}^{q}\right\}\) and \(B^{t}=\left\{ f_{1}^{t},f_{2}^{t}, \ldots , f_{n}^{t}\right\}\), we assume they have k pairs of matching functions: \([f_{1}^{q},f_{1}^{t}]\), \([f_{2}^{q},f_{2}^{t}]\), \(\ldots\), \([f_{k}^{q},f_{k}^{t}]\), where \(k\le min(m, n)\), and the rest of functions do not match. For any function \(f_{i}^{q}\) in \(B^{q}\), the BCSD tool could sort functions in the binary \(B^{t}\) by their similarities.

3.2 Overview

There are neural network-based methods that are good at handling the problem of BCSD. However, these methods ignore the common importance of semantic and structure of binary language. Furthermore, not only all parts of the program are non-equally relevant, but also single attribute encoding can not recognize the different relevance between each part of the binary [31, 32]. These problems affect the BCSD accuracy. To improve the performance of BCSD, we try to design an architecture that helps to address the dual attribute mentioned above by integrating semantic attribute embedding, structure attribute embedding and attribute fusion mechanism. The proposed architecture is named deep dual attribute-aware embedding for binary code similarity detection (DeeepDual-SD). In DeeepDual-SD, the semantic embedding module extracts n-gram features from the function for sentence modeling inspired by NLP technology. And structure embedding module aims to differentiate the similarity between two input attribute-based control flow graphs by graph embedding network. The fusion module is designed for both attribute representations which pay more attention to the feature related to the specific application and it can help to understand the binary.

The binary function \(f_{i}\) is described as \(f_{i}= Fusion(p_{i},g_{i})\), \(p_{i}\) is the semantic attribute representation and \(g_{i}\) is the structural attribute representation. The architecture of the DeeepDual-SD is shown in Fig. 2.

Fig. 2
figure 2

Overview of the DeepDual-SD

The key points of our approach are described in detail as follows.

3.3 Semantic Attribute Embedding

Challenges When using semantic features as embedded features to capture semantic information of binary functions, the challenges also exist. First, the number of instructions explosion. In assembly language, the operand addressing methods mainly include direct addressing, immediate addressing, register relative addressing and other addressing methods. Different programs are compiled by different compilers, and the generated countless immediate numbers and memory addresses. A large number of instructions with random numbers and random addresses are generated, resulting in an instruction explosion problem with too many instructions and a low repetition rate. Second, the problem of out-of-vocabulary (OOV). When we train a model to convert an instruction that has never appeared during training into a vector, the embedding generation for such instructions will fail. Third, the problem is how to make the machine understand and learn the semantic meaning of the code and express it into a suitable embedding vector.

Preprocess. In semantic attribute embedding, we use assembly instructions as tokens. An instruction token consists of both the instruction mnemonic and the operands. In response to the first two challenges, the number of instruction explosion and OOV, we have formulated the following rules when preprocessing instruction inputs: (1) When the value of numeric constants is above threshold (we define 0x400 in our experiments), replace the numeric constant with \(\left\langle \texttt {MAXVALUE} \right\rangle\). (2) When there is an instruction string reference, replace it with \(\left\langle \texttt {STRING} \right\rangle\). (3) Replace the memory address string with \(\left\langle \texttt {ADDR} \right\rangle\). (4) Replace the transfer address with \(\left\langle \texttt {LOCAL} \right\rangle\). (5) When instructing a function call, replace the function name with \(\left\langle \texttt {FUNCTION} \right\rangle\).

We take the following code example in Table 1: the left shows the original assembly code, and the right one is the preprocessed result.

Table 1 The example of semantic attribute embedding preprocessing

Semantic Embedding. After getting the initialization vector of instruction tokens, we construct the improved model ALBERT [33] based on BERT [34] to measure the semantic embedding representations in DeepDual-SD. Comparing to the conventional long short-term memory (LSTM) [35]-based feature extract framework, the model can be better applied to the complicated binary code semantic analysis task. The semantic function representations are considered as \(p =\left[ p_{1}, p_{2}, \ldots , p_{128}\right]\) for binary function semantic embedding.

3.4 Structural Attribute Embedding

The similarity detection problem of the binary function can be converted into the representation problem of the binary function attribute control flow graph. A binary function is expressed as a high-dimensional vector, and the distance between the vectors of the binary functions compiled by the same source code function is relatively close.

Preprocess. We extract two kinds of features, one is block-level features, describing the relevant feature information inside the basic block, and the other is structural information, describing the inter-attributes between basic blocks in the entire CFG. The specific characteristics are shown in Table 2.

Table 2 Features used by structural attribute embedding

This paper extracts 7 statistical block-level features and 2 inter-block structural features. For example, the following Fig. 3 is an Attributed CFG which extracted feature map of the function \(SSL\_do\_handshake\) after the binary code compiled by gcc 5.4-ARM of OpenSSL\(-\)1.0.1f is disassembled by IDA pro [36].

Fig. 3
figure 3

Attributed CFG of function \(SSL\_do\_handshake\) on ARM platforms

Structural Embedding. The existing structural-aware similarity-based methods rely on the CFGs of the function to extract features for each basic block.The structural embedding model is implemented using the structure2vec [37] algorithm which used in Gemini. The illustration of the structure2vec is shown as Fig. 4. The input of structure2vec is a CFG with data representation denoted as \(\langle {\mathcal {V}}, {\mathcal {E}}, x\rangle\), containing 3 elements. Each node i in the graph has initial node characteristics x. Learning through a structural embedding network, each node feature in the graph is represented as a new feature vector \(g_{i}\).

The structural embedding vector is produced by Algorithm 1.

figure a
Fig. 4
figure 4

Architecture of structure2vec for structural attribute embedding

The method of updating the structural attribute embedding is visualized as shown in Fig. 4, the process of \(x_{i}\) to \(g_{i}\) adopts the idea of variational inference, which is a process of iterative calculation based on graph topology. After a certain number of iterations, the network will calculate a new feature representation for each node i. This feature representation considers both graph characteristics and long-range interaction between node features. In the initial situation, each node is set to 0, and then the single iteration process of each node is as follows:

$$\begin{aligned} g_{i}^{(t)}=\tanh \left( W_{1} x_{v}+\sigma \left( \sum _{j \in {\mathcal {N}}_{i}} g_{j}^{(t-1)}\right) \right) , \forall i \in V, \end{aligned}$$
(1)

where \({\mathcal {N}}_{i}\) represents the direct neighbor of node i. Assuming that there are T iterations in total, it can be seen that each iteration updates all the node features of the entire graph synchronously. The new round of iterative calculation will be updated based on the results of the previous iteration and get a new round of calculation results. And after T iterations, the vector of each node \(\mu _{i}^{T}\) contains the relevant information of all nodes within the distance i hops within T. \(x_{v}\) is the basic feature of the basic block, \(W_{1}\) is the \(d \times p\) dimensional matrix, p is the dimension of the final embedding vector. \(\sigma (\cdot )\) is defined as the n-dimensional fully connected neural network, which is formulated as follows:

$$\begin{aligned} \sigma (l)=\underbrace{P_{1} \times \textrm{ReLU} \left( P_{2} \times \ldots \textrm{Re}\,L U \left( P_{n} l\right) \right) }_{n \textrm{levels }}, \end{aligned}$$
(2)

where \(P_i\) is a \(p \times p\) matrix. n is the embedding depth and ReLU is the rectified linear unit \(ReLU(x) = max \left\{ 0,x \right\}\). In DeepDual-SD, ReLU is used as the non-linear activation function because it can improve the learning dynamics of the networks and significantly reduce the number of iterations required for convergence in deep networks. The embedding vector g will be computed as an aggregation with the formula \(W_{2}( \sum _{i \in {\mathcal {N}}_{i}} g_{i}^{T}).\) Because the number of embedding size is 64, the structural embedding representation \(g =\left[ g_{1}, g_{2}, \ldots , g_{64}\right]\).

3.5 Dual Attribute Fusion Mechanism

A gated attention-based network is proposed to incorporate semantic information and structural graph representation. It is a variant of attention-based networks, with an additional gate to determine the importance of information in the function graph regarding an instruction. Given semantic and structural representation \(\left\{ p_{t}\right\} _{t=1}^{384}\) and \(\left\{ g_{t}\right\} _{t=1}^{64}\). Fusion map is proposed generating function representation via gated attention mechanism as follows:

$$\begin{aligned} \textbf{y}=H\left( p, \textbf{W}_{\textbf{H}}\right) \cdot T\left( g, \textbf{W}_{\textbf{T}}\right) +p\cdot \left( 1-T\left( g, \textbf{W}_{\textbf{T}}\right) \right) , \end{aligned}$$
(3)

where \(\textbf{y}\) is the output of the Fusion module, H(.) is an attention layer with the non-linear unit and T(.) seems to be a transform gate, which is also an attention layer. H(.) and T(.) play the other role that making semantic representation and structural representation become the appropriate matrix size for multiplication. \(\textbf{W}_\textbf{H}\) and \(\textbf{W}_\textbf{T}\) are the weight parameters, which can be train with the whole network.

Different from the gates in LSTM or GRU [38], the additional gate is based on the current instruction function and its attention vector of the graph function, which focuses on the relation between the semantic and structural representation.

Finally, DeepDual-SD learn parameters using siamese and compare the similarity of two function Q and T using the formula:

$$\begin{aligned} \textrm{similarity}\left( f_{Q}, f_{T}\right) =\frac{\sum _{i=1}^{n} \left( f_{Q}[i] \cdot f_{T}[i]\right) }{\sqrt{\sum _{i=1}^{n} f_{Q}[i]} \cdot \sqrt{\sum _{i=1}^{n}f_{T}[i]}}, \end{aligned}$$
(4)

where f[i] indicates the \(i-th\) component of the vector f.

We require input is a set of K function pairs \((\overrightarrow{f_{Q}}, \overrightarrow{f_{T}})\) to train the network. The final output of the siamese architecture is the similarity score between the two input. The ground truth labels \(y_{i} \in \{+1,-1\}\), where \(+1\) indicates that the two input functions are similar and \(-1\) means dissimilar. The loss function can be denoted as follows:

$$\begin{aligned} J=\sum _{ i =1}^{K}\left( \textrm{similarity}(\overrightarrow{f_{Q}}, \overrightarrow{f_{T}})-y_{i}\right) ^{2}. \end{aligned}$$
(5)

In our approach, Adam optimizer is chosen to optimize the loss function of the network. The model parameters are fine-tuned by Adam optimizer which has been shown as an effective and efficient back-propagation algorithm.

4 Performance Evaluation

In this section, the datasets and evaluation metrics used for evaluating our proposed method are described. Then, we study the performance to compare the similarity in cross-compiler, cross-architecture and cross-version settings. Furthermore, the impacts of various dimensions of embedding are discussed to achieve the best results.

4.1 Datasets and Experimental Settings

Datasets. We collect two datasets to investigating its performances on several tasks. The function pairs consist of binaries compiled from source code which we have ground truth. The compiled object files have been disassembled with IDA pro [36] and then preprocessed for encoding, as described in Sect. 3.

OpenSSL Dataset. This dataset has been obtained by compiling the OpenSSL [39] (version 1.0.1f and 1.0.1u). The compiler is set to emit code in ARM, MIPS and x86 platforms. The compilation has been done using gcc\(-\)5.4 with four optimization levels O0-O3. We obtain a total of 66964 function pairs.

Debian Dataset. This dataset is the Debian package repository, where we directly collected binaries from deb packages [40]. We have collected packages with different versions from Debian 7.11, Debian 8.11 and Debian 9.11. We grouped each version of binary with its closest version as a pair, and got 93324 pairs in total. The pairs can be divided into two parts depending on the following rule. Pairing together two binary functions originated by the same source code obtains similar pairs. Pairing randomly functions that do not derive from the same source code obtain dissimilar pairs.

We split datasets into training, validation and testing (8:1:1). Specifically, we generate similar pair associated with training label \(\left\langle +1 \right\rangle\) and a dissimilar pair with training label \(\left\langle -1 \right\rangle\).

Experimental Settings. We implemented our embedding model in Python using the Keras—2.3 [41] and TensorFlow—1.14 [42]. The experiments are performed on a computer running the Ubuntu 18.04 operating system with a 64-bit 2.7 GHz Intel ® Core (TM) i7 CPU with 48 GB RAM, and GPU 1080ti. In the following experiments, the networks are trained with the average mean squared loss between estimated and true induced current, and the network parameters are tuned with Adam optimization algorithm [43] with learning rate 0.0001. During the training process, we measure the loss and AUC on the validation set and save the model that achieved the best AUC on the validation set.

4.2 Evaluation Metrics

Identifying matching functions accurately is also important for BCSD solutions. We, thus, evaluate whether the matching function is in Precision, Recall and AUC, which are common evaluation metrics in machine learning and information retrieval tasks.

The robustness evaluation of a model not only needs to verify the ratio of the found query function to the target function by precision rate, but also needs to find out that how many target functions are detected by recall rate. Precision and Recall are formulated as follows:

$$\begin{aligned} Precision=\frac{TP}{{TP}+{FP}}, \quad Recall =\frac{TP}{{TP}+{FN}}, \end{aligned}$$
(6)

where TP, FP and FN denote the number of true positives, false positives and false negatives, respectively.

Similar as [21, 25], this paper also uses the Receiver Operating Characteristic Curve (ROC curve) [44] and the Area Under Curve (AUC) obtained by the model prediction results to measure the performance of our approach. The AUC depends on the calculation of the percentage of the query result. The higher the AUC value, the better predictive performance of the algorithm.

4.3 Performance in Cross-Compiler BCSD

These experiments analyze binary code similarity detection problems across compiler optimization options. We use the same experimental environment configuration to compare DeepDual-SD with the following baseline methods. They are effective methods and have achieved some good results in BCSD. Gemini [25] accumulates real-valued feature vector graph embedding and then computes the similarity of the feature vectors. SAFE [21] uses an NLP-based approach leveraging a self-attentive neural network to create functions embedding. The comparison results for cross-compiler BCSD are presented in Table 3.

Table 3 Comparing percentage of matching functions AUC by different methods (DeepDual-SD, Gemini and SAFE) in cross-optimization-level settings. O0, O1, O2, O3 are situations in which compilation using four different optimization levels

From Table 3, DeepDual-SD achieves better results than other methods except for cross-optimization-level O0 vs. O1 on the x86 instruction set. The results of DeepDual-SD on the x86 architecture are 96.4, 95.6 and 97.5% for cross-optimization-level O1 vs. O2, O1 vs. O3, O2 vs. O3. DeepDual-SD gives the relative improvements of 3.13, 4.50 and 3.64% compared to semantic-based method SAFE on MIPS O1 vs. O2, x86 O0 vs. O3 and ARM O1 vs. O3. Comparing with the structural-based method Gemini, DeepDual-SD gives a superior performance on all optimization-level O0 to O3 similarity comparisons. DeepDual-SD outperforms Gemini and SAFE on MIPS and x86 architectures. These results indicate that DeepDual-SD will be more feasible for various architectures. Simultaneously, it can be seen that the performance of dual attribute-based methods is better than that of the single attribute approaches.

4.4 Performance in Cross-Architecture BCSD

Since DeepDual-SD also specializes in binary code similarity detection across instruction sets, we still use the same experimental configuration to compare them here. More specifically, this article will match all functions in the OpenSSL binaries compiled for one architecture (i.e., x86, MIPS, and ARM) with the same names in binaries compiled for another architecture. The result is shown in Fig. 5.

Fig. 5
figure 5

ROC curves for different approach in OpenSSL binaries compiled for ARM, MIPS and x86

As can be seen from Fig. 5, compared to Gemini and SAFE, the DeepDual-SD proposed in this paper achieved the best results in all three sets of experiments. In MIPS vs. ARM comparison, our method achieves the best performance. The result of DeepDual-SD is 8.25 % higher than SAFE and 3.23 % higher than Gemini on average The results in this section show that DeepDual-SD is superior to other baseline methods in the convenience of cross-architecture tasks with the same settings.

4.5 Performance in Cross-Version BCSD

When the program needs to be updated (because of patching vulnerabilities and errors), a new version in binary format will be released, but without disclosing details of the changes. Due to the great interest of the software understand the differences between the two versions of the program. Binary cross-version similarity analysis is one of the most useful techniques for discovering these differences. In this section, we evaluate the performance of DeepDuel-SD on the Debian dataset, as shown in Fig. 6.

We can see that DeepDual-SD performs slightly better than Gemini and SAFE when the version gap is small, while obviously better when the version gap is large. The AUC of DeepDual-SD improves 0.72 and 0.71% than Gemini from Debian v7 to v8 and v8 to v9, respectively. DeepDual-SD outperforms Gemini by 1.97% when comparing some binaries from v7 to v9. In addition, Gemini is also better than SAFE across three versions. For example, Gemini outperforms SAFE by over 0.69% on average. It shows that the structure feature, which is captured by the structure attribute embedding network, is a very strong feature for cross-version BCSD.

Fig. 6
figure 6

ROC curves for different approaches comparison on binaries from Debian7.11, Debian8.11 and Debian9.11

4.6 Discussion

For DeepDual-SD, the purpose of dual-attribute embedding is to strengthen the understanding of the connection between semantics and structure of binary codes in the process of comparing similarity. For binary code similarity analysis, the methods to generate multiple embedding vector can affect the accuracy. Compared to the one-attribute embedding vector, multiple attribute embedding vector contains more feature dimensions, which can automatically adjust the current impact of each feature according to different similarity tasks, with better adaptability. The experiments show that the dual-attribute embedding vector can achieve better results than other embedding models. Hence, the method to generate comprehensive function representation is more suitable for BCSD. In most scenes, DeepDual-SD can obtain better results than other baseline methods. It shows that DeepDual-SD has a greater detection ability and DeepDual-SD performs better than state-of-the-art DNN-based BCSD methods.

5 Conclusion

In this paper, we proposed a deep dual attribute-aware embedding method for BCSD. As we know, using dual attribute-aware embedding to automatically learn discriminative function features to improve BCSD performance is a pioneering work. Comparing with some state-of-the-art baseline methods, the new method is more effective and efficient in terms of the detection quality in most cases.

Future work focuses on the research of the embedding method that is suitable for smaller datasets of BCSD tasks to replace. Moreover, in the real world, the training-based approaches are already applied to some products of vulnerability mining, which are combined with federated learning. The real-world dataset is decentralized among multiple client devices (e.g., IOT devices), which makes training have better practical applications. And we are applying our approach to federated learning to achieve better detection capabilities.