DeepDual-SD: Deep Dual Attribute-Aware Embedding for Binary Code Similarity Detection

Guo, Jiabao; Zhao, Bo; Liu, Hui; Leng, Dongdong; An, Yang; Shu, Gangli

doi:10.1007/s44196-023-00206-9

DeepDual-SD: Deep Dual Attribute-Aware Embedding for Binary Code Similarity Detection

Research Article
Open access
Published: 17 March 2023

Volume 16, article number 35, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

DeepDual-SD: Deep Dual Attribute-Aware Embedding for Binary Code Similarity Detection

Download PDF

Jiabao Guo¹,
Bo Zhao ORCID: orcid.org/0000-0003-1326-2889¹,
Hui Liu¹,
Dongdong Leng¹,
Yang An² &
…
Gangli Shu¹

1868 Accesses
3 Citations
Explore all metrics

Abstract

Binary code similarity detection (BCSD) is a task of detecting similarity of binary functions which are not available to the corresponding source code. It has been widely utilized to facilitate various kinds of crucial security analysis in software engineering. Because of the complexity of the program compilation process, identifying binary code similarity presents tough challenges. The most sensible binary similarity detector relies on a robust vector representation of binary code. However, few BCSD approaches are suitable to form vector representations for analyzing similarities between binaries, which may not only diverge in semantics but also in structures. And the existing solutions which only depend on hands-on feature engineering to form feature vectors, fail to take into consideration the relationships between instructions. To resolve these problems, we propose a novel and unified approach called DeepDual-SD that aims to combine the dual attributes (semantic and structural attribute). More specifically, DeepDual-SD consists of two branches, in which one text-based feature representation is driven by semantic attribute learning to exploit instruction semantics, another graph-based feature representation for structural attribute learning to investigate structural differences. Meanwhile deep embedding (DE) technology is utilized to map this information into low-dimensional vector representation. In addition, to get together the dual attributes, a fusion mechanism based on gate architecture is designed for learning to pay proper attention between the two attribute-aware embeddings. Experimental verifications are conducted on Openssl and Debian datasets for several tasks, including cross-compiler, cross-architecture and cross-version scenarios. The results demonstrate that our method outperforms the state-of-the-art BCSD methods in different scenarios in terms of detection accuracy.

How different are different diff algorithms in Git?

Article Open access 11 September 2019

Source-Code Generation Using Deep Learning: A Survey

Supporting single responsibility through automated extract method refactoring

Article 22 December 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Binary code similarity detection (BCSD) is used for comparing two or even more pieces of binary code (includes whole programs, functions or basic blocks) to determine their similarities or differences. It is essential to compare binary code in situations where the source code of the program is not available, including malware, legacy programs and commercial-of-the-shelf (COTS) programs [1]. It has been increasingly popular in research area and has broad applications in software engineering and security, such as code clone detection [2,3,4,5], vulnerability discovery [6, 7], malware detection [8,9,10,11,12], patch analysis [13,14,15] and porting information across program versions [16].

The similarity of the binary code not only comes from the source code update but also depends on the compilation process of the binary code. Binary code is generated by the compilation process, as illustrated in Fig. 1, and can be run directly. The usual compilation process takes the source code as input, through the selected compiler, optimization option or platform to generate an object file, after which the object file will generate a binary program.

2 Related Work

The core of BCSD problem is to design an approach to detect whether the two pieces of binary code are similar. A method that can solve this problem needs to achieve the following design goals: First, the application scenario of this method is binary-oriented. In many cases, we often cannot access the source code of a binary function. Effective similarity detection and code search technology must directly use binary code as the research object. Second, since the query function and the functions in the target corpus may come from different hardware architectures and software platforms, an effective BCSD technology which capture the inherent characteristics of these binary functions must be compatible with grammatical changes. In addition, an excellent BCSD method should satisfy high efficiency and adaptability. It can effectively calculate the similarity function for the task such as library function identification, vulnerability search, so as to expand to a larger target function library. At the same time, when domain experts can provide similar or dissimilar examples, the method should be able to quickly adapt to these examples for specific domain applications.

2.1 Semantic-Aware Similarity

Semantic-aware similarity records if the code compared has similar effects from instructions. FOSSIL [17] utilizes the idea of instruction classification, which is used to calculate semantic similarity, but is unable to determine whether two pieces of binary code are equivalent. The other method based on semantic similarity is the symbolic formula, which is a task declaration in which the left side is an output variable, as well as the right side is a logical expression of input variables and literals. BINHUNT [18], iBINHUNT [19] and COP use symbolic formulas. These approaches require to attempt all pair-wise comparisons and inspect whether a permutation of variables such that all matched variables contain the same value exists there. Recently, the majority of the current suitable strategies at the semantic-level have combined the ideas of natural language processing (NLP). For example, Zuo [20] was inspired by machine translation, treated each instruction in the binary code as a word while treating each block as a sentence, and used LSTM to encode the semantic vector of each sentence. Massarelli [21] used the word2vec model to train token embedding, and then used the attention mechanism to obtain function embedding.

2.2 Structural-Aware Similarity

Structural-aware similarity usually computes similarity depending on graph representations of binary code. It is different from semantic similarity because graph can capture multiple grammatical representations of the same code. The structure similarity can be calculated on different graphs. Traditional BCSD methods calculate the similarity scores between graphs using graph matching algorithm, such as SIGMA [22], DiscovRE [23], BinGo [24]. However, these methods have the disadvantages of low time efficiency and difficult migration applications. Later, a method of graph embedding was proposed. First, the binary function was represented by a graph, such as control flow graph (CFG). The nodes in the graph contained the relevant features of the binary function Basic Block. By learning the model, the binary function was expressed as a vector for direct comparison. The similarity distance of two vectors represents the similarity of two binary functions. The first to apply the embedded ideas to BCSD is the method Genius, which is a vulnerability detection engine for IoT devices that supports multiple architectures at the same time. Its method mainly includes 4 parts: feature extraction, codebook generation, feature encoding and online search. Recently, Xu [25] proposed a graph embedding method called Gemini, which can get better performance. Gemini combines the neural network to make the training and retraining time reduced. The time-dependent nature of graph neural networks makes BCSD more suitable for practical applications to improve the quality and efficiency of similarity detection.

2.3 Deep Embedding

Deep embedding is an efficient method to map a low-dimensional vector representation, with the goal of preserving the original data in the embedding space. The idea of embedding are widely used in many scenes. Deep embedding can extract high-level semantics from the sample and generate a vector that can represent the sample. For example, Chen [26] described pose variation by similarity embedding learning as spatial constraints for person re-identification. Gao [27] designed an effective similarity neural network, which focus on a similarity learning task in image retrieval. The embedded binary detection method uses the same principle, learns the high-level semantics of the graph through the neural network, and represents a binary function as a vector. Finally, the similarity of two vectors is directly compared to obtain the similarity score of two binary functions. Ou [28] proposed the idea of preserving asymmetric transitivity by approximating high-order proximity to improve graph embedding efficiency. Heimann [29] proposed a framework named REGAL that automatically learned node representations to match nodes. The literature [30] proposed Asm2Vec, a function embedding solution, which is based on the PV-DM model for natural language processing. Asm2Vec operatively computes the CFG of a function, and then it executes a series of random walks on top of it.

3 Approach

This section presents interpretations of the DeepDual-SD in detail. First, we introduce the code similarity task description in the Sect. 3.1 and provide an approach overview in the Sect. 3.2. Then, the dual-attribute embedding modules of our method are presented (semantic attribute embedding in the Sect. 3.3 and structural attribute embedding in the Sect. 3.4). Finally, a gated attention mechanism is provided to fuse the two attributes in the Sect. 3.5.

3.1 Code Similarity Task Description

In automatic BCSD, given two pieces of binary function, the machine needs to read and understand the binary codes or functions, and then compares the similarity between them. We call the binary function of interest the query function, and the corpus of binary functions the target corpus.

In this paper, all binaries are compiled from source code, not generated by the manual assembly. A binary B consists of a set of functions $f_1, f_2, \ldots , f_u$. Given two binaries $B^{q}=\left\{ f_{1}^{q},f_{2}^{q}, \ldots , f_{m}^{q}\right\}$ and $B^{t}=\left\{ f_{1}^{t},f_{2}^{t}, \ldots , f_{n}^{t}\right\}$, we assume they have k pairs of matching functions: $[f_{1}^{q},f_{1}^{t}]$, $[f_{2}^{q},f_{2}^{t}]$, $\ldots$, $[f_{k}^{q},f_{k}^{t}]$, where $k\le min(m, n)$, and the rest of functions do not match. For any function $f_{i}^{q}$ in $B^{q}$, the BCSD tool could sort functions in the binary $B^{t}$ by their similarities.

3.2 Overview

There are neural network-based methods that are good at handling the problem of BCSD. However, these methods ignore the common importance of semantic and structure of binary language. Furthermore, not only all parts of the program are non-equally relevant, but also single attribute encoding can not recognize the different relevance between each part of the binary [31, 32]. These problems affect the BCSD accuracy. To improve the performance of BCSD, we try to design an architecture that helps to address the dual attribute mentioned above by integrating semantic attribute embedding, structure attribute embedding and attribute fusion mechanism. The proposed architecture is named deep dual attribute-aware embedding for binary code similarity detection (DeeepDual-SD). In DeeepDual-SD, the semantic embedding module extracts n-gram features from the function for sentence modeling inspired by NLP technology. And structure embedding module aims to differentiate the similarity between two input attribute-based control flow graphs by graph embedding network. The fusion module is designed for both attribute representations which pay more attention to the feature related to the specific application and it can help to understand the binary.

The binary function $f_{i}$ is described as $f_{i}= Fusion(p_{i},g_{i})$, $p_{i}$ is the semantic attribute representation and $g_{i}$ is the structural attribute representation. The architecture of the DeeepDual-SD is shown in Fig. 2.

The key points of our approach are described in detail as follows.

3.3 Semantic Attribute Embedding

Challenges When using semantic features as embedded features to capture semantic information of binary functions, the challenges also exist. First, the number of instructions explosion. In assembly language, the operand addressing methods mainly include direct addressing, immediate addressing, register relative addressing and other addressing methods. Different programs are compiled by different compilers, and the generated countless immediate numbers and memory addresses. A large number of instructions with random numbers and random addresses are generated, resulting in an instruction explosion problem with too many instructions and a low repetition rate. Second, the problem of out-of-vocabulary (OOV). When we train a model to convert an instruction that has never appeared during training into a vector, the embedding generation for such instructions will fail. Third, the problem is how to make the machine understand and learn the semantic meaning of the code and express it into a suitable embedding vector.

Preprocess. In semantic attribute embedding, we use assembly instructions as tokens. An instruction token consists of both the instruction mnemonic and the operands. In response to the first two challenges, the number of instruction explosion and OOV, we have formulated the following rules when preprocessing instruction inputs: (1) When the value of numeric constants is above threshold (we define 0x400 in our experiments), replace the numeric constant with $\left\langle \texttt {MAXVALUE} \right\rangle$. (2) When there is an instruction string reference, replace it with $\left\langle \texttt {STRING} \right\rangle$. (3) Replace the memory address string with $\left\langle \texttt {ADDR} \right\rangle$. (4) Replace the transfer address with $\left\langle \texttt {LOCAL} \right\rangle$. (5) When instructing a function call, replace the function name with $\left\langle \texttt {FUNCTION} \right\rangle$.

We take the following code example in Table 1: the left shows the original assembly code, and the right one is the preprocessed result.

Table 1 The example of semantic attribute embedding preprocessing

Full size table

Semantic Embedding. After getting the initialization vector of instruction tokens, we construct the improved model ALBERT [33] based on BERT [34] to measure the semantic embedding representations in DeepDual-SD. Comparing to the conventional long short-term memory (LSTM) [35]-based feature extract framework, the model can be better applied to the complicated binary code semantic analysis task. The semantic function representations are considered as $p =\left[ p_{1}, p_{2}, \ldots , p_{128}\right]$ for binary function semantic embedding.

3.4 Structural Attribute Embedding

The similarity detection problem of the binary function can be converted into the representation problem of the binary function attribute control flow graph. A binary function is expressed as a high-dimensional vector, and the distance between the vectors of the binary functions compiled by the same source code function is relatively close.

Preprocess. We extract two kinds of features, one is block-level features, describing the relevant feature information inside the basic block, and the other is structural information, describing the inter-attributes between basic blocks in the entire CFG. The specific characteristics are shown in Table 2.

Table 2 Features used by structural attribute embedding

Full size table

This paper extracts 7 statistical block-level features and 2 inter-block structural features. For example, the following Fig. 3 is an Attributed CFG which extracted feature map of the function $SSL\_do\_handshake$ after the binary code compiled by gcc 5.4-ARM of OpenSSL$-$1.0.1f is disassembled by IDA pro [36].

Structural Embedding. The existing structural-aware similarity-based methods rely on the CFGs of the function to extract features for each basic block.The structural embedding model is implemented using the structure2vec [37] algorithm which used in Gemini. The illustration of the structure2vec is shown as Fig. 4. The input of structure2vec is a CFG with data representation denoted as $\langle {\mathcal {V}}, {\mathcal {E}}, x\rangle$, containing 3 elements. Each node i in the graph has initial node characteristics x. Learning through a structural embedding network, each node feature in the graph is represented as a new feature vector $g_{i}$.

The structural embedding vector is produced by Algorithm 1.

The method of updating the structural attribute embedding is visualized as shown in Fig. 4, the process of $x_{i}$ to $g_{i}$ adopts the idea of variational inference, which is a process of iterative calculation based on graph topology. After a certain number of iterations, the network will calculate a new feature representation for each node i. This feature representation considers both graph characteristics and long-range interaction between node features. In the initial situation, each node is set to 0, and then the single iteration process of each node is as follows:

$$\begin{aligned} g_{i}^{(t)}=\tanh \left( W_{1} x_{v}+\sigma \left( \sum _{j \in {\mathcal {N}}_{i}} g_{j}^{(t-1)}\right) \right) , \forall i \in V, \end{aligned}$$

(1)

where ${\mathcal {N}}_{i}$ represents the direct neighbor of node i. Assuming that there are T iterations in total, it can be seen that each iteration updates all the node features of the entire graph synchronously. The new round of iterative calculation will be updated based on the results of the previous iteration and get a new round of calculation results. And after T iterations, the vector of each node $\mu _{i}^{T}$ contains the relevant information of all nodes within the distance i hops within T. $x_{v}$ is the basic feature of the basic block, $W_{1}$ is the $d \times p$ dimensional matrix, p is the dimension of the final embedding vector. $\sigma (\cdot )$ is defined as the n-dimensional fully connected neural network, which is formulated as follows:

$$\begin{aligned} \sigma (l)=\underbrace{P_{1} \times \textrm{ReLU} \left( P_{2} \times \ldots \textrm{Re}\,L U \left( P_{n} l\right) \right) }_{n \textrm{levels }}, \end{aligned}$$

(2)

where $P_i$ is a $p \times p$ matrix. n is the embedding depth and ReLU is the rectified linear unit $ReLU(x) = max \left\{ 0,x \right\}$. In DeepDual-SD, ReLU is used as the non-linear activation function because it can improve the learning dynamics of the networks and significantly reduce the number of iterations required for convergence in deep networks. The embedding vector g will be computed as an aggregation with the formula $W_{2}( \sum _{i \in {\mathcal {N}}_{i}} g_{i}^{T}).$ Because the number of embedding size is 64, the structural embedding representation $g =\left[ g_{1}, g_{2}, \ldots , g_{64}\right]$.

3.5 Dual Attribute Fusion Mechanism

A gated attention-based network is proposed to incorporate semantic information and structural graph representation. It is a variant of attention-based networks, with an additional gate to determine the importance of information in the function graph regarding an instruction. Given semantic and structural representation $\left\{ p_{t}\right\} _{t=1}^{384}$ and $\left\{ g_{t}\right\} _{t=1}^{64}$. Fusion map is proposed generating function representation via gated attention mechanism as follows:

$$\begin{aligned} \textbf{y}=H\left( p, \textbf{W}_{\textbf{H}}\right) \cdot T\left( g, \textbf{W}_{\textbf{T}}\right) +p\cdot \left( 1-T\left( g, \textbf{W}_{\textbf{T}}\right) \right) , \end{aligned}$$

(3)

where $\textbf{y}$ is the output of the Fusion module, H(.) is an attention layer with the non-linear unit and T(.) seems to be a transform gate, which is also an attention layer. H(.) and T(.) play the other role that making semantic representation and structural representation become the appropriate matrix size for multiplication. $\textbf{W}_\textbf{H}$ and $\textbf{W}_\textbf{T}$ are the weight parameters, which can be train with the whole network.

Different from the gates in LSTM or GRU [38], the additional gate is based on the current instruction function and its attention vector of the graph function, which focuses on the relation between the semantic and structural representation.

Finally, DeepDual-SD learn parameters using siamese and compare the similarity of two function Q and T using the formula:

$$\begin{aligned} \textrm{similarity}\left( f_{Q}, f_{T}\right) =\frac{\sum _{i=1}^{n} \left( f_{Q}[i] \cdot f_{T}[i]\right) }{\sqrt{\sum _{i=1}^{n} f_{Q}[i]} \cdot \sqrt{\sum _{i=1}^{n}f_{T}[i]}}, \end{aligned}$$

(4)

where f[i] indicates the $i-th$ component of the vector f.

We require input is a set of K function pairs $(\overrightarrow{f_{Q}}, \overrightarrow{f_{T}})$ to train the network. The final output of the siamese architecture is the similarity score between the two input. The ground truth labels $y_{i} \in \{+1,-1\}$, where $+1$ indicates that the two input functions are similar and $-1$ means dissimilar. The loss function can be denoted as follows:

$$\begin{aligned} J=\sum _{ i =1}^{K}\left( \textrm{similarity}(\overrightarrow{f_{Q}}, \overrightarrow{f_{T}})-y_{i}\right) ^{2}. \end{aligned}$$

(5)

In our approach, Adam optimizer is chosen to optimize the loss function of the network. The model parameters are fine-tuned by Adam optimizer which has been shown as an effective and efficient back-propagation algorithm.

4 Performance Evaluation

In this section, the datasets and evaluation metrics used for evaluating our proposed method are described. Then, we study the performance to compare the similarity in cross-compiler, cross-architecture and cross-version settings. Furthermore, the impacts of various dimensions of embedding are discussed to achieve the best results.

4.1 Datasets and Experimental Settings

Datasets. We collect two datasets to investigating its performances on several tasks. The function pairs consist of binaries compiled from source code which we have ground truth. The compiled object files have been disassembled with IDA pro [36] and then preprocessed for encoding, as described in Sect. 3.

OpenSSL Dataset. This dataset has been obtained by compiling the OpenSSL [39] (version 1.0.1f and 1.0.1u). The compiler is set to emit code in ARM, MIPS and x86 platforms. The compilation has been done using gcc$-$5.4 with four optimization levels O0-O3. We obtain a total of 66964 function pairs.

Debian Dataset. This dataset is the Debian package repository, where we directly collected binaries from deb packages [40]. We have collected packages with different versions from Debian 7.11, Debian 8.11 and Debian 9.11. We grouped each version of binary with its closest version as a pair, and got 93324 pairs in total. The pairs can be divided into two parts depending on the following rule. Pairing together two binary functions originated by the same source code obtains similar pairs. Pairing randomly functions that do not derive from the same source code obtain dissimilar pairs.

We split datasets into training, validation and testing (8:1:1). Specifically, we generate similar pair associated with training label $\left\langle +1 \right\rangle$ and a dissimilar pair with training label $\left\langle -1 \right\rangle$.

Experimental Settings. We implemented our embedding model in Python using the Keras—2.3 [41] and TensorFlow—1.14 [42]. The experiments are performed on a computer running the Ubuntu 18.04 operating system with a 64-bit 2.7 GHz Intel Â^® Core (TM) i7 CPU with 48 GB RAM, and GPU 1080ti. In the following experiments, the networks are trained with the average mean squared loss between estimated and true induced current, and the network parameters are tuned with Adam optimization algorithm [43] with learning rate 0.0001. During the training process, we measure the loss and AUC on the validation set and save the model that achieved the best AUC on the validation set.

4.2 Evaluation Metrics

Identifying matching functions accurately is also important for BCSD solutions. We, thus, evaluate whether the matching function is in Precision, Recall and AUC, which are common evaluation metrics in machine learning and information retrieval tasks.

The robustness evaluation of a model not only needs to verify the ratio of the found query function to the target function by precision rate, but also needs to find out that how many target functions are detected by recall rate. Precision and Recall are formulated as follows:

$$\begin{aligned} Precision=\frac{TP}{{TP}+{FP}}, \quad Recall =\frac{TP}{{TP}+{FN}}, \end{aligned}$$

(6)

where TP, FP and FN denote the number of true positives, false positives and false negatives, respectively.

Similar as [21, 25], this paper also uses the Receiver Operating Characteristic Curve (ROC curve) [44] and the Area Under Curve (AUC) obtained by the model prediction results to measure the performance of our approach. The AUC depends on the calculation of the percentage of the query result. The higher the AUC value, the better predictive performance of the algorithm.

4.3 Performance in Cross-Compiler BCSD

These experiments analyze binary code similarity detection problems across compiler optimization options. We use the same experimental environment configuration to compare DeepDual-SD with the following baseline methods. They are effective methods and have achieved some good results in BCSD. Gemini [25] accumulates real-valued feature vector graph embedding and then computes the similarity of the feature vectors. SAFE [21] uses an NLP-based approach leveraging a self-attentive neural network to create functions embedding. The comparison results for cross-compiler BCSD are presented in Table 3.

Table 3 Comparing percentage of matching functions AUC by different methods (DeepDual-SD, Gemini and SAFE) in cross-optimization-level settings. O0, O1, O2, O3 are situations in which compilation using four different optimization levels

Full size table

From Table 3, DeepDual-SD achieves better results than other methods except for cross-optimization-level O0 vs. O1 on the x86 instruction set. The results of DeepDual-SD on the x86 architecture are 96.4, 95.6 and 97.5% for cross-optimization-level O1 vs. O2, O1 vs. O3, O2 vs. O3. DeepDual-SD gives the relative improvements of 3.13, 4.50 and 3.64% compared to semantic-based method SAFE on MIPS O1 vs. O2, x86 O0 vs. O3 and ARM O1 vs. O3. Comparing with the structural-based method Gemini, DeepDual-SD gives a superior performance on all optimization-level O0 to O3 similarity comparisons. DeepDual-SD outperforms Gemini and SAFE on MIPS and x86 architectures. These results indicate that DeepDual-SD will be more feasible for various architectures. Simultaneously, it can be seen that the performance of dual attribute-based methods is better than that of the single attribute approaches.

4.4 Performance in Cross-Architecture BCSD

Since DeepDual-SD also specializes in binary code similarity detection across instruction sets, we still use the same experimental configuration to compare them here. More specifically, this article will match all functions in the OpenSSL binaries compiled for one architecture (i.e., x86, MIPS, and ARM) with the same names in binaries compiled for another architecture. The result is shown in Fig. 5.

As can be seen from Fig. 5, compared to Gemini and SAFE, the DeepDual-SD proposed in this paper achieved the best results in all three sets of experiments. In MIPS vs. ARM comparison, our method achieves the best performance. The result of DeepDual-SD is 8.25 % higher than SAFE and 3.23 % higher than Gemini on average The results in this section show that DeepDual-SD is superior to other baseline methods in the convenience of cross-architecture tasks with the same settings.

4.5 Performance in Cross-Version BCSD

When the program needs to be updated (because of patching vulnerabilities and errors), a new version in binary format will be released, but without disclosing details of the changes. Due to the great interest of the software understand the differences between the two versions of the program. Binary cross-version similarity analysis is one of the most useful techniques for discovering these differences. In this section, we evaluate the performance of DeepDuel-SD on the Debian dataset, as shown in Fig. 6.

We can see that DeepDual-SD performs slightly better than Gemini and SAFE when the version gap is small, while obviously better when the version gap is large. The AUC of DeepDual-SD improves 0.72 and 0.71% than Gemini from Debian v7 to v8 and v8 to v9, respectively. DeepDual-SD outperforms Gemini by 1.97% when comparing some binaries from v7 to v9. In addition, Gemini is also better than SAFE across three versions. For example, Gemini outperforms SAFE by over 0.69% on average. It shows that the structure feature, which is captured by the structure attribute embedding network, is a very strong feature for cross-version BCSD.

4.6 Discussion

For DeepDual-SD, the purpose of dual-attribute embedding is to strengthen the understanding of the connection between semantics and structure of binary codes in the process of comparing similarity. For binary code similarity analysis, the methods to generate multiple embedding vector can affect the accuracy. Compared to the one-attribute embedding vector, multiple attribute embedding vector contains more feature dimensions, which can automatically adjust the current impact of each feature according to different similarity tasks, with better adaptability. The experiments show that the dual-attribute embedding vector can achieve better results than other embedding models. Hence, the method to generate comprehensive function representation is more suitable for BCSD. In most scenes, DeepDual-SD can obtain better results than other baseline methods. It shows that DeepDual-SD has a greater detection ability and DeepDual-SD performs better than state-of-the-art DNN-based BCSD methods.

5 Conclusion

In this paper, we proposed a deep dual attribute-aware embedding method for BCSD. As we know, using dual attribute-aware embedding to automatically learn discriminative function features to improve BCSD performance is a pioneering work. Comparing with some state-of-the-art baseline methods, the new method is more effective and efficient in terms of the detection quality in most cases.

Future work focuses on the research of the embedding method that is suitable for smaller datasets of BCSD tasks to replace. Moreover, in the real world, the training-based approaches are already applied to some products of vulnerability mining, which are combined with federated learning. The real-world dataset is decentralized among multiple client devices (e.g., IOT devices), which makes training have better practical applications. And we are applying our approach to federated learning to achieve better detection capabilities.

Data Availability

The authors confirm that the data supporting the findings of this study are available within the article and its supplementary materials.

Abbreviations

BCSD:: Binary code similarity detection
COTS:: Commercial-of-the-shelf
NLP:: Natural language processing
CFG:: Control flow graph
OOV:: Out-of-vocabulary
LSTM:: Long short-term memory
GRU:: Gated recurrent unit
ROC:: Receiver operating characteristic curve
AUC:: Area under curve
IOT:: Internet of things

References

Haq, I.U., Caballero, J.: A survey of binary code similarity (2019) arXiv:1909.11424
Luo, L., Ming, J., Wu, D., Liu, P., Zhu, S.: Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. In: Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, pp. 389–400 (2014)
Sæbjørnsen, A., Willcock, J., Panas, T., Quinlan, D.J., Su, Z.: Detecting code clones in binary executables. In: Proceedings of the Eighteenth International Symposium on Software Testing and Analysis, pp. 117–128 (2009)
Chen, K., Liu, P., Zhang, Y.: Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In: Proceedings of 36th ACM International Conference on Software Engineering, ICSE, pp. 175–186 (2014)
Zhang, F., Wu, D., Liu, P., Zhu, S.: Program logic based software plagiarism detection. In: 25th IEEE International Symposium on Software Reliability Engineering, ISSRE, pp. 66–77 (2014)
Gao, J., Yang, X., Fu, Y., Jiang, Y., Sun, J.: Vulseeker: a semantic learning based vulnerability seeker for cross-platform binary. In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE, pp. 896–899 (2018)
Shirani, P., Collard, L., Agba, B.L., Lebel, B., Debbabi, M., Wang, L., Hanna, A.: BINARM: scalable and efficient detection of vulnerabilities in firmware images of intelligent electronic devices. In: Detection of Intrusions and Malware, and Vulnerability Assessment-15th International Conference, DIMVA, vol. 10885, pp. 114–138. Springer (2018)
Cesare, S., Xiang, Y., Zhou, W.: Control flow-based malware variant detection. In: IEEE Transactions on Dependable & Secure Computing, pp. 307–317 (2014)
Bell, S., Bala, K.: Learning visual similarity for product design with convolutional neural networks. ACM Trans. Graph. 34, 98–19810 (2015)
Article Google Scholar
Hu, X., Chiueh, T., Shin, K.G.: Large-scale malware indexing using function-call graphs. In: Proceedings of the ACM Conference on Computer and Communications Security,CCS, pp. 611–620 (2009)
Jang, J., Woo, M., Brumley, D.: Towards automatic software lineage inference. In: Proceedings of the 22th USENIX Security Symposium, pp. 81–96 (2013)
Farhadi, M.R., Fung, B.C.M., Charland, P., Debbabi, M.: Binclone: Detecting code clones in malware. In: Proceedings of the 18th IEEE International Conference on Software Security and Reliability, SERE, pp. 78–87 (2014)
Brumley, D., Poosankam, P., Song, D.X., Zheng, J.: Automatic patch-based exploit generation is possible: Techniques and implications. In: IEEE Symposium on Security and Privacy (S &P), pp. 143–157. IEEE Computer Society (2008)
Xu, Z., Chen, B., Chandramohan, M., Liu, Y., Song, F.: SPAIN: security patch analysis for binaries towards understanding the pain and pills. In: Proceedings of the 39th IEEE/ACM International Conference on Software Engineering, ICSE, pp. 462–472 (2017)
Li, Y., Xu, W., Tang, Y., Mi, X., Wang, B.: Semhunt: Identifying vulnerability type with double validation in binary code. In: The 29th International Conference on Software Engineering and Knowledge Engineering, pp. 491–494 (2017)
Flake, H.: Structural comparison of executable objects. In: Detection of Intrusions and Malware & Vulnerability Assessment, GI SIG SIDAR Workshop, DIMVA, vol. P-46, pp. 161–173 (2004)
Alrabaee, S., Shirani, P., Wang, L., Debbabi, M.: FOSSIL: a resilient and efficient system for identifying FOSS functions in malware binaries. ACM Trans. Priv. Secur. 21, 8–1834 (2018)
Article Google Scholar
Gao, D., Reiter, M.K., Song, D.X.: Binhunt: Automatically finding semantic differences in binary programs. In: Information and Communications Security, 10th International Conference,ICICS, pp. 238–255 (2008)
Ming, J., Pan, M., Gao, D.: ibinhunt: binary hunting with inter-procedural control flow. In: Proceedings of 15th International Conference on Information Security and Cryptology ICISC, vol. 7839, pp. 92–109. Springer (2012)
Zuo, F., Li, X., Young, P., Luo, L., Zeng, Q., Zhang, Z.: Neural machine translation inspired binary code similarity comparison beyond function pairs. In: Proceedings of 26th Annual Network and Distributed System Security Symposium, NDSS (2019)
Massarelli, L., Luna, G.A.D., Petroni, F., Baldoni, R., Querzoni, L.: SAFE: self-attentive function embeddings for binary similarity. In: Proceedings of 16th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, DIMVA, vol. 11543, pp. 309–329 (2019)
Alrabaee, S., Shirani, P., Wang, L., Debbabi, M.: Sigma: a semantic integrated graph matching approach for identifying reused functions in binary code. Digit. Investig. 12, 61–71 (2015)
Article Google Scholar
Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: Discovre: efficient cross-architecture identification of bugs in binary code. In: Proceedings of 23rd Annual Network and Distributed System Security Symposium, NDSS (2016)
Chandramohan, M., Xue, Y., Xu, Z., Liu, Y., Cho, C.Y., Tan, H.B.K.: Bingo: cross-architecture cross-os binary search. In: Proceedings of the 24th ACM International Symposium on Foundations of Software Engineering, pp. 678–689 (2016)
Xu, X., Liu, C., Feng, Q., Yin, H., Song, L., Song, D.: Neural network-based graph embedding for cross-platform binary code similarity detection. In: Proceedings of the ACM Conference on Computer and Communications Security, CCS, pp. 363–376 (2017)
Chen, D., Yuan, Z., Chen, B., Zheng, N.: Similarity learning with spatial constraints for person re-identification. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1268–1277
Gao, X., Mu, T., Goulermas, J.Y., Thiyagalingam, J., Wang, M.: An interpretable deep architecture for similarity learning built upon hierarchical concepts. IEEE Trans. Image. Process. 29, 3911–3926 (2020)
Article MATH Google Scholar
Ou, M., Cui, P., Pei, J., Zhang, Z., Zhu, W.: Asymmetric transitivity preserving graph embedding. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1105–1114 (2016)
Heimann, M., Shen, H., Safavi, T., Koutra, D.: Regal: Representation learning-based graph alignment. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM, pp. 117–126 (2018)
Ding, S.H.H., Fung, B.C.M., Charland, P.: Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In: Proceedings of IEEE Symposium on Security and Privacy, SP, pp. 472–489 (2019)
Hin, D., Kan, A., Chen, H., Babar, M.A.: Linevd: statement-level vulnerability detection using graph neural networks (2022). arXiv:2203.05181 [cs]. https://doi.org/10.48550/arXiv.2203.05181
Neysiani, B.S., Morteza Babamir, S.: Automatic duplicate bug report detection using information retrieval-based versus machine learning-based approaches. In: 2020 6th International Conference on Web Research (ICWR), pp. 288–293 (2020). https://doi.org/10.1109/ICWR49608.2020.9122288
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: A lite bert for self-supervised learning of language representations (2019). arXiv:1909.11942
Devlin, J., Chang, M., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pp. 4171–4186 (2019)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural. Comput. 9, 1735–1780 (1997)
Article Google Scholar
Hex-Rays: Ida pro disassembler and debugger. In: Retrieved from https://www.hex-rays.com/products/ida/index.shtml (2015)
Dai, H., Dai, B., Song, L.: Discriminative embeddings of latent variable models for structured data. Proceedings of the 33nd International Conference on Machine Learning, ICML 48, 2702–2711 (2016)
Google Scholar
Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, pp. 1724–1734. ACL
Openssl. Retrieved from https://www.openssl.org/ (2020)
Debian. Retrieved from https://www.debian.org/ (2020)
Chollet, F.: Keras. In: Retrieved from https://keras.io/ (2015)
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D.G., Steiner, B., Tucker, P.A., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation, pp. 265–283 (2016)
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR (2015)
Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.: Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst. 22(1), 5–53 (2004)
Article Google Scholar

Download references

Funding

The work described in this paper is supported by the National Natural Science Foundation of China (No.U1936122) and and Primary Research Development Plan of Hubei Province (2020BAB101).

Author information

Authors and Affiliations

School of Cyber Science and Engineering, Wuhan University, Wuhan, China
Jiabao Guo, Bo Zhao, Hui Liu, Dongdong Leng & Gangli Shu
School of Computer Science, Wuhan University, Wuhan, China
Yang An

Authors

Jiabao Guo
View author publications
You can also search for this author in PubMed Google Scholar
Bo Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Hui Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dongdong Leng
View author publications
You can also search for this author in PubMed Google Scholar
Yang An
View author publications
You can also search for this author in PubMed Google Scholar
Gangli Shu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Jiabao Guo and Bo Zhao contributed to the conception of the study. Jiabao Guo performed the data analyses and wrote the manuscript. Hui Liu contributed significantly to analysis and manuscript preparation. Dongdong Leng conducted the research and investigation process, specifically performing the experiments and data collection. Yang An oversighted responsibility for the research activity execution, including mentorship external to the core team. Gangli Shu helped perform the analysis with constructive discussions.

Corresponding author

Correspondence to Bo Zhao.

Ethics declarations

Conflict of interest

Considered no such competing interests exist so, therefore, not applicable.

Ethical Approval and Consent to Participate

The research does not relate to personal privacy.

Consent to Publication

All authors approved the final manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Guo, J., Zhao, B., Liu, H. et al. DeepDual-SD: Deep Dual Attribute-Aware Embedding for Binary Code Similarity Detection. Int J Comput Intell Syst 16, 35 (2023). https://doi.org/10.1007/s44196-023-00206-9

Download citation

Received: 22 September 2022
Accepted: 21 February 2023
Published: 17 March 2023
DOI: https://doi.org/10.1007/s44196-023-00206-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

DeepDual-SD: Deep Dual Attribute-Aware Embedding for Binary Code Similarity Detection

Abstract

Similar content being viewed by others

How different are different diff algorithms in Git?

Source-Code Generation Using Deep Learning: A Survey

Supporting single responsibility through automated extract method refactoring

1 Introduction

2 Related Work

2.1 Semantic-Aware Similarity

2.2 Structural-Aware Similarity

2.3 Deep Embedding

3 Approach

3.1 Code Similarity Task Description

3.2 Overview

3.3 Semantic Attribute Embedding

3.4 Structural Attribute Embedding

3.5 Dual Attribute Fusion Mechanism

4 Performance Evaluation

4.1 Datasets and Experimental Settings

4.2 Evaluation Metrics

4.3 Performance in Cross-Compiler BCSD

4.4 Performance in Cross-Architecture BCSD

4.5 Performance in Cross-Version BCSD

4.6 Discussion

5 Conclusion

Data Availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical Approval and Consent to Participate

Consent to Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation