Dual-Component Deep Domain Adaptation: A New Approach for Cross Project Software Vulnerability Detection

Nguyen, Van; Le, Trung; de Vel, Olivier; Montague, Paul; Grundy, John; Phung, Dinh

doi:10.1007/978-3-030-47426-3_54

Van Nguyen¹⁴,
Trung Le¹⁴,
Olivier de Vel¹⁵,
Paul Montague¹⁵,
John Grundy¹⁴ &
…
Dinh Phung¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12084))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

5968 Accesses
9 Citations

Abstract

Owing to the ubiquity of computer software, software vulnerability detection (SVD) has become an important problem in the software industry and computer security. One of the most crucial issues in SVD is coping with the scarcity of labeled vulnerabilities in projects that require the laborious manual labeling of code by software security experts. One possible solution is to employ deep domain adaptation (DA) which has recently witnessed enormous success in transferring learning from structural labeled to unlabeled data sources. Generative adversarial network (GAN) is a technique that attempts to bridge the gap between source and target data in the joint space and emerges as a building block to develop deep DA approaches with state-of-the-art performance. However, deep DA approaches using the GAN principle to close the gap are subject to the mode collapsing problem that negatively impacts the predictive performance. Our aim in this paper is to propose Dual Generator-Discriminator Deep Code Domain Adaptation Network (Dual-GD-DDAN) for tackling the problem of transfer learning from labeled to unlabeled software projects in SVD to resolve the mode collapsing problem faced in previous approaches. The experimental results on real-world software projects show that our method outperforms state-of-the-art baselines by a wide margin.

You have full access to this open access chapter, Download conference paper PDF

Identify Vulnerability Types: A Cross-Project Multiclass Vulnerability Classification System Based on Deep Domain Adaptation

Multi-view Pre-trained Model for Code Vulnerability Identification

An Efficient Unsupervised Domain Adaptation Deep Learning Model for Unknown Malware Detection

Keywords

1 Introduction

In the software industry, software vulnerabilities relate to specific flaws or oversights in software programs which allow attackers to expose or alter sensitive information, disrupt or destroy a system, or take control of a program or computer system. The software vulnerability detection problem has become an important issue in the software industry and in the field of computer security. Computer software development employs of a vast variety of technologies and different software development methodologies, and much computer software contains vulnerabilities.

This has necessitated the development of automated advanced techniques and tools that can efficiently and effectively detect software vulnerabilities with a minimal level of human intervention. To respond to this demand, many vulnerability detection systems and methods, ranging from open source to commercial tools, and from manual to automatic methods have been proposed and implemented. Most of the previous works in software vulnerability detection (SVD) [1, 8] have been developed based on handcrafted features which are manually chosen by knowledgeable domain experts who may have outdated experience and underlying biases. In many situations, handcrafted features normally do not generalize well. For example, features that work well in a certain software project may not perform well in other projects. To alleviate the dependency on handcrafted features, the use of automatic features in SVD has been studied recently [11,12,13]. These works have shown the advantages of automatic features over handcrafted features in the context of software vulnerability detection.

However, most of these approaches lead to another crucial issue in SVD research, namely the scarcity of labeled projects. Labelled vulnerable code is needed to train these models, and the process of labeling vulnerable source code is very tedious, time-consuming, error-prone, and challenging even for domain experts. This has led to few labeled projects compared with the vast volume of unlabeled ones. A viable solution is to apply transfer learning or domain adaptation which aims to devise automated methods that make it possible to transfer a learned model from the source domain with labels to the target domains without labels. Studies in domain adaptation can be broadly categorized into two themes: shallow [6] and deep domain adaptations [3, 14, 18]. These recent studies have shown the advantages of deep over shallow domain adaptation (i.e., higher predictive performance and capacity to tackle structural data). Deep domain adaptation encourages the learning of new representations for both source and target data in order to minimize the divergence between them [3, 14, 18]. The general idea is to map source and target data to a joint feature space via a generator, where the discrepancy between the source and target distributions is reduced. Notably, the work of [3, 18] employed generative adversarial networks (GANs) [4] to close the gap between source and target data in the joint space. However, most of aforementioned works mainly focus on transfer learning in the computer vision domain. The work of [16] is the first work which applies deep domain adaptation to SVD with promising predictive performance on real-world source code projects. The underlying idea is to employ the GAN to close the gap between the source and target domains in the joint space and enforce the clustering assumption [2] to utilize the information carried in the unlabeled target samples in a semi-supervised context.

GANs are known to be affected by the mode collapsing problem [5, 7, 10, 17]. In particular, the study in [17] recently studied the mode collapsing problem and further classified this into the missing mode problem i.e., the generated samples miss some modes in the true data, and the boundary distortion problem i.e., the generated samples can only partly recover some modes in the true data. It is certain that deep domain adaptation approaches that use the GAN principle will inherently encounter both the missing mode and boundary distortion problems. Last but not least, deep domain adaptation approaches using the GAN principle also face the data distortion problem. The representations of source and target examples in the joint feature space degenerate to very small regions that cannot preserve the manifold/clustering structure in the original space.

Our aim in this paper is to address not only deep domain adaptation mode collapsing problems but also boundary distortion problems when employing the GAN as a principle in order to close the gap between source and target data in the joint feature space. Our two approaches are: i) apply manifold regularization for enabling the preservation of manifold/clustering structures in the joint feature space, hence avoiding the degeneration of source and target data in this space; and ii) invoke dual discriminators in an elegant way to reduce the negative impacts of the missing mode and boundary distortion problems in deep domain adaptation using the GAN principle as mentioned before. We name our mechanism when applied to SVD as Dual Generator-Discriminator Deep Code Domain Adaptation Network (Dual-GD-DDAN). We empirically demonstrate that our Dual-GD-DDAN can overcome the missing mode and boundary distortion problems which is likely to happen as in Deep Code Domain Adaptation (DDAN) [16] in which the GAN was solely applied to close the gap between the source and target domains in the joint space (see the discussion in Sects. 2.3 and 3.3, and the visualization in Fig. 3). In addition, we incorporate the relevant approaches – minimizing the conditional entropy and manifold regularization with spectral graph – proposed in [16] to enforce the clustering assumption [2] and arrive at a new model named Dual Generator-Discriminator Semi-supervised Deep Code Domain Adaptation Network (Dual-GD-SDDAN). We further demonstrate that our Dual-GD-SDDAN can overcome the mode collapsing problem better than SCDAN in [16], hence obtaining better predictive performance.

We conducted experiments using the data sets collected by [13], that consist of five real-world software projects: FFmpeg, LibTIFF, LibPNG, VLC and Pidgin to compare our proposed Dual-GD-DDAN and Dual-GD-SDDAN with the baselines. The baselines consider to include VULD (i.e., the model proposed in [12] without domain adaptation), MMD, DIRT-T, DDAN and SCDAN as mentioned [16] and D2GAN [15] (a variant of the GAN using dual-discriminator to reduce the mode collapse for which we apply this mechanism in the joint feature space). Our experimental results show that our proposed methods are able to overcome the negative impact of the missing mode and boundary distortion problems inherent in deep domain adaptation approaches when solely using the GAN principle as in DDAN and SCDAN [16]. In addition, our method outperforms the rival baselines in terms of predictive performance by a wide margin.

2 Deep Code Domain Adaptation with GAN

2.1 Problem Statement

A source domain data set $S=\{(\varvec{x}_{1}^{S},y_{1}),\dots ,(\varvec{x}_{N_{S}}^{S},y_{N_{S}})\}$ where $y_{i}\in \left\{ -1,1\right\} $ (i.e., 1: vulnerable code and −1: non-vulnerable code) and $\varvec{x}_{i}^{S}=[\varvec{x}_{i1}^{S},\dots ,\varvec{x}_{iL}^{S}]$ is a sequence of L embedding vectors, and the target domain data set $T=\{\varvec{x}_{1}^{T},\dots ,\varvec{x}_{N_{T}}^{T}\}$ where $\varvec{x}_{i}^{T}=[\varvec{x}_{i1}^{T},\dots ,\varvec{x}_{iL}^{T}]$ is also a sequence of L embedding vectors. We wish to bridge the gap between the source and target domains in the joint feature space. This allows us to transfer a classifier trained on the source domain to predict well on the target domain.

2.2 Deep Code Domain Adaptation with a Bidirectional RNN

To handle sequential data in the context of domain adaptation of software vulnerability detection, the work of [16] proposed an architecture referred to as the Code Domain Adaptation Network (CDAN). This network architecture recruits a Bidirectional RNN to process the sequential input from both source and target domains (i.e., $\varvec{x}_{i}^{S}=[\varvec{x}_{i1}^{S},\dots ,\varvec{x}_{iL}^{S}]$ and $\varvec{x}_{i}^{T}=[\varvec{x}_{i1}^{T},\dots ,\varvec{x}_{iL}^{T}]$). A fully connected layer is then employed to connect the output layer of the Bidirectional RNN with the joint feature layer while bridging the gap between the source and target domains. Furthermore, inspired by the Deep Domain Adaptation approach [3], the authors employ the source classifier $\mathcal {C}$ to classify the source samples, the domain discriminator D to distinguish the source and target samples and propose Deep Code Domain Adaptation (DDAN) whose objective function is as follows:

$$\begin{aligned} \mathcal {J}\left( G,\,D,\,C\right) =\frac{1}{N_{S}}\sum _{i=1}^{N_{S}}\ell (C(G(\varvec{x}_{i}^{S})),y_{i})+\lambda (\frac{1}{N_{S}}\sum _{i=1}^{N_{S}}\log \,D(G(\varvec{x}_{i}^{S}))+\frac{1}{N_{T}}\sum _{i=1}^{N_{T}}\log \,[1-D(G(\varvec{x}_{i}^{T}))]) \end{aligned}$$

2.3 The Shortcomings of DDAN

We observe that DDAN suffers from several shortcomings. First, the data distortion problem (i.e., the source and target data in the joint space might collapse into small regions) may occur since there is no mechanism in DDAN to circumvent this. Second, since DDAN is based on the GAN approach, DDAN might suffer from the mode collapsing problem [5, 17]. In particular, [17] has recently studied the mode collapsing problem of GANs and discovered that they are also subject to i) the missing mode problem (i.e., in the joint space, either the target data misses some modes in the source data or vice versa) and ii) the boundary distortion problem (i.e., in the joint space either the target data partly covers the source data or vice versa), which makes the target distribution significantly diverge from the source distribution. As shown in Fig. 1, both the missing mode and boundary distortion problems simultaneously happen since the target distribution misses source mode 2, while the source distribution can only partly cover the target mode 2 in the target distribution and the target distribution can only partly cover the source mode 1 in the source distribution.

3 Dual Generator-Discriminator Deep Code Domain Adaptation

3.1 Key Idea of Our Approach

We employ two discriminators (namely, $D_{S}$ and $D_{T}$) to classify the source and target examples and vice versa and two separate generators (namely, $G_{S}$ and $G_{T}$) to map the source and target examples to the joint space respectively. In particular, $D_{S}$ produces high values on the source examples in the joint space (i.e., $G_{S}(\varvec{x}^{S})$) and low values on the target examples in the joint space (i.e., $G_{T}(\varvec{x}^{T})$), while $D_{T}$ produces high values on the target examples in the joint space (i.e., $G_{T}(\varvec{x}^{T})$) and low values on the source examples (i.e., $G_{S}(\varvec{x}^{S})$). The generator $G_{S}$ is trained to push $G_{S}\left( \varvec{x}^{S}\right) $ to the high value region of $D_{T}$ and the generator $G_{T}$ is trained to push $G_{T}(\varvec{x}^{T})$ to the high value region of $D_{S}$. Eventually, both $D_{S}(G_{S}(\varvec{x}^{S}))$ and $D_{S}(G_{T}(\varvec{x}^{T}))$ are possibly high and both $D_{T}(G_{S}(\varvec{x}^{S}))$ and $D_{T}(G_{T}(\varvec{x}^{T}))$ are possibly high. This helps to mitigate the issues of missing mode and boundary distortion since as in Fig. 1, if the target mode 1 can only partly cover the source mode 1, then $D_{T}$ cannot receive large values from source mode 1. Another important aspect of our approach is to maintain the cluster/manifold structure of source and target data in the joint space via the manifold regularization to avoid the data distortion problem.

3.2 Dual Generator-Discriminator Deep Code Domain Adaptation Network

To address the two inherent problems in the DDAN mentioned in Sect. 2.3, we employ two different generators $G_{S}$ and $G_{T}$ to map source and target domain examples to the joint space and two discriminators $D_{S}$ and $D_{T}$ to distinguish source examples against target examples and vice versa together with the source classifier $\mathcal {C}$ which is used to classify the source examples with labels as shown in Fig. 2. We name our proposed model as Dual Generator-Discriminator Deep Code Domain Adaptation Network (Dual-GD-DDAN).

Updating the Discriminators. The two discriminators $D_{S}$ and $D_{T}$ are trained to distinguish the source examples against the target examples and vice versa as follows:

$$\begin{aligned}&\min _{D_{S}}\biggl (\frac{\left( 1+\uptheta \right) }{N_{S}}\sum _{i=1}^{N_{S}}[-\log \,D_{S}(G_{S}(\varvec{x}_{i}^{S}))]+\frac{1}{N_{T}}\sum _{i=1}^{N_{T}}[-\log \,[1-D_{S}(G_{T}(\varvec{x}_{i}^{T}))]]\biggr )\end{aligned}$$

(1)

$$\begin{aligned}&\min _{D_{T}}\biggl (\frac{1}{N_{S}}\sum _{i=1}^{N_{S}}[-\log \,[1-D_{T}(G_{S}(\varvec{x}_{i}^{S}))]]+\frac{\left( 1+\uptheta \right) }{N_{T}}\sum _{i=1}^{N_{T}}[-\log \,D_{T}(G_{T}(\varvec{x}_{i}^{T}))]\biggr ) \end{aligned}$$

(2)

where $\uptheta >0$. Note that a high value of $\uptheta $ encourages $D_{S}$ and $D_{T}$ place higher values on $G_{S}\left( \varvec{x}^{S}\right) $ and $G_{T}\left( \varvec{x}^{T}\right) $ respectively.

Updating the Source Classifier. The source classifier is employed to classify the source examples with labels as: $\min _{\mathcal {C}}\,\frac{1}{N_{S}}\sum _{i=1}^{N_{S}}\ell (\mathcal {C}(G_{S}(\varvec{x}_{i}^{S})),y_{i})$, where $\ell $ specifies the cross-entropy loss function for the binary classification (e.g., using cross-entropy).

Updating the Generators. The two generators $G_{S}$ and $G_{T}$ are trained to i) maintain the manifold/cluster structures of source and target data in their original spaces to avoid the data distortion problem and ii) move the target samples toward the source samples in the joint space and resolve the missing mode and boundary distortion problems in the joint space.

To maintain the manifold/cluster structures of source and target data in their original spaces, we propose minimizing the manifold regularization term as: $\min _{G}\,\mathcal {M}(G_{S},G_{T})$ where $\mathcal {M}(G_{S},G_{T})$ is formulated as:

$$\begin{aligned} \mathcal {M}(G_{S},G_{T})&=\sum _{i,j=1}^{N_{S}}\mu _{ij}||G_{S}(\varvec{x}_{i}^{S})-G_{S}(\varvec{x}_{j}^{S})||^{2}+\sum _{i,j=1}^{N_{T}}\mu _{ij}||G_{T}(\varvec{x}_{i}^{T})-G_{T}(\varvec{x}_{j}^{T})||^{2} \end{aligned}$$

in which the weights are defined as $\mu _{ij}=\exp \{-||h(\varvec{x}_{i})-h(\varvec{x}_{j})||{}^{2}/(2\sigma ^{^{2}})\}$ with $h(\varvec{x})=\text {concat}(\overleftarrow{h_{L}}(\varvec{x}),\overrightarrow{h_{L}}(\varvec{x}))$ where $\overrightarrow{h_{L}}(\varvec{x})$ and $\overleftarrow{h_{L}}(\varvec{x})$ are the last hidden states of the bidirectional RNN with input $\varvec{x}$.

To move the target samples toward the source samples and resolve the missing mode and boundary distortion problems in the joint space, we propose minimizing the following objective function: $\min _{D}\,\mathcal {K}(G_{S},G_{T})$ where $\mathcal {K}(G_{S},G_{T})$ is defined as:

$$\begin{aligned} \mathcal {K}(G_{S},G_{T})&=\frac{1}{N_{S}}\sum _{i=1}^{N_{S}}[-\log \,D_{T}(G_{S}(\varvec{x}_{i}^{S}))]+\frac{1}{N_{T}}\sum _{i=1}^{N_{T}}[-\log \,D_{S}(G_{T}(\varvec{x}_{i}^{T}))] \end{aligned}$$

(3)

Moreover, the source generator $G_{S}$ has to work out the representation that is suitable for the source classifier, hence we need to minimize the following objective function:

$$ \min _{G_{S}}\,\frac{1}{N_{S}}\sum _{i=1}^{N_{S}}\ell (\mathcal {C}(G_{S}(\varvec{x}_{i}^{S})),y_{i}) $$

Finally, to update $G_{S}$ and $G_{T}$, we need to minimize the following objective function:

$$\begin{aligned} \frac{1}{N_{S}}\sum _{i=1}^{N_{S}}\ell (\mathcal {C}(G_{S}(\varvec{x}_{i}^{S})),y_{i})+\alpha \mathcal {M}(G_{S},G_{T})+\beta \mathcal {K}(G_{S},G_{T}) \end{aligned}$$

where $\alpha ,\,\beta >0$ are two non-negative parameters.

3.3 The Rationale for Our Dual Generator-Discriminator Deep Code Domain Adaptation Network Approach

Below we explain why our proposed Dual-GD-DDAN is able to resolve the two critical problems that occur with the DDAN approach. First, if $\varvec{x}_{i}^{S}$ and $\varvec{x}_{j}^{S}$ are proximal to each other and are located in the same cluster, then their representations $h(\varvec{x}_{i}^{S})$ and $h(\varvec{x}_{j}^{S})$ are close and hence, the weight $\mu _{ij}$ is large. This implies $G_{S}(\varvec{x}_{i}^{S})$ and $G_{S}(\varvec{x}_{j}^{S})$ are encouraged to be close in the joint space because we are minimizing $\mu _{ij}||G_{S}(\varvec{x}_{i}^{S})-G_{S}(\varvec{x}_{j}^{S})||^{2}$. This increases the chance of the two representations residing in the same cluster in the joint space. Therefore, Dual-GD-DDAN is able to preserve the clustering structure of the source data in the joint space. By using the same argument, we reach the same conclusion for the target domain.

Second, following Eqs. (1, 2), the discriminator $D_{S}$ is trained to encourage large values for the source modes (i.e., $G_{S}(\varvec{x}^{S}$)), while the discriminator $D_{T}$ is trained to produce large values for the target modes (i.e., $G_{T}(\varvec{x}^{T})$). Moreover, as in Eq. (3), $G_{s}$ is trained to move the source domain examples $\varvec{x}^{S}$ to the high-valued region of $D_{T}$ (i.e., the target modes or $G_{T}(\varvec{x}^{T})$) and $G_{T}$ is trained to move the target examples $\varvec{x}^{T}$ to the high-valued region of $D_{S}$ (i.e., the source modes or $G_{S}(\varvec{x}^{S})$). As a consequence, eventually, the source modes (i.e., $G_{S}(\varvec{x}^{S})$) and target modes (i.e., $G_{T}(\varvec{x}^{T})$) overlap, while $D_{S}$ and $D_{T}$ place large values on both source (i.e., $G_{S}(\varvec{x}^{S})$) and target (i.e., $G_{T}(\varvec{x}^{T})$) modes. The mode missing problem is less likely to happen since, as shown in Fig. 1, if the target data misses source mode 2, then $D_{T}$ cannot receive large values from source mode 2. Similarly, the boundary distortion problem is also less likely to happen since as in Fig. 1, if the target mode 1 can only partly cover the source mode 1, then $D_{T}$ cannot receive large values from source mode 1. Therefore, Dual-GD-DDAN allows us to reduce the impact of the missing mode and boundary distortion problems, hence making the target distribution more identical to the source distribution in the joint space.

3.4 Dual Generator-Discriminator Semi-supervised Deep Code Domain Adaptation Network

Our proposed model can be incorporated with minimizing the conditional entropy and using the spectral graph to inspire the smoothness to enforce the clustering assumption [2] proposed in [16] to form Dual Generator-Discriminator Semi-supervised Deep Code Domain Adaptation Network (Dual-GD-SDDAN). Please read our Supplementary Material for more technical details, available at https://app.box.com/s/aijcavbcp.

4 Experiments

In this section, firstly, we compare our proposed Dual-GD-DDAN with VulDeePecker without domain adaptation, MMD, D2GAN, DIRT-T and DDAN using the architecture CDAN proposed in [16]. Secondly, we do Boundary Distortion Analysis to further demonstrate the efficiency of our proposed Dual-GD-DDAN in alleviating the boundary distortion problem caused by using the GAN principle. Finally, we compare our Dual-GD-SDDAN and SCDAN introduced in [16].

4.1 Experimental Setup

Experimental Data Set. We use the real-world data sets collected by [13], which contain the source code of vulnerable and non-vulnerable functions obtained from five real-world software projects, namely FFmpeg (#vul-funcs: 187, #non-vul-funcs: 5,427), LibTIFF (#vul-funcs: 81, #non-vul-funcs: 695), LibPNG (#vul-funcs: 43, #non-vul-funcs: 551), VLC (#vul-funcs: 25, #non-vul-funcs: 5,548) and Pidgin (#vul-funcs: 42, #non-vul-funcs: 8,268) where #vul-funcs and #non-vul-funcs is the number of vulnerable and non-vulnerable functions respectively. The data sets contain both multimedia (FFmpeg, VLC, Pidgin) and image (LibPNG, LibTIFF) application categories. In our experiment, data sets from the multimedia category were used as the source domain whilst data sets from the image category were used as the target domain (see Table 1).

Model Configuration. For training the eight methods – VulDeePecker, MMD, D2GAN, DIRT-T, DDAN, Dual-GD-DDAN, SCDAN and Dual-GD-SDDAN – we use one-layer bidirectional recurrent neural networks with LSTM cells where the size of hidden states is in $\{128,256\}$ for the generators. For the source classifier and discriminators, we use deep feed-forward neural networks with two hidden layers in which the size of each hidden layer is in $\{200,300\}$. We embed the opcode and statement information in the $\{150,150\}$ dimensional embedding spaces respectively (see our Supplementary Material for Data Processing and Embedding, available at https://app.box.com/s/aijcavbcp). We employ the Adam optimizer with an initial learning rate in $\{10^{-3},10^{-4}\}$. The mini-batch size is 64. The trade-off parameters $\upalpha ,\,\upbeta ,\,\upgamma ,\,\uplambda $ are in $\{10^{-1},10^{-2},10^{-3}\}$, $\uptheta $ is in $\{0,1\}$ and $1/(2\upsigma ^{2})$ is in $\{2^{-10},2^{-9}\}$.

We split the data of the source domain into two random partitions containing 80% for training and 20% for validation. We also split the data of the target domain into two random partitions. The first partition contains 80% for training the models of VulDeePecker, MMD, D2GAN, DIRT-T, DDAN, Dual-GD-DDAN, SCDAN and Dual-GD-SDDAN without using any label information while the second partition contains 20% for testing the models. We additionally apply gradient clipping regularization to prevent over-fitting in the training process of each model. We implement eight mentioned methods in Python using Tensorflow which is an open-source software library for Machine Intelligence developed by the Google Brain Team.

4.2 Experimental Results

Code Domain Adaptation for a Fully Non-labeled Target Project.

We investigate the performance of our proposed Dual-GD-DDAN compared with other methods including VulDeePecker (VULD) without domain adaptation [12], DDAN [16], MMD [14], D2GAN [15] and DIRT-T [18] with VAP applied in the joint feature layer using the architecture CDAN introduced in [16]. The VulDeePecker method is only trained on the source data and then tested on the target data, while the MMD, D2GAN, DIRT-T, DDAN and Dual-GD-DDAN methods employ the target data without using any label information for domain adaptation.

Table 1. Performance results in terms of false negative rate (FNR), false positive rate (FPR), Recall, Precision and F1-measure of VulDeePecker (VULD), MMD, D2GAN, DIRT-T, DDAN and Dual-GD-DDAN for predicting vulnerable and non-vulnerable code functions on the testing set of the target domain (Best performance in bold).

Full size table

In Table 1, the experimental results show that our proposed Dual-GD-DDAN achieves a higher performance for detecting vulnerable and non-vulnerable functions for most performance measures, including FNR, FPR, Recall, Precision and F1-measure in almost cases of the source and target domains, especially for F1-measure. Particularly, our Dual-GD-DDAN always obtains the highest F1-measure in all cases. For example, for the case of the source domain (FFmpeg) and target domain (LibPNG), Dual-GD-DDAN achieves an F1-measure of 88.89% compared with an F1-measure of 84.21%, 84.21%, 80%, 77.78% and 75% obtained with DDAN, DIRT-T, D2GAN, MMD and VulDeePecker respectively.

Boundary Distortion Analysis

Quantitative Results. To quantitatively demonstrate the efficiency of our proposed Dual-GD-DDAN in alleviating the boundary distortion problem caused by using the GAN principle, we reuse the experimental setting in Sect. 5.2 [17]. The basic idea is, given two data sets $S_{1}$ and $S_{2}$, to quantify the degree of cover of these two data sets. We train a classifier $\mathcal {C}_{1}$ on $S_{1}$, then test on $S_{2}$ and another classifier $\mathcal {C}_{2}$ on $S_{2}$, then test on $S_{1}$. If these two data sets cover each other well with reduced boundary distortion, we expect that if $\mathcal {C}_{1}$ predicts well on $S_{1}$, then it should predict well on $S_{2}$ and vice versa if $\mathcal {C}_{2}$ predicts well on $S_{2}$, then it should predict well on $S_{1}$. This would seem reasonable since if boundary distortion occurs (i.e., assume that $S_{2}$ partly covers $S_{1}$), then $\mathcal {C}_{2}$ trained on $S_{2}$ would struggle to predict $S_{1}$ well which is much larger and possibly more complex. Therefore, we can utilize the magnitude of the accuracies and the accuracy gap of $\mathcal {C}_{1}$ and $\mathcal {C}_{2}$ when predicting their training and testing sets to assess the severity of the boundary distortion problem.

Table 2. Accuracies obtained by the DDAN and Dual-GD-DDAN methods when predicting vulnerable and non-vulnerable code functions on the source and target domains. Note that tr src, ts tar, tr tar, ts src, and acc gap are the shorthands of train source, test target, train target, test source, and accuracy gap respectively. For the accuracy gap, a smaller value is better.

Full size table

Inspired by this observation, we compare our Dual-GD-DDAN with DDAN using the representations of the source and target samples in the joint feature space corresponding to their best models. In particular, for a given pair of source and target data sets and for comparing each method, we train a neural network classifier on the best representations of the source data set in the joint space, then predict on the source and target data set and do the same but swap the role of the source and target data sets. We then measure the difference of the corresponding accuracies as a means of measuring the severity of the boundary distortion. We choose to conduct such a boundary distortion analysis for two pairs of the source (FFmpeg and Pidgin) and target (LibPNG) domains. As shown in Table 2, all gaps obtained by our Dual-GD-DDAN are always smaller than those obtained by DDAN, while the accuracies obtained by our proposed method are always larger. We can therefore conclude that our Dual-GD-DDAN method produces a better representation for source and target samples in the joint space and is less susceptible to boundary distortion compared with the DDAN method.

Visualization. We further demonstrate the efficiency of our proposed Dual-GD-DDAN in alleviating the boundary distortion problem caused by using the GAN principle. Using a t-SNE [9] projection, with perplexity equal to 30, we visualize the feature distributions of the source and target domains in the joint space. Specifically, we project the source and target data in the joint space (i.e., $G\left( \varvec{x}\right) $) into a 2D space with domain adaptation (DDAN) and with dual-domain adaptation (Dual-GD-DDAN). In Fig. 3, we observe these cases when performing domain adaptation from a software project (FFmpeg) to another (LibPNG). As shown in Fig. 3,with undertaking domain adaptation (DDAN, the left figure) and dual-domain adaptation (Dual-GD-DDAN, the right figure), the source and target data sampled are intermingled especially for Dual-GD-DDAN. However, it can be observed that DDAN when solely applying the GAN is seriously vulnerable to the boundary distortion issue. In particular, in the clusters/data modes 2, 3 and 4 (the left figure), the boundary distortion issue occurs since the blue data only partly cover the corresponding red ones (i.e., the source and target data do not totally mix up). Meanwhile, for our Dual-GD-DDAN, the boundary distortion issue is much less vulnerable, and the mixing-up level of source and target data is significantly higher in each cluster/data mode.

Quantitative Results of Dual Generator-Discriminator Semi-supervised Deep Code Domain Adaptation. In this section, we compare the performance of our Dual-GD-SDDAN with Semi-supervised Deep Code Domain Adaptation (SCDAN) [16] on four pairs of the source and target domains. In Table 3, the experimental results show that our Dual-GD-SDDAN achieves a higher performance than SCDAN for detecting vulnerable and non-vulnerable functions in terms of FPR, Precision and F1-measure in almost cases of the source and target domains, especially for F1-measure. For example, to the case of the source domain (VLC) and target domain (LibPNG), our Dual-GD-SDDAN achieves an F1-measure of 76.19% compared with an F1-measure of 72.73% obtained with SCDAN. These results further demonstrate the ability of our Dual-GD-SDDAN for dealing with the mode collapsing problem better than SCDAN [16], hence obtaining better predictive performance in the context of software domain adaptation.

Table 3. Performance results in terms of false negative rate (FNR), false positive rate (FPR), Recall, Precision and F1-measure of SCDAN and Dual-GD-SDDAN for predicting vulnerable/non-vulnerable code functions on the testing set of the target domain (Best performance in bold).

Full size table

5 Conclusion

Software vulnerability detection (SVD) is an important problem in the software industry and in the field of computer security. One of the most crucial issues in SVD is to cope with the scarcity of labeled vulnerabilities in projects that require the laborious labeling of code by software security experts. In this paper, we propose the Dual Generator-Discriminator Deep Code Domain Adaptation Network (Dual-GD-DDAN) method to deal with the missing mode and boundary distortion problems which arise from the use of the GAN principle when reducing the discrepancy between source and target data in the joint space. We conducted experiments to compare our Dual-GD-DDAN method with the state-of-the-art baselines. The experimental results show that our proposed method outperforms these rival baselines by a wide margin in term of predictive performances.

References

Almorsy, M., Grundy, J., Ibrahim, A.: Supporting automated vulnerability analysis using formalized vulnerability signatures. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ASE, pp. 100–109. ACM (2012)
Google Scholar
Chapelle, O., Zien, A.: Semi-supervised classification by low density separation. In: AISTATS, pp. 57–64. Citeseer (2005)
Google Scholar
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: Proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 1180–1189. ICML (2015)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Google Scholar
Goodfellow, I.: Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)
Gopalan, R., Ruonan, L., Chellappa, R.: Domain adaptation for object recognition: an unsupervised approach. In: Proceedings of the 2011 International Conference on Computer Vision, pp. 999–1006. ICCV (2011)
Google Scholar
Hoang, Q., Nguyen, T.D., Le, T., Phung, D.: MGAN: training generative adversarial nets with multiple generators. In: International Conference on Learning Representation (2018)
Google Scholar
Kim, S., Woo, S., Lee, H., Oh, H.: VUDDY: a scalable approach for vulnerable code clone discovery. In: IEEE Symposium on Security and Privacy, pp. 595–614. IEEE Computer Society (2017)
Google Scholar
Laurens, V.M., Geoffrey, H.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Le, T., Hoang, Q., Vu, H., Nguyen, T.D., Bui, H., Phung, D.: Learning generative adversarial networks from multiple data sources. In: 2019 International Joint Conference on Artificial Intelligence (2019)
Google Scholar
Le, T., et al.: Maximal divergence sequential autoencoder for binary software vulnerability detection. In: International Conference on Learning Representations (2019)
Google Scholar
Li, Z., et al.: VulDeePecker: a deep learning-based system for vulnerability detection. CoRR abs/1801.01681 (2018)
Google Scholar
Lin, G., et al.: Cross-project transfer representation learning for vulnerable function discovery. IEEE Trans. Industr. Inform. 14, 3289–3297 (2018)
Google Scholar
Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep adaptation networks. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, Lille, France, vol. 37, pp. 97–105 (2015)
Google Scholar
Nguyen, T.D., Le, T., Vu, H., Phung, D.: Dual discriminator generative adversarial nets. In: Advances in Neural Information Processing (2017)
Google Scholar
Nguyen, V., et al.: Deep domain adaptation for vulnerable code function identification. In: The International Joint Conference on Neural Networks (IJCNN) (2019)
Google Scholar
Santurkar, S., Schmidt, L., Madry, A.: A classification-based study of covariate shift in GAN distributions. In: Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 4480–4489. PMLR (2018)
Google Scholar
Shu, R., Bui, H., Narui, H., Ermon, S.: A DIRT-t approach to unsupervised domain adaptation. In: International Conference on Learning Representations (2018)
Google Scholar

Download references

Acknowledgement

This research was supported under the Defence Science and Technology Group’s Next Generation Technologies Program.

Author information

Authors and Affiliations

Monash University, Clayton, Australia
Van Nguyen, Trung Le, John Grundy & Dinh Phung
Defence Science and Technology Group, Canberra, Australia
Olivier de Vel & Paul Montague

Authors

Van Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Trung Le
View author publications
You can also search for this author in PubMed Google Scholar
Olivier de Vel
View author publications
You can also search for this author in PubMed Google Scholar
Paul Montague
View author publications
You can also search for this author in PubMed Google Scholar
John Grundy
View author publications
You can also search for this author in PubMed Google Scholar
Dinh Phung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Van Nguyen , Trung Le , Olivier de Vel , Paul Montague , John Grundy or Dinh Phung .

Editor information

Editors and Affiliations

School of Information Systems, Singapore Management University, Singapore, Singapore
Hady W. Lauw
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Raymond Chi-Wing Wong
Department of Informatics and Telecommunications, National and Kapodistrian University of Athens, Athens, Greece
Alexandros Ntoulas
School of Information Systems, Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Institute of Data Science, National University of Singapore, Singapore, Singapore
See-Kiong Ng
School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Sinno Jialin Pan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, V., Le, T., de Vel, O., Montague, P., Grundy, J., Phung, D. (2020). Dual-Component Deep Domain Adaptation: A New Approach for Cross Project Software Vulnerability Detection. In: Lauw, H., Wong, RW., Ntoulas, A., Lim, EP., Ng, SK., Pan, S. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2020. Lecture Notes in Computer Science(), vol 12084. Springer, Cham. https://doi.org/10.1007/978-3-030-47426-3_54

Download citation

DOI: https://doi.org/10.1007/978-3-030-47426-3_54
Published: 06 May 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-47425-6
Online ISBN: 978-3-030-47426-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Dual-Component Deep Domain Adaptation: A New Approach for Cross Project Software Vulnerability Detection

Abstract

Similar content being viewed by others

Identify Vulnerability Types: A Cross-Project Multiclass Vulnerability Classification System Based on Deep Domain Adaptation

Multi-view Pre-trained Model for Code Vulnerability Identification

An Efficient Unsupervised Domain Adaptation Deep Learning Model for Unknown Malware Detection

Keywords

1 Introduction