1 Introduction

Nowadays, cyberattacks related to security vulnerabilities are growing in terms of sophistication and number [1, 2]. In this context, vulnerability detection in source code is paramount to safeguard software applications against security threats. As evidenced by the Common Vulnerabilities and Exposures (CVE) Details database [3], 29065 vulnerabilities were reported in 2023, and more than 2000 were reported in early 2024, which shows that the continuous growth rate exhibited in previous years is far from ending.

Traditional approaches to vulnerability detection, exemplified by rule-based methods [4] and signature-based techniques [5], typically use predefined rules or patterns indicative of known vulnerabilities. While effective in certain contexts, these approaches are frequently time-consuming to develop and may need to be adjusted to identify novel or previously unknown vulnerabilities. Such methods target specific known vulnerabilities, such as buffer overflows, SQL injection, or cross-site scripting. In the last years, machine learning [6], and in particular deep learning (DL) emerged as a promising alternative [7], leveraging its inherent capability to autonomously discern intricate patterns and features from vast amounts of source code [8,9,10]. The latter is leveraged by the advent of generative AI, with large language models (LLMs) indicating promising results [11,12,13,14,15].

Fig. 1
figure 1

Year-wise distribution of articles providing vulnerability datasets and their types. The term “Mix" denotes the use of both synthetic and real datasets

Nevertheless, the efficacy of AI-based vulnerability detection encounters a significant limitation-the availability of high-quality benchmark datasets [6, 16, 17]. As AI models require vast amounts of data during the training stage, creating quality datasets directly impacts the efficacy of these model’s parameter tuning. Despite numerous datasets being created and open-sourced in the literature (see Table 1 for more details), the quality of these datasets often falls short, potentially leading to inaccurate, biased, or incomplete outcomes when used to train models for vulnerability detection. A frequent issue, for instance, is the low coverage of vulnerability types, with many datasets addressing only a handful of vulnerabilities (refer to Sect. 4.2).

After an exhaustive analysis of the state of the art, which includes the analysis of 92 vulnerability detection datasets extracted from 27 articles, our contribution highlights existing issues within such datasets and provides recommendations to overcome them. The latter expands the understanding of vulnerability detection datasets’ intricacies and contributes to mitigating existing challenges by analysing them. Finally, we apply such analysis as a guideline to create a new benchmark dataset, and compare it with the state of the art in terms of features, showcasing its benefits.

The rest of this article is organised as follows. Section 2 provides the reader with the relevant background. Section 3 explains the search methodology followed to collect the articles used in Sect. 4 to provide a classification and description of the main findings related to existing datasets and their limitations. Section 5 describes our dataset creation guidelines and presents and example dataset crafted by us. Section 6 discusses the results of our analysis. Section 7 presents an overview of the related work. Finally, Sect. 8 concludes the article and identifies potential directions for future work.

2 Background

Vulnerability detection has become an increasingly important component of software security, enabling developers and security professionals to identify potential vulnerabilities more quickly and accurately than manual testing. Typically, AI-based vulnerability detection uses a model (e.g., a DL model), and describes the problem as a binary classification task where an input code is classified into two classes (e.g., either vulnerable or secure) [18, 19]. During the training procedure, labelled source code is fed into the model to tune its parameters to create a suitable mapping of inputs (source code) to outputs (vulnerable or secure predictions). In this setup, DL models automatically extract patterns of vulnerability, unlike traditional rule-based approaches [4]. As of today, various DL models have been developed and proven to perform effectively. Zhou et al. [8] proposed the graph neural network-based model Devign that embeds source code into graph data structures to learn patterns better. The large-scale pre-trained model CodeBERT [9] processes source code as plain text and supports multiple programming languages, such as Python, Java, and Javascript. Due to the dynamic creation of LLMs through fine-tuning procedures [11,12,13,14], we refer the reader to other sources such as Huggingface [20] for more on AI models capable of detecting vulnerable code and bad programming practices.

Table 1 Overview of collected datasets. real: source code collected from real-world projects. synthetic: artificially generated code. mix: real+synthetic. PL: programming language

The effectiveness of AI-based vulnerability detection models highly depends on the data quality used for training [45]. By quality, we refer to the completeness of data and its verifiable provenance, consistency, orderliness (i.e., data should be organised and use standard notation and representation), and correctness [46]. Jimenez et al. [47] demonstrated that noisy historical data that is labelled secure but holds undiscovered vulnerabilities can cause the detection accuracy to decrease by over 20%. Garg et al. [48] confirmed that considering the noisy historical data can help improve a model’s performance on vulnerability prediction. Note that if the data is incomplete, inconsistent, or biased, the models may produce unreliable results, leading to a false sense of security and wasted resources. Therefore, ensuring the quality of data is essential for the success of vulnerability detection.

3 Search methodology

Given the issues observed in the current state of the art in software vulnerability detection datasets, we wanted to further study them. Thus, we planned our review by using various features of the approach presented in [49]. First, we queried Scopus and Web of Science (WoS) with the following query TITLE-ABS-KEY ( ( ( vulnerability AND detection AND dataset AND software ))) on June 2024 without time restriction. Next, we selected articles with the following criteria “Articles that are focused on software vulnerability detection and provide/create a dataset as part of their contribution", and applied the snowball effect to find further relevant literature by searching the references of key articles and reports for additional citations [50]. After thorough screening, we ended up with a total of 27 articles, providing a total of 92 unique datasets. Note that the information provided in such articles was used to classify the issues in existing datasets for AI-based vulnerability detection. The distribution of year publication can be seen in Fig. 1. More details about the codes and the sources can be seen in Table 1, and a specific analysis of each dataset is provided in Sect. 6.

4 Issues in existing datasets for vulnerability detection

This section classifies the main issues found in the literature, which are represented in Fig. 2. For instance, small sampling size, data imbalance, programming language coverage and low coverage of vulnerability types, and bias in vulnerability distribution, can markedly degrade model performance and generalisation and pose non-trivial challenges. Issues such as encompassing errors in raw data, mislabelling of source code, and the inclusion of noisy historical data, also have an impact on the model, and are not easily identifiable as such information may be missing from the source.

Fig. 2
figure 2

Mindmap of the main vulnerability dataset issues found in the literature

Some of the issues can be alleviated through pre-processing techniques. Such techniques include common operations to prepare data before training models, such as cleaning data to ensure uniqueness, removing comments and empty lines, writing data into a unified format (e.g., JSON and PKL files), verifying the labels of source code, and splitting data for training, validation, and testing.

Table 2 Labelling method of datasets provided by each reference

4.1 Small sampling size and data imbalance

Training a model requires a minimum amount of data. However, quantifying such data is often impossible, as it depends on the model trained, its parameters, and the task. Usually, the richness of data representing the possible feature combinations for each class and the model’s performance is used as a guideline to assess the quality of the training. Existing datasets often share issues regarding the number of samples, which translates into the above problems. In addition, these datasets exhibit an imbalance of data samples (i.e., the ratio between benign source code samples and vulnerable ones), which has been proven to be a recalling issue in the literature [53,54,55].

Table 3 describes the characteristics of 70 vulnerability detection datasets provided by 24 articlesFootnote 1 (please refer to [56] for additional material). As it can be observed, the datasets LibPNG and LibTIFF in [18, 23] contain a low number of samples. More concretely, LibPNG only includes 621 source code samples in total. Concerning the sampling bias, except the Synthetic dataset by [22], Windows and Linux datasets by [25], and FFmpeg dataset by [8], all datasets contain a higher percentage of secure code samples than vulnerable ones. Particularly, the imbalance ratio is extreme for the Asterisk dataset by [23], obtaining a value of 324.82.

Table 3 Number of vulnerable and secure code samples of different datasets. The datasets provided by [21, 35, 38] are omitted due to restricted access. The imbalance ration is included in the last column for the sake of completeness. \(Imbalance~Ratio=\frac{\#Secure}{\#Vulnerable}\)

4.2 Low coverage of vulnerability types

Vulnerability detection models are expected to detect a broad range of vulnerability types, including common ones such as denial of service (DoS) and cross-site scripting (XSS), as well as less common ones such as HTTP response splitting and cross-site request forgery (CSRF) vulnerabilities [3]. However, existing datasets often have limited coverage of vulnerability types, which can lead to high rates of false negatives or an overall poor detection rate.

To showcase that situation, we summarise in Table 4 the vulnerability types of a total of 18 datasets aggregating the data reported in [18, 23, 29, 34]. In total, 28 vulnerability types are covered across these 18 datasets. These vulnerabilities belong to 10 vulnerability families[3], bypass something, XSS, DoS, directory traversal, code execution, gain privileges, overflow, gain information, SQL injection, and file inclusion. Other vulnerability families, such as HTTP response splitting and CSRF, are not covered by any dataset. In addition, note that each vulnerability family includes several vulnerability types, while in the list, most vulnerability types belong to the DoS (ID: 4-13) and the code execution (ID: 5-8, 15-23) vulnerability families.

Table 4 List of vulnerability types included in datasets analysed in the referenced articles

4.3 Bias in vulnerability distribution

Vulnerability distribution refers to the numerical proportion of vulnerability types in a given dataset. Bias in the distribution could force a vulnerability prediction model to learn more from the majority types during the training procedure. Thus, the model will perform poorly on minority types. While this matter could be partially alleviated by creating subsets of a dataset with an equal number of samples per class, datasets often come with poorly represented vulnerability types, which could entail the creation of very small, balanced datasets. Therefore, such a situation could hinder the training procedures due to the presence of both imbalance and insufficient number of samples in some cases.

Figure 3 shows the vulnerability distribution of six datasets provided by [23] (please refer to [56] for more results). The first observation is that each dataset covers a different set of vulnerability types, recalling the coverage issue discussed in Sect. 4.2. Second, regardless of the dataset, vulnerabilities are distributed unevenly. Remarkably, in FFmpeg, 9 out of 15 vulnerability types occupy less than 1% of source code from the entire dataset.

Fig. 3
figure 3

Vulnerability distribution in each dataset provided by [23]. The numeral text indicates the percentage of source code. For the sake of clarity, we avoided showing CWEs representing less than 1% of the dataset

4.4 Mixed data sources

Data for vulnerability detection can come from multiple sources, which can introduce inconsistencies and biases. For instance, the source code may exhibit different coding styles from different software developers. In addition, each source has its specific characteristics and vulnerabilities, as shown in Fig. 3e, making it difficult to generalise across different sources. Using data from multiple sources can also result in data that is difficult to integrate and may require extensive pre-processing. An example of a dataset using mixed data sources is the Devign dataset provided by [9], which is a mixture of the FFmpeg and QEMU datasets by [8] (see Table 2).

4.5 Mislabelling on source code

Mislabelling can severely degrade the performance of prediction models. Generally, code files are extracted from open-source projects from GitHub, and labels are manually given based on commit messages and descriptions in NVD\(^{5}\). Both commit messages and NVD entries are manually curated and analysed, which is error-prone even with experienced developers [57]. Another common way is to use static code analysis tools (Table 1), which is less accurate than the manual manner [58].

Figure 4 shows an example of mislabelling in the LibPNG dataset provided by [23]. The code file (including green lines in the example) named “cve-2016-10087.c” indicates that the source code is vulnerable, allowing context-dependent attackers to cause a NULL Pointer Dereference vectors. However, according to the commit message on GitHub,Footnote 2 the green lines are the patch of the corresponding vulnerable code (in red), thus, the label should be secure instead of vulnerable.

Fig. 4
figure 4

An example of mislabelling

Fig. 5
figure 5

Overview of the collection procedure of our dataset

4.6 Noisy historical data

Noisy historical data [47, 48] refers to the phenomenon that code labelled as secure might be identified as vulnerable in the future, given that most vulnerabilities are discovered much later than when they are introduced (e.g. zero-day vulnerabilities). For example, the decode_main_header() function in FFmpeg is recently reported to have the null pointer dereference flaw.Footnote 3 However, when collecting data before such a report, this function will appear as secure.

4.7 Errors in raw data

Errors in raw data can significantly affect the accuracy of vulnerability detection algorithms. We have found errors in existing datasets, including empty source code files, extra lines, and inconsistent file formats. [23] includes 435, 195, and 205 empty source code files in Asterisk, Pidgin, and VLC, respectively. Without their identification, these empty files would be processed normally to train a detection model. However, since they do not provide any pattern, the detection model to be trained can be misled to learn real patterns for secure and vulnerable source code.

All the 18 datasets provided by [18], [23], [29], and [34] have the issue of containing an extra line at the beginning of a separate source code file. This extra line varies across different datasets, such as [34]: such as “\(\texttt {\}}\) EightBpsContext;”, “\(\texttt {\}}\)”, “*/”, and “\(\texttt {\}}\) AascContext;”. Note that this extra line cannot be removed during the pre-processing procedure without checking the source code files manually. For example, the data pre-processing procedure generally filters comments by locating paired comment marks, such as “\(\mathtt {\backslash \backslash }\)” and “\(\mathtt {\backslash * */}\)” for C programming language. The work presented in [18] uses “.txt” for vulnerable code files and “.c” for secure code files. Depending on the framework, reading files can encounter errors.

4.8 Covered programming Languages

Existing datasets usually involve specific programming languages, such as C/C++, Java, and Python. Table 1 demonstrates an abundance of data in C and C++ and, a staggering absence of Java and Python despite them being two of the most prominent programming languages. This limitation leads to a gap in evaluating a detection model’s generalisation according to different programming languages.

Table 5 Summary of our dataset. #with patch indicates at least one patch exists for a certain vulnerable code

5 Our dataset creation methodology

For the sake of completeness, we created a dataset providing all the features described in Table 7 and open-sourced it on Zenodo [59]. We collected vulnerable source code from projects on GitHub that have registered CVEs into NVD\(^{5}\) from 2002 to 2023. Figure 5 illustrates the workflow of our data collection process. First, we collect all the CVE records of each year from the NVD Data JSON Feeds [60]. Each CVE record consists of the detailed information, such as CVE, impact, and references. Second, we identify the associated GitHub repository (if any) along with the commit hash based on the URL filed in the reference data of each CVE record. Next, based on the commit message, we extract the code files before and after this commit. From CWE.mitre, we obtain the vulnerability type and detailed information for each CVE record. Once code files in GitHub projects are collected, we identify the programming language using file extensions according to predefined rules. These rules are dedicated to assign specific file extensions to all programming languages known to GitHub (in total 484 programming languages), ensuring accurate identification throughout our analysis process.

Compared to most existing datasets [8, 18, 23,24,25, 27, 28] that simply assign code before commit as vulnerable and after as non-vulnerable, we compare code difference and apply several filtering rules to avoid mislabelling and duplication. Specifically, we compare the code difference before and after commit and apply several filtering rules to ensure a high-quality dataset. These rules include:

  1. 1.

    Eliminating duplicated code. Removing code that does not have substantive changes, usually caused by modifying code formatting.

  2. 2.

    Removing spurious entries. Filtering out instances where changes are limited to modifying variable or function names, adding or deleting comments, rather than making significant alterations to the code logic.

In addition, for each vulnerable code, we collect its available patch(es) to support repairing guidance. These patches are considered as secure code. Note that according to [48], it is recommended to utilise the most recent patch concerning the noise. Nevertheless, we provide access to all available patches, ensuring users have the flexibility to choose based on their specific requirements or preferences.

Our dataset covers eight programming languages, including JavaScript, C, PHP, Java, C++, Python, Go, and Ruby. Table 5 shows the data size for each programming language and Table 6 presents the coverage of 2023 CWE top 25 most dangerous software weaknesses [61]. More details about the features, committer, commit date, and projects can be found in the dataset description files [59]. Note that, this dataset will be updated to have newer versions according to new data collected from GitHub or other sources, formatted and revised periodically.

Table 6 Coverage of 2023 CWE top 25 most dangerous software weaknesses [61]

6 Discussion

Table 7 Mapping of the desired features across the methodology used in the articles providing vulnerability datasets. The datasets provided by [21, 35], and [38] are omitted due to restricted access

According to the previous analysis, we have extracted a set of desired features to be present in vulnerability detection datasets. Table 7 summarises them, along with the corresponding mapping of the datasets analysed in this article. First, we consider provenance as a critical aspect of verifiability and pre-processing. In this regard, we can ensure that the dataset contains correct versions of the code (i.e., easing the analysis and correction of secondary issues), if there are novel code excepts that can be used to update the dataset, and that multiple sources can be cleaned and tagged accordingly, which translates into a more robust database. In terms of adaptability, including multiple programming languages increases the dataset’s use cases and models’ capability to identify vulnerable code in multiple contexts. The latter, while not being critical, is a desirable feature to ensure adoption in DevSecOps pipelines, where multiple users with different necessities coexist. Some qualitative aspects that could affect the accuracy of models are related to the size and richness of the dataset. A well-represented number of classes in terms of secure and vulnerable code and their balance are also crucial to guarantee sound training procedures. In parallel, the above should also apply to the set of vulnerabilities, which should include as many as possible types with enough samples to be fed into the model. Finally, regarding usability, using correct and standard formatting increases the potential use across platforms and models and the use of automated updating mechanisms. The latter could be enhanced by the use of guidelines on how to correct or repair such vulnerabilities and patches for the corresponding code samples.

In terms of minimising the impact of issues in current datasets, certain problems (Sect. 4.1 - 4.3) are difficult to address by data pre-processing. However, there are techniques to mitigate their impact on the model performance. For the sampling size issue, data augmentation [62] can be applied to increase the number of data during training. To force the model to learn more from vulnerable code in imbalanced datasets, weighted loss functions, such as focal loss [55] and mean squared error loss [53], can be applied instead of the default cross-entropy loss. Concerning the low vulnerability coverage issue, one can consider merging several datasets that cover different vulnerabilities or just focus on detecting source code with included vulnerabilities. Code refactoring [62] and adversarial code attacks [63] can help to generate more similar code samples without changing the semantics to reduce the bias in vulnerability distribution. Finally, despite the real-world prevalence of imbalanced datasets, such imbalance can hinder the robustness of the performance of learning systems [64], and thus, several balancing strategies can be used [65]. For instance, sub-sampling of the dataset can also be used to provide balanced instances, yet only if the dataset has a sufficient number of samples for all classes and types to guarantee quality training.

For other issues (Sect. 4.4 - 4.8), the quality of existing datasets can be improved by designing an advanced pre-processing method to remove errors, check the correctness of labels, and reduce noise from the raw data. Both static analysis tools and expert manual manner should be considered when assigning labels to collected data to avoid mislabelling. In addition, one should include as much information as possible in the dataset, such as vulnerability type, source project, and commit ID, instead of only including labels to allow for tracking and re-checking. Note that, the errors mentioned in Sect. 4.7 do not exist in all studied datasets and may not cover all cases. A thorough check should be made to develop the operations for a comprehensive pre-processing. To reduce noise, one should look into the latest commit messages related to each code file and modify the labels when necessary. However, the latter can only be done if the provenance of the data is properly managed and verified.

6.1 Limitations

As seen in Table 7, we crafted a dataset that fulfils all the desired features. Compared to most of the evaluated vulnerability datasets, a particular aspect is the inclusion of the desired feature of providing potential solutions to help developers, enhancing the DevSecOps pipeline. The latter is currently being exploited by AI-based models such as GitHub Copilot [66] and other LLMs [20]. Nevertheless, it is worth to mention some limitations of our approach. First, in terms of data collection, the sources used to retrieve data are GitHub repositories which have registered CVEs into NVD, potentially introducing bias and limiting the representation of diverse software vulnerabilities. Next, regarding code granularity, the dataset is constrained to the function-level code given its more frequent utilisation in the existing literature. Finally, certain programming languages, notably Go and Ruby, have smaller data sizes within this dataset than others due to the scarcity of sufficient repositories, potentially leading to imbalances in vulnerability distribution. Consequently, the model’s efficacy in detecting vulnerabilities may be compromised, particularly when applied to source code written in these programming languages.

Regarding current regulations and models, Stanford’s Center for Research on Foundation Models (CRFM) recently evaluated the major AI companies on their transparency [67]. The findings revealed a significant gap in the AI industry’s transparency. Our dataset contains the provenance references of the code samples collected, providing data source transparency as required by the European AI ACT [68]. The latter, paired with the use of versioning through hashes, can provide additional protection against the use of adversarial data and similar mechanisms, as the contents of the dataset could be verified via tamper-proof mechanisms such as bloc-kchain. Similarly, we could enforce verified code when providing repairing/patching suggestions to avoid using malicious code. The latter is a research line that we will pursue in the future.

7 Related work

The quality issues in existing datasets have drawn the attention of researchers [69]. Most datasets with vulnerable code from real-world projects are extracted from publicly accessible repositories on GitHub. These datasets include vulnerabilities reported in databases like CVE,Footnote 4 the National Vulnerability Database (NVD),Footnote 5 and Exploit-DB.Footnote 6 It is important to note that these databases do not contain source code themselves; they provide vulnerability information and links to related projects (i.e. often referring to GitHub), from which the source code is retrieved. Generally, simulated datasets are extracted from the Software Assurance Reference Database (SARD)Footnote 7 where synthetic code is provided for various vulnerabilities. However, a recent comparison study [70] revealed that some databases, including NVD, may introduce errors because they do not perform vulnerability testing on reports. As a result, corresponding datasets may contain mislabelled data. For example, a recent study by Croft et al. [71] found that 20-71% of vulnerabilities are inaccurate in four existing datasets, and 17-99% data is duplicated.

Hanif et al. [72] pointed out three open issues of datasets for vulnerability detection datasets, including the lack of labelled datasets, inconsistencies of datasets for evaluation, and the impractical use of synthetic datasets since synthetic data cannot reflect the true structure of real-world vulnerabilities. Nie et al. [73] specifically studied the cause of label errors in existing datasets and investigated the impact on model performance. Croft et al. [71] studied five quality issues in existing datasets [8, 28, 33], including accuracy, uniqueness, consistency, completeness, and currentness. Accuracy refers to the label correctness and the causes of mislabelling are categorised into: irrelevant code changes, cleanup change, and inaccurate vulnerability fix identification. Uniqueness refers to the duplication of data samples. Consistency refers to duplicated data that have different labels, which is a type of mislabelling. Completeness refers to the reference information of data samples, and currentness focuses on if a dataset is up to date. All these five issues have been covered by our study. Ding et al. [43] revealed three issues in existing datasets [9, 28, 40], including low label accuracy, and high duplication rates.

In our article, we leverage a search methodology to provide a thorough state of the art analysis. As we collect all the previous efforts in terms of issue/challenge identification, we are able to provide a more fine-grained analysis of the existing datasets, as identified in the literature. The latter allowed us to identify and describe the current issues in a comprehensive manner, and to generate a new dataset that fulfils the identified desirable features while providing a quantitative comparison with the state of the art.

8 Conclusion

The performance of AI-based vulnerability detection models highly relies on the data used for training. Poor data quality can lead to unreliable results, false positives, and false negatives. In this article, we provide an in-depth discussion of existing vulnerability datasets and define eight quality issues. Furthermore, we provide actionable guidance to assist researchers in addressing these issues when using existing datasets and open-source a real-world dataset with all desired features extracted from our classification.

In future work, we will foster the pre-processing and updating capabilities of the datasets to avoid the issues discussed in our classification. We will also study the use of tamper-proof mechanisms and their impact on the trustworthiness and usability of the vulnerability detection datasets. The latter, paired with the adoption of novel regulations and guidelines such as the European AI Act, should guarantee the robustness and verifiability of datasets and training procedures, enhancing vulnerability detection practices.