ProvSec: Open Cybersecurity System Provenance Analysis Benchmark Dataset with Labels

System provenance forensic analysis has been studied by a large body of research work. This area needs fine granularity data such as system calls along with event fields to track the dependencies of events. While prior work on security datasets has been proposed, we found a useful dataset of realistic attacks and details that are needed for high-quality provenance tracking is lacking. We created a new dataset of eleven vulnerable cases for system forensic analysis. It includes the full details of system calls including syscall parameters. Realistic attack scenarios with real software vulnerabilities and exploits are used. For each case, we created two sets of benign and adversary scenarios which are manually labeled for supervised machine-learning analysis. In addition, we present an algorithm to improve the data quality in the system provenance forensic analysis. We demonstrate the details of the dataset events and dependency analysis of our dataset cases.


Introduction
Cybersecurity incidents on our nation's government and commerce are soaring.In 2021 alone, critical infrastructures [1], companies [2], schools [3,4], and municipal agencies [5] suffered major ransomware attacks and data breaches.The cybersecurity company Kaseya estimated that ransomware compromised up to 1500 businesses during this time [6].Industry statistics show that more than a thousand annual data breach cases have occurred since 2016 [7] and federal agencies experience more than 30,000 cyber incidents annually [8].
• High-quality dependencies across events.-Toconduct provenance analysis, such datasets should have dependency information intact, so that the causality of events 1 3 can be systematically reasoned.Operating system calls with required parameters are an example that qualifies for this purpose.• Realistic threat behavior.-Thedatasets should be based on a realistic scenario and real vulnerability exploits to reflect the characteristics and complexity of real software exploit attacks.• Explicit data labeling to assist machine-learning tasks.- The dataset should be labeled to be useful for validation purposes.Also, machine-learning tasks with supervision require accurate labels.We prepared clearly labeled dataset where each scenario case is provided with two recorded runtime instances; a benign scenario without an attack, and an adversary scenario with an attack occurring.This structure can simplify manual examination and data pre-processing for machine-learning-based approaches.
This paper proposes a new dataset for system provenance analysis called ProvSec1 to meet this need and provide an improved solution for past work's shortcomings.We use cyber attacks simulated in a cloud-based virtual environment to provide detailed high-quality digital forensic artifacts.This paper is organized in the following way.Section 2 presents the design of the dataset.Its evaluation is presented in Sect.3. Section 4 presents the details of the dataset shared.Section 5 discusses the information regarding data sharing and analysis.Section 6 presents related work.Finally, Sect.7 concludes this paper.

Design of ProvSec
To meet the aforementioned qualities demanded, we propose ProvSec, a cybersecurity provenance analysis dataset (Fig. 1) comprising the following.

Cloud Incidents
Virtual machines simulating the hosts of cyber attacks will provide realistic and safe sandbox environments for cybersecurity experiments while preventing any unintended damages such as mistaken security operations during course modules.Also, virtualization technology is useful for integrating the management of virtual environments and data transfer, so that forensic data are collected, labeled, and managed with convenience.

Provenance Data
In practical incident response research and education, obtaining high-quality data is critical to successfully expose attack sequences from piles of evidence.This is one important implementation goal.In real incidents, investigators may end up with an incomplete attack scenario due to various reasons such as an organization's unprepared cyber infrastructure against potential incidents (e.g., lack of monitoring software and loss of logs).
ProvSec records and safely preserves system forensic event history and artifacts, so that we can analyze and recover the details of attack and defense system activities.This architecture will offer cyber analysts/investigators realistic environments, data navigation interfaces, and quality forensic data.They will access these historical data through well-defined interfaces and available functions for manual and automated investigation.
Another important design issue of ProvSec is deciding which data to collect.Traditionally, provenance analysis research relies on operating system calls, which we also chose for the data format.A system call is a lower level interface invoked by software to use the services of the operating system kernel.Critical services for resources and privileges (e.g., memory, file, network, and processes) are performed via system calls.Therefore, this interface is important to monitor to understand attack activities and determine their causalities (e.g., a network intrusion → login → data copy).
Each event is stored as a json object with 14 fields shown in Table 1.We selected and adopted several fields available in sysdig event tracer for our dataset format.datetime is the event time relative to the start of the execution of the case.type is the system call name.We provide the names and program IDs of the current process and the parent process.prog_args field is useful to show the program parameters.File events have a file as an object whose name is described in the fd_name field and the type is shown in the fd_type field.The network events have the IP addresses and ports for the client and server sides, which are respectively described in the fd_cip, fd_cport fields (client side) and in fd_sip, fd_ sport fields (server side).Each system call can be generated as one or two events (e.g., the start event of a system call and the end event of the system call) depending on system call types.Then, the order field shows the order (e.g., 0, 1).

Provenance Analysis with Graph Improvements
These events are analyzed by event dependence analysis [17] known as a backtracking algorithm.We made several improvements in the original backtracking algorithm as shown in Algorithm 1 to improve multiple practical issues.Note this algorithm is general to any provenance data making it applicable to related work.

Improvement #1: Incomplete Capture of All Processes
In the original backtracking system [17], the data recorder is integrated with the hypervisor.Therefore, it tracks all processes starting from the very first one.However, we use a data recorder (sysdig) on top of a COTS operating system (ubuntu) which initiates recording after the machine has finished the booting sequence and loading daemons.This deployment issue causes the data recorder to miss the creation of certain processes.
While this issue can be partially alleviated by starting the recording software as early as possible in the booting stage, there is always a chance that some process starts could be missed from the recording, while their behavior is recorded.We handled this issue for practical usage by including such programs into the graph using artificial process creation when their behavior is observed for the first time.As shown in the lines 2-15 of Algorithm 1, when their first behavior is processed, the algorithm creates an artificial fork (process creation) event.

Improvement #2: Limited Data Fields from a Data Recorder
We found some recording fields from our data monitoring software; sysdig are missing as such data may not be available at the time when the data are retrieved and stored inside the OS kernel.
To improve data quality, we added logic to supplement such missing information as much as possible by extracting it from the event's metadata and other recorded history.This part is shown in the lines 16-18 of Algorithm 1.

Improvement #3: Anonymization
There are some names of processes or resources that might be sensitive to be identified.We applied an anonymization process to replace such names with artificial names.Lines 29-35 show this process.Generally, the anonymization of events is a complicated process.However, this is not the case for our approach, because we use a fixed list of event fields that can be properly examined and anonymized.

Attack Cases
We created several scenarios of cyber attacks where their data are generated by setting up virtual machines, software, and triggering attack actions along with manual labeling of behavior.

Dependency Graph Reduction
We identified a detection point of each dataset case and conducted dependency analysis to reduce the graph size.The examples of several dataset cases are presented in the evaluation section.They show a significant reduction in the sizes and complexity of graphs.

Evaluation
This section presents the evaluation of ProvSec datasets.We created a total of 11 attack scenarios using widely used software and vulnerabilities.We created the ProvSec dataset using docker containers and sysdig on top of Ubuntu 20.04.We have prepared a total of eleven real attack scenarios for this dataset.The details for these cases are illustrated in Figs. 2, 4, and 6, which respectively show the full attack behavior of C02, C03, and C05 scenarios.
We have three different types of behavior: process, file, and network, which are shown in different colors.In each figure, the red nodes and edges represent processes and process creation events, such as execve, fork, and clone system calls.Blue nodes and edges represent files and file activities.
Their examples include open, close, read, and write system calls and their variants.The green nodes and edges represent network addresses and network activities, such as connect and accept system calls.|E| .The nodes are simplified to 0.5-17.9% of the original graphs.The edge complexity got lower to 0.015-9.5%.

Simplified Backtracking Graphs
In this section, we explain three cases of attack graphs and their simplified attack behavior as examples.
Case 02: Apache Path Traversal and File Conflict of interest: Fig. 3 shows the simplified behavior of the original graph, Fig. 2 which demonstrates a path traversal and file disclosure vulnerability attack targeted on Apache http server.The simplified graph of Fig. 3 shows that the shell (sh) and ls processes were invoked from the httpd process exposing the paths of the server.
Case 03: Python PIL/Pillow RCE via Ghostscript: Fig. 5 highlights the core attack of the original graph, Fig. 4 by removing irrelevant nodes and edges of the C03 scenario.The intrusion was detected by the touch command which was triggered by shell processes (sh whose process IDs are 87004 and 87005).We can confirm that these processes were created by the Ghostscript (gs) processes whose process IDs are 87003 and 87004 which came from the python process (python).This graph indicates the root cause of an vulnerability exploit of Ghostscript in the Python program.
Case 05: Apache log4j lookup with JNDI injection: Fig. 6 illustrates a complex behavior of the Apache Log4j incident.This attack is initiated via the JNDI injection and a reverse shell demonstrated in its backtrack graph, Fig. 7.In this graph, we can observe a shell process (sh) of its process ID, 9743, was forked from a java process (PID 9712).Note this shell process was initially a java process and then turns into a shell using a execve system call.This shell process conducts two attack behavior copying (cp) and modifying (touch) a sensitive file (FiscalYearEndReport.xlsx).This simplified graph demonstrates what accesses have occurred on the sensitive file as a summary of attack behavior.

Data Characteristics
Table 4 shows the characteristics of the dataset events that we share.For 11 attack cases, we have two samples of recordings: one for a benign workload without any attack and the other for an adversary workload with attack behavior.For each recording, the number of events (#E), the number of distinct process names (#P), the number of distinct IP addresses (#I), and the number of different system call types (#T) are presented.The data recorder, sysdig, that we use generates one or two events per system call.Therefore, the total number of system calls will be less than #E.The total number of events of a benign case or an adversary case is different because of different workloads.In all eleven data cases, our dataset has 341.7K events in the benign cases and 987.7K events in the abnormal cases.

Data Sharing
We share our dataset with the cybersecurity community in the following link: https:// uco-cyber.github.io/ resea rch/# provs ec.

Processing Time for a Real-Time System
We used the Python language to write data processing code.Loading and analyzing the entire 1.3 million events take less than a minute with our Python implementation.If a compiled native program written in C or C++ is used, we can speed up this processing time further significantly.As the next step of this project, we are processing these events collected from multiple machines for anomaly detection in a live fashion.Therefore, we are able to use this type of data in a real-time environment.

Related Work
In this section, we compare our work with multiple prior works proposed for security datasets.
Network-oriented dataset: Many existing works focus on network-oriented data, such as five-tuples or full packet recordings (e.g., PCAP) [29][30][31][32][33][34].While these datasets have an influence on multiple research works, they lack the information necessary to conduct dependency analysis of operating system events for system provenance analysis.
Software vulnerability dataset: Other dataset work [35][36][37][38] is regarding software vulnerability including useful features, such as source code information, CWE (Common Weakness Enumeration), CVE (Common Vulnerability Enumeration), code metrics, etc.The datasets of this category have full details at the code level.However, they do not provide the runtime data on how they use operating system services and their parameters which are necessary to conduct system provenance analysis.
Provenance dataset: Multiple cybersecurity datasets have been introduced for the details of system behavior which enables provenance analysis.ISOT-CID dataset [26] includes data of multiple formats including network traffic, system logs, performance data (e.g., CPU utilization), and system calls.While these data are quite close to what we provide, the system call data are incomplete and not structured.They lack full details and the records are in a non-standard format similar to the strace output.Therefore, it takes manual effort to parse, curate, and extract useful information from the records.ProvMark [39] is a benchmarking system regarding provenance expressiveness, which evaluates three types of provenance recorders: OPUS [40], CamFlow [41], and SPADE [42], [43].
DARPA released Operationally Transparent Cyber (OpTC) data that was used to evaluate the DARPA Transparent Computing (TC) program [27].These data have been used in multiple papers for analyzing APT attacks.While this dataset has a large volume of rich data, it lacks proper explanation, so that the details of attacks are understood by researchers.In this regard, Anjun et al. analyzed and published the details of OpTC dataset [28] explaining the details of characteristics.However, still, this paper describes overall statistics such as the types of actions and objects.
Compared to these approaches, ProvSec has several advantages that can help researchers conduct research with provenance data especially for machine-learning tasks.Our dataset has full details of system calls and parameters that are organized in the json format and enable the construction of operating system dependencies and system provenance analysis.We utilized real vulnerabilities and proof-of-concept (PoC) code to simulate attack scenarios inside docker environments which are recorded in the operating system kernel.
As a most useful characteristic, we provide manual labeling of the attacks that are helpful to identify the root causes of attacks and the full details of attack behavior which will help experiments that need ground truth validation or supervised machine-learning experiments.Each scenario data case is organized into two separate runtime instances and corresponding recording files, (1) one benign case and (2) an adversary case which is recorded without and with attack behavior.This clear labeling structure can significantly facilitate the data pre-processing for machine-learning tasks.

Conclusion
In this paper, we introduce a new dataset for security provenance analysis along with a detailed description, analysis, and clearly provided labels with two separate execution traces of a benign scenario and an adversary scenario.This dataset is differentiated from prior work with detailed data for causal dependencies across events, the usage of real vulnerabilities and PoC exploits, and manual labeling which particularly would be helpful for validation and supervised machine-learning tasks.We performed an enhanced causality dependence analysis with our improved algorithm and demonstrated how the dependency analysis can simplify the analysis of each attack scenario with our dataset cases.We made our dataset public, so that the research and education communities advancing provenance analysis can benefit from this dataset.

Fig. 2
Fig. 2 Dependency graph of C02-path traversal and file disclosure vulnerability in Apache HTTP Server

Table 1
ProvSec event format Apache Tomcat has a vulnerability CVE-2020-1938 known as Ghostcat that allows an attacker a file read.We used this vulnerability to read a sensitive password file, /etc/passwd, as a demonstration of an arbitrary file read.Case 09-Path traversal and file disclosure vulnerability in Apache HTTP Server: This attack case is regarding CVE-2021-42013 which is a vulnerability caused by an incomplete fix of CVE-2021-41773.After the fix, the Apache server still allows path traversals and execution of remote commands.
• Case 06-Apache Tomcat AJP Arbitrary File Read/ Include Vulnerability: • • Case 10-Django QuerySet.order_bySQL Injection Vulnerability: Django has a vulnerability that allows SQL injection (CVE-2021-35042).We used this vulnerability to collect information from the machine as an error message.• Case 11-Escape from a Docker container: Vulnerability on docker: Docker has a vulnerability for an attacker to escape a container and run commands (CVE-2019-5736).We used this vulnerability to create a backdoor and execute several UNIX commands.

Table 2
shows the details of 11 incident cases.The graph complexity of each case is presented in Table3.|N| represents the total number of nodes and |E| represents the total number of edges.This table also shows the complexity of backtrack graphs which are simplified by applying a dependency analysis on the detection points.Their nodes and edges are shown in |N bt | and |E bt | columns and their reduction rates compared to the full graphs are respectively shown in

Table 2
Details

Table 3
Details of the incident

Table 4
Acknowledgements Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525.This article describes objective technical results and analysis.Any subjective views or opinions that might be expressed in the article do not necessarily represent the views of the U.S. Department of Energy or the United States Government.This work was supported through contract #70RSAT21KPM000105 with the U.S. Department of Homeland Security Science and Technology Directorate.Junghwan Rhee is the corresponding author of this work.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material.If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.To view a copy of this licence, visit http:// creat iveco mmons.org/ licen ses/ by/4.0/.Backtrack Graph of C05-Apache Log4j2 lookup feature JNDI injection with a reverse shell Data characteristics #E total number of events, #P total number of distinct process names, #I total number of distinct IP addresses, #T total number of distinct system call types