1 Introduction

Cybersecurity incidents on our nation’s government and commerce are soaring. In 2021 alone, critical infrastructures [1], companies [2], schools [3, 4], and municipal agencies [5] suffered major ransomware attacks and data breaches. The cybersecurity company Kaseya estimated that ransomware compromised up to 1500 businesses during this time [6]. Industry statistics show that more than a thousand annual data breach cases have occurred since 2016 [7] and federal agencies experience more than 30,000 cyber incidents annually [8].

System forensic analysis also known as system provenance analysis [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25] is an effective technique to track the dependencies across system events in a cyber incident, therefore, assessing the scope of damage and understanding the attack route of an intrusion. Previous approaches in security datasets have been proposed for research and educational purposes [26,27,28]. However, they lack the following characteristics to be used for provenance analysis research and education.

  • High-quality dependencies across events.—To conduct provenance analysis, such datasets should have dependency information intact, so that the causality of events can be systematically reasoned. Operating system calls with required parameters are an example that qualifies for this purpose.

  • Realistic threat behavior.—The datasets should be based on a realistic scenario and real vulnerability exploits to reflect the characteristics and complexity of real software exploit attacks.

  • Explicit data labeling to assist machine-learning tasks.—The dataset should be labeled to be useful for validation purposes. Also, machine-learning tasks with supervision require accurate labels. We prepared clearly labeled dataset where each scenario case is provided with two recorded runtime instances; a benign scenario without an attack, and an adversary scenario with an attack occurring. This structure can simplify manual examination and data pre-processing for machine-learning-based approaches.

This paper proposes a new dataset for system provenance analysis called ProvSecFootnote 1 to meet this need and provide an improved solution for past work’s shortcomings. We use cyber attacks simulated in a cloud-based virtual environment to provide detailed high-quality digital forensic artifacts.

This paper is organized in the following way. Section 2 presents the design of the dataset. Its evaluation is presented in Sect. 3. Section 4 presents the details of the dataset shared. Section 5 discusses the information regarding data sharing and analysis. Section 6 presents related work. Finally, Sect. 7 concludes this paper.

2 Design of ProvSec

Fig. 1
figure 1

Architecture of ProvSec dataset

To meet the aforementioned qualities demanded, we propose ProvSec, a cybersecurity provenance analysis dataset (Fig. 1) comprising the following.

2.1 Cloud Incidents

Virtual machines simulating the hosts of cyber attacks will provide realistic and safe sandbox environments for cybersecurity experiments while preventing any unintended damages such as mistaken security operations during course modules. Also, virtualization technology is useful for integrating the management of virtual environments and data transfer, so that forensic data are collected, labeled, and managed with convenience.

2.2 Provenance Data

In practical incident response research and education, obtaining high-quality data is critical to successfully expose attack sequences from piles of evidence. This is one important implementation goal. In real incidents, investigators may end up with an incomplete attack scenario due to various reasons such as an organization’s unprepared cyber infrastructure against potential incidents (e.g., lack of monitoring software and loss of logs).

ProvSec records and safely preserves system forensic event history and artifacts, so that we can analyze and recover the details of attack and defense system activities. This architecture will offer cyber analysts/investigators realistic environments, data navigation interfaces, and quality forensic data. They will access these historical data through well-defined interfaces and available functions for manual and automated investigation.

Another important design issue of ProvSec is deciding which data to collect. Traditionally, provenance analysis research relies on operating system calls, which we also chose for the data format. A system call is a lower level interface invoked by software to use the services of the operating system kernel. Critical services for resources and privileges (e.g., memory, file, network, and processes) are performed via system calls. Therefore, this interface is important to monitor to understand attack activities and determine their causalities (e.g., a network intrusion \(\rightarrow\) login \(\rightarrow\) data copy).

Each event is stored as a json object with 14 fields shown in Table 1. We selected and adopted several fields available in sysdig event tracer for our dataset format. datetime is the event time relative to the start of the execution of the case. type is the system call name. We provide the names and program IDs of the current process and the parent process. prog_args field is useful to show the program parameters. File events have a file as an object whose name is described in the fd_name field and the type is shown in the fd_type field. The network events have the IP addresses and ports for the client and server sides, which are respectively described in the fd_cip, fd_cport fields (client side) and in fd_sip, fd_sport fields (server side). Each system call can be generated as one or two events (e.g., the start event of a system call and the end event of the system call) depending on system call types. Then, the order field shows the order (e.g., 0, 1).

Table 1 ProvSec event format

2.3 Provenance Analysis with Graph Improvements

Algorithm 1
figure a

Enhanced backtracking algorithm

These events are analyzed by event dependence analysis [17] known as a backtracking algorithm. We made several improvements in the original backtracking algorithm as shown in Algorithm 1 to improve multiple practical issues. Note this algorithm is general to any provenance data making it applicable to related work.

2.3.1 Improvement #1: Incomplete Capture of All Processes

In the original backtracking system [17], the data recorder is integrated with the hypervisor. Therefore, it tracks all processes starting from the very first one. However, we use a data recorder (sysdig) on top of a COTS operating system (ubuntu) which initiates recording after the machine has finished the booting sequence and loading daemons. This deployment issue causes the data recorder to miss the creation of certain processes.

While this issue can be partially alleviated by starting the recording software as early as possible in the booting stage, there is always a chance that some process starts could be missed from the recording, while their behavior is recorded. We handled this issue for practical usage by including such programs into the graph using artificial process creation when their behavior is observed for the first time. As shown in the lines 2–15 of Algorithm 1, when their first behavior is processed, the algorithm creates an artificial fork (process creation) event.

2.3.2 Improvement #2: Limited Data Fields from a Data Recorder

We found some recording fields from our data monitoring software; sysdig are missing as such data may not be available at the time when the data are retrieved and stored inside the OS kernel.

To improve data quality, we added logic to supplement such missing information as much as possible by extracting it from the event’s metadata and other recorded history. This part is shown in the lines 16–18 of Algorithm 1.

2.3.3 Improvement #3: Anonymization

There are some names of processes or resources that might be sensitive to be identified. We applied an anonymization process to replace such names with artificial names. Lines 29–35 show this process. Generally, the anonymization of events is a complicated process. However, this is not the case for our approach, because we use a fixed list of event fields that can be properly examined and anonymized.

2.4 Attack Cases

We created several scenarios of cyber attacks where their data are generated by setting up virtual machines, software, and triggering attack actions along with manual labeling of behavior.

  • Case 01—Nginx integer overflow vulnerability: This case represents an integer overflow vulnerability that exists in Nginx software whose versions are between 0.5.6 and 1.13.2. This vulnerability is caused by insufficient bound checking (CVE-2017-7529).

  • Case 02—Path traversal and file disclosure vulnerability in Apache HTTP Server : Apache 2.4.49 has a vulnerability that allows a path traversal attack to map URLs to files outside the expected document root (CVE-2021-41773). We used this vulnerability to execute several UNIX commands.

  • Case 03—Python PIL/pillow remote shell command execution via ghostscript: Ghostscript whose version is before 9.24 has a vulnerability that allows the exploitation of a remote shell command. We create a file /tmp/test.txt remotely in the target server as a demonstration (CVE-2018-16509)

  • Case 04—PHP IMAP remote command execution vulnerability: The PHP IMAP extension is used to send and receive emails. imap_open call internally uses ssh and an attacker can inject a parameter for a remote command execution. We conducted an attack to execute the command echo ’1234567890’>/tmp/test0001 (CVE-2018-19518)

  • Case 05—Apache Log4j2 lookup feature JNDI injection with a reverse shell: Apache Log4j, a Java-based logging utility, has a vulnerability CVE-2021-44228 in its support for JNDI (Java Naming and Directory Interface). We used this vulnerability to initiate a reverse shell.

  • Case 06—Apache Tomcat AJP Arbitrary File Read/Include Vulnerability: Apache Tomcat has a vulnerability CVE-2020-1938 known as Ghostcat that allows an attacker a file read. We used this vulnerability to read a sensitive password file, /etc/passwd, as a demonstration of an arbitrary file read.

  • Case 07—Redis Lua Sandbox Escape and Remote Code Execution: Redis, an open-source in-memory data structure store, has a vulnerability CVE-2022-0543 to allow an escape of Lua sandbox and an execution of an arbitrary remote command. We used this vulnerability to run UNIX commands and dump the password file.

  • Case 08—Consul service APIs’ misconfiguration leading to Remote Code Execution (RCE) and reverse shell: Consul is an open-source software to discover and configure services. It has a vulnerability that allows remote code execution. We created a remote shell followed by several attack commands.

  • Case 09—Path traversal and file disclosure vulnerability in Apache HTTP Server: This attack case is regarding CVE-2021-42013 which is a vulnerability caused by an incomplete fix of CVE-2021-41773. After the fix, the Apache server still allows path traversals and execution of remote commands.

  • Case 10—Django QuerySet.order_by SQL Injection Vulnerability: Django has a vulnerability that allows SQL injection (CVE-2021-35042). We used this vulnerability to collect information from the machine as an error message.

  • Case 11—Escape from a Docker container: Vulnerability on docker: Docker has a vulnerability for an attacker to escape a container and run commands (CVE-2019-5736). We used this vulnerability to create a backdoor and execute several UNIX commands.

2.5 Dependency Graph Reduction

We identified a detection point of each dataset case and conducted dependency analysis to reduce the graph size. The examples of several dataset cases are presented in the evaluation section. They show a significant reduction in the sizes and complexity of graphs.

3 Evaluation

This section presents the evaluation of ProvSec datasets. We created a total of 11 attack scenarios using widely used software and vulnerabilities.

Fig. 2
figure 2

Dependency graph of C02-path traversal and file disclosure vulnerability in Apache HTTP Server

Fig. 3
figure 3

Simplified backtrack graph of C02-path traversal and file disclosure vulnerability in Apache HTTP server

Fig. 4
figure 4

Dependency graph of C03-python PIL/pillow remote shell command execution via ghostscript

Fig. 5
figure 5

Simplified backtrack graph of C03-python PIL/Pillow remote shell command execution via ghostscript

We created the ProvSec dataset using docker containers and sysdig on top of Ubuntu 20.04. We have prepared a total of eleven real attack scenarios for this dataset. The details for these cases are illustrated in Figs. 2, 4, and 6, which respectively show the full attack behavior of C02, C03, and C05 scenarios.

We have three different types of behavior: process, file, and network, which are shown in different colors. In each figure, the red nodes and edges represent processes and process creation events, such as execve, fork, and clone system calls. Blue nodes and edges represent files and file activities. Their examples include open, close, read, and write system calls and their variants. The green nodes and edges represent network addresses and network activities, such as connect and accept system calls.

3.1 Graph Complexity

Table 2 shows the details of 11 incident cases. The graph complexity of each case is presented in Table 3. |N| represents the total number of nodes and |E| represents the total number of edges. This table also shows the complexity of backtrack graphs which are simplified by applying a dependency analysis on the detection points. Their nodes and edges are shown in \(|N_{bt}|\) and \(|E_{bt}|\) columns and their reduction rates compared to the full graphs are respectively shown in \(\frac{|N_{bt}|}{|N|}\) and \(\frac{|E_{bt}|}{|E|}\). The nodes are simplified to 0.5–17.9% of the original graphs. The edge complexity got lower to 0.015–9.5%.

Table 2 Details of the incident cases
Table 3 Details of the incident provenance graphs

3.2 Simplified Backtracking Graphs

In this section, we explain three cases of attack graphs and their simplified attack behavior as examples.

Case 02: Apache Path Traversal and File Conflict of interest: Fig. 3 shows the simplified behavior of the original graph, Fig. 2 which demonstrates a path traversal and file disclosure vulnerability attack targeted on Apache http server. The simplified graph of Fig. 3 shows that the shell (sh) and ls processes were invoked from the httpd process exposing the paths of the server.

Case 03: Python PIL/Pillow RCE via Ghostscript: Fig. 5 highlights the core attack of the original graph, Fig. 4 by removing irrelevant nodes and edges of the C03 scenario. The intrusion was detected by the touch command which was triggered by shell processes (sh whose process IDs are 87004 and 87005). We can confirm that these processes were created by the Ghostscript (gs) processes whose process IDs are 87003 and 87004 which came from the python process (python). This graph indicates the root cause of an vulnerability exploit of Ghostscript in the Python program.

Case 05: Apache log4j lookup with JNDI injection: Fig. 6 illustrates a complex behavior of the Apache Log4j incident. This attack is initiated via the JNDI injection and a reverse shell demonstrated in its backtrack graph, Fig. 7. In this graph, we can observe a shell process (sh) of its process ID, 9743, was forked from a java process (PID 9712). Note this shell process was initially a java process and then turns into a shell using a execve system call. This shell process conducts two attack behavior copying (cp) and modifying (touch) a sensitive file (FiscalYearEndReport.xlsx). This simplified graph demonstrates what accesses have occurred on the sensitive file as a summary of attack behavior.

Fig. 6
figure 6

Dependency Graph of C05-Apache Log4j2 lookup feature JNDI injection with a reverse shell

Fig. 7
figure 7

Backtrack Graph of C05-Apache Log4j2 lookup feature JNDI injection with a reverse shell

4 Data Characteristics

Table 4 Data characteristics

Table 4 shows the characteristics of the dataset events that we share. For 11 attack cases, we have two samples of recordings: one for a benign workload without any attack and the other for an adversary workload with attack behavior. For each recording, the number of events (#E), the number of distinct process names (#P), the number of distinct IP addresses (#I), and the number of different system call types (#T) are presented. The data recorder, sysdig, that we use generates one or two events per system call. Therefore, the total number of system calls will be less than #E. The total number of events of a benign case or an adversary case is different because of different workloads. In all eleven data cases, our dataset has 341.7K events in the benign cases and 987.7K events in the abnormal cases.

5 Discussion

5.1 Data Sharing

We share our dataset with the cybersecurity community in the following link: https://uco-cyber.github.io/research/#provsec.

5.2 Processing Time for a Real-Time System

We used the Python language to write data processing code. Loading and analyzing the entire 1.3 million events take less than a minute with our Python implementation. If a compiled native program written in C or C++ is used, we can speed up this processing time further significantly. As the next step of this project, we are processing these events collected from multiple machines for anomaly detection in a live fashion. Therefore, we are able to use this type of data in a real-time environment.

6 Related Work

In this section, we compare our work with multiple prior works proposed for security datasets.

Network-oriented dataset: Many existing works focus on network-oriented data, such as five-tuples or full packet recordings (e.g., PCAP) [29,30,31,32,33,34]. While these datasets have an influence on multiple research works, they lack the information necessary to conduct dependency analysis of operating system events for system provenance analysis.

Software vulnerability dataset: Other dataset work [35,36,37,38] is regarding software vulnerability including useful features, such as source code information, CWE (Common Weakness Enumeration), CVE (Common Vulnerability Enumeration), code metrics, etc. The datasets of this category have full details at the code level. However, they do not provide the runtime data on how they use operating system services and their parameters which are necessary to conduct system provenance analysis.

Provenance dataset: Multiple cybersecurity datasets have been introduced for the details of system behavior which enables provenance analysis. ISOT-CID dataset [26] includes data of multiple formats including network traffic, system logs, performance data (e.g., CPU utilization), and system calls. While these data are quite close to what we provide, the system call data are incomplete and not structured. They lack full details and the records are in a non-standard format similar to the strace output. Therefore, it takes manual effort to parse, curate, and extract useful information from the records. ProvMark [39] is a benchmarking system regarding provenance expressiveness, which evaluates three types of provenance recorders: OPUS [40], CamFlow[41], and SPADE [42], [43].

DARPA released Operationally Transparent Cyber (OpTC) data that was used to evaluate the DARPA Transparent Computing (TC) program [27]. These data have been used in multiple papers for analyzing APT attacks. While this dataset has a large volume of rich data, it lacks proper explanation, so that the details of attacks are understood by researchers. In this regard, Anjun et al. analyzed and published the details of OpTC dataset [28] explaining the details of characteristics. However, still, this paper describes overall statistics such as the types of actions and objects.

Compared to these approaches, ProvSec has several advantages that can help researchers conduct research with provenance data especially for machine-learning tasks. Our dataset has full details of system calls and parameters that are organized in the json format and enable the construction of operating system dependencies and system provenance analysis. We utilized real vulnerabilities and proof-of-concept (PoC) code to simulate attack scenarios inside docker environments which are recorded in the operating system kernel.

As a most useful characteristic, we provide manual labeling of the attacks that are helpful to identify the root causes of attacks and the full details of attack behavior which will help experiments that need ground truth validation or supervised machine-learning experiments. Each scenario data case is organized into two separate runtime instances and corresponding recording files, (1) one benign case and (2) an adversary case which is recorded without and with attack behavior. This clear labeling structure can significantly facilitate the data pre-processing for machine-learning tasks.

Provenance analysis: Provenance analysis has been studied by a large body of work in recent years [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]. Recent survey papers [24, 25] summarize multiple approaches conducted in the provenance tracking and dependency analysis according to their categorizations. Multiple attack detection approaches [23, 44,45,46,47] have been proposed to detect APT campaign effectively. Due to a large volume of data, several ideas for data reduction have been explored, such as execution partitioning, garbage collection, and approximations of behavior patterns [18, 48,49,50,51,52,53]. Regarding the mechanisms of collecting provenance data, several approaches used OS-level data collectors with program instrumentation [18, 19]. A kernel-based framework [21] and hardware-based technique [22] are also proposed. As another trend, machine-learning-based solutions [10, 44, 54] are increasingly being used to improve detection. All such approaches can benefit from the provenance dataset with high-quality labeling. Our work can contribute to such approaches as additional evaluation data.

7 Conclusion

In this paper, we introduce a new dataset for security provenance analysis along with a detailed description, analysis, and clearly provided labels with two separate execution traces of a benign scenario and an adversary scenario. This dataset is differentiated from prior work with detailed data for causal dependencies across events, the usage of real vulnerabilities and PoC exploits, and manual labeling which particularly would be helpful for validation and supervised machine-learning tasks. We performed an enhanced causality dependence analysis with our improved algorithm and demonstrated how the dependency analysis can simplify the analysis of each attack scenario with our dataset cases. We made our dataset public, so that the research and education communities advancing provenance analysis can benefit from this dataset.