A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning

Mvula, Paul K.; Branco, Paula; Jourdan, Guy-Vincent; Viktor, Herna L.

doi:10.1007/s44248-023-00003-x

A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning

Review
Open access
Published: 06 April 2023

Volume 1, article number 4, (2023)
Cite this article

Download PDF

You have full access to this open access article

Discover Data Aims and scope Submit manuscript

A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning

Download PDF

Paul K. Mvula^nAff1,
Paula Branco^na1^nAff1,
Guy-Vincent Jourdan^na1^nAff1 &
…
Herna L. Viktor^na1^nAff1

4157 Accesses
6 Citations
Explore all metrics

Abstract

In Machine Learning, the datasets used to build models are one of the main factors limiting what these models can achieve and how good their predictive performance is. Machine Learning applications for cyber-security or computer security are numerous including cyber threat mitigation and security infrastructure enhancement through pattern recognition, real-time attack detection, and in-depth penetration testing. Therefore, for these applications in particular, the datasets used to build the models must be carefully thought to be representative of real-world data. However, because of the scarcity of labelled data and the cost of manually labelling positive examples, there is a growing corpus of literature utilizing Semi-Supervised Learning with cyber-security data repositories. In this work, we provide a comprehensive overview of publicly available data repositories and datasets used for building computer security or cyber-security systems based on Semi-Supervised Learning, where only a few labels are necessary or available for building strong models. We highlight the strengths and limitations of the data repositories and sets and provide an analysis of the performance assessment metrics used to evaluate the built models. Finally, we discuss open challenges and provide future research directions for using cyber-security datasets and evaluating models built upon them.

On the Variability in the Application and Measurement of Supervised Machine Learning in Cyber Security

Machine Learning for Intelligent Data Analysis and Automation in Cybersecurity: Current and Future Prospects

Article Open access 19 September 2022

Review: machine learning techniques applied to cybersecurity

Article 04 January 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

As a result of the significant technological advancements made throughout the years, people’s lifestyles are shifting from traditional to more electronic. This shift has resulted in an increase in cybercrimes on the Internet. Therefore, adequate measures have to be put in place to secure computer systems. Moreover, computer security or cyber-security systems must be capable of detecting and preventing cyber-attacks in real-time. The intersection of the Machine Learning (ML) and cyber-security fields has recently been rapidly growing as researchers make use of either fully labelled datasets with Supervised Learning (SL), unlabeled datasets with Unsupervised Learning (UL) or combining labelled and unlabeled data with Semi-Supervised Learning (SSL) to identify the various types of cyber-attacks. Due to the high cost and scarcity of labelled data in the cyber-security domain, SSL applications for cyber-security tasks have gained traction. Several datasets have been made available to the public to build ML-based defensive mechanisms. In ML, the quality of the output is determined by the quality of the input [1]; in other words, for ML models to generalize effectively, the datasets upon which they are built must be representative of real-world data. Therefore, surveys on the available datasets and performance evaluation metrics used to build and evaluate SSL models are required to give up-to-date information on recent cyber-security datasets and suitable performance metrics used in SSL frameworks to provide a starting point for new researchers who wish to investigate this vital subject.

Several works focusing on cyber-security provide discussions of datasets and data repositories that can be used for building ML models. For instance, Ring et al. [2] presented an extensive survey on network-based intrusion detection datasets discussing datasets containing packet-based, flow-based and neither packet- nor flow-based data while Glass-Vanderlan et al. [3] focused on Host-Based Intrusion Detection Systems (HIDS) and touched upon datasets and sources mainly related to HIDS. Other articles described datasets for (i) intrusion, malware and spam detection (e.g. [4,5,6,7,8]); (ii) network anomaly detection (e.g. [9]); or (iii) phishing URL detection (e.g. [10]). However, these works often focus on a particular cyber-security domain and do not examine in detail the characteristics of the available datasets and the performance evaluation metrics that are suitable for the various research challenges.

Because of the expanding interest in this area and the rapid speed of research, these surveys quickly become outdated; there is, therefore, an obvious need for a comprehensive survey to present the most recent datasets and evaluation metrics and their usage in the literature. To fill this gap, we present an exhaustive evaluation of the cyber-security datasets used to build SSL models. In this paper, we conduct a systematic literature review (SLR) of publicly available cyber-security datasets and performance assessment metrics used for building and evaluating SSL models. To this end, we provide a summary of datasets used to construct models for cyber-security-related tasks; the covered areas include not only network- and host-based intrusion detection, but also spam and phishing detection, Sybil and botnet detection, Internet traffic and domain name classification, malware detection and categorization, and power grid attacks detection. Additionally, we examine the performance assessment metrics used to evaluate the SSL models and discuss their usage in the selected papers. Furthermore, we provide a list of datasets, tools, and resources used to collect and analyze the data that have been made publicly available in the literature. Finally, we provide a discussion on the open research challenges and a list of observations with regard to datasets and performance metrics. This is, to the best of our knowledge, the first SLR analyzing a wide array of cyber-security datasets and performance evaluation metrics for SSL tasks, as well as providing easy access to publicly available datasets.

Our key contributions are the following:

1.
We provide a description of the most commonly used SSL techniques.
2.
We provide insights on the major cybercrimes for which SSL solutions have been explored.
3.
We present a systematic literature review of the publicly available cyber-security datasets, repositories and performance evaluation metrics used.
4.
We analyze the open challenges found in the literature and provide a set of recommendations for future research.

The remaining sections are organized as follows. Section 2 presents the definitions, important concepts, and basic assumptions of SSL, as well as a brief introduction to the methods utilized in the literature we reviewed and an overview of the different cybercrimes the included articles’ authors propose to counter. Additionally, we provide examples that highlight successful industrial deployments of ML for countering cyber threats, demonstrating the practical applications of the methods discussed in the literature. In Sect. 3, we present the methodology we used to construct our survey and in Sect. 4, an in-depth analysis of the publicly available datasets and the different evaluation metrics used in the selected papers is presented. Section 5 discusses the open challenges faced by the reviewed methods applying SSL for cyber-security, with respect to the datasets and evaluation metrics, presents a set of observations and the lessons learned, and highlights strategies for bridging the gap between research and practice. Finally, Sect. 6 concludes the work.

2 Background on SSL and cyber-security

Machine Learning (ML), the core subset of Artificial Intelligence (AI), may be defined as the systematic study of computer algorithms and systems that allow computer programs to automatically improve their knowledge or performance through experience [11]. It is a branch of computer science where the goal is to teach computers with sample data, i.e., training data, to make predictions or decisions on unseen data. ML algorithms can be categorized into three main types: SL, UL, and Reinforcement Learning (RL). In SL, the task, i.e., the inference of the function to map input data points from an instance space to their corresponding labels in the output space using labelled examples [12, 13], can either be classification where the function being learned is discrete, i.e., input data points in the input space are mapped to categorical values, or regression where the function being learned is continuous, i.e., input data points are mapped to real values. In contrast to SL, in UL, there are no labels available, therefore the goal of UL algorithms is to capture important patterns or extract relationships from untagged (unlabeled) data as probability density distributions [14] and in RL, the algorithms’ goal is to attempt to maximize the feedback (reward) they are provided with. SSL conceptually stands between SL and UL, [15,16,17]. Out-of-core Learning (OL), or Incremental, or Online Learning, is a learning technique where the data becomes available in a sequential, one at a time, manner [18]. In OL, the model can learn from newly available data, in addition to making predictions from it. Information Technology (IT) security, Computer security or simply cyber-security is the protection of computer systems and networks from cyber-attacks, i.e., information disclosure, loss, theft, or damage to their hardware, software, or electronic data, as well as from the disruption or misdirection of the services they offer [19].

SSL and ML, in general, have brought significant benefits to the cyber-security domain, including improved detection capabilities, adaptive learning, automation, and threat intelligence [20] (see Sect. 2.3 for industrial examples). However, there are also challenges that need to be addressed, including the lack of quality data, adversarial attacks, model explainability, and bias and discrimination [21, 22]. Addressing these challenges will be critical to ensuring that ML remains a useful tool in the fight against cyber threats.

In the remainder of this section, we introduce the key principles and techniques of SSL, provide a summary of cybercrimes examined in the literature, and present examples that demonstrate the potential of ML in mitigating cyber threats in the real world.

2.1 SSL concepts and methods

We will first introduce some notations. Let $\mathcal {D}_L=(x_i,l(x_i ))_{i=1}^k$ denote a labelled dataset where each sample $(x_i,l(x_i))$ consists of data point $x_i$ from the instance space $\mathcal {X}$ and a target variable $l(x_i)$ in the output space $\mathcal {Y}$. Let $\mathcal {D}_U=(x_i)_{i=k+1}^{k+u}$ denote an unlabeled dataset. In SL, when $l(x_i)$ consists of categorical values we face a classification task and when it consists of real values we have a regression task. In UL, the model is only provided with unlabeled data, i.e., $\mathcal {D}_U$. SL can build strong models to predict labels for unlabeled samples, but it requires $\mathcal {D}_L$ to contain diverse samples manually labelled by domain experts, which may not only be too costly but may also contain inaccurate labels due to human mistakes. Therefore, in practice, $u \gg k$. On the other hand, even though UL does not require labelled samples to infer patterns, it is prone to overfitting. SSL makes use of both $\mathcal {D}_L$ and $\mathcal {D}_U$ to infer a function whose performance surpasses one built with either SL or UL by making use of at least one of the main learning assumptions, i.e., smoothness, low-density, manifold, [23], and cluster, [24], assumption.

The smoothness assumption is based on the notion that if two data points, $x_1$ and $x_2$, lie close in the instance space, $\mathcal {X}$, their corresponding class labels, $l(x_1)$ and $l(x_2)$, should also be close (the same), in the output space $\mathcal {Y}$; the transitivity assumption, that states that if $x_1$ lies close to $x_2$ and $x_2$ lies close to $x_3$, then $x_1$ lies transitively close to $x_3$, is an important idea in the smoothness assumption because “close points in $\mathcal {X}$ have the same label,” thus this assumption implies that if $x_2$ is a noisy version of $x_1$, they should still have the same predicted label. In the low-density assumption, it is implied that data points with the same label are clustered in high-density sections of the instance space, i.e., the decision boundary must pass through a low-density region, $\mathcal {R} \subset \mathcal {X}$, and the probability of any data point, $p(x_i)$, being in the low-density region is low, i.e., $p(x_i)$ in $\mathcal {R}$ is low. This also verifies that the smoothness assumption is satisfied. In the manifold assumption, the instance space, $\mathcal {X}$, consists of one or more Riemannian manifolds $\mathcal {M}$ on which samples share the same label. According to the cluster assumption, which can be seen as a generalization of the other three assumptions mentioned earlier [16], if data points are in the same cluster, they are likely to share the same label, and there may be several clusters constituting the same class [15].

Based on [16, 25, 26], the taxonomy in Fig. 1 provides a general overview of the SSL approaches which will be described in more detail in Sects. 2.1.1 and 2.1.2. An overview of the key concepts in the taxonomy is presented next.

SS Classification and Regression methods can either be transductive or inductive [15, 27, 28]. In inductive SSL, the model is first built using information from $\mathcal {D}_L$ and $\mathcal {D}_U$ and it can then be used as one built with SL to generate predictions for previously unseen, unlabeled samples; there exists a clear distinction between a training phase and a testing phase. In transductive SSL, on the other hand, the goal is to generate labels for the unlabeled samples fed to the learner, therefore there is no clear distinction between a training and testing phase. Frequently, transductive approaches create a graph across all data points, including labelled and unlabeled, expressing the pairwise similarity of data points with weighted edges and are incapable of handling additional unseen data [17]. We group both SS Classification and Regression because they predict output values for input samples but note that most SS Classification approaches are incompatible with SS Regression, and we, therefore, specify when they may be compatible in Sect. 2.1.1.

In the SS Clustering assumption, the learner’s goal is clustering but a small amount of knowledge is available in the form of constraints, must-link constraints (two samples must be within the same cluster) and cannot-link constraints (two data points cannot be within the same cluster). It differs from traditional clustering in the way the constraints are accommodated: either by biasing the search for relevant clusters or altering a distance or similarity metric [29]. When it is not possible for an SL method to work, even in a transductive form, because the available knowledge is too far from being representative of a target classification of the items, the cluster assumption may allow the use of the available knowledge to guide the clustering process [30]. Bair [25] provides a survey on SS Clustering methods and groups them into constraint-based, partial-labels, SS hierarchical clustering and outcome variable associated methods.

A plethora of SSL approaches have been proposed in the literature, each making use of at least one of the SSL assumptions described. The following sections briefly describe the frequently used SSL methods showing how they relate to the SSL assumptions.

2.1.1 SSL for classification and regression

We divide the classification and regression methods between the two main classes: inductive SSL and transductive SSL.

2.1.1.1 Inductive methods

The goal of inductive methods is to build a model from labelled and unlabeled data and use the model as a built-in SL (only with labelled data) to make predictions on unlabeled data. Inductive methods can further be divided into wrapper methods, unsupervised preprocessing, and intrinsically semi-supervised methods. In wrapper methods, one or more supervised-based learners are first trained based on the labelled data only, then the learner or set of learners are applied to the unlabeled data to generate pseudo-labels which are used for training in the next iterations. Pseudo-labels, $l(x_i)$, $k<i<k+u$, are simply the most confident labels produced by the learner or set of learners for a set of unlabeled samples, $\mathcal {X}_U \subset \mathcal {D}_U$, [31]. The wrapper methods we will consider are self-training and co-training. According to the way they make use of the unlabeled data, unsupervised preprocessing methods can be divided into feature extraction, unsupervised clustering and parameter initialization or pre-training.

2.1.1.1.1 Wrapper methods In wrapper methods, a model is first trained on labelled data to generate pseudo-labels for an unlabeled subset, $\mathcal {X}_U \subset \mathcal {D}_U$, then the model is iteratively re-trained, until all unlabeled data are labelled or some stopping criterion is met, with a new dataset containing both the labelled dataset, $\mathcal {D}_L$, and the pseudo-labels, $l{(x_i)}$, $k<i<k+u$, of the subset $\mathcal {X}_U$, generated in previous iterations. They are the most well-known and oldest SSL methods [27, 31]. Wrapper methods may be used for classifictaion and regression and are divided into three categories: self-training, co-training, and boosting.

1.
Self-training. Self-training [32] also referred to as self-learning, are wrapper methods that consist of a single base SL learner that is iteratively trained on a training set consisting of the original labelled data and the high-confidence predictions, pseudo-label, from the previous iterations. They are the most basic wrapper methods [31] and may be applied to most, if not all, SL algorithms such as Random Forests (RF) [33], Support Vector Machines (SVM) [34], etc.
2.
Co-training. Co-training methods, [35, 36], assume that (i) features can be split into two or more distinct sets or views; (ii) each feature subset is sufficient to train a good classifier; (iii) the views are conditionally independent given the class label. Co-training extends the principle of self-training to multiple SL learners that are each iteratively trained with the pseudo-labels from the other learners, in other words, learners “teach” each other with the added pseudo-labels to improve global performance. For co-training to work well, the sufficiency (ii) and independence (iii) assumptions should be satisfied [35]. Multi-view co-training, the basic form of co-training, constructs two learners on distinct feature sets or views. When no natural feature split is known a priori, single-view co-training may be used to build two or more weak learners with different hyper-parameters on the same feature set. There exist several approaches based on single-view co-training such as tri-training [37], co-forest [38], co-regularization [39], etc. In co-regularization, the two terms of the objective function minimize the error rate and optimize the disagreement between base learners [39].

2.1.1.1.2 Unsupervised preprocessing The unsupervised preprocessing methods use $\mathcal {D}_U$ and $\mathcal {D}_L$ at two different steps. The first step often consists of extraction (feature extraction) or transformation (unsupervised clustering) of the feature space or for initialization of a model’s parameters (pre-training) while the second step consists of using knowledge from $\mathcal {D}_L$ to label the unlabeled data points in $\mathcal {D}_U$. We briefly describe the methods in the next points.

1.
Feature Extraction: Feature extraction is one of the most critical steps to take in ML. It consists of extracting a set of relevant features for ML models to work. Typically, SSL feature extraction methods consist of either finding lower-dimensional feature spaces, from $\mathcal {X}$, without sacrificing significant amounts of information or finding lower-dimensional vector representations of highly dimensional data objects by considering the relationships between the inputs. Examples of SSL feature extraction methods are autoencoder (AE) [14] and a few of its variants, such as denoising autoencoder [40] and contractive autoencoder [41], and methods in NLP (Natural Language Processing) such as Word2Vec [42], GloVe [43], etc.
2.
Unsupervised clustering: Also referred to as cluster-then-label methods, these methods explicitly join the SL or SSL classification or regression algorithms and UL or SSL clustering algorithms. The UL or SSL clustering algorithm first clusters all the data points, then those clusters are fed to the SL or SSL classifier or regressor for label inference [44,45,46].

2.1.1.1.3 Intrinsically semi-supervised Intrinsically semi-supervised methods are typically extensions of existing SL methods to directly include the information from unlabeled data points in the loss function. Regarding the SSL assumption they rely on, these methods can be further grouped into four categories: (i) maximum-margin methods, where the goal is to maximize the distance between data points and the decision boundary (low density-assumption), (ii) perturbation-based methods, often implemented with neural networks (NN), rely directly on the smoothness assumption (a noisy, or perturbated, version of a data point should have the same predicted label, as the original data point), (iii) manifold-based methods either explicitly or implicitly estimate the manifolds on which the data points lie and (iv) generative models whose primary goal is to infer a function that can generate samples, similar to the available samples, from random noise.

2.1.1.2 Transductive methods

A learner is said to be transductive if it only works on the labelled and unlabeled data available at training and cannot handle unseen data [17]. The goal of a transductive learner is to infer labels for an unlabeled dataset $\mathcal {D}_U$, using $\mathcal {D}_L$. If a new unlabeled data point, $x_u \notin \mathcal {D}_U$, is given, the learner must be reapplied, from scratch to all the data, i.e., $\mathcal {D}_L$, $\mathcal {D}_U$, and $x_u$. Graph-based methods, which are often transductive in nature, define a graph where the nodes are labelled and unlabeled samples in the dataset, and edges (weighted) reflect the similarity of the samples. These methods usually assume label smoothness over the graph. Graph methods are non-parametric and discriminative [17]. The defined loss function is optimized to achieve two goals: (i) for already labelled samples, from $\mathcal {D}_L$, the inferred labels should correspond to their true labels and (ii) the predicted labels of similar samples on the graph be the same. A transductive learner’s task may be classification or regression.

2.1.2 SSL for clustering

Semi-supervised clustering methods can be used with partially labelled data as well as other types of outcome measures. When cluster assignments, or partial labels, for a subset of the data, are known beforehand, the objective is to classify the unlabeled samples using the known cluster assignments [47], this is, in a sense, equivalent to an SL problem. When more complex relationships among the samples are known in the form of constraints, the problem becomes a generalization of the previous objective and is either called constrained clustering [48], i.e., an existing clustering method is modified to satisfy the constraints, or distance-based (metric-based) clustering, i.e., an alternative distance metric is used to satisfy the constraints [49, 50].

Hierarchical and partitional clustering techniques are the two main types of clustering algorithms. Hierarchical clustering methods recursively locate nested clusters in either agglomerative or divisive mode. In agglomerative mode, they start with each data point in its own cluster and merge the most similar clusters successively to form a cluster hierarchy and in divisive or top-down mode, they start with all the data points in one cluster and recursively divide each cluster into smaller clusters [51]. SS Hierarchical clustering methods group samples using a tree-like architecture, known as a hierarchy. They either built separate hierarchies for must-link and cannot-link constrained samples [52,53,54,55] or use other types of constraints [56,57,58,59,60]. Finally, SS Clustering may be used to build clusters related to a given outcome variable [61].

We refer the interested reader to [16, 25, 26, 29, 62] for detailed descriptions of the methods mentioned in this section.

2.2 Cybercrimes

As mentioned in Sect. 1, a cyber-attack is any offensive maneuver that targets computer systems aiming at information disclosure, theft of or damage to their hardware, software, or electronic data, as well as from the disruption or misdirection of the services they provide, and cyber-security can be defined as the protection of computer systems against cyber-attacks [19]. Cybercrimes are criminal activities that involve the use of digital technologies such as computers, smartphones, the internet, and other digital devices [63]. From a legal perspective, cybercrimes can be defined as criminal offences that involve the use of a computer or a computer network [64]. The cyber-attacks covered in this article can all be seen as specific types of cybercrime, we, therefore, use the two terms interchangeably. Note that different jurisdictions may have different laws regarding what constitutes a cybercrime or cyber-attack. Therefore, an activity that is considered a cyber-attack in one jurisdiction may not be considered a cybercrime in another jurisdiction, depending on the specific laws in each location but cybercrimes typically involve the illegal or unauthorized use of digital technologies such as computers [63, 64]. Additionally, some activities that are not considered cyber attacks in some jurisdictions may still be considered cybercrimes if they violate specific laws related to computer systems and networks [63]. Cybercrimes may also be viewed from technical [65] and procedural [66, 67] perspectives.

The IBM X-Force Incident Response and Intelligence Services (IRIS) estimated the profit made by a group of attackers to be over US$123 million in 2020 [68] and the Cost of a Data Breach report published in 2021 by IBM Security estimates the global average cost per incident to US$4.24 million [69]. Cybercriminals are always taking advantage of catastrophes, disasters, and hot events for their own gains. A clear example is the surge in cybercrimes of all sorts witnessed at the beginning of the pandemic.

The following subsections briefly describe the cybercrimes countered in the covered literature.

2.2.1 Network intrusion

Any unlawful action on a digital network is referred to as network intrusion. Network intrusions or breaches can be thought of as a succession of acts carried out one after the other, each dependent on the success of the last. The stages of the intrusion are sequential, beginning with reconnaissance and ending with the compromising of sensitive data [70]. These principles are useful for managing proactive measures and finding bad actors’ behaviour. Network intrusions often include the theft of valuable network resources and virtually always compromise network and/or data security [71, 72]. Living off the land, multi-routing, buffer overwriting, covert CGI scripts, protocol-specific attacks, traffic flooding, Trojan horse malware, and worms are the most frequent intrusion attacks.

Some intruders will attempt to implant code that cracks passwords, logs keystrokes, or imitates a website in order to lead unaware users to their own. Others will infiltrate the network and steal data on a regular basis or alter websites accessible to the public with a range of messages. Intruders may get access to a computer system in a number of ways, including internally, externally, or even physically.

2.2.2 Phishing

IBM X-Force identified phishing as one of the most used attack vectors in 2021 because of their ease of use and low resource requirements [73]. Phishing is a form of cybercrime where the attackers’ aim is to trick users into revealing sensitive data, including personal information, banking, and credit card details, IDs, passwords, and more valuable information via replicas of legitimate websites of trusted organizations. Phishing attacks can be grouped into deceptive phishing and technical subterfuge [74]. Deceptive phishing is often performed via emails, SMS, calendar invitations, using telephony, etc., and technical subterfuge is the act of tricking individuals into disclosing their sensitive information through technical subterfuge by downloading malicious code into the victim’s system. We refer the reader to a recent in-depth study on phishing attacks [74].

2.2.3 Spam

Spam, not to be mistaken for canned meat, may be defined as unsolicited and unwanted messages, typically sent in bulk, that can take several forms such as email, text messages, phone calls, or social media messages. The content of spam messages can vary widely, but they are often commercial in nature and aim to advertise a product or service or promote a fraudulent scheme or solicit donations [75].

2.2.4 Malware

Malware or malicious software is defined as any software that intentionally executes malicious payloads on victim machines (computers, smartphones, computer networks, and so on) to cause disruptions. There exist several varieties of malware, such as computer viruses, worms, Trojan horses, ransomware, spyware, adware, rogue software, wipers, and scareware. In the 2022 Threat Intelligence Index, IBM X-Force reported that ransomware, a type of malware, was again the top attack type in 2021, although decreasing from 23%, in 2020, to 21% [73]. Defensive tactics vary depending on the type of malware, but most may be avoided by installing antivirus software and firewalls, applying regular patches to decrease zero-day threats, safeguarding networks from intrusion, performing regular backups, and isolating infected devices.

2.2.5 Other cyber-attacks

In addition to intrusions, spam, phishing and malware, we also discuss SSL applications for:

1.
Traffic classification - traffic classification may be used to detect patterns suggestive of denial-of-service attacks, prompt automated re-allocation of network resources for priority customers, or identify customer use of network resources that in some manner violates the operator’s terms of service [76];
2.
Sybil detection—a Sybil attack may be defined as an attack against identity in which an individual entity masquerades as numerous identities at the same time [77];
3.
Stock market manipulation detection—market manipulation may be defined as an illegal practice in an attempt to boost or reduce stock prices by generating an illusion of an active trading [78, 79];
4.
Social bot detection—a social bot may be defined as a social media account that is operated by a computer algorithm to automatically generate content and interact with humans (or other bot users) on social media, in an attempt to mimic and possibly modify their behaviour [80, 81];
5.
Shilling attack detection—a Shilling attack is a particular type of attack in which a malicious user profile is injected into an existing collaborative filtering dataset to influence the recommender system’s outcome. The injected profiles explicitly rate items in a way that either promotes or demotes the target items [82];
6.
Pathogenic social media account detection—Pathogenic Social Media (PSM) accounts refer to accounts that have the capability to spread harmful misinformation on social media to viral proportions. Terrorist supporters, water armies, and fake news writers are among the accounts in this category [83, 84];
7.
Fraud detection—in the banking industry such as credit card fraud detection. Credit card fraud may happen when unauthorized individuals obtain access to a person’s credit card information and use it to make purchases, other transactions, or open new accounts [85]; and
8.
Detection of attacks on other platforms such as the power grid - the smart grid enables energy customers and providers to manage and generate electricity more effectively. The smart grid, like other emerging technology, raises new security issues [86].

2.3 Examples of industry deployments of ML in cyber-security

This section presents examples of successful industrial deployments of ML for countering cyber threats. The first example is “IBM X-Force Threat Management” [87], an ML platform deployed to counter cyber threats. IBM X-Force Threat Management is a cloud-based security platform that leverages ML to provide advanced threat detection and response capabilities. It analyzes massive amounts of security data, including network traffic, system logs, and user behaviour, to identify and respond to potential threats in real-time using ML algorithms. The ML models are trained on large datasets of historical security events, allowing the system to learn and adapt to new threats over time. Depending on the use case and data available, it is possible that IBM X-Force Threat Management may use a combination of ML techniques, such as SSL and Reinforcement Learning, in addition to other optimization methods for enhancing security policies. However, it should be noted that without specific information from IBM, it cannot be definitively confirmed whether these techniques are actually employed. Nonetheless, the platform has demonstrated success in detecting various types of cyber threats, including banking Trojans such as IcedID,^{Footnote 1} TrickBot and QakBot.

The second example is the Deep Packet Inspection (DPI) system developed by Darktrace, a cyber-security company. The system uses unsupervised ML algorithms to learn the expected behaviour of a network and detect anomalies that may indicate malicious activity. The system can also automatically respond to detected threats by initiating a range of actions, such as quarantining a device or blocking network traffic. Darktrace has deployed its DPI system in various industries, including healthcare, finance, and energy. In one instance, a UK construction company used the system to detect and respond to a ransomware attack.^{Footnote 2} The system identified the attack within minutes of it starting and initiated a range of responses, including blocking the attacker’s IP address and quarantining affected devices. The company was able to contain the attack and avoid paying the ransom demanded by the attackers.

Our third example is Feedzai, an ML platform that provides fraud prevention and anti-money laundering for financial institutions and businesses. Feedzai employs a variety of ML techniques, including Deep Learning and combining SL and UL (SSL),^{Footnote 3} to detect and prevent fraudulent activity in real-time. After partnering with a large European bank, Feedzai’s platform reduced false positives and accurately identified fraudulent activity, resulting in lower losses due to fraud.^{Footnote 4}

Overall, IBM X-Force Threat Management, Darktrace, and Feedzai demonstrate how ML can be successfully deployed in the industry to counter cyber threats and provide advanced threat detection and response capabilities.

3 Review methodology

This section provides the details of the methodology we followed. To achieve our goal of reviewing the datasets and evaluation metrics used in the applications of SSL techniques to cyber-security, we followed the standard systematic literature review guidelines outlined in [88] for assessing the search’s completeness. The entire process was done on Covidence [89], an online tool for systematic review management and production. We first defined our three research questions shown below. These are motivated by the need to examine the efforts being made to safeguard users and computer systems against attacks using SSL. This stems from the fact that attacks are far more harmful than vulnerability scans or related operations. We intend to review the datasets as well as the evaluation metrics used in the literature identifying the cyber-attacks as soon as possible to take the necessary actions to reverse them.

RQ. 1:
With the introduction and use of SSL in cyber-security, what are the assessment metrics used to evaluate the built models?
RQ. 2:
What datasets are the proposed SSL approaches built upon? What are the most used datasets?
RQ. 3:
What are the open challenges with respect to the datasets and performance assessment metrics?

Our inclusion and exclusion criteria were then defined from the above research questions. A paper is included if it directly applies SSL for detecting at least one of the cyber-attacks mentioned in Sect. 2.2. with enough details to address our research questions. On the other hand, a paper is excluded if (i) another paper of the same authors superseded the work, in which case the latest work is considered, (ii) it does not use SSL for the inclusion criteria and (iii) the approach is discussed at a high level, with insufficient information to fulfill the research questions. The entire process was done on Covidence [89], an online tool for systematic review management and production. We then queried IEEE Xplore and ACM Digital Library for articles having (“semi-supervised learning” AND “cyber-security”), (“semi-supervised” AND “cyber-security”) and (“semi-supervised” AND “security”) anywhere within the article.

The keywords (“semi-supervised learning” AND “cyber-security”) have been chosen because SSL has been increasingly used in cyber-security to improve the accuracy of detection and classification systems [90]. This combination has been used to find articles that specifically focus on using SSL in cyber-security tasks such as intrusion detection, malware detection, network traffic analysis, etc. Similarly, the combination of (“semi-supervised” AND “cyber-security”) has been used to find articles that discuss semi-supervised learning in a cyber-security context, even if they do not explicitly mention the phrase “semi-supervised learning”. Finally, the combination (“semi-supervised” AND “security”) has been used to broaden the search beyond just cyber-security and potentially include other domains where SSL has been applied to security-related tasks.

Note that we did not limit the search to the title, abstract or keywords because it was essential to making sure to find all the articles discussing and applying SSL methods for cyber-security for screening. The reason we chose these databases is that they are among the top databases suggested by our university library for conducting Computer Science research and they also contain papers published in top-tier venues. To complement the results obtained from IEEE Xplore and ACM Digital Library, we submitted the same search queries to Google Scholar and extracted the top 200 search results sorted by relevance. The combinations mentioned earlier and this search strategy allowed us to find articles that are relevant to using SSL in cyber-security, and gain a better understanding of how it is being/has been used to improve security systems.

As seen in Fig. 2, in total, 1914 studies were imported for screening; 267 duplicates were automatically removed, and the remaining 1647 studies’ titles and abstracts were manually screened for relevance. Based on our inclusion and exclusion criteria, 1319 studies were found irrelevant, because they either did not discuss SSL methods or cyber-attack defences. The remaining 328 studies’ full texts were further assessed as they were either partially or fully related to our inclusion criteria, and finally, 210 relevant studies were included for data extraction. Furthermore, we used state-of-the-art surveys and review articles on SSL [16, 27] and ML for cyber-security [4] to construct this extensive review of cyber-security datasets and performance evaluation metrics for SSL models.

4 Datasets and performance assessment metrics

In this section, we summarize and analyze the public datasets and performance assessment metrics used in the selected papers.

4.1 Datasets and repositories

AI, especially ML, has proven itself a particularly useful tool in cyber-security as well as other fields of computer science and has extensively featured in the literature for cybercrime or malicious activity detection. “Cost of a Data Breach” [69], published by IBM Security, reported a US$3.81 million, or almost 80% difference between breach costs of companies with fully deployed security AI/ML and automation and companies without security AI/ML and automation. We present the public datasets used in the covered literature in this section, grouped by type of attack and show their usage in the selected papers in Figs. 3, 4, 5, and 6. Note that we acknowledge the difference between Spam and Phishing in Subsections 2.2.3 and 2.2.2 as they are different attack vectors but due to the scarcity of these datasets, we have combined them in a single section.

4.1.1 Network intrusion datasets and sources

In terms of network intrusion, we found a total of 18 public datasets and sources in the papers we reviewed. We begin by providing a brief description of each dataset; we, then, provide a summary of their main characteristics as well as some key data usage statistics.

1.
KDD’99 and NSL-KDD. The KDD’99 dataset is a statistically preprocessed dataset which has been available since 1999 from DARPA [91], it is an updated version of the DARPA98. It is the most used dataset in the selected papers. The dataset has three components, basic, content and traffic features, making a total of 41 features for normal and simulated attack traffic. The NSL-KDD dataset, proposed by Tavallaee [92], is a version of the KDD’99 dataset in which redundant records are removed to enable the classifiers to produce unbiased results. The two datasets contain various attack types such as Neptune-DoS, pod-DoS, Smurf-DoS, and buffer-overflow. Table 1 gives a brief composition of the KDD’99 and NSL-KDD datasets.
2.
Moore Set. The Moore Set [93] was prepared in 2005 by researchers at Intel Research. It comprises real-world traces collected by the high-performance network monitor. Each object in the Moore set represents a single flow of TCP packets between client and server, which consists of 248 characteristics. The information in the features is derived using packet header information alone, while the classification- class has been derived using content-based analysis. Table 2 shows a brief composition of the Moore Set.
3.
LBNL2005. The Lawrence Berkeley National Laboratory (LBNL) 2005 traffic traces were collected at the LBNL/ICSI under the Enterprise Tracing Project over a period of three months in 2004 and 2005 on two routers [94]. It contains full header network traffic recorded at a medium-sized enterprise covering 22 subnets and includes trace data for a wide range of traffic including web, email, backup, and streaming media. Because the traffic traces are completely anonymized, all the packets do not have a payload. As seen in Table 3, the LBNL trace consists of five datasets labelled: D0–D4. The “Per Tap” row specifies the number of traces collected on each monitored router port while the “Snaplen” row gives the maximum number of bytes recorded for each packet.
4.
CAIDA Datasets. The Centre for Applied Internet Data Analysis (CAIDA), based at the University of California’s San Diego Supercomputer Center, collects a variety of data from geographically and topologically diverse locations and makes it available to the research community to the extent possible while respecting the privacy of individuals and organizations who donate data or network access. The CAIDA-DDoS Dataset [95], comprises approximately one hour of anonymized traffic from a DDoS attack on August 4, 2007 (20:50:08 UTC to 21:56:16 UTC). This type of denial-of-service attack tries to prevent access to the targeted server by using all of the server’s computational power and all of the bandwidth on the network linking the server to the Internet. The traces only include attack traffic to the victim and responses to the attack from the victim. Non-attack traffic has been eliminated to the greatest extent practicable.
5.
Kyoto2006+. The Kyoto2006+ is a publicly available benchmark dataset, consisting of 24 statistical features, that is built on three years of network traffic, from November 2006 to August 2009 [96]. It covers both regular servers and honeypots deployed at Kyoto University in Japan labelled as normal (no attack), attack (known attack) and unknown attack. It includes a variety of attacks performed against the honeypots such as shellcode, exploits, DoS, port scans, backscatter, and malware, shown in Table 4. An updated version of the dataset contains additional data collected from November 2006 to December 2015 [97].
6.
UNIBS2009. The UNIBS-2009 trace [98], was compiled by the University of Brescia in 2009. It consists of traffic traces collected by running Tcpdump on the edge router of the university’s campus network on three consecutive working days (2009.9.30, 2009.10.1 and 2009.10.02) connecting the network to the Internet through a 100 Mbps uplink. As shown in Table 5, the dataset supplies the true labels, and the traffic trace includes Web (HTTP and HTTPS), Mail (POP3, IMAP4, SMTP and their Secure Sockets Layer variants), Skype, P2P (BitTorrent, Edonkey), SSH (Secure Shell), FTP (File Transfer Protocol) and MSN.
7.
UNB ISCX-2012. The Installation Support Center of Expertise (ISCX)-2012 dataset has been prepared at the ISCX at the University of New Brunswick [99]. It is built on 7 days of network traffic, shown in Table 6, and consists of over two million traffic packets characterized by 20 features taking nominal, integer, or float values. The dataset includes full packet payloads in pcap format.
8.
CTU-13. The CTU-13 dataset was compiled by the Czech Technical University [100]. It consists of botnet traffic captured in the university in 2011. The dataset includes thirteen scenarios, shown in Table 7, covering different botnet attacks, that use a variety of protocols and performing different actions, mixed with normal traffic and background traffic. The dataset is available in the forms of unidirectional flow, bidirectional flow, and packet capture.
9.
SCADA 2014. The Supervisory Control And Data Acquisition (SCADA) [101] is a database proposed by Mississippi State University Key Infrastructure Protection Center in 2014 to evaluate the industrial network intrusion detection model. It is one of the standard databases in the current industrial control network intrusion detection commonly used in experiments. It includes the Gas system dataset and Water storage system dataset from the Industrial Control System network layer.
10.
UNSW-NB15. The UNSW-NB15 dataset was compiled in 2015 by the University of New South Wales Canberra at the School of Engineering and IT, UNSW Canberra at ADFA, using a small, emulated network over 31 h by getting normal and malicious raw network packets. It consists of nine attack types: analysis, backdoors, DoS, exploits, generic, fuzzers, reconnaissance, shell code and worms. It consists of over two million records each characterized by 49 features taking nominal, integer, or float values. The dataset’s data distribution is shown in Table 8.
11.
AWID 2015. The Aegean Wi-Fi Intrusion Dataset (AWID), published in 2015 [102], comprises the largest amount of Wi-Fi network data (normal and attack) collected from real network environments. The 16 attack types can be grouped into flooding, impersonation, and injection. As seen in Table 9, the dataset contains over 5 million samples each characterized by 154 features, representing the WLAN frame fields along with physical layer meta-data.
12.
ISCXVPN2016. The ISCXVPN2016 [103], published by the UNB in 2016, comprises traffic captured using Wireshark and tcpdump, generating a total amount of 28GB of data. For the VPN, an external VPN service provider connected to using OpenVPN (UDP mode) was used. To generate SFTP and FTPS traffic an external service provider and Filezilla as a client was used. Table 10 shows the data distribution in the ISCXVPN2016 dataset.
13.
CIDDS. The Coburg Intrusion Detection Datasets (CIDDS), prepared at Coburg University of Applied Sciences (Hochschule Coburg), consist of several labelled flow-based datasets created in virtual environments using OpenStack. The CIDDS database’s most used dataset, CIDDS-001, released in 2017, covers four weeks of unidirectional traffic flows each characterized by 19 features taking nominal, integer, or float values. As seen in Table 11, the dataset includes attacks such as DoS, port scan and SSH brute force.
14.
CICIDS2017. The Canadian Institute for Cyber-security Intrusion - Evaluation Dataset (CIC-IDS)-2017 was produced in an emulated network environment at the CIC [104]. It is built on 5 days (July 3 to July 7, 2017) of network traffic, shown in Table 12, and includes a variety of most common attack types including FTP patator, SSH patator, DoS slowloris, DoS Slowhttptest, DoS Hulk, DoS GoldenEye, Heartbleed, Brute force, XSS, SQL Injection, Infiltration, Bot, DDoS (Distributed denial of service), and Port Scan each characterized by 80 features extracted using CICFlowMeter [103, 105]. The dataset also includes full packet payloads in pcap format.
15.
UGR’16. The UGR’16 dataset, proposed in 2018 by Maciá-Fernández et al. [106], comprises NetFlow network traces collected from a real Tier 3 ISP network made up of several organizations’ and clients’ virtualized and hosted services including WordPress, Joomla, email, FTP, etc. NetFlow sensors were installed in the network’s border routers to capture all incoming and outgoing traffic from the ISP. As seen in Table 13, two sets of data are provided: one for training models (calibration set) and the other for testing the models’ outputs (test set).
16.
Kitsune2019. The Kitsune Network Attack Dataset, Kitsune2019, has been prepared at Ben-Gurion University of the Negev, Israel and was released in May 2018 [107]. The dataset is composed of 9 files covering 9 distinct attacks situations on a commercial IP-based video surveillance system and an IoT network: OS (Operating System) Scan, Fuzzing, Video Injection, ARP Man in the Middle, Active Wiretap, SSDP Flood, SYN DoS, Secure Sockets Layer Renegotiation and Mirai Botnet. It contains 27,170,754 samples each characterized by 115 real features. The violation column in Table 14 indicates the attacker’s security violation on the network’s confidentiality (C), integrity (I), and availability (A).
17.
NETRESEC is a software company that specializes in network security monitoring and forensics. They also maintain.pcap repository files gathered from various Internet sources [108]. It is a list of freely accessible public packet capture repositories on the Internet. Most of the websites listed on their website provide Full Packet Capture (FPC) files, however, others only provide truncated frames.
18.
MAWI archive. The MAWI archive [109] consists of an ongoing collection of daily Internet traffic traces captured within the WIDE backbone network at several sampling points. Tcpdump is used to retrieve traffic traces, and the IP (Internet Protocol) addresses in the traces are encrypted using a modified version of Tcpdpriv (MAWI Working Group Traffic Archive (http://www.wide.ad.jp)). The samplepoint-F consists of daily traces at the transit link of WIDE to the upstream ISP and has been in operation since 01/07/2006.
19.
Kaggle^{Footnote 5} is an online data sharing and publishing platform. It includes security-based datasets such as KDD’99 and NSL-KDD. Registered users can also upload and explore data analysis models.

Table 1 KDD’99 and NSL-KDD composition

A systematic literature review of cyber-security data repositories and performance assessment metrics for semi-supervised learning

Abstract

Similar content being viewed by others

On the Variability in the Application and Measurement of Supervised Machine Learning in Cyber Security

Machine Learning for Intelligent Data Analysis and Automation in Cybersecurity: Current and Future Prospects

Review: machine learning techniques applied to cybersecurity

1 Introduction

2 Background on SSL and cyber-security

2.1 SSL concepts and methods

2.1.1 SSL for classification and regression

2.1.1.1 Inductive methods

2.1.1.2 Transductive methods

2.1.2 SSL for clustering

2.2 Cybercrimes

2.2.1 Network intrusion

2.2.2 Phishing

2.2.3 Spam

2.2.4 Malware

2.2.5 Other cyber-attacks

2.3 Examples of industry deployments of ML in cyber-security

3 Review methodology

4 Datasets and performance assessment metrics

4.1 Datasets and repositories

4.1.1 Network intrusion datasets and sources

4.1.2 Spam and phishing datasets and sources

4.1.3 Malware datasets and sources

4.1.4 Additional datasets and sources

4.2 Performance assessment metrics

5 Open issues and challenges

5.1 Datasets and repositories

5.2 Performance assessment metrics

5.3 Bridging the gap between ML-based cyber-security research and practice

6 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation