Keywords

Introduction

The widespread adoption of digital services in people’s daily lives has resulted in an increased demand for cybersecurity. With the proliferation of new software and hardware, detecting known botnets or other types of attacks has become a daunting task for cybersecurity professionals. Botnets as one type of cyberattack can have disastrous consequences [1, 2], as they allow attackers to remotely control infected machines, since they have the potential to impact numerous devices in parallel, particularly within IoT networks, due to a large number of devices interconnected.

Cybersecurity incidents are predominantly addressed reactively, subsequently to the occurrence of an attack, necessitating the engagement of cybersecurity professionals to respond and mitigate resultant damage. To combat these infections, cybersecurity experts are developing proactive systems that utilize machine-learning and deep learning (ML & DL) technologies. Consequently, the primary dataset for cybersecurity analysis predominantly comprises historical attack data. This essentially implies that nearly all cybersecurity systems are developed based on historical attack patterns, rendering them susceptible to emerging variants. Nonetheless, many organizations refrain from sharing their attack data, resulting in a scarcity of such information, consequently hindering the effective training of ML or DL models and the development of such systems.

The current study proposes a methodology for generating botnet-type data in a tabular format. This methodology employs an 8-layer generative adversarial network (GAN) model [3] to evaluate its effectiveness in generating synthetic data with high precision while minimizing computational expenses The generated samples will be assessed using a wide range of graphical data quality indicators, such cumulative sums, absolute log mean and STD diagrams, correlation matrices, and heatmaps.

The remainder of this study contains Section “Related Works” that investigates related research on botnet attack generation techniques, Section “BNGAN: A Proposed Solution For Addressing the Data Issue” that provides a more in-depth explanation of the GAN model design methodology, whereas Section “Data Generation Results Evaluation” focuses on evaluating the synthetic dataset’s significance and Section “Conclusion and Future Work” examines the revealed discoveries.

Related Works

The escalating damage inflicted on computer systems by botnet attacks has underscored the imperative need to delve deeper into detection methods. Consequently, a plethora of studies in this domain can be found in the existing literature. Within the context of our study’s core elements outlined earlier, the works discussed in this section primarily concentrate on two key aspects: the generation and classification of botnet attack datasets.

Yin et al. [4] concentrated on augmenting botnet detection. Their research introduced a GAN designed to generate nearly lifelike botnet attack samples, enhancing the training of machine-learning classifiers. The Bot-GAN consistently supplied “synthetic” data to the discriminator, which classified these samples using a softmax function. This approach resulted in improved accuracy and precision when compared to pretrained models utilizing the original imbalanced dataset. Pursuing a similar route to mitigate the challenges posed by imbalanced datasets, Song et al. [5] introduced the GAN-efficient lifelong learning algorithm (ELLA) solution. Their methodology demonstrated that dataset expansion through a GAN architecture not only boosted the performance of traditional ML solutions for botnet identification but also enhanced the lifelong learning approach of the ELLA algorithm.

Tram Truong-Huu and his team [6] investigated the application of GANs in network anomaly detection. They employed multiple datasets to assess GANs’ performance in comparison with other network anomaly detection methods. Their experiments revealed significant improvements over existing deep learning techniques, indicating promise in detecting unknown anomalous behavior and zero-day attacks focusing on botnet traffic.

Zhong et al. [7] introduced MalFox, a solution designed to demonstrate the limitations of existing black box detectors. MalFox employs a convolutional GAN and adopts a confrontational strategy to create perturbation paths. These paths incorporate up to three methods (Obfusmal, Stealmal, and Hollowmal) to generate adversarial malware examples. Their results showed promising performance, with an accuracy of approximately 99%, while the detection rate of the generated samples was at a lower percentage, around 45%.

The significance of GANs for data augmentation, especially in the cybersecurity realm, was underscored also by Habibi et al.’s Conditional Tabular GAN (CTGAN) model [8]. They experimented with various CTGAN versions and parameters to identify the most effective one. The outcomes demonstrated CTGAN’s ability to preserve the structure of both continuous and discrete data. This provided a solution for ML classifiers or detectors, addressing dataset imbalances and training these algorithms for novel threats, given that GAN-generated data are novel and unseen.

Lingam et al. [9] conducted a study on imbalanced data concerning bot identification. Their objective was to tackle the issue of imbalanced data for ML classifiers by employing a GAN with a gated recurrent unit (GRU). This enabled them to generate synthetic data closely resembling real data, effectively balancing benign user and bot classes. Results indicated that their approach outperformed ML methods trained solely on the original Twitter dataset, achieving an average accuracy of approximately 91% with the GAN-generated dataset.

BNGAN: A Proposed Solution for Addressing the Data Issue

Generative adversarial networks (GANs) [1] utilize an architecture that generates new data based on input data and random noise. GANs consist of two components: the generator and discriminator. The generator uses random noise to create realistic data, while the discriminator classifies input samples as either real or fake. Both components are optimized based on the discriminator’s ability to accurately classify real and fake data.

Hence, there is significant importance in conducting experiments involving various GAN architectures and adjusting their hyperparameters to discover the most suitable model tailored to a particular dataset and objective. Approaches like hyperparameter optimization and architectural exploration serve as valuable tools in pinpointing the ideal GAN structure and hyperparameters tailored to a specific task.

This study aims to evaluate the effectiveness of a proposed 8-layer GAN architecture called BNGAN in generating synthetic data that accurately represent malicious cyber-attacks, specifically botnet attacks [3]. To accomplish this, the study evaluates the performance of the proposed 8-layer GAN model [10] for both the generator and discriminator, using the CTU-13 dataset [11] from the Stratosphere IPS. This dataset includes captures of diverse malware samples and normal traffic, with 32 million packets. The training dataset has 216,352 records, with 140,849 marked as “0” for malware and 75,503 labeled as “1” for legitimate. The evaluation dataset has 88,258 records without any labels.

The study utilizes the BNGAN model architecture, which is designed to generate 1D synthetic data from the input dataset. The model was implemented using Tensorflow 2.0 and Keras API. The proposed BNGAN architecture utilizes the sequential API to stack the different layers of the deep neural network. The generator model is built using the sequential API and consists of an input layer for accepting appropriately scaled, randomly generated noise with the intended size. This input is then processed through six subsequent hidden layers utilizing the “ReLU” activation function, ultimately leading to an output layer. This output layer employs a “linear” activation function, aligning its dimension with that of the preprocessed dataset.

The discriminator, by itself, takes the form of a sequential model, composed also of eight dense layers. In the initial seven layers, the “ReLU” activation function is utilized, while the last layer employs the “sigmoid” function to classify input samples as either authentic (genuine) or counterfeit (malware). To bolster the model’s precision, a 20% dropout rate is applied to both the visible (input) layer and the six concealed layers within the discriminator model. The ultimate choice of this dropout rate was reached through a series of iterative experiments, considering its influence on preventing overfitting while ensuring the model’s capability to capture pertinent data patterns.

After detailing the generator and discriminator models, the proposed BNGAN model is characterized as a sequential model that integrates these components in an adversarial manner. Figure 25.1 illustrates how the BNGAN model uses (preprocessed) botnet data samples to generate synthetic, tabular data.

Fig. 25.1
A schematic diagram of the B N G A N model presents original data that goes through the B N G A N discriminator to the discriminator predictions. If the data is correct, then it goes through fine-tune training, and then the B N G A N generator gets generated data, which further goes through B N G A N.

BNGAN Model Implementation

Data Generation Result Evaluation

In the previous chapter, the generator and discriminator models were established, combining them to form the comprehensive BNGAN model. Subsequently, the training process was initiated to facilitate the generation of datasets mirroring the originals. The training process encompassed a total of 1000 epochs, with each epoch involving batch training of a predefined size for both the generator and discriminator networks. In this process, the discriminator received as input a predetermined batch of data samples from the original dataset as well as the generated output data sample from the generator. For each (data) batch, the discriminator computed the loss for both the genuine and the generated data. The losses computed (by the discriminator) served to refine the predictions made by the discriminator model, subsequently enabling the computation of generator losses and gradients via backpropagation techniques. In this iterative process, the generator persists in enhancing the quality of the synthetically generated data samples by constantly adjusting its weights based on these gradients.

Visual representations, such as diagrams, prove to be an effective way of assessing and illustrating the similarity between datasets (synthetically) generated by a GAN model and real data. These similarity estimation methods offer valuable insights into the fidelity and precision of the generated dataset, aiding researchers in pinpointing areas where enhancements to the GAN model might be necessary to produce synthetic data that closely mirrors real data. Furthermore, the GAN model’s performance in creating synthetic data that closely resembles the real data can be determined. The choice of diagram types is contingent upon the nature of the analyzed data and the particularly considered objectives of the research. The current study includes the following diagrams to evaluate the generated data: correlation matrices with heatmaps, highlighting clusters illustrating distinctions between the real and generated datasets, cumulative sum (cumsum) diagrams for visualizing the accumulation over time of the original and the generated data and STD diagrams to compare the (similarity) scores between the original and generated datasets from the GAN model. Figures 25.2, 25.3, and 25.4 visualize the comparison results between the original and the generated data from the GAN model.

Fig. 25.2
3 correlation matrices of real, fake, and difference with color gradient scales. A and B. They present a diagonally high-intensity trend. The color gradient scale ranges from negative 1.00 to 1.00. C. It presents high intensity at state on state, D port at state, sport at label, and label at sport. The color gradient scale ranges from 0.00 to 0.30.

Correlation Matrices with Heatmap

Fig. 25.3
8 line graphs of cum sum versus d u r, sport, d port, state, tot p k t s, tot bytes, s r c bytes, and label for real and fake. 1. A straight rise then a slow rising trend. 2. A steep rising trend. 3. A zigzag rising trend. 4. A straight side then a slow rising trend. 5 to 7. A straight rise then a stable trend. 8. A straight trend.

Cumulative Sum Diagrams

Fig. 25.4
2 scatterplots of fake data mean versus real data mean and fake data standard versus real data standard. A. The best-fit line follows a linearly rising trend from (negative 2, negative 2) to (10, 10). B. The best-fit line follows a linearly rising trend from (negative 2, negative 2) to (12, 12).

STD Diagrams

Based on the results depicted previously, the cumulative sum diagrams reveal notable insights regarding the similarity between real and generated datasets for eight variables. Five of these variables (Dur, TotPkts, TotBytes, SrcBytes, and Label) exhibit a consistent, steadily increasing similarity score in both datasets, suggesting a continuous pattern. In contrast, the remaining three variables (Sport, Dport, and State) display a fluctuating pattern with abrupt spikes and drops in the similarity score, indicating deviations in the synthetic dataset. These fluctuations suggest certain data points significantly diverge from the overall pattern, contributing to lower overall similarity scores for these variables. Furthermore, the cumulative sum diagrams suggest that the GAN model might require more training epochs to produce a synthetic dataset closely in order to resemble the real one. Moving to the correlation matrix diagrams, the real dataset illustrates a strong positive correlation among its variables. However, in the generated synthetic dataset, the positive correlations are weaker, and no significant negative correlations emerge. Additionally, a strong positive correlation arises between the various features in the “Difference” section, signifying that as the epochs progress, the generated data faithfully replicates patterns and characteristics from the real dataset in a realistic manner. Finally, examining the absolute mean and standard deviation diagrams reveals that the synthetic dataset contains higher values for certain features compared to the real dataset. This variance may suggest that the generated data for some features could not precisely mirror the real data, at least initially. However, as the number of training epochs increases, the synthetic dataset progressively aligns more closely with the real dataset provided.

Conclusion and Future Work

As digital tools continue to evolve and become more prevalent, the need for effective cybersecurity measures has become increasingly critical. The primary objective of this study is to outline a comprehensive methodology for generating synthetic data for botnet attacks using generative adversarial networks (BNGAN). The generation process utilizes an open-source dataset, the CTU-13 dataset, provided by Stratosphere IPS, which is a collection of network traffic captures that have been widely used in the field of cybersecurity research. This tabular format data is used as input for the suggested BNGAN architecture [11]. The BNGAN model generates over 200,000 new botnet data samples that closely resemble the original data. Subsequently, the generated botnet data samples are evaluated using a wide range of graphical data quality indicators, including cumulative sums, absolute log mean and STD diagrams and correlation matrices with heatmaps, to assess the quality of the generated data. Overall, this proposed methodology provides a promising approach to improving botnet attack detection and prevention. The future prospect of this research involves expanding data categories and domains into various fields, encompassing diverse data formats and addressing a broader range of cyberthreats. Furthermore, an important avenue of exploration is the integration of lifelong learning techniques, both for data generation and the zero-day detection and classification of such attacks.