Introduction

The rapid expansion of information technology and the widespread adoption of the Internet have undoubtedly improved various aspects of everyday life, but they have also given rise to concerns regarding the management and cyber security of the massive network traffic data generated due to people’s high dependence on the Internet. With this increased reliance, numerous network security issues have emerged, exposing online activities to various risks. Cyber attackers frequently target computer systems on networks, exploiting vulnerabilities in software applications to carry out different types of attacks such as probes, botnets, distributed denial of service (DDoS), trojans, and more. The aftermath of these attacks can result in severe cyberspace security problems, leading to significant financial losses for individuals and organizations at large.

To address these challenges and protect against invasive activities, Network Intrusion Detection System (NIDS) has emerged as one of the effective methods for detecting cyberattacks and enhancing network security [1]. Intrusion detection systems (IDSs) typically assess current network activity patterns or standards to determine if the network connection is behaving normally or if there is an intrusion. However, the Internet continuously accumulates a vast amount of high-dimensional and detailed data as it evolves, rendering conventional IDS methods inadequate. Researchers have explored various machine learning techniques for network intrusion detection (NID) and have published studies utilizing algorithms such as Random Forest (RF), Support Vector Machine (SVM), Naïve Byes (NB), K-Nearest Neighbor (KNN), and Decision Trees (DT) [2,3,4]. Some studies have examined the performance of SVM, NB, and DT in identifying network threats [5,6,7,8], while others have turned to deep learning techniques, such as Convolutional Neural Networks (CNN), Deep Neural Networks (DNN), and Recurrent Neural Networks (RNN), to address the limitations of shallow classifiers and improve intrusion detection accuracy [9,10,11]. Despite the advancements made by deep learning techniques, challenges persist in achieving high generalization and classification accuracy. For instance, while CNNs are adept at computing large amounts of data and possess excellent feature extraction capabilities, the pooling process may result in the loss of critical information for intrusion data. Additionally, RNNs and CNNs may struggle to capture long-term dependencies, affecting their model generalization and performance. In this context, another relevant framework to develop intrusion detection systems is one-class classification, with anomaly detection algorithms serving as important alternatives. Furthermore, the incorporation of outlier detection and concept drift models plays a vital role in tackling the evolving nature of cyber threats [12,13,14].

This paper presents a novel approach to address the aforementioned challenges by proposing a two-stage feature extraction technique for network intrusion detection. The proposed approach leverages the hybridization of CNN and GRU networks to mitigate the drawbacks associated with individual techniques. The resulting IDS, referred to as CNN-GRU-FF, harnesses the CNN algorithm to extract spatial features and the GRU algorithm to extract temporal features from input data. Subsequently, a multilayer perceptron (MLP) strategy integrates the features extracted via the convolutional layers in the CNN with those from the GRU algorithm. To evaluate the performance of this approach, we conduct experiments on two well-known intrusion detection datasets, namely NSL-KDD and UNSW-NB15. The key contributions of this work are as follows:

  1. 1.

    Propose a novel two-stage feature extraction NIDS method, which effectively leverages the advantages of CNN and RNN through feature fusion. This comprehensive approach enables robust detection of network intrusions by incorporating spatial and temporal feature learning from deep neural networks.

  2. 2.

    Address the imbalanced class issues in intrusion detection datasets by utilizing a modified focal loss function, which further enhances the accuracy of the proposed CNN-GRU-FF NIDS approach.

  3. 3.

    To the best of our awareness, this work is the first to propose using the fusion of spatial and temporal features coupled with a modified focal loss function for the NIDS problem.

  4. 4.

    Demonstrate through extensive experimentation that the CNN-GRU-FF NIDS approach outperforms existing NIDS methods in the literature, highlighting its effectiveness and potential for detecting network intrusions.

The subsequent sections of this paper are arranged as follows: section “Related works” presents the most recent works relevant to our study. section “Methods and datasets” describes the intrusion detection datasets and presents the working principles of the proposed NIDS model. In section “Experiments and results”, we describe our experiments and discuss the results of the experiments. Finally, section “Conclusion and future directions” provides the conclusion of our study and highlights our future directions.

Related works

NIDS has caught the scientific community’s attention as a vital tool for guaranteeing network security since its inception by Anderson [15] in 1980. As a classification problem, NIDS can be categorized into three (3) different methodological concepts (i.e., Anomaly, Misuse, and hybridized detection). Anomaly detection is a technique for detecting behaviors or activities out of the usual for a given user. Misuse detection is a signature-based strategy equipped with previously recorded attack behaviors or signatures. Misuse detection NIDS raises a signal when it identifies a particular matching signature or behavior with the recorded ones. Hybrid NIDS is the combination of anomaly based and misuse-based NIDS approaches. Security researchers over the years have proposed several NIDS in literature, with the earliest ones adopting statistical methods for network intrusion detections. However, as artificial intelligence research advances, machine learning (ML) and deep learning (DL) methodologies are gaining popularity.

ML and DL-based network intrusion detection systems

Machine learning is classified into two types: supervised and unsupervised. Most machine learning models are shallow learning models with straightforward structural designs and great generalization potential. As a result, they are widely utilized in numerous research domains, including NIDSs.

The authors in [16] proposed an intrusion detection system using the K-Nearest Neighbor (KNN) method for wireless sensor networks. This work used the NSL-KDD dataset for evaluation and obtained good performance results. In [17], Ingre et al. proposed and evaluated a decision tree-based NIDS using the NSL-KDD dataset. Their algorithm utilizes correlation technique to select the best feature subset before training with the decision tree (DT) classifier. Using the KDDCup’99 NIDS dataset, Neethu et al. [18] combined a Naïve Bayes (NB) classifier and Principal Component Analysis (PCA) for feature selection and binary class classification. In [19], Patil et al. presented the Adaboost ML approach to investigate the efficiency of using the KDDCup’99 and NSL-KDD datasets to develop a NIDS. The research work in [20] proposes a hybrid ML model that incorporates the k-Means clustering method and two other classification techniques (i.e., NB and KNN) for detecting network intrusions. For feature selection, their model employs an entropy-based approach. The approach was evaluated with the KDDCup’99 data and obtained a more effective performance than other algorithms with k-Means alone. Muniyandi et al. [21] also proposed a technique for detecting network anomalies based on k-Means and C4.5. The authors cascaded the k-means clustering method with the C4.5 decision tree algorithm, which outperformed other existing methods in the literature (KNN, Naïve Bayes, SVM, and ID3). In another approach, Shapoorifard et al. [22] presented a KNN classifier integrated with a K-means clustering algorithm for network anomaly detection. This work improves existing KNN classifiers for intrusion detection and results in a good performance. In [23], the authors presented a NIDS based on a wrapper method that uses hypergraph (HG) combined with genetic algorithm (GA) to generate likely feature subsets. The method employs SVM as a classification technique that uses the NSL-KDD dataset for evaluation. Based on the results of their experiments, their proposed approach has a good performance accuracy of 96.72% on 35 features selected. Although these ML approaches achieved good performance results, ML-based IDSs most often have few drawbacks (e.g., raising false alarms and the possibility of being tricked by smartly launched attacks), as stated in [24].

On the other hand, deep learning consists of DNNs and feature extraction mechanisms. It has strong classification and generalization capabilities and can learn features directly from vast amounts of data without requiring feature engineering. As a result, DL has received much interest and has become a study focus in network intrusion detection. In [25], Wu et al. presented a CNN-based NIDS which transforms the original data’s vector structure into image representations before extracting traffic features for training. The authors in [26] combined conventional machine learning techniques with a DL framework by applying genetic algorithms (GA) and simulated annealing procedures to deep DNNs for intrusion detection. From experimental results, their approach shows high detection and low false-positive rates on the NSL-KDD, CICIDS2017, and CIDDS-001 datasets, outperforming other state-of-the-art techniques. In [27], Baig et al. proposed a feedforward artificial neural network-based intrusion detection approach. Findings from experiments with the UNSW-NB15 dataset reveal that their method can successfully detect intrusions with an 86.40% detection accuracy. Yin et al. [28] used RNN to identify intrusions, and the findings show that deep learning-based NIDS works better when processing large amounts of data. Yang et al. [29] integrated an enhanced conditional variational auto-encoder with a DNN for cybersecurity intrusion detection. The authors used their proposed method to discover and investigate possible thin representations of data attributes and classes. The auto-encoder uses a decoder to create fresh attack samples based on the predefined intrusion categories, boosting detection accuracy. In another study, Mirza et al. [30] projected data samples onto a vector space spanned by latent variables using an autoencoder model. The work extracts features using the LSTM algorithm and then uses a predetermined threshold to detect if an incoming network data stream is abnormal or not. The authors in [31] proposed a strategy for reducing redundant and irrelevant features in IDS data samples by incorporating feature selection and ensemble learning algorithms. According to the experimental results, their approach has demonstrated promising performance on numerous datasets. Using a 1D CNN approach, Azizjon et al. [32] developed a DL strategy for a flexible and efficient NIDS. Compared to existing traditional ML models, the model achieved good results on the UNSW-NB15 dataset. The authors in [33] put forth a novel neural network Bat algorithm known as LGBA-NN, which yielded impressive results in detecting multi-class botnets. Their experiments demonstrated a remarkable improvement of up to 90% in detection accuracy, outperforming other advanced methods like the particle swarm optimization and Bat algorithm. In [34], Toldinas et al. took an innovative approach by transforming network data into four-channel images and employing the ResNet50 model for classification. The outcome of their endeavors was a highly accurate intrusion detection system, showcasing its prowess on datasets like UNSW-NB15 and BOUN-Ddos. Pu et al. [35] addressed the escalating complexity of network attacks with an unsupervised anomaly detection method for intrusion detection. The integration of subspace clustering (SSC) and support vector machines (OCSVM) led to superior performance on the challenging NSL-KDD dataset. In response to ever-evolving cyber threats, Khan [36] introduced HRCNNIDS, a cutting-edge hybrid intrusion detection system leveraging deep learning. The fusion of convolutional neural networks and recurrent neural networks resulted in a high detection rate for malicious attacks, as validated on the CSE-CIC-IDS2018 dataset. Nguyen et al. [37] brought the realms of blockchain data transfer and intrusion detection together to fortify security and efficiency. By employing deep belief networks, they successfully achieved improved detection results on the NSL-KDD 2015 and CIDDS-001 datasets. To tackle the issue of imbalanced intrusion detection data samples, Panigrahi et al. [38] proposed a host-based intrusion detection algorithm with a clever balance-enhancing mechanism. Their carefully designed approach, backed by an improved multi-class feature selection technique, delivered remarkably high detection accuracy. Injadat et al. [39] meticulously explored the impact of sampling techniques on model performance. Their multi-stage optimization approach for intrusion detection systems combined SMOTE algorithm-based dataset sampling, feature selection, and parameter optimization. The result was a substantial boost in detection accuracy, alongside reductions in training sample and feature set sizes. The authors in [40] introduced sample-weighted and class-weighted algorithms to enhance support vector machines for intrusion detection. Their method showcased an impressive array of advantages, including time efficiency, high recognition accuracy, low false alarm rate, and superior classification accuracy in diverse situations. In [41], the authors made a significant stride in intrusion detection with their few-shot learning model. Harnessing the power of a Siamese convolutional neural network, they efficiently identified cyber-physical attack types, thanks to an optimized feature representation and a robust cost function comprising transformation, coding, and prediction losses. Experimental results demonstrated the model’s exceptional detection performance.

Feature fusion-based network intrusion detection systems

In recent years, several research works have shown that feature fusion can provide significantly better data reference sources for model generalization, thereby enhancing the efficiency of detecting network intrusions [42]. The authors in [43] proposed a hierarchical network intrusion detection approach using CNN and LSTM. The model first uses the CNN algorithm to learn spatial data features at the low level before learning temporal features at the high level using the LSTM algorithm without using any feature engineering method. Experimental findings indicate that their proposed approach outperforms other published methods in literature on the ISCX2012 and DARPA1998 datasets. Multi-CNN feature fusion was used by Li et al. [44] to develop a DL-based network intrusion detection system. The authors transformed the 1D traffic data into a grayscale graph and categorized network traffic features into four groups based on their correlation. On the NSL-KDD dataset, research results show that the multi-CNN feature fusion method has a high detection accuracy and minimal complexity. One fundamental feature, on the other hand, cannot accurately describe traffic characteristics. Moreover, the detection impact of one feature subset may be so inadequate that it influences the detection effect of other feature subsets. In [45], Xiao et al. introduced a streamlined version of the residual network, aptly named S-ResNet, comprising a series of interconnected simplified residual blocks. Departing from the conventional residual block, the S-ResNet variant omits a weight layer and two batch normalization (BN) layers while integrating a novel pooling layer and employing the parametric rectified linear unit (PReLU) function in place of the rectified linear unit (ReLU) function. Notably, the proposed approach obtained promising outcomes on the NSL-KDD dataset, exhibiting good proficiency in detecting U2R and R2L attacks. In their research, [46], the authors presented CNN-IDS, an innovative network intrusion detection model. To enhance efficiency, they eliminate irrelevant features using dimensionality reduction techniques. The CNN is then utilized for extracting essential information, optimizing intrusion identification through supervised learning. To ensure computational efficacy, they converted the original traffic vector format into a more manageable image representation. The model’s performance is thoroughly evaluated on the widely utilized KDD-CUP99 dataset, establishing its effectiveness and efficiency in combating network intrusions. Sinha et al. [47] presented a deep learning model that combines Convolutional Neural Network and Bi-directional LSTM. It effectively captures spatial and temporal features. Tested on NSL-KDD and UNSW-NB15 datasets, the model shows good detection accuracy and outperforms state-of-the-art systems. In [48], Jiang et al. tackled low detection accuracy and complexity issues with their innovative malicious domain name detection model (CNN-GRU-Attention). The model employs CNN for spatial features extraction, GRU for temporal features extraction, and the attention mechanism to boost domain name detection accuracy. In a similar approach, Cao et al. [49] introduced a hybrid sampling algorithm (ADASYN and RENN) to tackle sample imbalance. Their intrusion detection model, evaluated on UNSW-NB15, NSL-KDD, and CIC-IDS2017 datasets, achieved improved classification accuracy (86.25%, 99.69%, and 99.65%, respectively) compared to CNN-GRU, thereby, addressing the issues of low accuracy and class imbalance effectively.

Methods and datasets

This study provides a NIDS approach based on a two-layer future fusion technique. The proposed approach is categorized into four main phases (i.e., Data preparation, GRU, CNN, and the attribute fusion phase). In the subsequent subsections, we discuss the datasets and the various components of our proposed NIDS architecture.

Description of datasets

This research evaluates the effectiveness of the proposed method using two well-known network intrusion detection datasets, NSL-KDD and UNSW-NB15. The choice of NSL-KDD and UNSW-NB15 datasets in this study is justified by several factors that make them common and popular choices for evaluating intrusion detection methods. These factors include:

  1. (i)

    Realism and Diversity: Both datasets are derived from real-world network traffic data, providing a more realistic representation of network activities and potential intrusion scenarios. This realism is crucial for assessing the performance of intrusion detection methods in practical settings. Additionally, these datasets encompass a wide range of network attacks and normal activities, ensuring diversity in the data samples.

  2. (ii)

    Benchmarking: NSL-KDD and UNSW-NB15 datasets have been widely adopted in the research community over the years, serving as benchmark datasets for evaluating the effectiveness of various intrusion detection techniques. Their widespread use facilitates fair and standardized comparisons among different methods, enabling researchers to showcase the strengths and weaknesses of their proposed approaches relative to existing state-of-the-art solutions.

  3. (iii)

    Availability and Reproducibility: The availability and accessibility of these datasets contribute to their popularity in the literature. Researchers can easily obtain and reproduce experiments, fostering transparency and facilitating peer review. This availability promotes collaboration and enables the research community to build upon previous work, ultimately driving advancements in intrusion detection research.

  4. (iv)

    Well-documented and Annotated: NSL-KDD and UNSW-NB15 datasets come with detailed documentation and annotations that describe the data attributes, the types of attacks present, and the corresponding ground truth labels. Such information enhances the interpretability of the results and allows researchers to gain deeper insights into the strengths and limitations of their methods.

  5. (v)

    Comparison with Previous Works: As these datasets have been widely used in prior research, adopting them in new studies allows for meaningful comparisons with existing state-of-the-art methods. It enables researchers to identify trends, improvements, and gaps in the field, guiding the development of more effective and accurate intrusion detection techniques.

NSL-KDD dataset

The NSL-KDD dataset [50] is an enhanced version of the KDD Cup’99. It contains 125,973 traffic records in the training set (NSL-KDDTrain+) and 22,544 records in the test set (NSL-KDDTest+). The dataset is categorized into five distinct classification labels, namely: Normal, Probe, Denial of Service (DoS), User-to-Root (U2R), and Remote-to-Local (R2L), with the latter four being the different network attack types in the dataset. The NSL-KDD dataset contains fairly distributed traffic records with no duplicates, as shown in Table 1.

UNSW-NB15 dataset

In the UNSW Canberra Cyber Range Lab, raw network packets for the UNSW-NB15 dataset were produced using the IXIA Perfect Storm program to generate a hybrid of genuine current routine activities and synthesized contemporary attack behaviors. The dataset [51] comprises 100 GB of collected raw traffic and nine attack types: exploits, worms, backdoor, fuzzers, generic analysis, DoS, shellcode, and reconnaissance [52]. In this, we use the processed training and testing sets provided by the UNSW lab. The testing set includes 82,332 records of the various attack types, whereas the training set includes 175,341 records. Each sample has 49 features, including 2 class label features. Table 2 shows the complete data distribution for each class.

Table 1 Traffic distribution of the NSL-KDD training and testing datasets
Table 2 Traffic distribution of the UNSW-NB15 training and testing datasets

Preparation of datasets

The dataset preparation step is vital, and it is carried out on all of the datasets used in this study. The NSL-KDD and UNSW-NB15 datasets contain both numeric and non-numeric features. Like any neural network, our proposed method’s back-end calculations are executed on numeric values rather than symbolic elements. As a result, we first convert non-numeric features in each dataset to numeric representations using one-hot encoding. After feature encoding, the data in each dataset is normalized to achieve a standard data format. Finally, we perform a dimensional reduction process to obtain the appropriate dimensions for all datasets before training.

Fig. 1
figure 1

Proposed CNN-GRU-FF model architecture

Feature encoding

The NSL-KDD dataset has 41 features for training and testing, of which 38 are numeric, and 3 are non-numeric. The features with non-numeric data types include protocol_type, service, and flag. “Service” contains 70 different attributes values, whereas “flag” and “protocol_type” contain 11 and 3 different attributes values, respectively. We map the three non-numeric features into numeric values using one-hot encoding. One-hot encoding generally uses a state register of size N to encode a character, with each character having its register bit. After one-hot encoding, the three non-numeric features are mapped to 84 numeric features and combined with the initial 38 numeric features to yield a 122 numeric feature set. As shown in Table 1, the class labels in the datasets are also labeled encoded into five code labels.

The UNSW-NB15 dataset, like the NSL-KDD dataset, comprises 47 training and testing features, 44 of which are numeric and 3 of which are symbolic. The three symbolic attributes (i.e., state, proto, and service) contain 16, 135, and 13 different attribute values, respectively. These 3 features are mapped to 164 numeric representations and combined with the initial 44 numeric attributes to obtain a 208 feature set. In addition, class labels in the UNSW-NB15 dataset are labeled encoded into ten classification code labels, as shown in Table 2.

Data normalization

The data in both the UNSW-NB15 and NSL-KDD datasets comprise a variety of distributions, each with its own mean and variance, making neural network learning problematic. Hence, we standardize all data inputs before training by applying Table 1 to transform all data to a normalized distribution of mean 0 and variance 1.

$$\begin{aligned} f' = \frac{f - \mu }{\sigma } \end{aligned}$$
(1)

where f denotes the original feature. \(f'\) represents the scaled feature, \(\mu \) represents the mean value of the feature, and \(\sigma \) indicates the feature’s standard deviation. The original and scaled data maintain the same linear relation [53]. Data normalization enhances the generalization of a model resulting in a more accurate performance.

Data dimensionality reduction

The proposed approach in this article (CNN-GRU-FF) accepts data inputs with a two-dimensional (2D) structure. However, the data values of both UNSW-NB15 and NSL-KDD datasets are in one dimension (i.e., they are 1D dimensions). Hence, we perform dimension reduction on both datasets and covert the data inputs to 2D structures. To achieve this, we applied a correlation-based feature filtering technique to reduce the data dimension by dropping feature(s) with the minimum correlation. Correlation-based feature filtering helps in dimensionality reduction by retaining only the most informative features, reducing the risk of overfitting and improving model generalization. However, it is essential to strike a balance in the feature selection, because eliminating too many features can lead to a flat feature space with little variability. To prevent this, we combine the correlation-based feature filtering with mutual information test to select the most relevant features from the datasets based on their correlation with the target variable (in this case, the intrusion detection label). The primary objective is to ensure that the most informative and diverse set of features is retained. After applying the feature filtering method to the NSL-KDD dataset, we reduced the data dimension from 122 to 121, which is then converted into an 11 \(\times \) 11 matrix representation. Similarly, we minimize the input dimension of the UNSW-NB15 from 208 to 196 and covert it to a matrix representation of 14 \(\times \) 14 to adapt to the input requirement of the proposed CNN-GRU-FF model.

Proposed NIDS architecture

This subsection discusses the various components of the proposed CNN-GRU-FF network intrusion detection architecture. CNN-GRU-FF is basically constructed with four building units as depicted in Fig. 1: Batch Normalization, CNN, GRU, and Feature Fusion Units. The functions of each unit are presented in the subsequent subsections below:

Batch normalization unit

When the parameters of the preceding layer are changed during the training process of CNNs and GRUs, the input distribution of the subsequent layer changes as well, making it much harder for deeper neural networks training. Batch Normalization (BN) decreases the intrinsic covariate change during model training by adjusting the model’s weights to unit-norms. Furthermore, batch normalization ensures that each layer’s input follows the same distribution, making it easier and faster to train neural network models. The process of BN is in two steps, where the first is input normalization, and the second step is the rescaling and offsetting process. Suppose we have a batch input from layer h, we first compute the mean of hidden activation \(\mu \) as:

$$\begin{aligned} \mu = \frac{1}{n}\sum _{i=1}^{n}{h_{i}} \end{aligned}$$
(2)

here, n denotes the number of neurons at layer h. Next, we compute the standard deviation \(\sigma \) of the hidden activations as:

$$\begin{aligned} \sigma = \sqrt{\frac{1}{n}\sum _{i=1}^{n}{\left( h_{i} - \mu \right) ^{2}}} \end{aligned}$$
(3)

The next step is to normalize the hidden activations of the layer using the mean and standard deviation from Eqs. (2) and 3 as follows:

$$\begin{aligned} {{\mathcal {H}}}_{\textrm{norm}} = \frac{{h_{i}} - {\mu }}{{\sigma } + {\varepsilon }} \end{aligned}$$
(4)

The parameter “\(\varepsilon \)” is introduced as a small constant value to prevent the denominator from becoming negative, especially when the standard deviation (\(\sigma \)) is very small or close to zero. In such cases, without the adjustment of “\(\varepsilon \)”, the denominator could potentially become zero or negative, causing issues with the normalization process. The second and final step of the BN process is re-scaling and offsetting input values. At this stage, two new components are added to the process. These components (\(\gamma \) and \(\beta \)) are vector parameters used to re-scale and shift the vector comprising values from the preceding operations, respectively. The re-scale and offset operation is given as:

$$\begin{aligned} h_{i} = \left( \gamma \times {{\mathcal {H}}}_{\textrm{norm}} \right) + \beta \end{aligned}$$
(5)

where \(\gamma \) and \(\beta \) are learnable parameters. During training, the neural network makes sure that the best values of these parameters are chosen to obtain an accurate normalization of each mini-batch. In this article, BN is applied before convolution layers and recurrent layers.

The CNN unit

CNNs are deep learning algorithms well known for their outstanding performance on audio and image inputs. A typical CNN model has three layers: a convolution layer, pooling layer, and a fully connected layer. The convolution layers are the main building units of the CNN model where most of the computations occur. The pooling layer (down-sampling) minimizes the number of input parameters, thereby lowering the dimensionality. The final layer (i.e., the fully connected) explains itself rightly. It is responsible for performing classification tasks based on extracted features. In this study, the convolutional layer generates a feature map by extracting spatial features from the input data. Suppose we have a data input D such that, \(D=\{d_1,d_2,\ldots d_{n}\}\), a weight parameter \(\mathcal {W}\), a deviation of \(\delta \). The convolution operation in layer k is given as:

$$\begin{aligned} D_{k} = F \left( {\mathcal {W}}_{k} \circledast D_{k-1} + \delta _{k} \right) \end{aligned}$$
(6)

where \(D_{k-1}\) denotes the input to the kth Convolutional layer, \(D_k\) denotes the kth Kernel output with \(\circledast \) as the convolutional operation and F as the activation function. An activation function is typically used after the convolution procedure to accentuate variations between features obtained by the convolution. The activation function used in this article is the Rectified Linear Unit (ReLU). Monotonicity applies to both the ReLU function and its derivative. If there is any negative input, the function returns 0. However, it immediately returns any positive value (x) that it gets. Therefore, the output ranges from 0 to infinity. As a result, the output has a range of 0 to infinity. The ReLU activation function transforms its input using Eq. (7) to facilitate a faster model generalization and better Accuracy.

$$\begin{aligned} F(x) = Max {\left\{ \begin{array}{ll} 0, &{} \quad \text {for all } x < 0 \\ x, &{} \quad \text {for all } x > 0 \end{array}\right. } \end{aligned}$$
(7)

The GRU unit

Like an RNN, the Gated Recurrent Unit (GRU) follows identical steps, except for those related to each GRU unit’s operations and gates. To get over the challenges that the standard RNN presents, GRU combines two gate operating techniques called Update gate and Reset gate. The update gate decides how much historical data must be passed through to the subsequent state. It is a useful unit since the model may repeat all prior data and remove the threat of gradient explosion. The update gate (\(z_{t}\)), as shown in Fig. 1, is calculated using Eq. (8):

$$\begin{aligned} z_{t} = \sigma \left( {{\mathcal {W}}}_{z} * [h_{t-1}, \quad x_{t}] + b_{z} \right) \end{aligned}$$
(8)

where \(x_t\) represents the input vector, \({{\mathcal {W}}}_z\) denotes the weight parameter matrix, \(h_{t-1}\) represents the previous hidden state, and \(b_z\) is the connecting bias parameter at timestep t. The model’s reset gate determines how much prior data should be disregarded. It determines whether the previous cell state is significant or not. Similar to the update gate, the reset gate (\(r_t\)) at timestep t, and weight matrix \({{\mathcal {W}}}_r\) is expressed as:

$$\begin{aligned} r_{t} = \sigma \left( {{\mathcal {W}}}_{r} * [h_{t-1}, \quad x_{t}] + b_{r} \right) \end{aligned}$$
(9)

This article uses the GRU component to extract temporal input data features via a recurrent procedure. Like the convolutional layer, GRU requires an activation function implemented using tanh and hard sigmoid functions.

Feature integration unit

The final phase of the proposed approach is the integration of the features. The feature integration unit fuses the spatial and temporal features extracted from the CNN and GRU units. We use the multi-layer perceptron to integrate all extracted features for final classification and prediction. The multi-layer perceptron maps between input and outputs in a non-linear structure using an input layer, an output layer, and one or more hidden layers with several neurons stacked together. We employ back-propagation to iteratively alter the network weights to reduce the network’s cost function.

Experiments and results

This section initially presents the environmental settings and hyper-parameter tuning for model implementation in this section. Next, we explain the metrics used to evaluate the model’s efficiency and discuss the experimental results. Finally, we perform a comparative analysis to validate the model’s performance against other existing methods.

Environment settings and hyper-parameter tuning

We implemented the proposed CNN-GRU-FF NIDS model in a python programming language and conducted experiments on the NSL-KDD and UNSW-NB15 datasets. All experiments were carried out with a personal computer (PC) which runs on a 64-bit Windows operating system (OS). Table 3 presents the list of specific simulation environment parameters used in this study.

Table 3 Implementation environment settings
Table 4 Experiment hyper-parameter settings

Hyper-parameter tuning

In every deep learning model, the most integral step in building the model is perhaps the training phase of the model. To obtain a robust predictive model, choosing the parameters whose values are utilized to guide the learning process (Hyper-parameters) is essential so that the training is time and cost-effective. For the CNN component of our proposed NIDS, we set the number of the convolution kernels to 64 with 10 and 5 neural units in the output layers for the UNSW-NB15 and NSL-KDD data, respectively (see Table 4). We use “Adam” as the optimization algorithm with a learning rate of \(8 \times 10^{-3}\) and a “batch_size” of 500. In the GRU component, we chose 75 neurons for the hidden layers. The hidden layers of the MLP (i.e., the feature fusion component) are set to have 128 neural units. The parameter configurations for the output layer, optimization algorithm, and the lost function for the GRU and MLP are the same as the CNN component. Except for the output layer, which utilizes Softmax activation, we use ReLU function to activate all other layers in the model. As shown in Eq. (10), the softmax layer calculates each class label’s probability distribution.

$$\begin{aligned} F_{i}(x) = \frac{e^{x_{i}}}{\sum _{j=1}^{J}{e^{x_{i}}}} \quad \quad \text {for }i = 1,\ldots , J \end{aligned}$$
(10)

where \(x_i\) represents the ith softmax layer input and J is the total class count. The Focal loss, as introduced by Lin et al. in [54], is the loss function we adopted for training the proposed model instead of the cross-entropy due to the class imbalance nature of both datasets. As a result of the apparent class imbalance issue in the training set, it is essential to prevent the loss function from optimizing one class while curtailing others. As a result, we select a loss function that allows the model to focus on smaller classes during training. Unlike the cross-entropy loss function, the focal loss function alleviates class imbalance by reducing the influence of particular samples on the overall loss when the number of easy-to-train examples is considerably large (i.e., the focal loss concentrates on the smaller samples as well). In this study, we define the focal loss function as follows:

$$\begin{aligned} L_{\textrm{focal}} = \left( \frac{C_{s}}{T_{s}} - 1 \right) {\left( 1 - B_{n}\right) }^{m}\log {\left( B_{n}\right) } \end{aligned}$$
(11)

where \(C_s\) is the number of training examples belonging to class n, \(T_s\) denotes the total of training examples in the dataset. \((1 - B_n )^m\) represents the regulating term that minimizes the influence of easy-to-train samples on the loss and \(B_n\) which is an element of [0, 1], denotes the classification probability of the class. As \(B_n\) approaches 1, \((1 - B_n )^m\) turns to 0, thereby reducing the weights of the easy-to-train samples with respect to the loss. Additionally, \((\frac{C_s}{T_s} - 1)\) is the weight adjusting term used to scale the smaller classes separately. Finally, we add a dropout of 0.5 to the pooling layers to guarantee that the model does not over-fit the data.

Evaluation metrics

The metrics used to investigate the efficiency of the proposed model in predicting modern attack types include Accuracy (Acc), Detection Rate (DR), False Alarm Rate (FAR), F1-Score, and Precision. The most important is the model’s detection accuracy which describes the ratio of rightly classified traffic records to the overall traffic records. Each of these metrics is derived using the true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) from the classification results. TP represents the number of abnormal records correctly detected as abnormalities, whereas TN denotes the number of normal records correctly spotted as normal. Similarly, FP indicates the total of normal records wrongly detected as an abnormality, while FN denotes the sum of abnormal records wrongly spotted as normal. With these indicators as the basic units, each evaluation metric is derived as presented in Eqs. (12)–(16).

$$\begin{aligned}{} & {} \text {Acc} = \frac{\text {TN} + \text {TP}}{\text {TN} + \text {TP} + \text {FN} + \text {FP}} \end{aligned}$$
(12)
$$\begin{aligned}{} & {} \text {DR} = \frac{\text {TP}}{\text {TP} + \text {FN}} \end{aligned}$$
(13)
$$\begin{aligned}{} & {} \text {FAR} = \frac{\text {FP}}{\text {TN} + \text {FP}} \end{aligned}$$
(14)
$$\begin{aligned}{} & {} \text {Precision} = \frac{\text {TP}}{\text {FP} + \text {TP}} \end{aligned}$$
(15)
$$\begin{aligned}{} & {} \text {F1-Score} = 2 \left[ \frac{\text {recall} \times \text {precision}}{\text {recall} + \text {precision}} \right] \end{aligned}$$
(16)
Fig. 2
figure 2

A comparison of learning performance of CNN-GRU-FF on UNSW-NB15 and NSL-KDD

Results and discussion

This section presents the experimental findings of the proposed CNN-GRU-FF approach on the two datasets (i.e., NSL-KDD and UNSW-NB15). To analyze the performance of our proposed approach, we use the K-fold cross-validation method with K set to 10, and the algorithm runs for 100 epochs. We investigate the training and testing losses of the proposed NIDS approach to ensure that the model does not suffer from vanishing gradient, degradation, or overfitting. Figure 2 depicts the training losses of the CNN-GRU-FF model on the two datasets. As illustrated in Fig. 2a, the training loss decreases gradually as the number of epochs increases until it reaches stability at epochs 40 and 55 for the NSL-KDD and UNSW-NB15 datasets, respectively. Similarly, as shown in Fig. 2b, the test loss at the very initial was relatively high and continued to minimize as the training proceeded through the epochs until it reached a stable point for both datasets, indicating the model’s ability to converge and generalize well.

As presented in section “NSL-KDD dataset”, the NSL-KDD test set contains a total of 22,544 traffic samples, for which 9711 are normal records and 12,833 are intrusive records. Table 5 presents the total number of True Positives (malicious records) correctly detected and the total of False Alarms (False Positives) raised by our proposed model on the two datasets. From the table, CNN-GRU-FF can correctly detect a total of 12,777 attack samples out of 12,833 samples and 9695 out of 9711 normal records for the NSL-KDD dataset, producing an accuracy of 99.86%, a detection rate of 99.68%, and an F-Score of 99.68%. The model misclassified 72 records resulting in a low False Alarm Rate (FAR) of 0.10%, as shown in Table 7. Similarly, the UNSW-NB15 test set contains 82,332 traffic records, for which 45,332 are attack records and 37,000 are normal. From Table 6, the proposed model correctly detects 44,113 attack records and 36,755 normal records, obtaining an accuracy of 99.54%, a detection rate of 98.22%, and an F-Score of 98.28%. 1464 records were misclassified, producing a false alarm rate of 0.17%, as presented in Table 8.

Table 5 The model’s confusion matrix using NSL-KDD test set
Table 6 The model’s confusion matrix using UNSW-NB15 test set
Table 7 The model’s performance using NSL-KDD test set (%)
Table 8 The model’s performance using UNSW-NB15 test set (%)

Although the proposed approach performed well on the NSL-KDD dataset, it did struggle in detecting some specific attack types (i.e, Analysis, Backdoor, Shellcode, and Worms) in the UNSW-NB15, as shown in the table. The slightly low results on these attack types is due the fewer number of training and testing samples available for these classes, the model may not have enough data to effectively learn the unique characteristics of these attacks. Also, attacks such as Backdoor, Shellcode, and Worms, can be highly complex and exhibit subtle variations, making them more challenging to detect accurately. Nevertheless, the model still produced a good performance overall on both datasets.

Furthermore, the receiver operating characteristic (ROC) curve plays a crucial role in assessing the performance of our CNN-GRU-FF model. It represents the relationship between the true-positive rate (TPR) and the false-positive rate (FPR) on a graph. By analyzing the ROC curve, we can determine how effectively the model distinguishes between different classes. In this study, we employed ROC curves to analyze the multi-class outputs of both the NSL-KDD and UNSW-NB15 datasets. Moreover, for a multi-class classification task involving various attack types such the ones presented in this study, Figs. 3 and 4 present the ROC curves, aiding in identifying the different attack categories using the NSL-KDD and UNSW-NB15 datasets. To further assess the model’s efficacy, we computed the area under the curve (AUC) values for the various attacks. As depicted in Figs. 3 and 4, all AUC values exceeded 0.98, signifying strong performance across all attack classes. These AUC values serve as a reliable indicator of the model’s discriminative capabilities and successful classification of diverse attack types.

Fig. 3
figure 3

ROC curve obtained by the proposed approach on NSL-KDD

Fig. 4
figure 4

ROC curve obtained by the proposed approach on UNSW-NB15

Table 9 Best hyperparameter settings for different algorithms

Comparative analysis

In this section, we present a comparison between the proposed model and traditional machine learning methods, as well as other existing feature extraction-based intrusion detection methods. In order to conduct a thorough assessment of the multi-classification task encompassing all classes, the evaluation metrics are computed using both Macro-Averages and Weighted-Averages, as shown in Tables 7 and 8. The Macro-Average represents the arithmetic mean and is determined by Eq. (17):

$$\begin{aligned} {\text {Macro average}} = \frac{1}{n}\sum _{i=1}^{n}{M_{i}} \end{aligned}$$
(17)

where n is the number of classes and \(M_{i}\) is the evaluation indicator under consideration for class i.

Weighted-Average on the other hand, employs sample numbers as weights to capture the detection performance of classes with varying sample sizes. In other words, it takes into consideration the significance of each class in relation to its representation in the dataset. By using these weighted averages, the evaluation metrics offer a more balanced and representative assessment, as they give greater importance to classes with a substantial number of samples, which have a more considerable impact on the overall model performance. This approach ensures that the evaluation is not biased towards classes with fewer instances, thereby providing a more comprehensive and accurate measurement of the model’s overall detection effectiveness across the entire multi-class classification task. For any given evaluation indicator such as accuracy, detection rate, precision, specificity, F-Score and FAR, we compute the weighted average as shown in Eq. (18):

$$\begin{aligned} {\text {Weighted average}} = \frac{\sum _{i=1}^{n}{M_{i}}\cdot {w_{i}}}{\sum _{i=1}^{n}{w_{i}}} \end{aligned}$$
(18)

where \(w_{i}\) denotes the weight of a given class i.

Comparison with traditional ML methods

In this comparative analysis of performance results obtained on the NSL-KDD and UNSW-NB15 datasets, six traditional machine learning algorithms were evaluated against the proposed method: Support Vector Machine (SVM), Random Forest (RF), k-Nearest Neighbors (KNN), Naïve Bayes (NB), AdaBoost, and the XGBoost method. We implemented these methods with the same environment settings as the proposed approach (see Table 3). The specific hyper-parameters settings for each of these ML methods are presented in Table 9. Since the classes are imbalanced, it is important to utilize performance metrics that consider this asymmetry. Therefore, we also calculated the balanced accuracy for each method, as shown in Tables 10 and 11. Balanced accuracy (Balanced Acc) is the arithmetic mean of sensitivity and specificity, which provides a fair evaluation considering the class distribution.

Table 10 Performance of the proposed method versus traditional ML methods using NSL-KDD
Table 11 Performance of the proposed method versus traditional ML methods using UNSW-NB15
Fig. 5
figure 5

Comparison of recall, precision, and F1-score performance of CNN-GRU-FF versus ML methods using NSL-KDD

On the NSL-KDD dataset, XGBoost and the proposed method outperformed all other methods with an accuracy of 91.04% and 99.86%, respectively. XGBoost achieved the highest detection rate of 90.38% and the lowest false alarm rate of 3.95%, while the proposed method attained a better detection rate of 99.68% and a low false alarm rate of 0.10%. Moreover, our approach also produced the highest balanced accuracy of 99.88% compared to all the ML methods. With the UNSW-NB15 dataset, the proposed method continued to show good performance, achieving an accuracy of 99.54%, a balanced accuracy of 99.69%, and a detection rate of 98.22%, with a very low false alarm rate of 0.17%. Random Forest also performed consistently well on both datasets, with an accuracy of 85.77% on UNSW-NB15 and 80.67% on NSL-KDD, maintaining relatively low false alarm rates in both cases.

Fig. 6
figure 6

Comparison of recall, precision, and F1-score performance of CNN-GRU-FF versus ML methods using UNSW-NB15

The comparison indicates that the proposed method stands out as the top performer on both datasets, exhibiting good accuracy and detection rates while minimizing false alarms. These results highlight the method’s effectiveness in intrusion detection tasks. Moreover, XGBoost showed remarkable performance on the NSL-KDD dataset but had a relatively higher false alarm rate on UNSW-NB15. On the other hand, Random Forest consistently performed well on both datasets, demonstrating its suitability for intrusion detection across different data complexities.

Furthermore, we conducted a comparison between the proposed method and the ML-based approaches to analyze their recall, precision, and F1-Score. The results of this comparison are illustrated in Figs. 5 and 6. Notably, the graphical representations clearly highlight that our proposed approach outperforms all the ML-based methods on both datasets, showcasing its capability to accurately classify both positive and negative instances.

Comparison with other IDS methods

This subsection presents a comparative analysis of our proposed method with other published intrusion detection methods on the NSL-KDD and UNSW-NB15 datasets. The methods were assessed based on Accuracy (Acc), Detection Rate (DR), and False Alarm Rate (FAR) metrics, and their respective publication years were noted to provide context on the recency of the research. On the NSL-KDD dataset, various methods showcased strong performances. The proposed method outperformed all others with a better accuracy of 99.86%, a high detection rate of 99.68%, and an impressively low false alarm rate of 0.10%, as shown in Table 12. CNN-BiLSTM, LuNet, and ABC-AdaBoost also demonstrated competitive results, achieving an accuracy of 99.22%, 99.05%, and 98.90%, respectively. CNN-GRU and its attention-enhanced variant achieved remarkable accuracies of 99.69% and 99.26%, respectively, in different years.

Similarly, on the UNSW-NB15 dataset, the proposed method exhibited superior performance, attaining an accuracy of 99.54%, a detection rate of 98.22%, and a remarkably low false alarm rate of 0.17%, as reported in Table 13. CNN-LSTM, OC-Bi-GRU, and S-ResNet also showed noteworthy performances with an accuracy of 98.92%, 98.92%, and 95.94% respectively. However, some methods struggled with higher false alarm rates on this more challenging dataset. Comparing the two datasets, the proposed method consistently outperforms all other methods in terms of accuracy and detection rate. It demonstrates superior intrusion detection capabilities and highlights the effectiveness of the proposed approach in capturing both normal and malicious activities with minimal false alarms. Regarding comparative trends, it is evident that CNN-based architectures, including CNN-BiLSTM, CNN-GRU, and our proposed method, consistently emerge as prominent contenders among the top-performing techniques on both datasets. Although there may be variations in specific simulation environment and parameter settings between these models, the prevalence of CNN-based models attests to their effectiveness in intrusion detection tasks for diverse network datasets.

Table 12 Performance of the proposed method versus recently published methods for NSL-KDD (N/R not reported, ***current)
Table 13 Performance of the proposed method versus recently published methods for UNSW-NB15 (N/R not reported, ***current)

Additionally, we performed a comparison of recall, precision, and F1-Score metrics, contrasting the performance of our proposed methods with those of recently published techniques. The visual representation of this comparison is presented in Figs. 7 and 8. The findings demonstrate the superiority of our proposed approach across both datasets, except for one instance: CNN-GRU achieved a slightly higher F1-Score (99.70%) on the NSL-KDD dataset compared to our proposed approach (99.68%). However, the strength of our approach becomes evident when analyzing the results on the UNSW-NB15 dataset. It significantly outperformed CNN-GRU and all other examined methods, showcasing its robustness in handling diverse datasets.

Fig. 7
figure 7

Comparison of recall, precision, and F1-score performance of CNN-GRU-FF versus published methods using NSL-KDD

Fig. 8
figure 8

Comparison of recall, precision, and F1-score performance of CNN-GRU-FF versus published methods using UNSW-NB15

Table 14 Time complexities (i.e. in seconds) of the various methods used in this study

Time complexity analysis

In this section, we provide an analysis of the time complexity and execution time in seconds (s) of the methods employed in our study. Understanding the time complexity of these implemented techniques is crucial as it sheds light on their computational efficiency and scalability. By examining the execution time, we gain valuable insights into how these methods perform under different input sizes. This is often denoted by the Big-O notation, representing the worst case time complexity. Table 14 presents the time complexity of each ML algorithm. Here, n denotes the number of training samples, and k represents the number of features. For SVM, \(n_{\textrm{svm}}\) stands for the number of support vectors. \(n_{est}\) for the AdaBoost corresponds to the number of estimators, and for RF, \(n_{t}\) denotes the number of trees. With the XGBoost, t denotes the number of trees, d is the height of the trees and x denotes the number of instances in the dataset.

However, with the proposed approach, n refers to the length of input, d is the dimension of input, k denotes the kernel size on convolution, and N represents the number of nodes in the feature fusion layer. Moreover, the table displays the inference time per instance in microseconds (\({\upmu }\)s) for both datasets. This metric pertains to the time required to make predictions for all instances within the testing dataset, divided by the total number of instances.

The results presented in Table 14 demonstrate that our approach requires slightly more training and inference time compared to the other ML methods. This is primarily due to the fact that our method extracts not only spatial but also temporal features, enhancing its capability to capture intricate patterns and dynamics within the data. Despite the marginally increased training time, our method delivers significantly superior performance. This trade-off, wherein a small amount of additional time investment yields substantial gains in detection accuracy, highlighting the robustness and effectiveness of our model in handling data imbalance in intrusion detection tasks.

Notably, it is essential to highlight that our proposed method was not compared with other published approaches due to limited information regarding their time complexities and execution times. The absence of such details prevented a comprehensive comparison. Nevertheless, the outcomes of our method stand on their own merit, indicating its efficacy and potential as an intrusion detection system.

Conclusion and future directions

This paper proposes a NIDS based on two-layer feature extraction and feature fusion mechanism. The proposed model, CNN-GRU-FF, combines techniques of CNN and GRU for feature extraction. The former extracts spatial features, while the latter extracts temporal features. The model then adopts an MLP algorithm to fuse the features from the two layers (i.e., the CNN and GRU layers) for classification using a softmax layer. We implemented and evaluated the model’s performance using two well-known intrusion detection datasets (i.e., the NSL-KDD and UNSW-NB15 datasets). However, these datasets contain imbalanced classes in their training sets, resulting in most DL models favoring dominant classes over the minor ones. To handle the issue of imbalanced classes in the dataset, we utilized the focal loss function, which allows the model to pay close attention to minor classes during training. Compared with seven other baseline algorithms for both NSL-KDD and UNSW-NB15 datasets, the proposed CNN-GRU-FF model obtains the best performance in Acc, DR, Precision, F-Score, and FAR as proven in the experimental results.

In future developments, the feature fusion process and feature augmentation must be further refined to increase the model’s operating efficiency. Furthermore, we intend to explore the standards of data construction for various data dimensions for network intrusion detection. Finally, we would investigate various optimum classification model tweaks, such as feature selection techniques to obtain a more robust NIDS model.