1 Introduction

With the rapid development and expansion of the Internet and IoT or the Internet of Things, the demand for security against cyber-attacks has also had an exponential spike in recent years. With the growth of the network, the growth of information flowing through is imminent which poses a multitude of opportunities for attackers to act, with malicious intentions [1]. An intrusion can be described as an action that violates the CIA (Confidentiality, Integrity, and Availability) triad of a system by bypassing security put in place with malicious intent. These intrusions can be known or unknown, the latter has been coined as zero-day attacks [2]. A constant monitoring system that dynamically scans events within a system and looks for inconsistencies, preventing any such events from entering the system is generally referred to as an Intrusion Detection System (IDS) [3].

There are two types of intrusion detection: Signature based detection and anomaly-based detection. The signature-based network intrusion detection system uses a signature-driven database of attack signatures, which are matched to real-time network traffic for the detection of intrusions. This database is usually updated for better results. Therefore, Signature-based Network Intrusion Detection Systems (SNIDS) are also referred to as real-time intrusion detection systems. However, an anomaly-based intrusion detection process relies on classifiers that separate benign network and system behaviours/activities from unknown or unusual activities/behaviours by studying the normal behaviours of the system and network traffic [4]. SNIDS analyse all network traffic and detects attacks such as Denial of Service (DoS) and many more, whereas Anomaly-based Intrusion Detection Systems (AIDS), work independently on host devices or workstations monitoring packet contents and system log files to identify any abnormality and inconsistency in system activities [5]. The anomaly-based IDS (AIDS) identifies anomalies by analyzing the normal network behavior. Anomalies can be detected by observing the deviations between real-time network activities and a normal behaviour pattern [6].

An attack of the same origin can be identified by its signature and used as a reference for similar attacks. SIDS generally displays high accuracy but fails to detect zero-day attacks. As a result of malware’s polymorphic behavior, SIDS relies on signature databases which can be circumvented. The malware evolves over time, making it difficult to match signatures in the database [7]. Signatures are strings, patterns, or rules that show similarities to known attacks. A SIDS compares the signature with real-time network traffic to identify any intrusions in the network based on similarities with attack signatures. A database of pre-existing signatures can mitigate known attacks [8]. It is also difficult to maintain a low false positive rate for anomaly-based IDS. A lack of behavioral studies and improper algorithms and processing schemes in IDSs contribute to this challenge [9]. Anomaly-based IDS helps detect unknown attacks but are highly unreliable with the results they produce when compared with a signature-based IDS.

Hybrid intrusion detection methods have proven to resolve the issue of the two conventional, individual intrusion detection methods. However, there are a few areas that pose opportunities for improvement in the Hybrid Intrusion Detection framework. The proposed model incorporates machine learning algorithms and deep learning algorithms that are most suitable for intrusion detection models. The performance superiority of the C5 classifier was demonstrated by the researchers which yielded better results than K-Nearest Neighbour (KNN), Random Forest, Naïve Bayes, and Classification And Regression Tree (CART) [10]. Also, Imrana et al. [11] proposed IDS with conventional LSTM trained on KDD 99 which yielded better results than J48, Naïve Bayes, Random Forest, (recurrent neural network) RNN, SVM (Support Vector Machine) and Standard Template Library (STL). Furthermore, hybrid intrusion detection techniques incorporate anomaly and misuse detection approaches to detect both known attacks and unknown attacks, as well as generate signatures for zero-day attacks or attacks that aren’t in the signature repository and update the IDS repository [12]. Zero-day attacks can be used to overcome misuse detection shortcomings. Additionally, detecting attacks in the first phase reduces anomaly detection system stress.

Finally, datasets are key to Machine Learning (ML), in its training and testing phases. Updating algorithms and models and including current attack features will ensure the efficacy and reliability of IDSs. Dataset training and testing will help the ML models identify new attacks, and a large dataset will confirm patterns, sequences, and rules in network behavior [13]. KDD 99 and NSL-KDD 99 datasets are used in most works on IDS and Hybrid Intrusion Detection Systems (HIDS). However, these datasets are outdated as they do not include new attacks and require heavy pre-processing. The proposed model is trained with UNSW-NB15 and ADFA-LD, which embodies new attack types. The clustering technique supported by UNSW-NB15 can also be used to analyze the similarity of attributes and adjust attribute usage to improve the performance of IDSs [14].

Machine Learning (ML), in intrusion detection and prevention, has proven pivotal as witnessed by its prevalence in cyber-attack detection. The ML models are tasked with identifying data patterns or predicting behavior, by processing vast-scale datasets collected over an extensive period [15]. Most intrusion detection systems prevalent today use ML-based algorithms as their detection strategy [16]. Machine Learning can generally be categorized into shallow learning and deep learning. Deep learning models, in intrusion detection systems, are predominantly neural network models with a high number of hidden layers. Such models are capable of learning highly complex non-linear functions and the hierarchical layer structure facilitates the learning of useful features from the data. Deep learning has established a firm grip on Intrusion Detection Methods [16].

A self-healing component [17] is incorporated into the proposed model, allowing it to learn signatures from anomalous packets by using techniques that combine signature-based and anomaly-based IDS models, as well as a self-learning attribute for the ensemble model to learn. UNSW-NB15 and ADFA-LD are the two models used in the current IDS framework to train the models, which are very relevant for the current IDS framework.

Aims of this research and contributions

  1. 1.

    The following are the list of objectives in this research.

  2. 2.

    The research aims to establish a network intrusion detection system that incorporates some of the best machine learning models in terms of performance.

  3. 3.

    The proposed model also aims to develop a hierarchical structure of cyclic dataflow that enables the system to become a self-sufficient IDS.

  4. 4.

    The proposed hybrid intrusion detection system aims to achieve a high detection rate and high accuracy by incorporating a self-learning technique through a signature generator.

  5. 5.

    Anomalies identified from the anomaly-based IDS are fed into a signature extraction based on their deviation from normal traffic patterns and similarity to malicious traffic patterns.

  6. 6.

    The self-learning framework significantly increases the detection rate of SIDS in the hybrid architecture by updating the signature repository with unknown anomalous signatures.

  7. 7.

    The proposed model aims to grow the signature repository over time by collecting attributes and signatures of unknown attacks.

  8. 8.

    The model encompasses continual learning without the need for human intervention to update the signature repository.

The main contributions of this paper are as follows:

  1. 1.

    In this paper, we propose a hybrid intrusion detection system with self-healing attributes, machine-learning models, and a robust architecture.

  2. 2.

    The model has been trained and tested using relevant datasets pertaining to today’s attacks using relevant datasets.

  3. 3.

    In order to detect intrusions, the system proposes a continuous learning hybrid intrusion detection model, in which attack signatures are continuously updated without the need for human intervention.

  4. 4.

    During real-world network deployments, the self-healing model improves network performance through the continuous extraction of signatures from anomalies detected in the network over time.

  5. 5.

    The proposed system contributes to a continuous learning system that eliminates the need to update training datasets for updated attack tracking in the foreseeable future since the architecture of data flow within the proposed model facilitates continuous learning.

The rest of the paper is organised as follows: Sect. 2 consists of a summary of the closely related works, and Sect. 3 consists of a background study based on selected systems. We have explained the proposed methodology in Sect. 4, as well as the results and benchmarking in Sect. 5. In Sect. 6, we have explained the conclusions and suggested the next steps.

2 Related works

Studies in IDS demonstrate the variations in the categorization of NIDS based on patterns, rules, statistics, states, and heuristics. Also, there have been numerous studies that have proposed Hybrid NIDS with different techniques, models, and architectures combining signature-based and anomaly-based detection systems. Moreover, various Shallow Machine Learning models, Deep learning models, and hybrid models (in terms of the algorithm used rather than network and system parameters) have also been used in conjunction with Hybrid NIDS [18]. SNORT’s performance has been highlighted which uses an anomaly pre-processor integrated with the SNORT IDS which compensated for the shortcoming of SNORT IDS, on itself, as it was incapable of detecting unknown or zero-day attacks [19].

The application of Neural Networks extends widely into different sectors, especially in agriculture. In the paper [20] in which MaskRCNN, a fast convolutional neural network model is employed in detecting weed which was trained using a custom leaf-based dataset. The experiment incorporated 50 datasets, more than 100 training configurations, and 300 h of training of the datasets on the MaskRCNN model. The deep learning model yielded a 93% of mAP accuracy in the detection of training and 95% accuracy in the testing images. The use of the detection model will save farmers time and effort in detecting weeds in a pasture environment. Similarly, the use of the deep learning model in anomaly detection is not limited to Intrusion Detection Systems but has several many industries dependent on it. One case of Machine Learning models used in detecting anomalies is Energy consumption. The research uses deep learning models to detect anomalous behaviour of energy consumption to detect abnormal energy consumption behaviour and help prevent it. Using various features such as temperature, humidity, occupancy, and so on, clusters were generated. The features considered as inputs represent the context of the energy meter. The meters are then grouped by the context type and the behaviour gets analysed. Through the formation of a cluster using the KNN algorithm, any deviation from the cluster is identified as an anomaly as those instances represent abnormal behaviour. Such deviations are then identified and investigated further [21].

One of the biggest challenges in the health sector is the classification of imbalanced miRNA (micro-Ribonucleic Acid) sequences. miRNA helps detect and diagnose cancer. Jain et al. [22] proposed a Hybrid Neural Network model with Deep ANN and Deep Decision Tree classifier. The proposed hybrid method performed better than the individual Neural Network and the Decision Tree model. The proposed model showed an accuracy of more than 99% and improved the time complexity when compared with other existing models [22]. There have been several innovations to overcome the security issues affecting data integrity and privacy, one of which is through the use of blockchain technology. The paper explained the use of blockchain in various sectors that contributes to maintaining the reliability of data. The paper illustrates the benefits of integrating blockchain technology with IoT through decentralization, which is considered one of the most significant factors as a single authority cannot authorise a transaction but require a bulk of participants to do so. [23] The application of blockchain in an IDS can result in the elimination of opportunities to tamper with the alerts generated by the IDS. This can be achieved by using blockchain technology where all the alerts generated by the IDS are treated as a transaction. The collective alerts then adopt a consensus rule which helps validate the alerts before being placed in the block [24]. The use of private Blockchain models in IDS allows the network owners to privately invite and vet the participating nodes.

Shallow learning, more specifically, classifiers, has had a huge impact on network intrusion detection. The proposed Hybrid IDS by Tesfahun and Bhaskari [25] was a layered approach to HIDS consisting of 2 layers: one being a misuse detector or a signature-based NID model and the other layer functioning as an anomaly-based NID. The SNIDS was based on a random forest classifier model and the AIDS was built using the bagging technique with an ensemble of one-class support vector model classifiers and the dataset used for the study was NSL-KDD. The proposed model produced an attack detection rate of 92.1% and a false positive rate of 6.4%. Furthermore, Chitrakar and Chuanhe [26] developed a novel ensemble HIDS that comprised a combination of the C5 classifier and OCSVM (One Class Support Vector Machine) classifier which hosted both signature-based NIDS and anomaly-based IDS. In contrast with results from SIDS and AIDS, the proposed HIDS gave a higher detection rate and a lower number of false positives. The datasets used to evaluate the proposed HIDS were NSL-KDD and ADFA. The proposed technique yielded the highest NSL-KDD accuracy when compared with other techniques such as C4.5, Random Forest, KNN, and Naïve Bayes [26]. To overcome the drawbacks of an individual learning algorithm, multiple machine learning algorithms are used, to complement the overall Intrusion Detection process. The accuracy of the developed model was 83.24%.

The prevalence of deep learning in recent years has been attributed to various reasons each unique in its scope. Firstly, the processing capabilities have drastically improved due to powerful GPUs, also known as Graphics Processing Units, and the ease of acquiring the services, and the cost of GPU providers [27]. Secondly, the cost of hardware dropping significantly in the past decade has paved a path for increasing deep learning Approaches. deep learning algorithms’ ability to form learnable links between actions and effects, also known as Depth of Credit assignment paths, and what differentiates deep learning models from Shallow Learning models has led to its increasing usage [28]. With regards to deep learning models being used in a recent IDS, Khan et al. [29] also proposed a HIDS based on a Convolutional-LSTM network model which was also a two-stage IDS in which the first stage employed an anomaly-based Intrusion Detection model that was based on Spark ML whereas the second stage was a misuse detection model based on the Conv-LSTM network. The dataset used was ISCX-UNB. An accuracy of 97.29% was observed in detecting network misuse under the proposed HIDS.

The architecture of the NIDS also plays a vital role in the performance of the hybrid intrusion detection model [30]. As [31] developed a novel hybrid detection method that integrated the misuse detection model and anomaly detection model hierarchically in a decomposition structure in which the SNIDS was built based on the C4.5 decision tree algorithm whereas the AIDS was built based on multiple one-class SVM models created for the decomposed subsets created by the SNIDS. The models were evaluated through experiments on the NSL-KDD dataset. The proposed decomposition structured model yielded a high detection rate and low false positives in comparison to previous studies. However, it also displayed incredibly low training and testing time when compared with the serial conventional hybrid model and the parallel conventional hybrid model. The proposed hierarchical model produced an accuracy of 99% and a false positive of 2%. More advancements and variations in HIDS have been developed in recent years. This is along with the proposal by Kim et al. [32] that conceptualized a signature generation engine integrated with a deep recurrent neural networks based HIDS. The proposed HIDS comprised a signature detection system, a Deep Neural Network-based anomaly detection system, and a Signature Generation Engine (SGE) which was envisioned to sustain the Detection approach as the generated signatures were fed into the signature repository. The results were: an updated & extensive signature repository, simultaneous detection and signature generation of unknown attacks, and a self-healing intrusion detection approach [32].

In some studies, multiple software has been integrated into one another to form a hybrid intrusion model. This model fundamentally works similarly to other HIDS, a combination of AIDS and SIDS. The proposed HIDS by Rizvi et al. [33] comprised of a combination of Packet Header Anomaly Detector (PHAD) and Network Traffic Anomaly Detector (NETAD) integrated into signature-based IDS Snort. PHAD uses a host protocol model and time-based model, while NETAD uses a host packet model. As a result, the HIDS was able to detect an additional 119 attacks that the traditional Signature-based detection of SNORT could not [34]. More recently, Degeler et al. [35] proposed a hierarchical hybrid intrusion detection approach with an anomaly detector as the first stage of the IDS and an attack classifier as the second stage of the IDS. The anomaly detection is done via a novel lightweight solution based on Multi-modal Deep Autoencoder (M2-DAE) and the attack classification is carried out via soft output classifiers. This approach follows an inverted hierarchical architecture in contrast with the predominant studies in IDSs. The M2-DAE as a result displayed a decline in false positive rate by 40% in comparison with multiple baselines at the same positive rate. Additionally, the HIDS in comparison with best-performing misuse detectors showed an increase in the F1 score by 5% [36]. Similarly, to improve the efficiency and accuracy of a Hybrid Intrusion Detection System, Sohi et al. [37] proposed an IDS: Hybrid VMM-based Honeypots integrated into the HIDS that transform the entire IDS into a self-healing Intrusion Detection Prevention System (IDPS). A unique component of the proposal is the IDPS signature and anomaly databases, as well as the Intrusion Detection Prevention Operations Centre (IDPOC), which allows users to quarantine potential threats or ban traffic from a particular source [33].

The paper proposed by Creech and Hu [38] introduces a self-healing intrusion detection system with a danger theory that investigates the danger signals which the IDS perceives as malicious, firstly in a manual observation by the system’s operator and secondly in automated observation done by analysing system logs such as events considered intrusive, sudden spike in CPU usage, packet loss, undefined usage of ports and so on. In both cases, if the events are confirmed to be dangerous, they are communicated to the entire network, so that every device can check its timeline for similar events. The authors concluded that such self-healing IDS could result in a drastic decline in false positives [35]. In addition, recent studies have proposed IDSs that generate synthetic signatures that can be used to detect zero-day attacks, such as the work of [39] that uses RNN, also known as recurrent neural network, to develop synthetic signatures and mutants of known attacks. Through the development of a mutation signature database as well as synthetic signatures through deep learning, the IDS proposed in the study defends against known and unknown attacks [37].

Several studies suggest that using the KDD99 dataset in anomaly detection techniques is insufficient to capture the wide spectrum of attacks that exist today. Chew et al. [40] conducted a comparative study highlighting the complexity of the ADFA feature in contrast to the relatively simple feature of the KDD99. Moreover, the training algorithms with older datasets not only could not detect contemporary attacks but also are not very reliable. This is because they are not as rich in data as the newer ones [38]. UNSW-NB15 is synthetic data, like CIC_IDS2017 whereas ISP and UQ are real-world data. The binary classification of attacks was highly accurate when running the UNSW-NB15 dataset [39]. Also, a study conducted by Vinayakumar et al. [41] demonstrated that CIDDS001 suffers from high false positives through 10 machine models classification whereas both UNSWNB15 and GureKDDCup obtained low false positives and high accuracy rate [40]. The comparision of various network intrusion datasets is represented in Table 1 for different types of attacks.

Table 1 Comparison of network intrusion datasets

The HIDS have been trained and tested using UNSWNB15 and ADFALD as datasets such as KDD99 and NSLKDD do not reflect relevant results in regard to accuracy and detection rate as they lack modern attack patterns with low congestion and also lack the normal traffic behaviour of the present [42]. The HIDS proposed by some authors combines the anomaly and signaturebased IDS to develop a HIDS without learning the signatures of the anomalies. The proposed HIDS is a self-learning IDS that extracts signatures from the anomalies found and transfers them to the SIDS, allowing the early detection of previously unknown threats [15]. The IDS proposed hard-lined the performance superiority of the deep learning approach, which demonstrated dominance in terms of accuracy, precision, recall, and F-score of datasets: KDDCup99, NSLKDD, UNSWNB15, WSNDS and CICIDS 2017. This study evaluated the performances of shallow and deep learning models with a cross matrix of their performance with all the 5 datasets [41].

Furthermore, LSTMRNN yielded better results when compared to feed forward neural networks (FNN), generative neural networks (GNN), and recurrent neural networks (RNN) with hessianfree and Jordan ANN (Artificial Neural Network) in the experiment done by Kim and Kim [43]. LSTMRNN in comparison with conventional RNN resolves the issue of vanishing gradient which conventional RNN suffers from. Also, LSTM-RNN learns long-term dependencies by using a gating mechanism. It also holds previous states in its memory cell [44]. The selection of LSTMRNN for AIDS in the proposed HIDS was done because of the features mentioned in [44] paper. The proposed HIDS has been heavily influenced by the IDS model in which the SIDS matches the signatures in Lightnet utilizing an HMS or Hybrid Multi-Start algorithm and the AIDS utilizing Deep Q-learning. The proposed framework also is self-healing as the signature from the anomalies is fed into the signature repository [45]. The selection of the C5 classifier was done with regards to the proposed performance evaluation by Tang et al. [46] which showed 99% accuracy and detection rate with a shallow learning model as a SIDS. Belavagi and Muniyal [47] This cemented the fact that the use of deep learning is not always required for yielding better results as similar results can be obtained from simple MLs. While Intrusion Detection approaches possess a variety of concepts and models, there is still much-unexplored territory that needs study and innovation due to the ever-growing interconnectedness of the Internet. Table 2 illustrates comparison of related works for detection category, machine learning algorithm, datasets with paper title and Table 3 gives information about comparison of proposed model and related works performance.

Table 2 Comparison of related works
Table 3 Performance based comparison of proposed model and related works

3 Background study

This section of the study discusses the models which have been selected for the HIDS and the datasets which are being used to train and test the model. It also involves the mechanism of the selected models with a comparison to some other algorithms, explaining the advantages of the selected models over others. Lastly, it involves the details of the datasets: UNSW-NB15 and ADFA-LD.

3.1 Long short-term memory-recurrent neural networks

A sequential neural network works by processing inputs independently from each other, however, in case of RNNs, inputs are considered in context and interdependence between inputs is reflected. RNN, a deep learning algorithm that incorporates inputs, outputs, and hidden layers, allows the entire network to be stored and remembered. RNN has a one-directional flow in a loop that can memorize the previous information, then apply the rules to the current output. This differentiates RNN from Feed-Forward Neural Networks. The nodes between the hidden layer also have connections and the previous output is related to the current output. Moreover, the output of the hidden layer acts as the input of hidden layers [5]. RNN, in an IDS, identifies patterns and irregularities in a huge dataset that helps establish rules for evaluating real-time network traffic for malicious or normal activity [46].

Fig. 1
figure 1

Recurrent neural network [46]

A recurrent neural network in Fig 1 used for attack classification employs sequential layers which perform information processing in feature representation. This was only made available in recent years due to the affordable hardware and availability of high processing capabilities for general/research use. The proposed RNN model is a multilayer Long-Short Term Memory that outperforms most traditional approaches in IDS. RNN is a deep learning model widely used in recognizing generated images and text and interpreting the results. However, the failure to capture long-term dependency in RNN can be resolved by LTSMRNN. LTSM shown in Fig 2 is exclusively developed to overcome the problem of long-term dependency. The disappearing gradient issue in RNN can be resolved by achieving disappearing gradient descent, an algorithm for optimization that finds the neural network weights to avoid long-term dependency [48].

Fig. 2
figure 2

Long short-term memory [48]

3.2 C5 decision tree

The decision tree, in its simplest form, is an if-then-else rule-based machine learning algorithm that is a very powerful classifier and has had a very high detection rate in various sectors of application. The C5 decision tree can deal with missing attributes by providing value to those attributes that are most common in other instances at the same node [49]. In a decision tree, each branch node represents a selection of alternatives, and each leaf node corresponds to a classification/decision [50]. And each decision tree represents a rule. However, C5 supports decision tree boosting which helps in generating and combining multiple classifiers for improved prediction [49].

C5 follows the algorithm of its predecessor, C 4.5, and has features such as the large-scale decision tree which makes it easy to understand with a visual representation of the rules. The missing value while training the algorithm will also be handled within C5. Missing values will be marked as ‘?’ and will not be used in gain and entropy calculations. The C5 classifier resolves the problem of overfitting data in the decision tree by:

  1. 1)

    Stop the production of the decision tree once it reaches the point where the training data has been perfectly classified.

  2. 2)

    Applying Post prune to the tree when there is overfitting of the training data.

In post-pruning, a decision tree is pruned after it has been constructed, such as when a decision tree has very deep levels of branching, in which case post-pruning may be used to speed up the process. The C5 classifiers are used to select a small subset of relevant features from the datasets provided, which have been shown to perform well even with data that has high dimensionality. A high-dimensional problem has been one of the challenging factors in the design of a variety of other machine learning algorithms [51].

3.3 Datasets

Datasets are collections of data that have been gathered and organized in such a way that they are commonly processed and analyzed. The following datasets have been used for testing and benchmarking the proposed model.

  1. i)

    UNSW-NB15

UNSW-NB15 is a dataset from the University of New South Wales (UNSW) in Australia, which shows network intrusion detection using behavioural analysis. This dataset consists of a hybrid collection of real modern normal activities and synthetic prevalent attack activities which holds nine attack behavior types Fuzzers, analysis, backdoor, DoS, exploits, generic, reconnaissance, Shellcode, and worms. The datasets have been partitioned for training and testing ML [13]. The categorization of UNSW-NB15’s features by type is summarized in Table 4 with total number and names.

Table 4 categorization of UNSW-NB15’s features by type
  1. ii)

    ADFA-LD

The ADFA consists of AIDS-based data. These datasets cover both Linux and Windows operating systems. The data collection for Linux includes system call traces which when used for the training set, traces of 300 bytes to 6kB were neglected. Similarly for validation or testing set traces of 300 bytes to 10 kb were neglected. For Windows, there are DLL or Dynamic Link Library calls of 1828 normal traces and 5773 attack traces [52]. ADFA-LD has been used in training the LSTM model. The Table 5 shows categorization of ADFA datasets by data types for windows and linux platforms.

Table 5 categorization of ADFA datasets by data types

4 Methodology

In this section, the proposed architecture and the processes have been defined. The C5 classifier is built upon R studio whereas the LSTM model has been built on Keras through Python. The details of the flow of packets and the decision points have been highlighted in Figs 3, 4 and 5.

Fig. 3
figure 3

Packet flow

The proposed hybrid IDS in Fig 6 is built on highly effective individual network intrusion detection models. The Signature-based IDS is based on the C5 decision tree algorithm which classifies the inputs into known attacks and normal packets and is one of the highest accuracy-yielding algorithms in network intrusion detection. Similarly, the LSTM-RNN algorithm is used in anomaly-based intrusion detection as it is class-leading, in terms of performance, for determining anomalies from normal activities. The hybrid approach helps in the detection of known as well as unknown attacks. Moreover, the self-healing attribute of the proposed hybrid intrusion detection system assists in storing signatures of anomalies detected by AIDS. This helps in the early detection of similar attacks in the future through signature matching. Due to the increasing volume of attacks through circumvention techniques such as polymorphism that changes the signature of malicious packets, anomaly-based detection has been integrated into the IDS which is able to detect known as well as zero-day attacks. ML algorithms have been chosen after considering previous research which yielded one of the highest performance metrics related to intrusion detection. As a result, a C5 classifier model was implemented for binary classification as a signature-based intrusion detection model, whereas an LSTM model was implemented as an anomaly-based detection model. The anomalous packets are then evaluated, and features are extracted into a '.csv’ file. These attributes from the anomalous packets when verified as malicious are categorized as attacks and fed into the decision tree. This helps the proposed HIDS in detecting similar attacks at an earlier stage and has been proposed as a self-healing approach. Similarly, attributes extracted from zero-day attacks, which have been detected externally, can also be fed into the SIDS stage of the proposed IDS.

4.1 Proposed hybrid intrusion detection system

The proposed hybrid intrusion detection system will combine misuse detection and anomaly detection with a high detection rate and a continuous signature generator. This will add newly produced signatures to the signature repository. The proposed HIDS not only resolves the issues of SIDS and AIDS individually but also creates a flow of newly discovered signatures of unknown attacks into the signature repository. This contributes to the early detection of those attacks. Such a feature helps in early detection, which saves time as well as optimizing resource utilization. The HIDS can be segregated into three stages:

  1. a)

    Signature-based intrusion detection model utilizing the C5 algorithm (stage 1)

  2. b)

    Anomaly-based intrusion detection model utilizing the LSTM recurrent neural networks algorithm (stage 2)

  3. c)

    Signature generator (stage 3)

The 3 staged HIDS have a combination of the C5-based Signature-based Intrusion Detection model and Recurrent Neutral Network-based Anomaly Detection model in a hierarchical order. The signature generator then characterizes and extracts signatures from, the detected anomalies which then get updated in the signature repository used by the SIDS. The subsequent benefit would be the discoverability of the recently detected anomalies in the earliest stage of HIDS.

  • Stage 1

SIDS has been placed at the upper level in the hierarchy of the HIDS because of the signature repository that the detection system hosts. Also, the early detection of known intrusions, through signature matching reduces the load on the anomaly detection method in the latter stage. Since SIDS provides low false positives and high accuracy in detection, signature detection must be the first stage rather than the latter. This is because signature detection is a progressive narrowing of attacks in the proposed IDS. A reciprocal to the proposed hierarchy could lead to high redundancy of the SIDS (if placed in the second or third stage). It is believed that eliminating the known attacks in the first stage can lead to reduced resource utilization in the AIDS field and possibly save time as well.

Ahmad et al. [7] compared the effectiveness of signature-based anomaly detection with the C5 classifier in comparison with other Machine Learning algorithms and established that there was a reduction in the false negatives and a significant improvement in the rate of detection. The classifiers were trained and tested using NSL_KDD. Ahmad et al. [7] Furthermore, the UNSW-NB15 datasets will be run through the C5 decision tree algorithm along with other ML algorithms in the methodology and the results will be analyzed for comparison of accuracy and false alarm rate. RStudio and WEKA will be used to train and test the C5 model. In this SIDS, unknown packets are handled through signature matching in determining the nature of the packets i.e., normal, or abnormal. When the signature extracted from the packets match with one in the signature repository, an alert will be triggered which will be reviewed by the user. However, if there is no match the packet will be forwarded to AIDS.

  • Stage 2

The packets categorized as normal, by the SIDS will be the input data in the Anomaly Detection phase. The SIDS is responsible for building a normal behavior profile that represents the pattern and summary statistics of network traffics which are non-malicious. For the training phase, an offline component will be used to help build the profile for normal user behavior, through the extraction of rules from network traffic that are labeled as non-attacks. Similarly, the SIDS will also learn the attack classes, in an offline component, through the network traffics labeled with known attacks [12]. The proposed system uses ADFA-LD and ADFA-WD. datasets used to train the LSTM-RNN.

The training datasets are used to train the classifiers whereas the testing datasets are used to measure the accuracy of the classifier. The classification conducted is binary which produces two classes that are either normal or anomaly. Sarhan et al. [13] implemented an IDS based on LSTM-RNN, which was trained using instances from KDD Cup 1999 dataset. The result demonstrated a superior detection rate and accuracy when compared to the performance of GRNN (General Regression Neural Network), PNN (Probabilistic Neural Network), RBNN (Radial Basis Functions Neural Networks), KNN, SVM, and Bayesian [32]. Also, Naidu and Avadhani et al. [50] proposed an LSTM RNN utilizing an Adam optimizer, that yielded an accuracy rate of 99.97% as a binary classifier IDS in Anomaly detection. Based on the classifiers the packets then get matched against the normal behavior profile and if it detects any deviation in the pattern, an alert gets sent to the user. Those packets then get sent to the signature generator which has been categorized as malicious packets [48].

  • Stage 3

The signature generator is an integral phase in this proposed hybrid detection method due to the conversion of features extracted from anomalous packets that are segregated by the AIDS into signatures that help identify the attack about the abnormality of the anomaly. This conversion relies on the learning capability of the signature generator primarily based on the features of the anomalous packets. The generated signature is then fed into the signature repository which aids in more effective, precise, and accurate detection of future attacks by the proposed HIDS [53]. AIDS provides the rate of normality and the rate of abnormality of each connection after the processing of anomaly detection. The rate of normality refers to the similarity of features of the packets to the normal traffic behavior whereas the rate of abnormality refers to the degree of deviation of the features of the processed traffic with the normal traffic.

The signature generation proposed by Hwang et al. [54] was a weighted signature generation where the rate of normality (normality score) and rate of abnormality score (anomaly score) were normalized, in this case, the sum of the scores was 1. They defined the overall rate of normality and abnormality of a pattern as the sum of the normality score and anomaly score of all the established connections that match the pattern [54]. Signatures transferred into the repository are those which have a high rate of abnormality and a low rate of normality. A high rate of abnormality of a signature would be the result of more anomalous connections matched. Whereas a low rate of normality of a signature is a result of less normal connections matched. A low rate of normality also results in lesser false alarms. This is due to the high deviation of the signature from the ones with normal traffic. The signatures with a low rate of normality and high rate of abnormality are then integrated into the signature repository in stage 1.

Fig. 4
figure 4

Flowchart for signature-based IDS

Firstly, the packets are sent through the C5 classifier as shown in Fig. 4. Classification occurs through matching patterns, to determine whether they demonstrate normal or abnormal behaviour. For the C5 to learn the pattern in the datasets, a classified dataset is required [55]. The dataset used by the C5 classifier is UNSW-NB15. The connections established will then be interpreted by the classifier and then get assigned to a specified class. In this case, the classifications are known attacks and normal traffic. The signatures of known attacks are then stored in the signature repository.

Fig. 5
figure 5

Flowchart for anomaly-based IDS

The packets sent from SIDS as normal traffic is now the input of AIDS which helps in finding the zero-day attack. AIDS is based on the learning of normal behavior which when implemented as a NIDS, helps in detecting abnormal behaviors in the network [44]. The classification in the proposed Hybrid Intrusion Detection System is binary which refers to the detection classification as 0 and 1 or normal and anomaly respectively. The dataset used to train the LTSM -RNN is ADFA-LD shown in Fig. 5 categorize data into normal and anomaly.

Signature generator:

In Fig. 4, generating signature from the packets requires attributes and features including signature repository. Wireshark can be used to extract features from detected malicious packets. The captured features are then compiled in CSV file and used in combination with the training set for the C5 model in stage one. The attributes collected are like that of UNSW-NB15. Wireshark is an open-source, commonly used network protocol analyzer that helps detect any suspicious packet entry, entering from an unreliable source. Wireshark is one of the most popular packet analyzers which is equipped with many features and can easily run on any platform. For signature generation and attribute extraction, Wireshark can perform various actions such as sniff, capture, log, and post-sniffing analysis. There are many paid applications and devices which help extract attributes and generate signatures. However, we consider Wireshark in this paper for a signature generation due to cost and ease of access [56].

Fig. 6
figure 6

Proposed HIDS

4.2 Performance evaluation

An evaluation of performance can be used to determine whether algorithms, software, or systems are efficient in their operation. A performance metric is a measure of how well a system performs in order to evaluate the performance efficiency of the system. The evaluation of the performance by the classification models has been done in terms of standard performance metrics which are as follows: [15]

  1. 1)

    An accuracy measure is an indicator of how many instances are correctly classified among all instances within a dataset. An instance that was classified as a True Positive (TP) is defined as every instance that was classified correctly as a positive, plus every instance that was classified as a negative, divided by the total number of instances that were correctly classified as true positives.

    $$\begin{aligned} Accuracy = \frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$
    (1)

    Where TP is true positive, TN is true negative, FP is false positive, and FN is False Negative.

  2. 2)

    As a performance metric, precision is defined as the ratio of true positives and the sum of true positive and false positive. It measures the fraction of instances that are predicted to be positive that is positive.

    $$\begin{aligned} Precision = \frac{TP}{TP+FP} \end{aligned}$$
    (2)
  3. 3)

    The recall measure refers to the ability of a model to identify all instances of a particular class or category correctly when it is evaluated in terms of its performance. It is the ratio of true positives and the sum of true positive and false negative. A true positive rate can also be called a recall rate or a sensitivity rate.

    $$\begin{aligned} Recall = \frac{TP}{TP+FN} \end{aligned}$$
    (3)
  4. 4)

    The F1 score is a performance metric that is often used in the assessment of the accuracy of binary classification models. It is a test of accuracy which is represented by the ratio of 2 times the multiplication of precision and recall and the sum of precision and recall. An F1 score is a measure of the harmonic mean of precision and recall, and it ranges from 0 to 1.

    $$\begin{aligned} F1 = \frac{2*Precision*Recall}{Precision+Recall} \end{aligned}$$
    (4)
  5. 5)

    The Receiver Operating Characteristic (ROC) is a performance evaluation technique that is used to analyze the trade-off between the true positive rate (TPR) and the false positive rate (FPR) in a data model with a variety of discrimination thresholds.

5 Results

There have been various scenarios in which the proposed method has been tested, and the results are discussed in this section based on the tests performed.

5.1 C5 classifier and dataset reduction

The dataset used in the decision tree model has been analysed and pre-processed before the integration of the data into the algorithm. To optimize the attack detection, the datasets will be pre-processed. The dataset comprised 45 variables, 4 of which were nominal and the remaining numerical. Furthermore, of the attributes in the dataset, 7 were categorical and the remaining were quantitative. The UNSW-NB15 training set has been reduced by eliminating redundant data. The training dataset had an attack column and ID column which were not of significance in the experiment and were dropped from the execution of the model. The dataset consisted of a column named ‘label’ which represented either attack or normal instances in binary. ‘1’ represented attack instances whereas ‘0’ represented normal instances. The order of the datasets was randomized, and the number of the normal instances was reduced to have a 1:1 proportion of attack:norm.

Also, the test dataset does not have the proto, state, and is_ftp_login columns which can be found in the training dataset. As a result, those columns were deleted from the training set. Moreover, the service column has missing instances that the C5 classifier cannot process when trained. Hence, this column has been dropped as well.

The R script to train the C5 model is as follows:

  • C5_model \(\leftarrow\) C50(x = train_model[,−44], y = as.factor(train_model$attack_cat)) Where, UNSW-NB15_Trainingset has been assigned to train_model and to drop the column ‘attack_cat’ in predictors model ‘train_model[,−44] has been used. Since, attack_cat is the 44th column in the training set. Similarly, the output, assigned as y is the attack_cat column.

  • Summary (C5 model).

This script provides the detail of the model such as subtrees, size of the tree, attribute usage, and errors. From the summary, 1,693,409 instances were observed. 17% of which was an error. Similarly, the attribute usage of the datasets was observed. The attribute usage of UNSW-NB15 dataset in C5 classifier are shown in Table 6 with usage (%) and attribute parameters.

Table 6 Attribute usage of UNSW-NB15 dataset in C5 classifier

To train the training dataset, the following script was used,

  • P1 \(\leftarrow\) predict (C5_model, test_data[,−44]) Here, P1 is the prediction model where test data is run through the previously generated C5 model.

  • P1 This script runs the prediction model.

The performance metrics for the C5 classifier are shown in Table 7, which includes the TP, FP, precision, recall, and F1 score for both classes.

Table 7 TP, FP, precision, Recall and F1 score for both classes with C5 classifier

In comparison to [15] Al’s research which ran C5 model to the UNSW-NB15 training set that had the number of instances reduced to 74,588, originally 152,148, produced the normal accuracy of 90.74% and the attack accuracy of 70.65%. The C5 model yielded better results than the proposed method in terms of accuracy [15]. However, [57] Als experiments yielded marginally superior accuracy through training and testing on combined UNSW-NB15 files with the ANN model yielding 99.26% average accuracy in binary classification and DNN yielding 99.22% accuracy in binary classification. The dataset used in the experiments was a combined dataset of both training and testing sets that consisted of 2,540,044 packets [57]. Recent paper trains and tests UNSW-NB15 on two stacking models of which the first model has XGBoost and KNN as a base and Random Forest as a meta classifier and the second model is XGBoost, NN, KNN as a base and Random Forest as a meta classifier. Kabir et al. [58] achieved an accuracy of 93.62% with stack 1 and 92.76% with stack 2 which was higher than individual models such as XGBoost, NN, KNN, and RF [58]. The following Table 8 provides detailed accuracy by class for the UNSW-NB15 dataset was trained in the C4.5 model.

Table 8 TP, FP, precision, recall and F1 score for both classes with C4.5 classifier

As seen from the table, the C5 outperforms the C4.5 classifier in almost every measure. In comparison to C5, TP Rate, FP Rate, Precision, Recall, F-Measure, MCC, ROC Area, PRC Area, and Class are not vastly different from C5, but C5 has proven to have greater performance in terms of these metrics.

5.2 LSTM-RNN

The ADFA-LD was evaluated and processed before the execution of LSTM. On the evaluation of the data, 6 types of attack were observed which are as follows:

There were 833 normal traces found in the training set and 4373 normal traces are found in the validation set. The attack data has been split into two sets, 70% of the attack data are used as a training set and 30% for validation. To achieve this, 7 folders of attack data have been used for the training set and the remaining 3 for validation. The LSTM model has 2 layers with 200 cells and the epoch was set between 50 and 5000 for training parse. Also, the learning rate was set to 0.001 to cover more data points.

Fig. 7
figure 7

False alarm rate vs detection rate for LSTM-200

Figure 7 represents the ROC curve of the LSTM-200. The LSTM classifiers yielded high accuracy and a low false alarm rate. The Area under ROC obtained for the experiment was 0.936 with a 17% false alarm rate (FAR). Table 9 represents the comparison of the performance of various models with the proposed LSTM. The specifications of the computer used include an Intel Core i711800H@2.3 GHz, 16 GB of RAM with NVIDIA GeForce RTX 3050 GPU running on 64-bit Windows 10. The following code was used as a reference to build the LSTM classifier:

http://github.com/ririhedou/systemCallAnomalyDetectionLSTM

Table 9 Trace counts and payloads of attack types in ADFA-LD
Table 10 Comparison table of HIDS

In comparison to the results yielded through a CuDNNLSTM network which trained the ADFA-LD dataset in the paper by Borisaniya et al. [59] the proposed LSTM classifier slightly outperformed the bidirectional LSTM encoder as it achieved a TDR (True Detection rate) of 90% and FAR (False Alarm Rate) of 25% [59]. Also, Xie et al. [61] developed a System-call Behavioural Language based on a sensitivity-based LSTM model which achieved an AUC of 0.99 on test data and 0.93 on the unknown dataset [60]. The proposed LSTM model performs well in comparison to many ML and DL (deep learning) models and is on par with the most recent best-performing NIDS. However, the score achieved by the proposed model is significantly higher than most. Overall, the C5 classifier yielded an average true positive of 97.3% and an average false positive of 8% which is among the class leaders in signature-based intrusion detection systems. Also, the proposed LSTM model yielded a detection rate of 90% maintaining a very low false alarm rate of 17%. Both the stages of the HIDS have displayed class-leading performance metrics as shown in Table 10 which compares the result of the proposed IDSs with some of the best-performing IDS models.

6 Conclusion and future works

Real-time packet testing has been identified as a future research project. To test the effectiveness and accuracy of the proposed model, it needs to be subjected to real-time packets. These packets should contain various attack types as well as should represent benign network traffic. The main intention of the real-time network testing is to verify the anomalous packets, extract the malware attribute and feed it to the signature repository to train the C5 model in the signature-based intrusion detection stage. After this, upon the execution of the same attack, the SIDS must trigger the alert before reaching the anomaly-based intrusion detection stage. Attack signatures that have been found externally can also be added to the repository to train the SID. Moreover, the performance of the HIDS model collectively needs to be assessed in terms of accuracy, detection rate, and false alarm rate. The retention of signatures and the execution of those signatures in the SIDS stage needs to be assessed as well. This helps measure the self-healing ability of the proposed model. Developing signature generation techniques requires more focus in future work, as there are several novel methods and devices to extract attributes from anomalous packets more efficiently. Appropriate feature extraction/attribute extraction methods should be evaluated, and a method of signature extraction needs to be selected those complements and is cohesive with the proposed model. The paper does not address obfuscation techniques and methods to mitigate the inability to detect obfuscated packets. As this poses a threat to the proposed model, there arises an opportunity to act upon the mitigation of such threats. In order to further research, computational power and time can be taken into account for the distribution of the proposed model in real life. This will take into account the efficiency, affordability, and practicality of the model in a network.