A Multi-level Random Forest Model-Based Intrusion Detection Using Fuzzy Inference System for Internet of Things Networks

Awotunde, Joseph Bamidele; Ayo, Femi Emmanuel; Panigrahi, Ranjit; Garg, Amik; Bhoi, Akash Kumar; Barsocchi, Paolo

doi:10.1007/s44196-023-00205-w

A Multi-level Random Forest Model-Based Intrusion Detection Using Fuzzy Inference System for Internet of Things Networks

Research Article
Open access
Published: 12 March 2023

Volume 16, article number 31, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

A Multi-level Random Forest Model-Based Intrusion Detection Using Fuzzy Inference System for Internet of Things Networks

Download PDF

3306 Accesses
11 Citations
Explore all metrics

Abstract

Intrusion detection (ID) methods are security frameworks designed to safeguard network information systems. The strength of an intrusion detection method is dependent on the robustness of the feature selection method. This study developed a multi-level random forest algorithm for intrusion detection using a fuzzy inference system. The strengths of the filter and wrapper approaches are combined in this work to create a more advanced multi-level feature selection technique, which strengthens network security. The first stage of the multi-level feature selection is the filter method using a correlation-based feature selection to select essential features based on the multi-collinearity in the data. The correlation-based feature selection used a genetic search method to choose the best features from the feature set. The genetic search algorithm assesses the merits of each attribute, which then delivers the characteristics with the highest fitness values for selection. A rule assessment has also been used to determine whether two feature subsets have the same fitness value, which ultimately returns the feature subset with the fewest features. The second stage is a wrapper method based on the sequential forward selection method to further select top features based on the accuracy of the baseline classifier. The selected top features serve as input into the random forest algorithm for detecting intrusions. Finally, fuzzy logic was used to classify intrusions as either normal, low, medium, or high to reduce misclassification. When the developed intrusion method was compared to other existing models using the same dataset, the results revealed a higher accuracy, precision, sensitivity, specificity, and F1-score of 99.46%, 99.46%, 99.46%, 93.86%, and 99.46%, respectively. The classification of attacks using the fuzzy inference system also indicates that the developed method can correctly classify attacks with reduced misclassification. The use of a multi-level feature selection method to leverage the advantages of filter and wrapper feature selection methods and fuzzy logic for intrusion classification makes this study unique.

Modified parallel random forest for intrusion detection systems

Article 02 May 2016

IoT Intrusion Detection Using Modified Random Forest Based on Double Feature Selection Methods

A random-forests-based classifier using class association rules and its application to an intrusion detection system

Article 02 June 2016

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The internet has formed an integral part of people's daily lives, and a great deal of data must be protected from cybercrimes [1]. The importance of the internet to daily human activities has caused attackers to perpetuate many cybercrime activities. The attackers are always looking for new methods to steal users' confidential data by exploiting a vulnerability in the computer networks. The main goal of data security is to develop security models to ensure data confidentiality, integrity, and availability on networks [2, 3]. The motivation to prevent security breaches against information systems on networks has prompted researchers to develop security models capable of detecting intrusions. Developing intrusion detection systems (IDSs) aims to distinguish between intrusion and normal attacks.

IDSs are security frameworks designed to protect information systems on networks. IDS can be classified based on their environments [4] and detection mechanisms [5,6,7]. IDSs are further classified based on their environment as host-based IDS (HIDS) and network-based IDS (NIDS) [8]. Host-based IDS can be defined as the IDS designed to detect attacks and vulnerabilities on the host computer. On the other hand, network-based IDS monitors the whole network boundary to identify intrusive traffic on the network before penetrating the host computers. Based on their detection mechanisms, IDSs are further classified as signature-based IDS (SIDS) and anomaly-based IDS (AIDS). Signature-based IDS, also known as knowledge-based intrusion detection system (KIDS) or misuse-based intrusion detection system (MIDS), compare incoming patterns with known attack database to detect deviations from known attack patterns. The false alarm rate for SIDS is extremely low but cannot detect new and unknown attacks. Therefore, researchers are focusing more on anomaly detection. Anomaly-based IDS, sometimes known as behavior-based intrusion detection systems (BIDS), is a better method for detecting unknown attacks by recognizing deviations of incoming patterns with a normal profile behavior. The advantage of AIDS is the ability to detect new attacks from any deviations from normal patterns, but one major drawback is the possibility of a high false alarm rate [9].

Several IDS which have been developed are still susceptible to attacks [10, 11]. Although many machine learning (ML) algorithms have been deployed for IDS to increase detection accuracy, existing IDS methods continue to struggle to achieve good results [9]. Researchers have affirmed the importance of feature selection methods in IDSs to increase detection accuracy. Feature selection is a method that can be used during the preprocessing stage to improve the accuracy of the base classifiers. The filter method is a more popular method for feature selection based on the dataset analysis for choosing the most important features without considering the base classifier performance. On the other hand, the wrapper method considers the base classifier performance for choosing the best feature subsets to increase classifier accuracy. The embedded method is considered relatively close to the wrapper method because it uses essential functions and considers the classifier performance for the choice of a better feature subset. The filter method has a faster processing time than the other feature selection methods with slow processing time.

Data mining algorithms have been frequently applied in implementing IDSs [12]. However, many developed data mining methods are either single or hybrid. This study developed a multi-level random forest algorithm for intrusion detection using a fuzzy inference system. The developed IDS method uses correlation-based feature selection and sequential forward selection (CFS-SFS) for the multi-level selection of features. The first phase of the multi-level features selection used a method called correlation-based feature selection to filter out irrelevant features. The output of the filtered features serves as input to the sequential forward feature selection, a wrapper method for selecting the most relevant features. The selected relevant features finally serve as input to the random forest classifier for intrusions detection. To know the severity level of a detected intrusion and prevent misclassification, fuzzy logic was used to classify an intrusion as either small, medium-small, medium-large, or large.

1.1 Motivations

Three forms of feature selection methods are available they are (i) wrapper, (ii) filtering, and (iii) embedding [13]. Techniques like filtering and embedding are found in different selection methods. While the latter employs both strategies, the first two are distinct. The wrapper technique relies on the accuracy of a predetermined learning algorithm in resolving a specific issue. The chosen features are evaluated based on their performance. There are two steps in the wrapper approach. It first looks for a subset of features, and then, in a subsequent phase, it uses the learning algorithm, which functions as a black box, to evaluate the features that have been chosen. These stages are repeated iteratively until a predetermined stopping criterion is satisfied. The wrapper technique has a problem because the search space for any $n$ features is ${2}^{n}$, it poses a problem for datasets with enormous dimensions. Different approaches have been developed to address the high-dimensional issue, such as best-first search, hill-climbing, and branch-and-bound search. Genetic algorithms can also improve the locally optimal training performance. The feature selection filtering techniques are separate from the learning algorithms and are more effective than wrapper approaches. However, the chosen features may not be the best due to a lack of a specific learning algorithm. Therefore, the filter methods are divided into two steps. A set of ranking criteria is used to rank the features first. Each feature can be ranked separately using a univariate feature ranking algorithm, or it can be multivariate, with a batch ranking of multiple features. The second stage extracts the characteristics using the aforementioned rating criteria [14].

This study addresses the issue of selecting a set of unique attributes that can improve an IDS’s classification accuracy. When a class label is present, the feature selection technique is simplified by calculating each feature’s impact on the class label’s predictions. Even when class labels are present, feature selection can be done unsupervised (supervised learning). This strategy ensures that the feature subset of the original attributes contains the optimal number of features and is more accurate at detection. The most critical stage is choosing the best representative features using ML-based algorithms. By eliminating redundant and irrelevant characteristics, dimensionality reduction is used to minimize the number of features. However, the calculation cost is higher when determining the ideal number of features from data with many features. This work employs a genetic search algorithm (GSA) to find the best features to improve classification performance. This task involved fine-tuning parameters while the GSA for the feature optimization problem was being implemented. The current approach also creates a novel fitness function for the task at hand. The GSA quickly converges and offers the best features with higher prediction accuracy when the settings are fine tuned.

1.2 Contributions

The study key contributions are as follows:

(a)
The design of a multi-level feature selection method to combine the advantages of the filter and wrapper feature selection methods, using the best features chosen, created the hybrid GSA model to train ML classifiers.

(b)
The use of a random forest classifier to improve detection accuracy

(c)
The design of a fuzzy logic model for intrusion classification reduces the likelihood of misclassification.

(d)
The proposed model's effectiveness is compared with cutting-edge intrusion detection systems and traditional feature selection approaches.

2 Related Work

In cybersecurity, ML is critical for detecting malicious and intrusive traffic. In other words, ML algorithms are frequently used in Internet of Things (IoT) risk management to identify IoT traffic. However, due to poor feature selection, ML approaches misclassify a wide range of malicious traffic in a secure IoT network. Therefore, selecting a feature set with enough data to identify smart IoT anomalies accurately and intrusion traffic is critical to solving the problem. This section discusses a few studies on IoT anomalies and intrusion attacks. In addition, several studies have demonstrated the effectiveness of feature selection techniques in the field of network security.

Anomaly and intrusion detection in IoT networks have received a lot of attention in recent years, and experts are working hard to find a solution [11]. Various types of cybersecurity solutions are suggested and used in an IoT network platform to protect computers and IoT applications from attacks and unauthorized access [15,16,17,18,19]. In 2017, for example, IoT distributed denial-of-service (DDoS) attacks increased by up to 172% [20, 21]. Similarly, when compared to 2013, the number of malicious attacks in 2017 has increased several times, with the vast majority of their attacks, such as Botnet attacks and others, being quite dangerous, according to a Kaspersky lab study [22]. Anderson proposed the first intrusion detection system in 1980 to combat the issue of cyberattacks [23]. The authors in [24] then presented a real-time intrusion detection expert systems paradigm, which was able to identify breaches, intrusions, leaks, Trojan horses, and other threats. However, their model employed assumptions to find malicious network attacks. Additionally, their analysis placed a particular emphasis on user activity to detect irregular processes. The man-in-the-middle (MITM) vulnerabilities have suddenly worsened thanks to DDoS [25]; however, these pose a significant danger to the IoT, and other researchers work hard to precisely identify, detect, and implement a plan to safeguard IoT networks against such dangerous intrusions.

Similarly, a novel approach called fog computing-based security (FOCUS) was unveiled in 2018 by authors in [26]. This technique is mainly employed to protect the IoT network against malware-based intrusions. The virtual private network (VPN), in their suggested concept, is employed to protect IoT communication pathways and channels [27, 28]. Additionally, in an IoT network context, their proposed security system can transmit notifications throughout DDoS attacks [29, 30]. Their study validated proof of concept for results evaluation, and they experimented on the proposed model to test the system's effectiveness. However, their experimental findings demonstrated that the suggested approach effectively percolates harmful attempts with a slight reduction in response time and bandwidth utilization.

The feature selection method is crucial and indispensable during data processing. However, feature selection entails choosing useful features from many attributes and eliminating unnecessary ones that do not offer identification-related information. In this regard, in [31], the authors reviewed some effective feature selection techniques based on correlation measurement techniques. They created a new method for the fast-based correlation features (FCBF) algorithm's functions to improve industrial IoT network capabilities. However, they convert the FCBF technique into the fast-based correlation features in pieces (FCBFiP) method for their experiments. The main goal was to partition the feature space into equal-sized segments. They suggested this strategy and enhanced the correlation and ML models running on each node. However, their suggested model performs better regarding model accuracy and throughput. Authors in [32] created a novel technique for identifying attacks coming from IoT devices, suggested an anomaly identification approach that extracts system performance, evaluated it experimentally, and used autoencoders to identify unusual network traffic coming from IoT devices. However, they utilized two well-known IoT-based botnet assaults to evaluate the suggested strategy, and some business devices in the IoT network were compromised by Mirai as well. Their suggested method can detect attacks on IoT devices, according to experimental findings. Similarly, authors in [33] proposed a features selection strategy to improve the functionality of IoT anomaly detection hardware. The data correlation variation between the IoT sensors was monitored in real time to detect identical deployed sensors, and the sensors with the highest correlation variances were selected as the features for anomaly classification. They investigated the window size for data calculation and clustering using curve alignment. Multi-cluster feature selection (MCFS) was then used to select the online feature selection scenario. They demonstrated that the proposed method effectively reduces the false negative (FN) rate of detecting IoT infrastructure anomalies.

In addition to the previously mentioned security technologies, for instance, attack [34, 35], and crucial management [36], the management of evidence [35] can be utilized for IoT security as well. However, in the literature review above, it was clear that finding a reliable and consistent feature set for anomaly and intrusion detection to classify IoT network data is crucial. The attributes selection method's main notion entails four critical steps. The subset generation, which produces a feature set; evaluation of the subset, in which the features are assessed through analysis; decision-making process, where decisions are made to either approve or reject a feature according to specific guidelines and subset validation.

2.1 Feature Selection

Feature selection methods are variable reduction techniques that can convert features from a high-dimensional space to a low-dimensional space while maintaining the classification algorithms’ efficiency [37]. In another word, feature selection is the extraction of the best features required for the development of a classifier with high detection accuracy and low false alarm. The goal of feature selection methods is to remove uncorrelated variables from the set of features while keeping the data useful to the classification model. Handling Big Data for ML is required in most fields today, including cybersecurity. Security data is proliferating, and intelligent and efficient management is required [25]. Data mining (DM) and machine learning (ML) techniques for high-dimensional datasets focus on generating relevant insights by minimizing dataset features. The dimensionality constraint is the primary issue that must be addressed to implement DM and ML techniques [25] A “dimensionality constraint” is data that is dispersed in high-dimensional space. This harms low-dimensional space learning methods [38]. Another issue is overfitting, which reduces the accuracy of the ML model when the data contains a large number of characteristics. A high feature count also generates a higher memory and computational cost [39]. The best solution to the high dimensionality problem is to reduce the dimensions of a given dataset using state-of-the-art feature reduction techniques. Feature selection can reduce dimensions [40, 41]. This technique converts numerous features of big data into a new, low-dimensional feature space. Feature selection refers to selecting the most appropriate feature subset from the provided input feature vector to assist the ML model train effectively.

In the real world, the datasets have noise that adds unnecessary and redundant characteristics. Eliminating the noise from the data helps accelerate the learning process, thus improving the classifier’s classification performance while lowering the false positives (FPs) and FNs [42]. There are two types of feature selection techniques: supervised and unsupervised. Supervised feature selection methods are typically created to solve classification or regression issues [43]. These strategies are used to separate the feature subset from the original features provided to estimate the targets in a regression analysis or to be able to discriminate between the classes of data that are accessible [44].

2.1.1 Genetic Search

A search method based on a genetic algorithm is called a genetic search (GA). Using computers to imitate the process of natural evolution served as the inspiration for GA. The GA was initially proposed as an ML algorithm by authors in [45]. The algorithm is an iterative one that normally starts with an initial population of random individual programs. The evaluation of their fitness measures determines the best individual programs in the population. Every iteration results in the next population of the fittest individuals, thanks to computerized genetic recombination and mixing.

2.1.2 Correlation-Based Feature Selection

Correlation-based feature selection (CFS) is a filter-based feature selection method that selects features based on their correlation with the class. The feature–class relationship is evaluated and the relationship with the highest correlation is chosen for selection. Based on this feature evaluation, the GSA assesses each characteristic’s attributes and creates elements with the best fitness value. If two feature subsets have the same fitness values, the genetic search additionally employs rule evaluation to return the feature subset with the fewest number of subgroups.

2.1.3 Sequential Forward Selection

Sequential forward selection (SFS) is a wrapper method that performs a bottom-to-up search process. The SFS starts from an empty set and sequentially adds features from the full feature set with already selected features that result in the highest classifier accuracy.

2.2 Random Forest Algorithm

Random forest (RF) is an ensemble method of decision trees. The algorithm works by producing a different number of decision trees from different samples and takes their majority vote for the classification decision. The benefit of RF is that increased precision can be attained without the risk of overfitting.

2.3 Fuzzy Logic

Fuzzy logic is described as a multi-valued algebra in which the truth values are all intermediate values between 0 and 1, inclusive [46]. Fuzzy logic has become a popular application for problems relating to uncertainty and classification [47,48,49,50]. The advantage of using fuzzy logic for intrusion detection is that one can capture the overlapping severity grades of intrusions.

3 Methodology

This study developed a multi-level random forest algorithm for intrusion detection using a fuzzy inference system (ML-RFID-FIS). The developed ML-RFID-FIS is divided into four major phases: dataset preprocessing, feature selection, detection, and classification. The dataset preprocessing phase involves feature encoding and normalization to make the data interpretable and easily understood by the ML model. Training and testing sets were created from the dataset. 80% of the data are in the training set, while the remaining 20% are for the testing set. The feature selection phase used a multi-level feature selection approach to blend the advantages of the filter and wrapper methods. The first stage (filter method) of the multi-level feature selection used correlation-based feature selection to select essential features based on the multi-collinearity in the data. The second stage (wrapper method) used a sequential forward selection method to further select top features based on the accuracy of the baseline classifier. This is because the filter methods are not affected by the classifier, hence the wrapper method. The random forest technique is then used to detect intrusions using the chosen top features. Fuzzy logic was used to classify intrusions as either normal, low, medium, or high to reduce misclassification. Python programming language was used for the implementation. Figure 1 describes the architecture of the developed ML-RFID-FIS.

3.1 Dataset Description

The NSL-KDD dataset was adopted for implementation due to its effectiveness in intrusion detection [51]. NSL-KDD is a dataset that has been proposed as a solution to some of the issues in the existing IDS datasets. The dataset is a perfect representation of real networks. It can be used as a standard benchmark dataset for the design of IDS relating to Internet traffics. The 41 attributes in the dataset are classified as either normal or anomalous. The characteristics can be broken down into three categories: time based (19 features), connection based (9 features), and content based (13 features). Four types of attacks may be distinguished from the dataset: probing, denial of service (DoS), user-to-root (U2R), and remote-to-local (R2L).

(a)
Probing attack (Pr): This is the accumulation of system information by testing it to discover vulnerabilities that can be used to compromise it later. Some of the probing attacks are ipsweep, portsweep, nmap, and satan.
(b)
Denial of service (DOS): An attack in which the host system is flooded by unwanted messages by the attacker preventing authorized users from gaining access to resources or services. Some examples of DoS attacks are Neptune, Smurf, tear drop, pod, and mail bomb.
(c)
User-to-root attack (U2R): The attacker begins the attack using a regular user account to gain entry to the system. The attack then exploits the system flaws to gain access to resources that should normally be unavailable to them. Examples of U2R are buffer overflow, loadmodule, and Perl.
(d)
Remote-to-local attack (R2L): This is an intrusion committed by an attacker with authorization to send packets to a machine connected to the network with no identity on that machine. The attacker exploits some system flaws to attain remote access to the machine as a user. Examples are Guess_passwd, imap, and spy.

3.2 Dataset Preprocessing Phase

The training data (TR) for the implementation consists of 9500 randomly chosen records from the NSL-KDD training file, while the testing data (TE) comprises 4500 randomly chosen records from the NSL-KDD test file. Additionally, numeric encoding is used for symbolic properties (such as protocol type, service, flag, and class). The TR is split into two partitions: TR1 and TR2 of 65% and 35%, respectively. Table 1 shows the summary of the dataset.

Table 1 Summary of the dataset

A Multi-level Random Forest Model-Based Intrusion Detection Using Fuzzy Inference System for Internet of Things Networks

Abstract

Similar content being viewed by others

Modified parallel random forest for intrusion detection systems

IoT Intrusion Detection Using Modified Random Forest Based on Double Feature Selection Methods

A random-forests-based classifier using class association rules and its application to an intrusion detection system

Explore related subjects

1 Introduction

1.1 Motivations

1.2 Contributions

2 Related Work

2.1 Feature Selection

2.1.1 Genetic Search

2.1.2 Correlation-Based Feature Selection

2.1.3 Sequential Forward Selection

2.2 Random Forest Algorithm

2.3 Fuzzy Logic

3 Methodology

3.1 Dataset Description

3.2 Dataset Preprocessing Phase

3.3 Feature Selection Phase

3.4 Detection Phase

3.5 Classification Phase

3.5.1 Fuzzy Set

3.5.2 Linguistic Variables

3.5.3 Fuzzification

3.5.4 Fuzzy Rules

3.5.5 Inference Engine

4 Experimental Implementation of the Proposed System

4.1 Implementation

4.2 Dataset Preprocessing

4.3 Feature Selection

4.4 Results and Discussion

4.5 Comparative Analysis with Existing Models

5 Conclusion and Future Work

Data Availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of Interest

Ethical Approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation