1 Introduction

The rapid proliferation of Internet-of-Things (IoT) devices has revolutionized various industries, enabling seamless connectivity, data exchange, and automation. However, the widespread adoption of IoT technology has also brought forth new security challenges, particularly in the realm of network intrusion detection. IoT includes extensive number of limited and heterogeneous devices [1]. Such condition makes each layer of the three-tier IoT environment representing potential attack surface for attackers and suffering from a variety of security threats.

Threat landscape changes when moving from conventional networks to IoT-based networks. This shift introduces unique challenges and expands the attack surface, making IoT networks more susceptible to various security threats. With growing number of machines and smart devices connected to the network, the vulnerabilities of IoT security are gradually increased. IoT networks typically consist of a vast number of interconnected devices with diverse hardware, operating systems, and communication protocols. This scale and heterogeneity make it challenging to implement consistent security measures across all devices, leading to potential vulnerabilities [42].

Limited resources of IoT devices compromise the capability of installing protective security solutions. IoT devices often have limited computational power, memory, and energy resources. This limitation makes it difficult to deploy resource-intensive security solutions, such as robust encryption or intrusion detection systems, on all IoT devices. Attackers can exploit these resource constraints to launch attacks and compromise the devices. Moreover, many IoT devices lack proper mechanisms for software updates and patches that adds a difficulty to an already challenging environment [2]. This situation makes it difficult to address known vulnerabilities and apply security fixes promptly, leaving devices exposed to known attacks.

IoT standards and protocols are still evolving [3], resulting in a lack of uniform security practices across different IoT devices and ecosystems. Inconsistent security implementations can create vulnerabilities, as attackers can exploit weak links in the network. Especially, with considering that IoT devices are often deployed in physically exposed and uncontrolled environments, such as industrial settings or public infrastructure [4]. This physical exposure increases the risk of physical tampering, unauthorized access, and device compromise.

Securing IoT is the only solution for supporting its spreading or decaying is the alternative. Detecting and mitigating intrusions in IoT networks is of paramount importance to safeguard sensitive data, ensure privacy, and maintain the integrity of IoT systems. Network intrusion detection plays a crucial role in providing real-time protection for IoT environment. It is used to monitor network traffic and distinguish between normal and abnormal network behaviors. Traditional network intrusion detection systems (NIDS) may not be well-suited to address the unique characteristics and challenges presented by IoT networks [5]. The scale, heterogeneity, and resource constraints of IoT devices necessitate innovative approaches to effectively detect and respond to network intrusions.

Incorporating Machine Learning (ML) into the defense architecture has shown promise in this domain. It contributes in achieving higher detecting accuracy rates in addition to the capability of detecting zero day infections [6]. Boosting is an ensemble modeling technique that refers to improving ML algorithms predictive accuracy through combining weak or base learning models into strong predictive model [7]. Its core idea is to iteratively train the base models, and then combine their predictions for the sake of improving the accuracy of the overall ensemble model.

The objective of this paper is to conduct a comparative study on the effectiveness of boosting-based machine learning algorithms for IoT network intrusion detection. It conducts a comprehensive comparative study of multiple boosting-based models, i.e., Adaptive Boosting (ADB), Gradient Descent Boosting (GDB), Extreme Gradient Boosting (XGB), Categorical Boosting (CAB), Hist Gradient Boosting (HGB), and Light Gradient Boosting (LGB). This study aims to evaluate their efficacy within the context of IoT network security and identify the most suitable algorithm for accurate and efficient intrusion detection. The findings of this comparative study will contribute to the existing body of knowledge on IoT network security and intrusion detection. By identifying the most effective boosting-based machine learning algorithm for IoT network intrusion detection, this research can guide the development of robust and efficient intrusion detection systems tailored to the unique characteristics and constraints of IoT environments. Furthermore, the insights gained from this study can inform the design of proactive security measures to mitigate the risks associated with IoT network intrusions.

The main contributions of this paper are four folds:

  1. 1.

    Examining the literature of using boosting-based ML algorithms in IoT network intrusion detection.

  2. 2.

    Conducting an exploratory data analysis (EDA) to N-BaIoT data set [8] to analyze and summarize their main characteristics and features.

  3. 3.

    Investigating the potential of boosting-based methods for detecting IoT botnet attacks through an experimental performance evaluation of six boosting-based ML algorithms representing boosting technique-based algorithms ADB, GDB, XGB, CAB, HGB, and LGB.

  4. 4.

    Benchmarking the six models through a computational analysis to gain more insight into how light they are to an IoT environment.

The remaining sections of this paper are structured as follows. Section 2 surveys the related work. Section 3 presents a background for the boosting-based ML algorithms. Section 4 demonstrates the evaluation scheme. It describes the used data set, shows the data set preprocessing and the evaluation metrics. Section 5 introduces the experimental results for model performance evaluation. Section 6 provides the conclusion of this work and provides possible future research directions.

2 Related Work

A number of studies have been proposed for the sake of detecting network intrusions in IoT environment. This section investigates papers applying boosting-based ML algorithms for detecting intrusions in IoT environments. A quantitative systematic review approach is followed to select relevant studies. An extensive search was conducted using scientific electronic search engines on scientific databases including IEEE Xplore, Science Direct, Scopus, and Research Gate. The search is limited to publications written in English and published in scientific journals, conferences, or theses. All combinations of “machine learning”, “boosting”, “intrusion detection” and “IoT” were used in the title, abstract, and keywords over the period from 2017 to 2023. The focus was only on published work during that period due to the fact that the trigger for this research field was the reported botnet malware (Mirai) in 2016. US Computer Emergency Readiness Team (US-CERT) reported a botnet malware that had disrupted the services of a major US Internet provider. It caused a disruption of multiple major websites via a series of massive distributed denial of service (DDoS) attacks. It spread quickly and infected thousands of malicious endpoints. ML umbrella covers several learning techniques. Boosting algorithms have been around for years and yet it’s only recently when they have become mainstream in the ML community. This section surveys and discusses the literature of network intrusion detection in IoT environment for boosting-based related work.

Kumar et al. [9] used a two-step process for identifying peer to peer P2P bots which are detection step and analyzing step. For the classification step, tenfold cross-validation is used on Random Forest (RF), Decision Tree (DT) and XGB. Their approach achieved detection rate of 99.88%. They trained the model for P2P botnet detection using traffic from three botnets namely Waledac, Vinchuca, and Zeus.

Liu et al. [10] studied eleven ML algorithms for detecting intrusions in Contiki-NG-Based IoT Networks. They reported that XGB achieved the best performance with 97% accuracy using the NSL–KDD data set.

Alqahtani et al. [11] proposed Fisher-score for reducing the number of features for an IoT botnet attack detection. Their approach used a genetic-based extreme gradient boosting (GXGB) model. Fishers score allows them to select only three out of 115 data features of the N-BaIoT data set [8]. Their approach achieved an accuracy of 99.96%. Dash et al. [12] proposed a multi-class Adaptive Boost model (ADB) for predicting the anomaly type. They used IoT security data set from DS2OS [13] for the model evaluation. This data set covers eight types of anomalies those are data probing, denial of service (DoS), malicious control, malicious operation, scan, spying and wrong setup. They reported an anomaly detection accuracy of 95%.

Krishna et al. [14] proposed hybrid approach based on ML and feature selection. The NSL–KDD [8] and NBaIoT data sets [14] are used with applying feature extraction using Recursive feature elimination (RFE). They reported an accuracy of 99.98%. They compared it with GDB classifier which achieved an accuracy of 99.30%. Hazman et al. [15] proposed an approach for intrusion detection in IoT-based smart environments with Ensemble Learning called IDS–SIoEL. Their approach uses ADB and combining different feature selection techniques Boruta, mutual information and correlation furthermore. They evaluated their approach on IoT-23, BoT–IoT [16], and Edge-IIoT data sets. They reported a detection accuracy of 99.90%.

Ashraf et al. [17] proposed a federated intrusion detection system for blockchain enabled IoT healthcare applications. Their approach is based on using lightweight artificial neural networks in a federated learning way. In addition, it uses blockchain technology to provide a distributed ledger for aggregating the local weights and then broadcasting the updated global weights after averaging. They compared using of ANN and XGB models through BoT–IoT data set. The results show that ANN has better performance of 99.99 rather than XGB of 98.96.

Khan et al. [18] proposed a proactive interpretable prediction model to detect different types of security attacks using the log data generated by heating, ventilation, and air conditioning (HVAC) attacks. Several ML algorithms were used, such as DT, RF, GDB, ADB, LGB, XGB, and CAB. They reported that the XGB classifier has produced the best result with 99.98% accuracy. Their study was performed using the Elnour et al. [19] HVAC systems data set.

Alissa et al. [20] proposed a DT, an XGB model, and a logistic regression (LR) model. They used UNSW-NB15 data set with applying features correlation technique resulting in discarding nine features. They reported that the DT outperformed with 94% test accuracy with slight higher accuracy that XGB while LR achieved the worst accuracy.

Al-Haija et al. [21] proposed an ensemble learning model for botnet attack detection in IoT. Their approach is to applying the voting-based probability to ensemble three ML classifiers, i.e., ADB, Random under sampling boosting model (RUS), and bagged model. The individual performance for the selected classifiers was 97.30, 97.70, and 96.20, respectively. While the performance of the proposed ensemble model was 99.60%.

Garg et al. [22] compared the performance of boosting techniques with the non boosting ensemble-based techniques. They identified two types of attacks: IoT attacks and DDoS attacks as binary class and multiclass output, respectively. Three data sets were used to for evaluation BoT–IoT, IoT-23 and CIC–DDoS-2019. Two boosting methods were used, i.e., XGB and LGB. LGB achieved the best performance with an accuracy of 94.79%.

Bhoi et al. [23] proposed an LGB-based model for anomalies detection in IoT environment. They used Gravitational Search-based optimization (GSO) for optimizing LGB hyper parameters and compared with the Particle swarm optimization (PSO). They used a simulated IoT sensors data set called IoT data set that is cited in [24]. They reported an optimal accuracy of 100%.

Awotunde et al. [25] proposed a boosting-based model for intrusion detection in industrial Internet-of-Things networks. They investigated the detection of various ensemble classifiers, such as XGB, Bagging, extra trees (ET), RF, and ADB. They utilized the Telemetry data of the TON_IoT data sets. The results indicated that XGB showed the highest accuracy in detecting and classifying IIoT attacks. Rani et al. [26] compared several algorithms for intrusion detection in IoT environments, i.e., LR, RF, XGB, and LGB. They utilized DS2OS data set [27] and reported that XGB and LGBM achieved almost equal accuracy of 99.92%.

Table 1 presents a comparative analysis for the previous related work in tabular form. It lists the surveyed papers in adopting boosting-based approaches in detection IoT network intrusions and their characteristics. The papers are ordered chronologically, and their characteristics in terms of the objective, the employed boosting algorithm(s), the evaluation data set, the number of classes, the number of features, and the reported accuracy.

Table 1 Comparative analysis for the related work

3 ML Boosting-Based Algorithms

In the context of IoT intrusion detection, ML techniques play a crucial role in enhancing security measures. One such technique is boosting, which leverages the concept of ensemble supervised learning to strengthen the detection capabilities. By combining several learners into a strong model, boosting effectively reduces bias and variance in the prediction process [28]. Those simple models are called weak learners or base estimators [7]. It aggregates all predictions from its constituent learners in a sequential manner. In such way, each learner eliminates the error of its previous one to update the residual error. This section introduces six different boosting-based algorithms. These algorithms build upon the principles of ensemble learning, empowering ML models to achieve higher prediction accuracy in detecting and mitigating intrusions in IoT systems.

The general architecture of the boosting techniques consists of the following steps:

  1. 1.

    Initialize the training data set and assign equal weights to each training instance.

  2. 2.

    Train a base learner on the weighted data set.

  3. 3.

    Adjust the weights of misclassified instances to give them higher importance.

  4. 4.

    Repeat steps 2 and 3 for a specified number of iterations (or until a stopping criterion is met).

  5. 5.

    Combine the predictions of all weak learners using a weighted voting or averaging scheme to obtain the final prediction.

3.1 Adaptive Gradient Boosting

Adaptive Gradient Boosting algorithm (AdaBoost or ADB) of Freund and Schapire was the first practical boosting algorithm [29]. The algorithm begins by fitting a classifier on the original data set, and then fits additional copies of the classifier on the same data set. It assigns higher weights to the incorrectly classified classes and lower weights to the correctly classified classes to focus more on difficult cases. The exact process repeats until the best possible result is achieved and the algorithm has used all the instances in the data. It is implemented in scikit class sklearn.ensemble.AdaBoostClassifier [30].

The mathematical architecture of the ADB model training can be concluded as follows [31]:

  1. 1.

    Initialize the weights of the training samples:

    $$\begin{aligned} w_i = 1/N, \end{aligned}$$
    (1)

    where N is the number of training samples.

  2. 2.

    For each boosting iteration t = 1 to T:

    • Train a weak learner on the training data with weights \(w_i\).

    • Compute the weak learner’s error rate:

      $$\begin{aligned} \epsilon _t = \Sigma _i w_i * \delta (y_i \ne h_t(x_i)), \end{aligned}$$
      (2)

      where \(h_t(x_i)\) is the weak learner’s prediction for sample \(x_i\) and \(\delta \) is the Kronecker delta.

    • Compute the weak learner’s weight in the ensemble:

      $$\begin{aligned} \alpha _t = 0.5 * \ln ((1 - \epsilon _t) / \epsilon _t). \end{aligned}$$
      (3)
    • Update the sample weights:

      $$\begin{aligned} w_i = w_i * exp(\alpha _t * \delta (y_i \ne h_t(x_i))). \end{aligned}$$
      (4)
    • Normalize the sample weights:

      $$\begin{aligned} w_i = w_i / \Sigma _i w_i. \end{aligned}$$
      (5)
  3. 3.

    Output the final boosted model: \(H(x) = sign(\Sigma _t \alpha _t * h_t(x))\).

ADB is known for its ability to handle complex data sets and achieve high accuracy. It focuses on mis-classified samples, giving them higher weights in subsequent iterations, leading to improved performance. It is resistant to over-fitting and can work well with weak classifiers. However, It can be sensitive to noisy data and outliers, which can negatively impact its performance. It may struggle with data sets that have imbalanced class distributions.

3.2 Gradient Descent Boosting

Gradient descent boosting (GDB) is an extension of boosting technique where the process of additively generating weak models is formalized as a gradient descent algorithm. The final prediction is a weighted sum of all of the tree predictions. All its weak learners are decision trees. The idea behind is to take a weak hypothesis or weak learning algorithm and make a series of tweaks to it, that will improve the strength of the hypothesis/learner. This type of Hypothesis Boosting is based on the idea of Probability Approximately Correct Learning (PAC). Gradient boosting classifiers are the Ada-Boosting method combined with weighted minimization, after which the classifiers and weighted inputs are recalculated. The objective of Gradient Boosting classifiers is to minimize the loss. It is implemented in scikit class sklearn.ensemble.GradientBoostingClasifier [30].

The mathematical architecture of the GDB model training can be concluded as follows [32]:

  1. 1.

    Initialize the model’s predictions:

    $$\begin{aligned} F_0(x) = argmin_c \Sigma _i L(y_i, c), \end{aligned}$$
    (6)

    where L is the loss function, and c is the predicted value.

  2. 2.

    For each boosting iteration t = 1 to T:

    • Compute the negative gradient of the loss function:

      $$\begin{aligned} r_{it} = -[\delta L(y_i, F(x_i)) / \delta F(x_i)]{_{F(x_i)=F_{t-1}(x_i)}}.\nonumber \\\end{aligned}$$
      (7)
    • Fit a weak learner to the negative gradient:

      $$\begin{aligned} h_t(x) = argmin_h \Sigma _i L(r_{it}, h(x_i)) \end{aligned}$$
      (8)
    • Update the model’s predictions:

      $$\begin{aligned} F_t(x) = F_{t-1}(x) + \eta * h_t(x) \end{aligned}$$
      (9)

      where \(\eta \) is the learning rate.

  3. 3.

    Output the final boosted model: \(H(x) = F_T(x)\).

GDB builds models sequentially, minimizing the loss function by gradient descent, resulting in improved performance. However, it can be computationally expensive and may require careful tuning of hyper-parameters. It is more prone to over-fitting compared to other algorithms.

3.3 Extreme Gradient Boosting

Extreme Gradient Boosting (XGB) is simply an improved version of the GBM algorithm. It implements machine learning algorithms under the Gradient Boosting framework. Its working procedure is the same as GBM, except that XGB implements parallel pre-processing at the node level which makes it generally over ten times faster than GBM [33]. XGB also includes a variety of regularization techniques that reduce over fitting and improve overall performance. The mathematical architecture of the XGB model training can be concluded as follows [34]:

  1. 1.

    Initialize the model’s predictions:

    $$\begin{aligned} F_0(x) = argmin_c \Sigma _i L(y_i, c), \end{aligned}$$
    (10)

    where L is the loss function.

  2. 2.

    For each boosting iteration t = 1 to T:

    • Compute the negative gradient of the loss function:

      $$\begin{aligned} g_{it} = -[\delta L(y_i, F(x_i)) / \delta F(x_i)]{_{F(x_i)=F_{t-1}(x_i)}}.\nonumber \\ \end{aligned}$$
      (11)
    • Compute the second derivative approximation of the loss function:

      $$\begin{aligned} h_{it} = [\delta ^2L(y_i, F(x_i)) / \delta F(x_i)^2]{_{F(x_i)=F_{t-1}(x_i)}}.\nonumber \\ \end{aligned}$$
      (12)
    • Fit a weak learner to the negative gradient and second derivative:

      $$\begin{aligned} h_t(x)= & {} argmin_h \Sigma _i [g_{it} * h(x_i) \nonumber \\{} & {} + 0.5 * h_{it} * h(x_i)^2] + \Omega (h), \end{aligned}$$
      (13)

      where \(\Omega \)(h) is the regularization term.

    • Update the model’s predictions:

      $$\begin{aligned} F_t(x) = F_{t-1}(x) + \eta * h_t(x), \end{aligned}$$
      (14)

      where \(\eta \) is the learning rate.

  3. 3.

    Output the final boosted model: \(H(x) = F_T(x)\).

XGB excels in both speed and performance. It supports parallel processing and has a comprehensive set of hyper-parameters for fine-tuning. However, it is sensitive to hyper-parameter settings. Selecting the optimal combination of hyper-parameters can be time-consuming and computationally expensive. Additionally, the interpretability of XGB models can be challenging due to their complexity.

3.4 Light Gradient Boosting

Light Gradient Boosting algorithm (LGB) is an ensemble learning method. It is an implementation of Gradient Boosted Decision Trees (GBDT) similar to random forest [35]. It combines multiple decision trees to obtain a better prediction. LGB algorithm is an implementation of GBD [35]. It uses boosting to eliminate the residual error. LGB is able to handle huge amounts of data with ease. It does not perform well with a small number of data points. The trees in LGB have a leafwise growth, rather than a levelwise growth. After the first split, the next split is done only on the leaf node that has a higher delta loss. To speed up the training process, LGB uses a histogram-based method for selecting the best split. Observing the high training time requirement for gradient boosting decision trees (GBD), Ke et al. [28] proposed two novel techniques to overcome the challenge based on Gradient-based One-Side Sampling (GOSS) and Exclusive Feature Bundling (EFB). This new implementation was named LGB, and it improved training and inference time of GBD by 20%.

The mathematical architecture of the LGB model training can be concluded as follows [36]:

  1. 1.

    Initialize the model’s predictions:

    $$\begin{aligned} F_0(x) = argmin_c \Sigma _i L(y_i, c), \end{aligned}$$
    (15)

    where L is the loss function

  2. 2.

    For each boosting iteration t = 1 to T:

    • Compute the negative gradient of the loss function:

      $$\begin{aligned} g_{it} = -[\delta L(y_i, F(x_i)) / \delta F(x_i)]{_{F(x_i)=F_{t-1}(x_i)}}.\nonumber \\ \end{aligned}$$
      (16)
    • Compute the second derivative approximation of the loss function:

      $$\begin{aligned} h_{it} = [\delta ^2L(y_i, F(x_i)) / \delta F(x_i)^2]{_{F(x_i)=F_{t-1}(x_i)}}.\nonumber \\ \end{aligned}$$
      (17)
    • Grow a tree with a leafwise approach, selecting the best split based on the gradients and second derivatives.

    • Update the model’s predictions:

      $$\begin{aligned} F_t(x) = F_{t-1}(x) + \eta * h_t(x), \end{aligned}$$
      (18)

      where \(\eta \) is the learning rate.

  3. 3.

    Output the final boosted model: \(H(x) = F_T(x)\).

LGB utilizes a leafwise tree growth strategy and gradient-based optimization, resulting in faster training times and lower memory usage. However, it may not perform well when dealing with smaller data sets. It is more sensitive to over-fitting and may require careful regularization. The interpretability of LGB models can be challenging due to their complex nature.

3.5 Categorical Boosting

Categorical Boosting is an algorithm for gradient boosting on decision trees (also know as CatBoost or CAB) [37]. It is a special version of GBDT. It solves problems with ordered features while also supporting categorical features. It shuffles the data randomly and mean is calculated for every object only on its historical data. It constructs combinations in a greedy way. It incorporates an ordered boosting with a permutation driven alternative to the conventional gradient boosting. Such permutations decrease the final model predictions’ variance compared to the general boosting algorithm [38].

The mathematical architecture of the CAB model training can be concluded as follows [39]:

  1. 1.

    Initialize the model’s predictions:

    $$\begin{aligned} F_0(x) = argmax_c \Sigma _i L(y_i, c), \end{aligned}$$
    (19)

    where L is the loss function.

  2. 2.

    For each boosting iteration t = 1 to T:

    • Compute the pseudo-residuals:

      $$\begin{aligned} r_{it} = -[\delta L(y_i, F(x_i)) / \delta F(x_i)]{_{F(x_i)=F_{t-1}(x_i)}}.\nonumber \\ \end{aligned}$$
      (20)
    • Fit a weak learner to the pseudo-residuals and the categorical features.

    • Update the model’s predictions:

      $$\begin{aligned} F_t(x) = F_{t-1}(x) + \eta * h_t(x). \end{aligned}$$
      (21)
  3. 3.

    Output the final boosted model: \(H(x) = F_T(x)\).

CAB provides good accuracy and robustness to noisy data. It also offers built-in handling of missing values. However, it can be computationally expensive and slower compared to other boosting algorithms, especially with large data sets. It may require more computational resources during training. Tuning its hyper-parameters can be challenging due to the increased complexity.

3.6 Hist Gradient Boosting

Histogram-based Gradient Boosting Classification Tree (HGB) is much faster than Gradient Boosting Classifier for big data sets. Its implementation is inspired by LGB. During training based on the potential gain, the tree grower learns at each split point whether samples with missing values should go to the left or right child. When predicting, samples with missing values are assigned to the left or right child consequently. If no missing values were encountered for a given feature during training, then samples with missing values are mapped to whichever child has the most samples. It is implemented in scikit class: sklearn.ensemble.HistGradientBoostingClassifier [30]. The mathematical architecture of the HGB model training can be concluded as follows [40]:

  1. 1.

    Initialize the model’s predictions:

    $$\begin{aligned} F_0(x) = argmin_c \Sigma _i L(y_i, c), \end{aligned}$$
    (22)

    where L is the loss function.

  2. 2.

    For each boosting iteration t = 1 to T:

    • Compute the negative gradient of the loss function:

      $$\begin{aligned} g_{it} = -[\delta L(y_i, F(x_i)) / \delta F(x_i)]{_{F(x_i)=F_{t-1}(x_i)}}.\nonumber \\ \end{aligned}$$
      (23)
    • Compute the second derivative approximation of the loss function:

      $$\begin{aligned} h_{it} = [\delta ^2L(y_i, F(x_i)) / \delta F(x_i)^2]{_{F(x_i)=F_{t-1}(x_i)}}.\nonumber \\ \end{aligned}$$
      (24)
    • Construct a histogram of the feature values and their corresponding gradients and second derivatives.

    • Find the best split points in the histogram using a greedy algorithm.

    • Compute the leaf values for the histogram bins.

    • Update the model’s predictions:

      $$\begin{aligned} F_t(x) = F_{t-1}(x) + \eta * h_t(x), \end{aligned}$$
      (25)

      where \(\eta \) is the learning rate.

  3. 3.

    Output the final boosted model: \(H(x) = F_T(x)\).

Table 2 Data set attributes information
Fig. 1
figure 1

Feature extraction

4 Evaluation Scheme

This section demonstrates the evaluation environment. It describes the data set used, the applied data preprocessing and the evaluation metrics to be used for performance evaluation.

4.1 Data Set Description

This section describes the data set used in the experimental framework. N-BaIoT [41] data set is selected for training and evaluation purposes as it is a widely accepted as benchmark sequential data set. It contains realistic network traffic and a variety of attack traffic. It was suggested by Meidan et al. [8] through gathering traffic of nine commercially available IoT devices authentically infected by Mirai and Bashlite malware. The devices were two smart doorbells, one smart thermostat, one smart baby monitor, four security cameras and one webcam. Traffic was captured when the devices were in normal execution and after infection with malware. The traffic was captured through network sniffing utility into raw network traffic packet capture format (PCAP). It can be achieved through using port mirroring. Five features are extracted from the network traffic as abstracted in Table 2. Three or more statistical measures are computed for each of these five features for data aggregation, resulting in a total of 23 features. These 23 distinct features are computed over five separate time-windows (100 milliseconds (ms); 500 ms; 1.5 seconds (s); 10 s; and 1 minute) as demonstrated in Fig. 1. Using time windows makes this data set appropriate for stateful IDS and resulting in total of 115 features. Naveed et al. [41] organized this data set in an easier file structure and made it available at Kaggle.

The data set contains instances of network traffic data divided into three categories: normal traffic (Benign data), Bashlite infected traffic, and Mirai infected traffic. Each data instance consists of 115 features represented by 23 different traffic characteristics in 5 different time frames. Table 2 presents an abstracted demonstration for the data set attributes information. The attacks executed by botnets include: Scan that can discover vulnerable devices; flooding that makes use of SYN, ACK, UDP and TCP flooding; and combo attacks to open connections and send junk data.

Our study uses Median’s data set in Naveed organized formats. Figure 2 shows the data exploration for the data set collected by three labeled types, i.e., benign, Mirai and Gafgyt. Figure 3 shows the data set individual distribution of the 10 malware classes in addition to the benign traffic.

4.2 Data Set Preprocessing

Data preprocessing is the process of preparing the data set for analysis. It is an essential step in ML as it helps to ensure that the data is appropriate and correct for feeding into the model. As demonstrated during data set exploration in Sect. 4.1, the data set is imbalanced and diversified into many files based on the attack type, as shown in Fig. 3.

Fig. 2
figure 2

Data set exploration

Fig. 3
figure 3

Distribution of packets for each class

To integrate the data set files, the data set files are integrated together into three main categories, i.e., Benign, Mirai, and Gafgyt. The Bengin category contains all normal traffic records represented in Light Green color in Fig. 3. Mirai category includes all Mirai related attacks, i.e., Mirai_Ack, Mirai_Scan, Mirai_Syn, Mirai_Udp, Mirai_Udpplain represented in Blue color in Fig. 3. The data set file of this category is called “All_Mirai”. The third category is Gafgyt and it includes all Gafgyt related attacks, i.e., Gafgyt_Combo, Gafgyt_Junk, Gafgyt_Scan, Gafgyt_Tcp, Gafgyt_Udp represented in Red Color in Fig. 3. The data set file of this category is called “All_Gafgyt”.

To deal with the data set imbalance, two data sets are created for ternary classification. A data set contains the three categories is created, i.e., Benign, Mirai, Gafgyt. Each category contains 555,932 rows. The Benign category contains 555,932 rows. The Mirai and Gafgyt categories are created by shuffling the “All_Mirai” and “All_Gafgyt” respectively and selecting only 555,932 rows of each of them. The ternary data set overall size is 1,667,796 records. As the implementation is by using Python, every step in the data set preprocessing is creating an index column. To avoid the bias of such columns, those columns are cleaned from the data set.

4.3 Evaluation Metrics

Performing a comprehensive performance evaluation requires addressing several metrics. Accuracy only is not sufficient for imbalanced data set. The confusion matrix is used to visualize the performance of a ML technique. It describes the performance of a classification model on a set of test data and allows easy identification of confusion between classes. The classification is evaluated through four indicators as follows: True positives (TP): packets are predicted as malicious, and their ground truth is malicious. False positives (FP): packets are predicted as malicious, while their ground truth is benign. True negatives (TN): packets are predicted as benign, and their ground truth is benign. False negatives (FN): packets are predicted as benign, while their ground truth is malicious.

Table 3 Evaluation results for ternary classification

A successful detection requires correct attacks identification with minimizing the number of false alarms. Four metrics are widely used for evaluating ML models, i.e., accuracy, precision, recall, and F1 score. Those four measures are defined as follows:

$$\begin{aligned}{} & {} \text {Accuracy }\equiv \frac{(TP+TN)}{(TP+TN+FP+FN)}, \end{aligned}$$
(26)
$$\begin{aligned}{} & {} \text {Precision }\equiv \frac{(TP)}{(TP+FP)}, \end{aligned}$$
(27)
$$\begin{aligned}{} & {} \text {Recall (Detection Rate) }\equiv \frac{(TP)}{(TP+FN)}, \end{aligned}$$
(28)
$$\begin{aligned}{} & {} \text {F1 Score }\equiv \frac{(2 \times \text {Precision }\times \text { Recall})}{(\text {Precision}+\text {Recall})}. \end{aligned}$$
(29)
Fig. 4
figure 4

Confusion matrix for HGB and CAB ternary classifiers

Those measures range from 0 to 1. The goal is to maximize all the pre-mentioned measures. The higher values correspond to better classification performance. For fair comparative evaluation, two additional performance measures are considered. The first measure is the model training time that is defined by the consumed time during the model training phase. The second measure is the testing time that is defined by the consumed time during the testing phase.

5 Experimental Results

The experiments are conducted on Colab notebook interactive environment. For the sake of providing evidence-based evaluation, the project along with data sets are uploaded and shared on Github. The evaluation is conducted for evaluating six boosting-based ML algorithms, i.e., ADB, GDB, XGB, CAB, HGB, and LGB. The six boosting-based algorithms are evaluated for the objective of ternary classification. The ternary data set demonstrated in Sect. 4.2 is used for models evaluation. The six algorithms are fitted with the formed data set. The performance evaluation metrics identified in Sect. 4.3 are calculated and documented for ternary classification in Table 3.

Fig. 5
figure 5

Temporal performance for boosting-based algorithms in ternary classification

Table 4 Evaluation results for fivefold cross-validation

The empirical evaluation results showed significant potential for the boosting-based ML algorithms in detection network intrusion in IoT environments. For ternary classification, both CAB and HGB algorithms outperform with highest accuracy of 99.9994%. Figure 4 show the confusion matrix of ternary classification using HGB and CAB algorithms. Adaptive Boosting algorithm is originally developed for binary classification. Its tree is just made of a decision stump which is a node and two leaves. That’s why it can be seen as a forest of stumps. This explains its relative low performance in ternary classification of 95.2566.

Some boosting algorithms might be computationally intensive and resource-demanding, which could hinder their practicality in resource-constrained IoT environments. Figure 5 illustrates the temporal performance for the six algorithms. The results show that the GDB model training took around five times the consumed time for training the XGB model. It is because GDB does not support multi threading. Unlike the XGB algorithm that is an implementation of GDB supporting multi-threading.

Enhancing the intrusion detection rate of the model can lead to an improvement in the real-time detection performance of the entire IoT intrusion detection system. Table 3 shows the experimental results of ternary classification. However, HGB and CAB achieved the highest detection accuracy, HGB consumed lower time for training. CAB consumed around eleven times the consumed time of HGB. It is an empirical result for the two techniques used in LGB, i.e., GOSS and EFB. As HGB is inspired by the LGB design, it achieves the same relative small training time compared with the other algorithms. Beside achieving the highest detection accuracy, CAB consumed the least testing time of 1.37 s, that reflects its strength in IoT intrusions detection and real-time feasibility.

To ensure the robustness and reliability of our findings, we conducted cross-validation as an essential step in our research methodology. Cross-validation is a widely accepted technique used to assess the generalization performance of a predictive model. In our study, we employed k-fold cross-validation, where the data set was divided into k equally sized folds. During each iteration, one fold was held out as a validation set, while the model was trained on the remaining k-1 folds. This process was repeated k times, with each fold serving as the validation set once. By averaging the performance metrics across all iterations, we obtained a comprehensive evaluation of the model’s effectiveness and its ability to generalize to unseen data. Cross-validation allowed us to mitigate the risk of overfitting, as it provided a more objective assessment of our model’s performance. The utilization of this rigorous technique enhances the credibility of our results and strengthens the validity of our conclusions. Table 4 shows the results of conducting a fivefolds cross-validation. The results show that HGB outperforms the other algorithms with an average detection accuracy of 0.999992.

Fig. 6
figure 6

Learning curve for HGB

Figure 6 generated by the code shows the learning curve for the HGB model’s performance. The x-axis represents the number of training examples used, and the y-axis represents the performance of the model.

The learning curve is composed of two lines: the training error and the cross-validation error. The training error represents how well the model performs on the training data as the number of training examples increases. The cross-validation error, on the other hand, represents the model’s performance on the validation data during cross-validation. The curves are plotted with the mean errors, however variability during cross-validation is shown with the shaded areas that represent a standard deviation above and below the mean for all cross-validations.

As the number of training examples increases, both the training error and the cross-validation error should improve. The gap between the two lines indicates the model’s generalization ability. The gap is not large that means that the model is not overfitting the training data and is able to generalize well to unseen data.

6 Conclusion and Future Work

This paper presented an experimental evaluation for adopting ML boosting-based algorithms in detection network intrusion in IoT. Six boosting-based algorithms are implemented and tested using a well known standard data set N-BaIoT for bench marking. The results demonstrated the significant potential of the boosting-based ML algorithms. The best performance was achieved using HGB algorithm in ternary classification. Fivefolds cross-validation is conducted to ensure the experimental results that showed an outperform of HGB with a detection accuracy of 0.999992.

Boosting -based algorithms can sometimes lack interpretability, making it difficult to understand the concept behind how the intrusion detection was made. A future research potential is employing the explainable artificial intelligence (XAI) with boosting-based algorithms in the context of intrusion detection in IoT to provide transparency and interpretability of the intrusions detection and classification.

This study presented an empirical evaluation for employing boosting-based algorithms with N-BaIoT data set. Further research is required for evaluating the performance of the boosting-based algorithms with other IoT data sets.