1 Introduction

Insiders, such as employees, have legal access to an enterprise's resources in order to perform their job duties; as a result, detecting insider threats is one of the most difficult challenges facing security administrators and makes it difficult to identify these internal threats [1, 2]. That is why this study employed a variety of supervised machine learning classifiers with specific criteria to find the most accurate classifier to predict these insider threats, mainly LAN attacks from the NSL-KDD data set [3,4,5].

According to [6], 94% of firms have had insider data breaches in the last 12 months, and 84% have encountered security difficulties caused by nontechnical errors (Insider Data Breach Survey, 2021). Humans are the leading cause of disastrous insider data breaches. Therefore, malicious insiders are the major concern of department heads, with 28% agreeing to the previous statement.

[7] published a report stating that insider threat occurrences increased by 44% over the last two years, with a cost climb of more than a third to USD 15 million (Cost of Insider Threats Global Report, 2022). In addition, the cost of corporate credential theft rose by 65% since 2020, from USD 2 million to USD 4 million today. Furthermore, the time to contain an insider threat incident increased from 77 to 85 days, implying that organizations spent more on containment operations. When issues take more than 90 days to settle, communities incur an average yearly cost of USD 17 million [5].

The data breach attacks are classified into different categories [8]: passive attacks, active attacks, close-in attacks, insider attacks, and distribution attacks. Insider attacks are among the most significant threats to information systems because of their impact on confidentiality, integrity, and availability (CIA), especially if they occur on a LAN. These attacks can impact businesses, reputations, and finances [9].

The purpose of the study is to find the most accurate classifier for identifying insider attacks that occur on LANs. Additionally, the significance of the study lies in locating irregular 'attacked' LAN traffic by developing a Python code that uses scikit-learn for backend machine learning, then by plotting the charts with the Plotly open-source, Seaborn, and Matplotlib frameworks. To eliminate bias, a random search algorithm (RSA) is used to tune the hyperparameter using K-fold and stratified cross-validation methods to avoid overfitting.

This study is divided into four sections. Section one and two summarizes related articles and previous studies. Section 3 discusses the proposed framework. Finally, Sect. 4 analyzes the study’s findings.

1.1 Tuning hyperparameters and risk minimization

Hyperparameters, which are also known as nuisance parameters, are values that must be specified outside of the training procedure. A regularization hyperparameter is a process to determine the optimal hyperparameter to send as input to the estimator, such as the decision tree classifier's criteria and maximum depth values. It also indexes the method in many learning issues. Because the optimum hyperparameter for one data set is not always the best for other data sets, the settings must be adjusted for each task. Before evaluating potential estimators, the hyperparameter must be modified to decrease the expected risk [10,11,12].

1.2 Avoid overfitting and model selection

Using a test data set in the model selection procedure can introduce an overfitting problem due to unseen data leaking into the model, which depends on the model selection procedure on the best evaluation metrics. In addition, utilizing the training data set in the model test performance will also cause an overfitting problem. The overfitting model produces inaccurate predictions and cannot handle and generalize all forms of new input (unseen data). As a result, the model may become useless [13,14,15,16].

As a solution, a technique called cross-validation (CV) is employed to mitigate overfitting. It is a powerful tool for developing and selecting ML models. Not only does it ensure that the data is suitable for the dependability of used classifiers, but it also prevents the need to split the data set. So, we are not having the underfitting issue caused by data division, a lack of samples, or insufficient learning of the model [13, 14].

Cross-validation randomly separates the training set into two logical parts: a training set and a validation set. When the existing test set, as in our NSL-KDD data set, is added, we will have three sets in total. Each set serves a different purpose: the training set is used to teach the model, the validation set is used to solve the above issues such as choosing the best model and generalizing to new data, and the test set is used to evaluate the model's performance [13, 14].

1.3 Background of the study

ML has proven to be the ideal solution for situations like anomaly detection and network intrusion detection [17, 18]. Therefore, supervised ML algorithms are used to solve the problem of the study due to their speed of response in detecting threats. Supervised ML algorithms are divided into two types [19]: classification algorithms and regression algorithms. First, classification algorithms address this issue since they can distinguish between two or more classes (normal, attack), as in the framework, the outcomes predicted are discrete class labels [20,21,22]. The following are the supervised ML classification algorithms: linear support vector machines (SVMs), decision trees (DTs), and logistic regression (LR).

In addition to the ensemble algorithms, they aid in solving both classification and regression problems. The goal of ensemble techniques is to combine many prediction models and enhance the outcomes. The following are the supervised ML classification ensemble algorithms: adaptive boosting (AdaBoost), gradient boost (GB), extremely randomized trees (ERTs), and random forests (RFs) [22].

Insiders exhaust an organization's resources significantly, resulting in huge financial and human losses. Because of this, insider activities on a LAN must be recognized and their impact on security politics (CIA) must be identified. Due to the various motivations of the insider which can be personal, political, or economic [17, 23, 24], a plan must be prepared to detect all possible disasters and security concerns. As a result, the study aims to immediately characterize these risks on a LAN and instantly support the security administrators by identifying the most accurate ML classifier.

There is a need to comprehend the link between insiders and their threats. Insiders can gain access rights to networks, either legally or illegally. Legally, various departments can get access to each other because of variances across departments, joint ventures, outsourcing, and the potential of recruiting temporary employees such as consultants. Thus, there are certainly different levels of authorization granted to these insiders [5]. Their threats involve the misuse of legitimate access rights. However, there are several types of insiders. Each type has its own procedures, risks, and data sets. The NSL-KDD data set targets anyone connected either internally or remotely to a LAN [4, 17].

2 Related articles

This section discusses the previous studies and articles related to the study at hand. In [25], the theoretical obstacles to detecting insider threats are addressed, which helped define the research topic. The study also lists the existing insider threat data set types that include emails, authentication, login, HTTP, and files but exclude insider attacks. Consequently, the importance of our research contributions is highlighted. [26] proves that a one-hot-encoding approach is capable of converting categorical features into new individual binary features to train the classification models on them. [3] offers a review of existing insider threat approaches that use NSL-KDD to detect DOS attacks. [27] provides context-specific definitions of ML model hyperparameters as well as their impact on tuning model hyperparameters for decision-making performance and various approaches to obtain optimal values. In [28], the sensitivity of hyperparameter adjustment to eliminate bias in performance prediction is explored. The researchers carry out a detailed investigation, but the results are unsatisfactory, and the classifier cannot be generalized to another test data set. As a result, there is a need to conduct additional research to locate the best classifier during the creation of a machine learning system The problems of both overfitting and underfitting are presented in [14] along with their impact on the performance model for decision-making, as well as cross-validation methods as a solution. [29] extended and simplified significant CV approaches in developing a final model based on ML. The researcher stresses the importance of generalizing to unseen data to maximize the potential of predictive models and avoid overfitting. Consequently, the researcher concludes that generalizing to unseen data cannot be overlooked and should not become only limited to training and testing to build models and extract results. [23] explored an in-depth examination of the NSL-KDD data set. The study also analyzes the issues found in the kdd99 data set in addition to evaluation metrics. [17] presents a comprehensive review and in-depth understanding of insider threats based on previously published articles and statistical data on both insiders and methodologies employed to detect them. However, the reported results of supervised machine learning are disappointing. Furthermore, most articles in the review focus on outside attacks from emails, HTTP, illegal file access, and devices while ignoring dangers from within the network. In contrast, this study is concerned about LAN insider attacks since they are more common, easily motivated, and cause maximum damage. [30] details the NSL-KDD data set features as well as both the concerns observed in kdd99 and the attack type classifications. The researchers in [27] conduct a survey assessment of insider threat concerns. They state that the extent of the insider threat is a complicated challenge since it is usually difficult to distinguish between insiders and outsiders of a community while operating within a LAN. Furthermore, some insiders can initiate attacks from the outside, for example, an employee who left the organization. The article discusses the challenge of identifying internal attacks that took place on the internal network. Therefore, the main motive behind the current study is to find the most accurate classifier that identifies these threats correctly. [19] demonstrated supervised machine learning methods and the significance of classification models in classifying anomalous behavior from ideal traffic. [31] highlights the insider threat aspects and the methods to confront these threats using either machine learning or non-machine learning techniques. In [32], it is shown that the restoration of missing values in data sets uses zero as a solution.

After surveying the above-listed studies, we conclude that ML techniques are the best solution for insider threat identification. Consequently, one of the reasons behind conducting our research is to locate the best ML technique to address the challenge of identifying insider threats.

3 Case study

In this section, the study focuses on clarifying the methodology utilized in this research paper. Figure 1 depicts the process framework which consists of seven stages: (1) collect data set, (2) preprocessing, (3) tuning models, (4) feature selection, (5) avoid overfitting, (6) training models, and (7) final evaluation. In the following subsections, each stage is explained in detail.

Fig. 1
figure 1

The process framework

3.1 Data set description

MIT Lincoln Labs developed and managed the 1998 DARPA intrusion detection evaluation program, and built a LAN that simulated a US Air Force LAN, conducted several attacks, and gathered raw TCP dump data. Data flowed from the source IP address to the destination IP address depending on a specific protocol to distinguish between normal and malicious connections [13, 17]. A connection was defined as a sequence of TCP packets transmitted at certain times. Afterward, MIT Lincoln Labs extracted the features from raw DARPA and packaged them into the first ready-to-use version, known as KDD99. However, more issues were discovered in the KDD99 data set [30, 33, 34]. A lot of these issues were solved in the updated NSL-KDD data set such as removing redundant records which lead to reducing the size of training and test sets and doing experiments easier and faster [21, 30, 35].

3.2 Data set analysis

The NSL-KDD data set contains 41 features and is divided into two files: the training data set file and the test data set file. On one hand, the test data set file has 125,973 entries, and on the other hand, the test data set file has 22,544 records [30].

As shown in Fig. 2, the Python code retrieved by Pandas, a data analysis tool, classifies the feature data types into an object (nominal), int64, and float64 [13, 19, 23]. Figure 2 also displays the counting and variation in the unique values among the nominal features in the training and test data sets. Finally, Fig. 2 illustrates the service feature in the training data set equals 70 unique values, whereas the service feature in the test data set equals 64 unique values. In the preprocessing phase, the researcher tries to tackle this issue.

Fig. 2
figure 2

Unique values of the nominal features

The class label contains five main categories of classifications [21, 30, 33]:

  1. i.

    Normal: normal connections.

  2. ii.

    DoS: denial-of-service, for example, Smurf.

  3. iii.

    Probing: surveillance, such as port sweep.

  4. iv.

    Access: unauthorized remote machine access, e.g., spying.

  5. v.

    Privilege: unauthorized access to local superuser (root) privileges, e.g., Rootkit.

The probability distribution of the training data set is different from the test data set. The test data set should contain a lot more attacks than the training set just until the estimators can predict new offensives and the system can simulate reality [21, 30, 33]. Figure 3 shows the class label sizes in the NSL-KDD training data set, whereas Fig. 4 illustrates the class label sizes in the NSL-KDD test data set.

Fig. 3
figure 3

The count of each class label in the training set

Fig. 4
figure 4

The size of each class label in the test data set

3.3 Data preprocessing

This stage is one of the most critical phases in the machine learning approach. Figure 5 exhibits the data flow diagram (DFD) for the data preprocessing procedure. As shown in Fig. 5, we first separate the numerical and categorical features. Then, we repair the missing in-service feature between the training and test sets. Afterward, we apply transformational techniques to the categorical features. Finally, we perform scaling methods on all features before recombining them. The preprocessing principle attempts to transform the raw data set into a beneficial format while also ensuring that the data set is clean and noise-free, as a result, the estimator's decision is not affected [32]. The following section describes the preprocessing methods applied to the training and test data sets:

Fig. 5
figure 5

DFD for the data preprocessing stage

3.3.1 Data transformation

After checking the purity of the data, transformation techniques are applied because most machine learning models do not accept categorical features. First, categorical features are converted into numbers, a process known as 'encoding' the category features [32]. There are four categorical features in the NSL-KDD data set, namely protocol type, service, flag, and class label. After that, all the features are standardized by assigning them the same weight until the classifiers are not able to choose the values based on the greater weight. The data transformation methods used are the following: one-hot-encoding and class label.

One-hot-encoding is also known as dummy encoding. This process converts categorical features into binary. This process is carried out in two stages [26]. First, the unique values of the categorical features are transformed into new binary features. Then, the feature's unique value that the connection detected is assigned the value of 1, and the remainder is assigned the value of 0. This methodology was only used for categorical features [36, 37], namely protocol type, service, and flag. The class label was handled otherwise.

The protocol type, service, and flag features, which are displayed in Fig. 2, denote that there is a striking difference between the service feature offered by the training data set on one hand and the testing data set on the other. To equate the testing data set with the training data set, a zero value is used to compensate for the missing values [32, 38]. Table 1 exhibits samples of the protocol-type feature after applying dummy encoding.

Table 1 An extract of the protocol-type feature

In addition to the 38 original features [39], the data set has been increased to include a total of 122. The added features are three protocol-type features, 70 service features, and 11 flag features. Table 2 presents the complete number of features after encoding.

Table 2 All the features after encoding

The class label contains sub-attacks that fall within the scope of five main categories, namely DoS, probe, access, privilege, and a normal connection. Each of these attacks is converted into a unique integer in the same class label column, as shown in Table 3. After converting all categorical attributes to integers, each group is handled individually. As for normal connections, traffic is introduced to each of them, allowing the models to differentiate between regular and irregular attacks or connections. Figure 6 depicts the magnitude of each attack type as well as normal traffic in both the training and test data sets.

Table 3 The class labels categories to unique integers
Fig. 6
figure 6

The size of each group of attacks in the NSL-KDD data set

3.3.2 Scaling

The feature scale aims to place all features on the same scale, indicating that all features are equally important [13]. Figure 7 depicts the data set before any scaling is applied. The approaches listed below are used. Scaling uses the following: robust scaler and standardization.

Fig. 7
figure 7

Unscaled data

First, robust scaler removes outliers by eliminating the median and scaling the data based on the quantile range [13, 40]. Figure 8 depicts the data set after applying a robust scaler. The following formula (1) is used:

$$ X_{{{\text{new}}}} = \frac{{Xi - X_{{{\text{median}}}} }}{{IQR\left( {Q3\left( x \right) - Q1\left( x \right)} \right)}} $$
(1)
Fig. 8
figure 8

Data after applying robust scaling

where: Xnew: Standardized value, Xi: Original values, Xmedian: sample median, Q1: 1st quartile, and Q3: 3rd quartile.

Second, standardization in Python is called a standard scaler (SS). It specifies that the standard deviation is equal to 1, and the mean of the values changes to 0 [13, 41]. Figure 9 displays data after applying SS. The (Z score) Eq. (2) determines this SS:

$$ Z = \frac{x - \mu }{\sigma } $$
(2)
Fig. 9
figure 9

The data set's deviation shape after applying the standard scaling

where: X: values, µ: mean, and σ: standard deviation.

3.4 Tuning model

The tuning process is a matter of trial and error. The statistical ML model experiments repeatedly with different hyperparameter values [13, 14]. After that, its efficiency is compared to the validation set to determine which set of hyperparameter results in the most accurate model [6]. The main technique used in tuning the model is known as RSA.

RSA defines for each hyperparameter a statistical distribution from which values are randomly picked and utilized to train the model. This step increases the likelihood of quickly determining practical prime values for each hyperparameter [6, 12]. The following Table 4 depicts the results of the RSA for the optimal hyperparameter of each model, the best hyperparameter affecting the decision model process, the hyperparameter datatype, the hyperparameter default values at which each model was operating, the start–end random values, and finally the chosen optimal values for each hyperparameter.

Table 4 Optimum values of hyperparameters for the models

In the following paragraphs, we describe the mathematical functions of hyperparameters in our models. First, the linear SVM model employs two equations to determine the loss hyperparameter. A hinge is a form of the cost function in which a margin or distance from the classification border is defined according to the following Eq. (3) and the squared hinge by Eq. (4) [42, 43], where \(t\) is the actual result, either 1 or 0.

$$ {\text{hinge}}\left( y \right) = \max (0, 1 - t.y) $$
(3)
$$ {\text{squared hinge}}\left( y \right) = \max (0, 1 - t.y)^{2} $$
(4)

Second, the criterion hyperparameter contains three arguments in the DT model: Gini, entropy, and log loss. The equations are as follows [14, 32, 43, 44]:

The Gini determines the splitting for each feature and quantifies the impurity of \(\left( D \right)\). The following formula (5) determines Gini:

$$ {\text{Gini}}\left( D \right) = 1 - \mathop \sum \limits_{i = 1}^{m} p_{i}^{2} $$
(5)

where: \(p_{i}\) is the probability that a tuple in D \(\in\) Ci and is estimated by \(\left| {C_{i,D} } \right|/\left| D \right|\). The total is calculated across \(m \) classes.

Entropy is a metric of information that is used to evaluate the impurity or uncertainty in a set of data. It controls how a decision tree splits data. \(p_{x}\) indicates the probability of the \(x\) the class in the data set \(D\), where \(x = 1,{ }2,{ } \ldots .,{ }n\). The following formula (6) is used to compute entropy:

$$ {\text{Entropy}}:H\left( D \right) = - \mathop \sum \limits_{i = 1}^{n} p\left( {x_{i} } \right)\log_{2} p\left( {x_{i} } \right) $$
(6)

Log loss is employed when predicting whether a Boolean (true or false) is something with a likelihood range from certainly true (1) to obviously false (0). The log loss formula (7) is defined as:

$$ {\text{logloss}} = - 1/N\mathop \sum \limits_{i = 1}^{N} (\log \left( {Pi} \right)) $$
(7)

where:\( N\) is the number of instances, and Pi is the model likelihood.

Third, the solver hyperparameter in the LR model has five approaches [45]. To begin with, Newton's approach employs a quadratic function around \(\left( {xn} \right) \) to approximate \(f\left( x \right) \) in each iteration [46, 47]. Then, limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS): limited memory refers to keeping only a few vectors and uses an inverse hessian matrix that is estimated using gradient evaluation-specified updates [46]. In addition, the library for large linear classification (Lib-linear) employs a coordinate descent (CD) approach to solve optimization issues by executing sequential approximation reduction along coordinate directions [48]. Furthermore, stochastic average gradient descent (SAG) is an iterative approach for optimization by gradient descent and incremental aggregated gradient technique modification that uses a random sample of prior gradient values and is suitable for large data sets since it can be handled quickly [49, 50]. Finally, SAGA is an extension of SAG that considers the improved version to have a quicker convergence than SAG [49, 50].

Fourth, the purpose of the criterion hyperparameter in the GB model is to evaluate the quality of a data split. The criterion hyperparameter includes 'friedman_mse' for mean squared error (MSE) with Friedman improvement score and 'squared_error' for mean squared error. The 'friedman_mse' Eq. (8) and MSE Eq. (9) are defined as [51, 52]:

$$ {\text{Friedman}}\_{\text{mse}}:i^{2} \left( {R_{{\text{l}}} ,R_{{\text{r}}} } \right) = \frac{{w_{{\text{l}}} w_{{\text{r}}} }}{{w_{{\text{l}}} + w_{{\text{r}}} }} (\overline{y}_{{\text{l}}} - \overline{{y_{{\text{r}}} }} )^{2} $$
(8)

where: \(W_{{\text{l}}}\) is the sum of weight for the left part, \(W_{{\text{r}}} \) is the sum of weight for the left part, and \(\overline{y}_{{\text{l}}} \) and \(\overline{{y_{{\text{r}}} }}\) are the mean left and right.

$$ {\text{MSE}} = \frac{{\sum (y_{i} - p_{i} )^{2} }}{n} $$
(9)

where: \(y_{{\text{i}}}\) is the \(i^{th}\) observed value, \(p_{{\text{i}}}\) is the corresponding predicted value, and \(n\) is the number of observed values.

Fifth, the 'max_features' hyperparameter utilized in the GB model is the maximum number of features that are permitted on each individual tree. In the first selection, Sqrt will take the square root of the overall number of features. The sqrt Eq. (10) realized as follows [51]:

$$ {\text{max}}\_{\text{features}} = {\text{sqrt}}\left( {n{\text{\_features}}} \right) $$
(10)

Another option is \(\log_{2}\), which will take \(\log_{2} \) of the number of features. The \(\log_{2} \) Eq. (11) realized as follows:

$$ \max \_{\text{features}} = \log_{2} \left( {n{\text{\_features}}} \right) $$
(11)

Sixth, the algorithm hyperparameter for AdaBoost Classifier offers two options: 'SAMME.R' and 'SAMME.' SAMME is an acronym for stagewise additive modeling with a multi-class exponential loss function and R is an acronym for real. For each weak learner, SAMME employs a separate set of 'decision influence' weights (alphas). SAMME.R, on the other hand, allocates an equal weight to each weak learner and evaluates the class likelihood, which usually converges faster than SAMME [53,54,55,56]. The SAMME and SAMME.R EQs (12) and (13) are defined as:

$$ {\text{SAMME}}: H\left( x \right) = \arg \mathop {\max }\limits_{k} \mathop \sum \limits_{t = 1}^{T} \alpha_{{\text{t}}} I\left( {h_{{\text{t}}} x = k} \right) $$
(12)
$$ {\text{SAMME}}.{\text{R: }}H\left( x \right) = \arg \mathop {\max }\limits_{k} \mathop \sum \limits_{t = 1}^{T} s_{k}^{t} \left( x \right) $$
(13)

where: \(H\left( x \right)\) classification predictions, \(T\) weak learners, \(\alpha_{{\text{t}}} \) weight for weak learner \(t\), \(h_{{\text{t}}} x\) the prediction of weak learner \(t\), and \(s_{k}^{t} \) as a multiplier.

3.5 Features selection

This section lists the features that are used throughout the training models. The following approaches are employed:

3.5.1 Univariate feature selection (UFS)

It is a statistical method that exploits the discrepancy between the qualities until a threshold value is obtained from them. This threshold value is used to determine the real features through the recursive feature elimination method utilized to train the models [19, 22, 57].

The 'f_classif' function, named in the scikit-learn ML framework, finds variance using univariate statistical tests that rely on the analysis of variance (ANOVA) F value [19, 22, 58]. It computes the overall comparison error and finds a greater F value when the variance between groups is less than within the groups, indicating a higher likelihood that the observed difference is actual rather than random. Consequently, it excludes features that differ in variance and selects features with the same variance. This technique has picked 13 features for each attack category, and then the recursive feature elimination approach has used number 13 as the threshold. The F-statistic EQs (14) and (15) in one-way ANOVA are represented as follows [19, 22, 58]:

$$ F = \frac{{\text{between - groups variance}}}{{\text{within - group variance}}} $$
(14)
$$ F = \frac{{{\text{MSGroups}}}}{{{\text{MSError}}}} = \frac{{\frac{{{\text{SSGroups}}}}{I - 1}}}{{\frac{{{\text{SSError}}}}{nT - I}}} $$
(15)

where: MS: mean square, SS: sum of squares, I: number of groups, and nT: sample size.

3.5.2 Recursive Feature Elimination (RFE)

It is a type of wrapper used for feature selection algorithms. The RFE seeks to identify acceptable feature subsets. It operates immediately following the UFS technique. First, each model is implemented individually using tuned hyperparameters. Then, all features are passed to establish their relevance to one another. Ultimately, the least important features are pruned. The RFE recursively continues this technique on the reduced set until it obtains the requisite feature count defined as the threshold by the UFS method [19, 22, 57]. Table 5 reports the select features of each model using the RFE method for each attack category based on UFS's threshold.

Table 5 Select features of each model for each attack

3.6 Cross-validation

As previously indicated, cross-validation [13, 14, 22] is an efficient instrument for designing and choosing ML models. It is employed in the study to avoid overfitting. The following methods, which are part of cross-validation, are used in the study.

3.6.1 K-Fold CV

The original training data set is divided into equal-sized folds (K subsamples) with random sampling. The model is trained using the fold by (K-1) as training data and then verified using the remaining folds. It entails repeating and recording the arithmetic mean and standard deviation of the k-folds produced from the evaluation measures on the various partitions [6, 22]. The following Table 6, 7, 8, 9, 10, 11, and 12 show the outcomes (accuracy, recall, and area under the curve) of the K-fold CV mean and ± standard deviation between folds for each model, with the better results highlighted in bold in Table 10.

Table 6 Results of K-Fold CV for linear SVM model
Table 7 Results of K-Fold CV for DT model
Table 8 Results of K-Fold CV for LR model
Table 9 Results of K-Fold CV for GB model
Table 10 Results of K-Fold CV for AdaBoost model
Table 11 Results of K-Fold CV for RFs Model
Table 12 Results of K-Fold CV for ERTs model

3.6.2 Stratified K-fold CV

It is the same as a K-fold CV but uses stratified sampling to avoid two issues: random sampling in the K-fold CV method and the imbalance in the sample size in the data set. The strata have nearly the same rate of samples as in the original data set, and each fold has the same size as normal and attack samples. Consequently, whichever criteria are used to evaluate them, the findings will be consistent across all folds [13, 22, 59]. Table 13 illustrates the results of the stratified K-fold CV applied to the model which achieved better results in K-fold CV. Furthermore, the stratified K-fold CV approach delivers good results for the AdaBoost model, guaranteeing that the above-mentioned concerns are addressed.

Table 13 Results of stratified K-Fold CV for AdaBoost model

3.7 Training models

This stage covers how to train machine learning algorithms on the training data set. Algorithm (1) demonstrates the implementation phase of the framework. This framework is written in Python and uses the scikit-learn framework as a backend ML tool to analyze the predicted data and find the best model to assess the probable normal and abnormal behavior on a LAN.

Algorithm 1
figure a

Supervised Learning Algorithm

3.8 Final evaluation

The final evaluation performance is implemented by using the test set. The primary principles of testing the models are their capacity to appropriately adjust to new, previously unobserved data and the model's quality, which is determined via some evaluation measures. Performance estimators are derived from the confusion matrix (CM), which visualizes the prediction results [22, 23, 33, 39], represented by four rates, as indicated in Table 14 The number of predicted values is represented in each column of the CM, while the number of actual values is represented in each row.

Table 14 Confusion matrix

TP = True Positive (Normal Traffic Predicted as Normal).

TN = True Negative (Malicious Traffic Predicted as Malicious).

FP = False Positive (Malicious Traffic Predicted as Normal).

FN = False Negative (Normal Traffic Predicted as Malicious).

Our research findings reveal that the AdaBoost model gets a higher accuracy, as exhibited in Fig. 10 which shows the CMs of the experiment results for predicted values and actual values based on the above-mentioned rates by the AdaBoost model.

Fig. 10
figure 10

CMs for the AdaBoost model

Accuracy (Acc), recall (Rec), or true-positive rate [14, 23, 33], and area under the receiver operating characteristic curve (AUC-ROC) are three essential assessment metrics generated from the rates listed above. The accuracy score denotes the proportion of true-positive and true-negative predictions generated by a model as a percentage of the total number of predictions made by Eq. (16).

$$ {\text{Acc}} = \frac{TP + TN}{{TP + TN + FP + FN}} $$
(16)

The recall [14, 23, 33, 39] of the ML model indicates its ability to define the proportion of true positives that are correctly classified by Eq. (17), while the AUC-ROC indicates positive predictions that are classified higher than negative predictions. The ROC-AUC curve is represented as a plot of the false-positive rate (FPR) as the x-axis versus the TPR as the y-axis. (17) and (18) EQs are used to calculate AUC-ROC [13, 14, 22, 39]. Table 15 and Fig. 11 show the findings for the most accurate model (AdaBoost).

$$ {\text{Rec}} = \frac{TP}{{TP + FN}} $$
(17)
$$ FPR = \frac{FP}{{TN + FP}} $$
(18)
Table 15 The final evaluation of the AdaBoost model of predicted attacks
Fig. 11
figure 11

Representing AdaBoost model results

Figure 12 describes the AUC-ROC evaluation for detecting attacks by the AdaBoost model. It accurately identifies attack samples with an AUC of DoS attaining 0.992 (7410 out of 7460 samples) of Probe reaching 0.986 (2374 out of 2421 samples), of Access totaling 0.952 (2677 out of 2885 samples), and of privilege achieving 0.954 (62 out of 67 samples).

Fig. 12
figure 12

AUC-ROC curves of AdaBoost model \* MERGEFORMAT \* MERGEFORMAT

3.8.1 Comparison with related works

We compared our proposed method with the existing related works that used the NSL-KDD data set in the field. Three metrics were used to compare the performance of our method with others: recall, AUC (TPR vs. FPR), and accuracy. Our current study produces superior results through employing the AdaBoost model in all attack branches. For example, our DoS attacks have a recall of 99.3% and an AUC of 0.992, in addition to overall accuracy for all attack categories reaching 98.5%. In contrast, the highest results of [60] were a recall of 96.5%, an AUC of 0.980, and an overall accuracy of 94%. Table 16 displays all the results.

Table 16 Performance comparison with existing related works

4 Conclusion and future work

This study aims to determine the most accurate ML classifier for detecting LAN attacks. The research findings demonstrate that the AdaBoost model has the highest classification accuracy for both insider attacks and normal traffic behavior, with 99% DoS, 98% probe, 96% access, and 97% privilege. It also has an AUC of 0.992 DoS, 0.986 probe, 0.952 access, and 0.954 privilege. The study is carried out using the publicly accessible NSL-KDD data set, with an AUC rate measure overriding previous approaches in this data set due to the strategies used to remove noise from the data set, the choice of relevant features, the tuning of hyperparameters, and the minimization of bias. As a future recommendation, the techniques used in the study might be integrated into firewall configurations to identify insider threats and assist cybersecurity specialists in making the work environment secure and minimizing risks.