Introduction

Anomalies are the unusual patterns that do not conform to the usual patterns of data [1,2,3,4]. Anomaly detection is the detection of deviation or uncertainty in data. Doing so in the early stages can save the time and resources spent during the processing and decision taken after processing the data having anomalies.

Cyber security can save the data and digital systems from widespread critical security threats emerging from the Internet [5]. Cyber security can secure networks, and can protect data, applications and digital infrastructure from unauthorized access, attack, unauthorized modification and availability issues to [6,7,8,9,10,11,12] keeps Confidentiality, Integrity, and Availability (CIA) triad intact [13,14,15].

Network Intrusion Detection Systems (NIDS) detect signature and anomaly-based attacks. The primary target of NIDS is to provide robust automated detection capability to networked devices for efficient and effective protection. Network activities are compared in Signature-based attacks with the database of attack patterns to identify if an attempt is being made to compromise the network. Alerts are generated in case of detection of an attack [16]. Anomaly-based attacks detect the unknown attacks in network traffic by checking the variance in behavior from the baseline already identified [17]. There are many ways to detect anomaly intrusions, most of which comprise statistical methods and machine-learning techniques. A detailed review of machine learning methodologies for anomaly detection is presented in [18].

Statistical analysis methodologies can create a generic or benign profile of a particular activity. This analysis can detect and identify the deviating nature of a particular activity from a typical profile which can also be considered a cyber-attack or suspicious activity. Machine learning opens a new doorway to the detection technologies [19]. The use of federated learning is also increasing in large-scale networks like smart transport infrastructure [20].

In NIDS, anomaly detection can be automated by utilizing machine learning classifiers. Numerous researchers utilized machine learning classification techniques to detect or identify different kinds of attacks [21,22,23]. However, these techniques produced low accuracy in anomaly detection. The novelty of this paper is proposing an ensemble learning-based approach to detect cloud-based anomalies accurately.

Paper Contributions

The following are the research contributions to this article:

  • We design an approach named CAD that comprises Convolutional Neural Network Long Short-Term Memory CNN-LSTM based customized deep learning model for network graph data based binary anomaly detection and multiclass anomaly categorization. CNN-LSTM applies CNN at the first layer, LSTM layer at the second layer till the second last layer, and the final dense layer at the output to detect anomalies.

  • We also design an approach named Ensemble Machine Learning (EML) that combines conventional machine learning algorithms and results in detecting binary anomalies in networks.

  • Analyzed the complex state-of-the-art dataset UNSW-NB-15 [23] and highlighted the significant parameters to adjust for the performance enhancement of AI models.

  • The results of the experiments demonstrate that CAD approach improves the anomaly classification rate and attains the maximum accuracy of \(97.06\%\) in case of binary anomaly detection with EML and \(99.91\%\) for multiclass anomaly detection with CNN-LSTM which outperforms other state-of-art studies.

Paper Structure

The remainder of the manuscript is structured as follows. Related Work section sheds light on the state-of-the-art previous relevant research work. Network Preliminaries and Dataset section discusses the dataset used for this research. CAD for Anomaly Detection section states the proposed approach CAD and sub-methods for different anomaly detection. The experimental analysis in the form of proposed approaches is also part of this Results and Evaluations section. Finally, Discussion section summarizes this article’s methodologies and signifies the intended future approach.

Related Work

Numerous researchers have investigated and tested the network’s security with machine learning techniques. The authors in [24] proposed a framework that uses the past behavior of nodes and machine learning techniques to improve the network’s security. The information on the past behavior of the node’s participant benefits its trustworthiness in the network. For this accomplishment, the datasets should be preprocessed properly to remove numerous irrelevant features and noisy data.

Faker et al. proposed two approaches that are composed of a Deep Feed-Forward Neural Network (DNN) and two ensemble techniques, Random Forest and Gradient Boosting Tree (GBT), to detect the network intrusions by training the model over UNSW NB15 and CICIDS2017 dataset [25]. Khan et al. proposed a novel approach based on a two-stage deep learning model and tested the model KDD99 and UNSW-NB15 datasets to prove the proficiency of the model [26]. Furthermore, trustworthiness is required to check for malicious participating nodes in the network, and past information can be used to identify if the user has been reliable with the network [27]. This proved to be an excellent approach to guarantee the trustworthiness of the environment.

Machine learning techniques can be utilized to automate NIDS in anomaly detection. Djibouti et al. proposed a k-nearest neighbour methodology for distance-based outlier detection to perform flow distribution probability (FDP) outlier detection [28]. Chapaneri et al. presented a comprehensive survey of machine learning approaches to prevent network intrusion attacks using UNSW-NB15, TUIDS, and NSLKDD datasets [29]. Bagui et al. examined machine learning techniques over the UNSW-NB15 dataset to test the capabilities of algorithms [30]. In 2015, Authors in [22, 23] introduced the UNSW-NB15 dataset, a hybrid of the normal modern attacks and the new synthesized attack activities of the network traffic. The authors Moustafa and Slay [23, 31, 32], also criticized that other datasets such as KDD’99 or NSL-KDD are limited, and these datasets did not cover the modern attacks in NIDS and proposed a new dataset UNSW-NB-15 that included different features from the KDD’99 dataset and only shared few standard features.

The authors in [21] also utilized the UNSW-NB-15 dataset and improved the results by using central points of attribute values in the preprocessing stage. The authors used the Apriori algorithm with Naive Bayes (NB) and Logistic Regression machine learning classifiers. On the UNSW-NB15 dataset, numerous researchers have used machine learning techniques to evaluate the dataset’s efficiency. In 2020, Mohanad Sarhan [33] experimented on the UNSW-NB15 dataset and achieved the highest accuracy of \(99.25\%\) with binary classification without reducing all unnecessary features. The authors also used multi-label classification on this dataset and achieved the weighted accuracy of 98.19% with an f-score of 98%. However, the whole dataset is vast, and all the previous research has used random sampling to train their respective models instead of using the original training and testing files given with the dataset. Similarly, another research [34] published recently in 2020 by J. Olamantanmi Mebawondu explains that if the features of datasets are reduced through an algorithmic procedure, then the dataset can be used in a real-time intrusion detection system. The author evaluated the dataset’s efficiency by applying the Artificial Neural Network (Multi-layer perceptron) algorithm for anomaly classification and achieved an accuracy of \(76.96\%\).

Network Preliminaries and Dataset

The proposed models discussed in CAD for Anomaly Detection section use unprocessed network packets of the UNSW-NB 15 dataset generated by the IXIA PerfectStrom tool. The purpose of creating the UNSW-NB15 dataset is to build Artificial Intelligent models that observe the system’s sophisticated real-time activities and real-time exploitation feedback. 100 GB of the raw traffic data was generated using the tcpdump, which includes Pcap files. This dataset comprises nine attacks, including Shellcode analysis, DoS, Exploits, Generic, Fuzzers, Reconnaissance, Backdoors, and Worms. UNSW-NB-15 dataset comprises two files: the training and testing files containing records of all types of attacks and regular traffic features. The training data file contains 82 and 332 records; in the testing file, there are 175 and 341 records. The dataset contains 45 features in training and testing files [35]. In the UNSW-NB15 dataset, features such as scrip, sport, strip, time, and time are missing in the training and testing dataset.

The optimum and maximum performance of an ML model can be achieved by performing preprocessing on the dataset. In the preprocessing stage of this research, not a number (NAN) values, identical instances were removed, and scaling was performed, which points to the re-scaling of arithmetic real-values to a fixed scope. Moreover, the first four columns of the dataset are also removed due to non-usability in identifying network intrusion detection. Those columns include source IP address, source port number, destination IP address, and destination port number. Due to the low variance of the dataset, MinMax scaling is applied for feature normalization, as mentioned in Eq. 1.

$$\begin{aligned} X_{norm}=\frac{X_i - X_{min}}{X_{max} - X_{min}} \end{aligned}$$
(1)

The original value of the feature is denoted by \(X_i\) that is subtracted from the minimum magnitude of that particular feature and then divided by the subtracted value from the maximum and minimum of the feature.

CAD for Anomaly Detection

CAD comprises data analysis, preprocessing, feature reduction, classification of anomalies from regular files through machine learning and deep learning techniques, and then categorizing them. Figure 1 presents the proposed approach for classifying and categorizing network anomalies.

Fig. 1
figure 1

Workflow Model Depiction of Proposed Approach for Network Graph Data based Anomalies Detection

Binary Anomaly Detection

We review and test different conventional machine learning classifiers. The most proficient combination has been chosen to be incorporated within an ensemble model upon the evaluation results. The evaluated models are as follows:

Decision Tree (DT) classifier works on the principle of supervised learning. DT’s ability allows it to take continuous or series values and thus give a series of predicted results in a continuous manner [36]. DT performance is based on entropy as shown in Eq. 2 in which p represents the probability and E(S) represents the entropy of the specific entity. The less the entropy, the better the performance.

$$\begin{aligned} E(S) = \sum \limits _{i=1}^{c}-p_1 log_2p_i \end{aligned}$$
(2)

We tune the attributes to optimize the model’s performance for anomaly detection. We set various parameters to improve the performance of DT for anomaly detection are the following: criterion=’gini’, max_depth=10, random_state=0, splitter=best, min_samples_split=2, min_samples_leaf=1.

Random Forest (RF) classifier works on the theory of ensemble learning methodology for classification, regression, and other similar tasks by developing multiple trees when training the model and resulting in the predicted class. The prediction calculation is done by taking the mode of the targeted classes from the independent trees [37].

$$\begin{aligned} MSE=\frac{1}{N}\sum \limits _{i=1}^N (f_i-y_i)^2 \end{aligned}$$
(3)

In Eq. 3, the number of data points i are denoted by N, where \(y_i\) is the actual value of data point and \(f_i\) denotes the value returned by the classifier. Following parameters are configured to tune the RF model: bootstrap = true, criterion =’gini’, min-samples-leaf = 1, min-samples-split = 2, n-estimators = 100, and random-state = 0.

Gradient Boosting (GB) classifier is a combination of machine learning classifiers that integrate weaker models to create a more robust predictive machine learning model [38]Footnote 1. Gradient boosting is a technique that uses weak predictions and a decision tree format to build ensemble structure for better accuracy in regression and classification problems.

$$\begin{aligned} F_{0}(x) = arg_{\gamma }min \sum \limits _{i=1}^{n}L(y_{i},\gamma ) \end{aligned}$$
(4)

In Eq. 4, \(F_0(x)\) is the constant function, y is the observed value where the \(\gamma\) is the real value in the loss function L. Following parameters are configured to tune the GB model: learning_rate=0.01, n_estimators=100, random_state=1, subsample=1, criterion=friedman_mse, max_depth=3, validation_fraction=0.1.

Extreme Gradient Boosting (XGB) classifier inherits most of the features from GB but for approximation, it uses the 2\(^{nd}\) order derivative [39]Footnote 2.

$$\begin{aligned} F_m(x)\leftarrow F_{(m-1)}(x)+\gamma _{x} h_{m}(x) \end{aligned}$$
(5)

Equation 4 is extended to Eq. 5 in which m represents number of iterations, \(h_m(x)\) is the match on the gradient that is result of each iteration, and \(\gamma _m\) is the multiplicative factor. Following parameters are configured to tune the XGB model: max_depth = 10, objective = multi: softmax, num_class=2, n_gpus=1, sampling_method=uniform, tree_method = auto, max_bin =256.

Logistic Regression classifier is a statistical Learning technique categorized in supervised ML methods dedicated to classification tasks.

$$\begin{aligned} g(E(y)) = \alpha +\beta x1 +\gamma x2 \end{aligned}$$
(6)

In Eq. 6, g() is the function also known for linking, E(y) is the possibility of required variable and \(\alpha\) + \(\beta\)x1 + \(\gamma\)x2 are for linear predictions. \(\alpha\), \(\beta\), \(\gamma\) are the required variables which are predicted. The ’link‘ function combines the possibility of E(y) with \(\alpha\) + \(\beta\)x1 + \(\gamma\)x2. Following parameters are configured to tune the LR model:: penalty=l2, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class=auto, solver = liblinear.

Stochastic Gradient Descent (SGD) classifier takes only one random point while varying the weights. It is more useful when working with a dataset of a larger size. [40].

$$\begin{aligned} \Theta _{1} = \Theta _{1}-\alpha (\frac{\sigma }{\sigma \Theta _{1}}C\widehat{y}_i-y_i) \end{aligned}$$
(7)

Equation 7 represents the standard equation of SGD in which \(\theta _1\) is the parameter, \(\widehat{y}\) is the model, and where y is the subject in the supervised dataset. Following parameters are configured to tune the SGD model: loss=hinge, penalty=l2, fit_intercept=True, max_iter=1000, learning_rate=optimal, early_stopping=False.

Ridge classifier interchanges the labeled data in the range of \([-1,1]\). The model outputs the final prediction based on the highest value attained during prediction. Following parameters are configured to tune the RF model: normalize=False, fit_intercept=True, solver=auto.

$$\begin{aligned} \widehat{\beta }^{ridge} = arg_{\beta \epsilon \mathbb {R}_{p}}min\sum \limits _{i=1}^{n}(y_i-x_{i}^{T}\beta )^2 + \lambda \sum \limits _{j=1}^{p}\beta _{j}^{2} \end{aligned}$$
(8)

Equation 8 defines the standard equation of the ridge classifier. This equation has 2 segments. Before the addition sign, the first segment denotes the least square term or loss and the second part denotes the lambda of the summation of \(\beta ^2\) where \(\beta\) is the coefficient. Several research types suggest utilizing ensemble methods to obtain better performance as final predictions [41, 42]. In this research, the following machine learning classifiers are used as 3 layers to form a meta classifier: 1) Stochastic Gradient Descent, 2) Logistic Regression, and 3) Ridge classifier to analyze the selected features of dataset UNSW and detection of anomalies.

Ensemble Model: Suppose D denote the dataset containing instances \(I = \{i_1, i_2, \dots , i_n\}\). CP represents each classifier’s confidence prediction, and CT represents the targeted confidence threshold which is set to evaluate the CP of each classifier. Suppose PL represents each classifier’s predicted classifier and ATL denotes the All target classes. Here \(sgd_{acc}\) denotes the accuracy score predicted by the SGD classifier. rc denotes the accuracy score of the ridge classifier. lr denotes the accuracy of logistic regression. The notation IC represents the instance of each class, whereas the number of classes as a sum is denoted by ICC, which is incremented upon a particular classifier’s vote in favor of a class’s prediction. Each prediction of three evaluated models on every instance I is the input given to the voting classifier for evaluated prediction as an anomaly reading or regular and then added in IC. The ICC confidence and TL are then estimated. The classifier casts their prediction to the vote counter. The value of ground truth is configured at \(80\%\) to perform the comparison between the certainties. When the prediction results yield the same number of evaluated prediction votes and arbitrary decisions, anyone can be selected as a classification result; if the CL value is more significant than the initialized threshold, the target class will be selected as a resultant label of that attribute. Following are the technical explanation of the proposed ensemble-based machine learning model as shown in Algorithm 1.

figure a

Algorithm 1 Ensemble Machine Learning Classifier

Multicalass Anomaly Detection

The Multi-Layer structure of the CNN processes the input to manipulate for the desired outcome [43]. CNN requires less preprocessing in the initial phase than other classification techniques as it comprises a dense neural network based on multiple layers, as depicted in Fig. 1.

$$\begin{aligned} h_t = H(W_{hx}x_t + W_{hh}h_{t-1} + b_h) \end{aligned}$$
(9)
$$\begin{aligned} p_t = W_{hy}y_{t-1} + b_y \end{aligned}$$
(10)

Equations 9 and 10 represents the LSTM based neural network core computations where \(x_t\) denotes the input time series, \(y_t\) denotes the output time series, \(h_t\) indicates the hidden memory cells, W indicates the weight matrices, and b indicates the bias vectors. The hidden state of memory cells is calculated in the following Eqs. 11, 12, 13, 14, 15 where \(i_t\) represents the input gate, \(f_t\) represents the forget gate, \(c_t\) represents cell state and \(o_t\) represents the output gate. The cell state carries cumulative information of the sequence data from one time step to the next time step till the end of the sequence. Based on these gates, the hidden state is calculated. Cell state passes through a ‘tanh’ function reducing all feature values between -1 and 1, enabling it to decide on the labels.

$$\begin{aligned} i_t = \sigma (W_{ix}x_t + W_{hh}h_{t-1} + W_{ic}c_{t-1} + b_i) \end{aligned}$$
(11)
$$\begin{aligned} f_t = \sigma (W_{fx}x_t + W_{hh}h_{t-1} + W_{fc}c_{t-1} + b_f) \end{aligned}$$
(12)
$$\begin{aligned} c_t = f_t * c_{t-1} + i_t *g(W_{cx}x_t + W_{hh}h_{t-1} + W_{cc}c_{t-1} + b_c) \end{aligned}$$
(13)
$$\begin{aligned} o_t = \sigma (W_{ox}x_t + W_{hh}h_{t-1} + W_{oc}c_{t-1} + b_o) \end{aligned}$$
(14)
$$\begin{aligned} h_t = o_t * h(c_t) \end{aligned}$$
(15)

Following configuration is made for CNN-LSTM to detect anomalies: two layers of 1-dimensional convolutional, filter size of each layer respectively \(32*3\) and \(64*3\), activation=relu, padding=causal, maxpooling layer with pool size 2, for LSTM layer recurrent_dropout=0.1, flatten layer use, Root Mean Square Propagation optimizer with learning rate=0.005 use, loss=binary_crossentropy, validation_split=0.33, batch_size=2048, epochs=8.

The working of whole multiclass CNN-based LSTM is given in Algorithm 2. Let D represents the dataset which contains instance \(I= \{i_1, i_2, \dots , i_n\}\) and LE represents the label encoding transformer function which changes the labels into 1-dimensional vectors, V. The mean, \(\mu\), is subtracted from data for normalization and then normalizes the variance \(\sigma\). Then data is converted to get a 2D matrix. The library NumPy is used for this operation. Then Gaussian variable is used to initialize the weights. L denotes the total number of layers, n denotes the total number of features, and W denotes the matrix’s weight. \(x*y\) denotes the dimension of the generated weight matrix. \(D_{2}\) denotes the 2D matrix that contains the dataset of the training file, which is further processed into a 3D matrix, \(D_{3}\). This procedure is supported by the reshape function to get the input ready for processing into the CNN model. Two states are used for feature extraction by applying \(32*3\) and \(64*3\) filters. The feature map F is generated by the CNN model and converted into 1-dimensional vectors V after applying a max-pooling layer. V is a feature vector fed to the LSTM layer as input to the LSTM functionality model. This information is forwarded to the flatten layer, which converts into a 1-dimensional array. Flatten layer 1-dimensional array denoted by flstm. This 1-dimensional data pass to the dense layer as input to predict the target labels. Last train for eight epochs. At every epoch, the LSTM model learns its weights to improve its accuracy by updating weights. Actual loss, validation loss, actual accuracy, and validation accuracy are measured after every epoch.

figure b

Algorithm 2 Multiclass CNN based LSTM

Results and Evaluations

We use the following computing environment for experiments. We use Windows 10 Professional 20H2 operating system, Intel(R) Core(TM)i7-6700HQ, 16GB RAM, NVIDIA GeForce 1060 GPU, CUDA 9.0 and Python 3.8 version.

Binary Anomaly Detection

Table 1 provides an overview of the overall results achieved by each classifier for cloud-based anomaly detection. DT classifier is trained for cloud-based anomaly detection. DT achieved an accuracy of \(91.86\%\), a Precision of 91.65%, a Recall of 99.78%, and an F1-score of \(95.54\%\). Due to complex and high-dimensional data, DT does not provide promising performance to achieve an acceptable accuracy score. XGB method yielded an accuracy of \(93.34\%\), a Precision of \(93.38\%\), a Recall of \(99.59\%\), and an F1-score of \(96.39\%\). XGB also deals with irrelevant features without affecting prediction performance by implementing the decision tree with boosted gradient, and yet it yields better results from the decision tree. Applying the RF method yielded an accuracy of 93.57%, the Precision of \(93.59\%\), a Recall of \(99.63\%\), and an F1-score of \(96.52\%\). Random forest is called a bagging algorithm which reduces the variance of data. In UNSW data, the variance is high. Thus, RF tends to improve accuracy, Precision, Recall, and F1-score more than others. Applying the GB method to training features achieved the accuracy of 94.40%, the Precision of 94.72%, Recall of 99.36%, and F1-score of 96.99% on binary classification.

UNSW dataset is used to train the EML, which contains 175, 341 features in the training set as shown in Table 1. The model is evaluated on a testing set containing 82, 332 features. Applying the ensemble method on stochastic gradient descent, logistic regression, and ridge classifier achieved the accuracy of \(97.06\%\), the Precision of \(98.39\%\), and a Recall of 98.45% F1-score of \(98.45\%\) on binary classification.

Figure 2 shows the graphical representation of performance metrics comparison between the evaluated machine learning algorithms and the proposed ensemble machine learning model, as discussed in the above sections.

Fig. 2
figure 2

Performance Metrics Comparison of Machine Learning Algorithms

Figure 3 shows the graphical representation of the Receiver Operating Characteristic (ROC) curve of the machine learning models used in the EML approach, including logistic regression, ridge classifier, and stochastic gradient descent. Combining these algorithms, the ensemble machine learning approach is also depicted in the graph in terms of ROC.

Fig. 3
figure 3

ROC curve of the Proposed Approach

Fig. 4
figure 4

Confusion Matrix of Machine Ensemble Model

Figure 4 represents the confusion matrix generated from evaluation of the our Machine Ensemble. It depicts instances considered Network-based anomalies or wrongly identified as other classes. There are 1258 instances wrongly identified as anomalous instances, while 1155 instances are wrongly identified as normal instances.

Table 1 Achieved Results (%) using binary classification of anomalies
Fig. 5
figure 5

Graphical Representation of Performance Evaluation of Deep CNN-LSTM

Multi Class Anomaly Categorization

In this section, the evaluation of Deep CNN-LSTM, the second segment of the proposed approach, is discussed. Figure 5a, the accuracy of the model during the test and train phase over the epochs has been depicted. It can be seen that the model shows the highest accuracy near the second epoch. Figure 5b shows the loss trend over the successive epochs during the train and test phase.

The Fig. 6 depicts the confusion matrix of model Deep CNN-WDLSTM. The confusion matrix depicts the model accuracy while identifying the class of the type of attack. The confusion matrix shows that the model outcomes excellent performance in identifying the correct class of attacks.

Fig. 6
figure 6

Confusion Matrix of CNN-LSTM Evaluation

Comparative Analysis

In Table 2, the comparative analysis is presented that comprises various research results from the articles [25, 33, 44,45,46]. In 2018, the author of research [44] utilized a random forest classifier to achieve the accuracy of \(97.49\%\) with the Precision of \(97.75\%\). Similarly in 2019, [46] and [25] presented their researches which improved the previous results, [46] used J48 decision tree algorithm to achieve 98.71% accuracy and [25] used DNN algorithm to achieve 99.16% accuracy. Over the years, many kinds of research were conducted in which recently, in 2020, [33] achieved \(99.25\%\) accuracy by utilizing the Extra Trees algorithm with an F1-score of \(92\%\). In this research, we contributed to work on the original files of the dataset for classification purposes and achieved the accuracy of \(97.06\%\) with an F1-score of \(98.45\%\) by utilizing the CAD method.

Table 2 Comparative Analysis

Figure 7 shows the graphical representation of the performance metrics comparison of the proposed approaches: ensemble and machine learning and the CNN-LSTM model. The graph shows that the ensemble machine learning model performs better than the CNN-LSTM model in detecting the attack in the binary outcome category in terms of Precision, F1-score, and Recall. However, better accuracy is being observed by the CNN-LSTM model than in the EML model.

Fig. 7
figure 7

Performance Metrics Comparison of Proposed Approaches

Discussion

Network connectivity is one of the essential features of the digital world since its medium connects the world by all means. With this digital advancement, hackers have discovered vulnerabilities in the networking systems, thus exploiting them. However, AI-based cybersecurity solutions encounter those attacks with proficiency. In this research, the dataset UNSW-NB15 is analyzed by preprocessing, and feature extraction, and then the data is divided into training and testing samples. The selected ratio for training is \(80\%\), and testing is \(20\%\) samples. This sample ratio has already been given in the original files of the UNSW-NB15 dataset. Then different machine learning classifiers are trained on the dataset. To improve the final prediction ensemble method is used to bring out the optimal results. In this experiment, CAD classification technique comprises the best combination of layers of SGO, Ridge, and Logistic regression machine learning algorithms to achieve the highest accuracy in final predictions. Our approach CAD achieves the highest accuracy of 97.06% with the F1-score of \(98.45\%\). Other algorithms also achieved good results in which the Gradient Boosting classifier achieved 94.4% accuracy with an F1-score of \(96.99\%\). Then XGboost and Random Forest classifier achieved \(93\%\) accuracy with \(96\%\) F1-score. In the end, the lowest accuracy in this experiment is achieved by the Decision Tree classifier, which is \(91.86\%\) with an F1-score of \(95.54\%\). This approach comprises the use of original files from the dataset, which is the main contribution of this dataset.

Conclusion and Future Work

In this paper, a novel AI-based technique is proposed, namely, CAD which is composed of an ensemble machine learning model and deep learning-based CNN-LSTM technique to efficiently detect and classify the anomalies by using the state-of-the-art dataset, UNSW-NB15. This dataset contains all sorts of critical attacks that are regarded as harmful to the systems. The proposed approach CAD achieved the highest accuracy of \(97.06\%\) with precision, recall, and F1-score of \(98.38\%\), \(98.45\%\), and \(98.45\%\), respectively. In the future, we plan to combine the UNSW-NB15 dataset with other anomaly and signature-based datasets to filter the dataset to critical features to enhance the automated anomaly detection systems’ performance. In order to keep pace with the advancement in computing [47], as well as respond to the matured offensive techniques effectively, well in time detection has become of utmost importance. The use of cloud computing made it possible to have numerous computing power available to the researchers [48] that can be researched and utilized for such integration of trained cloud-based models to provide anomaly detection as a service offering. For such a global anomaly detection mechanism, Federated learning can be utilized to fight various security events like spam detection, anomaly identification, behavioral-based security, and other network-based attacks. Explainable artificial intelligence (XAI) can be an excellent option to understand why a model made a decision [12]. Furthermore, fog computing and sever-less computing can be used to reduce the latency and improve privacy [49].