Advanced Persistent Threat Identification with Boosting and Explainable AI

Advanced persistent threat (APT) is a serious concern in cyber-security that has matured and grown over the years with the advent of technology. The main aim of this study is to establish an effective identification model for APT attacks to prevent and reduce their influence. Machine learning has the potential as well as substantial background to detect and predict cyber-security threats including APT. This study utilized several boosting-based machine learning methods to predict various types of APTs that are consistent in cyber-security domain. Furthermore, Explainable Artificial Intelligence (XAI) was coupled with the predictions to provide actionable insights to the domain stakeholders as well as practitioners in this domain. The results, particularly XGBoost with weighted F1 score of 0.97 and SHapley Additive exPlanations (SHAP)-based explanation, prove that boosting methods as well as machine learning models paired with XAI are indeed promising in handling cyber-security-related dataset problems which can be extrapolated towards new avenues of challenging research by effectively deploying boosting-based XAI models.


Introduction
The advent of information and technology posed great challenges pertaining to data security and privacy which lead the emergence towards the field of cyber-security in this decade [1]. Advanced persistent threat (APT) being one of the major cyber-security issues emerged in this decade which can be defined as a long-term cyber-security-related sophisticated exploitation and hostile situations aimed at governments mainly but recently has diversified to several domains [2]. An APT is stealthy in nature and remains undetected for a prolonged period of time and is utilized by stealthy threat actors for political, economic influence or monetary gains [3].
Machine learning became one of the important tools for detecting security threats in the cyber world domain with the explosion of data in this century [4]. Particularly, intrusion detection systems (IDS) were developed keeping in mind the power of machine learning to identify unwanted intrusions in cyber world. Furthermore, APTs being one of the sub-domain of cyber-security threats gained much attention from machine learning practitioners that lead to emergence of machine learning as a constituent solution towards detecting APTs [3,4].
Friedberg and colleagues [5] worked extensively on a theoretical framework for anomaly detection which learns over time and reports anomaly that differs from the model. Furthermore, Siddiqui and his team [6] developed a fractal-based anomaly detection system and compared against the traditional machine learning methods, thus improving false-positive and false-negative rates while retaining better classification accuracy. Siddiqui's work [6] can be further corroborated with work carried out by Ghafir's team [3]. Ghafir and his team [3] reported a prediction accuracy of 84.8% with true-positive rate (TPR) of 81.8% and false-positive rate (FPR) of 4.5% with their novel machine learning-based system called MLAPT that utilized ensemble and support vector machine (SVM)based models. Their system provided three major contributions, including threat detection, alert correlation, and attack prediction. Similar research work was carried out in [7] where the researchers developed TerminAPTor, an Information Flow Tracking (IFT)-based APT detection system that finds out the chains of traces that were left by attackers through several stages of attack campaign. Although their system had admirable accuracy and TPR, the FPR rate minimization carried out by Ghafir's team [3] was more commendable. Issues regarding the existing solution and their subsequent disadvantage of static data with longer training time and the nbeed for complete retraining for new APT sample or APT class are discussed by Laurenza and his colleagues [8]. Laurenza's team opted for a solution resulting in precision and accuracy over 90% that included moving from multi-class classification to a group of single-class classifiers, thus minimizing the runtime significantly and allowing higher level of modularity [8]. Neuschmied et al. [9] applied several autoencoderbased IDS methods for the detection of such IDS attack patterns where they utilized tools of statistics to converge into statistical analysis to understand features that supplemented the anomaly as well as IDS workflow. Explainable artificial intelligence (XAI) was explored in [10] where their mechanism provided insights and interpretations for designing the defense strategy and resource allocation scheme of the edge defender based on edge Bayesian Stackelberg game and cyber threat intelligence (CTI) to detect APT.
In this work, we have explored vividly into network security threats pertaining to advanced persistent threats and proposed a novel approach that combines boosting-based ensemble methods and XAI to detect, identify, report, and explain the APT-based attacks.
Hence, with a strong basis of introductory analysis and background in the section "Introduction", the organization of our research work is as follows. The section "Related Work" provides background on our research work by providing related work on this topic. The section "Materials and Methods" dives deep into materials and methods that were employed in the study including dataset description, theories and practical information regarding data pre-processing, boosting-based ensemble methods, and rationale behind utilizing them. The section "Results and Discussion" encompasses the results and discussion that includes the theoretical framework of model evaluation followed by experimental result analysis and visualization through XAI. Finally, the section "Conclusion" summarizes and concludes our research work by providing comprehensive insights of contribution, and denoting limitations to practitioners on this field to take the research forward.

Related Work
Intrusion detection systems (IDS) have garnered formidable importance in the cyber-security world with the advent of information and technology [11]. While developments have been made, challenges pertaining to dataset, techniques, and approaches have attracted researchers in the cyber-security world to advance the research field further [12]. Particularly, the field of machine learning and deep learning evolved quickly to aid the tasks of IDS, thus contributing broadly to the ever-changing cyber world [13].
Advanced Persistent Threat (APT), a subset of IDS [14], can be deduced as a form of a stealthy threat actor which usually is a nation-state or can be originated as a funded group from state sponsorship that gains unauthorized access to a network and remains undetected for a prolonged period of timespan. APT remains a center of attention and concern for governments and companies as history has shown that unauthorized and undetected access from adversaries has resulted in unwanted outcomes of which political, economic, and socio-economic shift is noteworthy. Various detection methods exist on dealing with APT attacks of which machine learning and deep learning methods have garnered attraction due to the ability of these methods to provide actionable insights from large amounts of data often absent in generic algorithms. Apart from MLs, honeypots are utilized by Saud and Islam [15] to detect APTs where, for the architecture, they have utilized KFSensor. Han's team [16] proposed a novel APT malware identification and framework named APTMalInsight based on system call and ontology knowledge framework. Similarly, Holmes framework was proposed in [17] by correlating the suspicious information flows. Niu's team [18] utilized mobile DNS logging for APT detection tasks. Network traffic behavior can be modeled with ML models for anomaly detection tasks and subsequently reduce the false positives on alarms. and detect and eliminate threats in real-time scenarios. The reason ML has attracted researchers note mainly because the same ML models can also be used to create the attacks [3]. Contribution to the body of research on APT is done by Myneni's team [19] that contributed a benchmark dataset on APT named DAPT-2020. The research work also discussed several limitation factors related to generic APT datasets. The authors through experimentation with semi-supervised approach reported having class imbalance in their dataset that ultimately led their model to perform poorly to detect attacks. The problem of class imbalance was also found in the contribution of another similar benchmark contribution SN Computer Science named SCVIC-APT-2021 [20] where the authors had better luck with ensemble methods and an machine learningbased Attack Centric Method (ACM) is proposed to evaluate the model performance on contributed dataset. Although research in [19] had relatively poor performance, Liu and his team's work [20] outperformed the baseline models with a maximum macro-average F1 score of 82.27% that corresponds to 9.4% improvement with respect to the baseline performance. Work on Liu's benchmark dataset [20] was carried out in [21] where the authors proposed a machine learning-based model named Prior Knowledge Input (PKI). PKI utilized unsupervised clustering methodologies to preclassify the original dataset to obtain prior knowledge which eventually is incorporated onto the supervised model that minimizes training complexity. Authors reported having best macro-average F1 score of 81.37%, which is 10.47% higher than the baseline results.
The motivation of our research can be drawn from several factors related to our contribution to the body of knowledge on cyber-security. First, the field of APT has much attention with low resources to tackle the challenges as discussed in related work. Second, APT is of much importance in cybersecurity domain. The global anomaly detection market is said to grow up to 4.47 billion in 2022 at a compound annual growth rate (CAGR) of 16.7%. Subsequently, this untapped market statistically speaking would grow up to 8.0 billion in 2026 at a CAGR of 15.7% [22]. Third, applications of machine learning particularly, boosting methods, are good at feature understanding and are generally resilient towards overfitting. The third point coupled with XAI in our view would provide the readers a new view at APT that would not only provide insights but would also provide interpretability and explainability which is often absent in the cyber-security domain. At length, our scope of work lies within exploring machine learning models, specifically boosting methods and coupling the results with XAI to provide explainable insights to the researchers as well as other stakeholders in this domain. The next section termed Materials and Methods goes into details explaining the boosting methods and their usability in detecting APT.

Dataset Description
The dataset SCVIC-APT-2021 [23] is taken from the research work carried out in [20] and is one of the latest benchmark datasets of 2022 for detecting Advanced Persistent Threats (APT) in network traffics. The dataset contains 315,607 rows of data and 84 features in total. As per the description of dataset, the target label consists of six class labels. They are data exfiltration, initial compromise, lateral movement, normal traffic, reconnaissance, and pivoting that followed the global knowledge base of adversary tactics and approaches, thus serving as the building blocks for their selection of common attack techniques.

Data Pre-processing
The dataset contained several columns, such as identity, ports, and protocols, which apart from being an identification number did not have any substantial value in classification problem that ultimately lead us to drop several columns leaving us with 77 features in total to work with. The dataset contained infinity values and null as well as NaN values which were treated by imputing 0 to remove any unwanted values in the dataset followed by treating the duplicate values. Although the dataset has been cleaned through preceding approaches, the next data pre-processing challenge poised us with processing the categorical data that lead us to opt label encoding techniques. For data scaling, we have used normalization method through min-max normalization. The formula for min-max normalization is provided in Eq. 1 There remains a substantial debate as to whether to use normalization or standardization where we observe that data scaling is particularly important for distance-based algorithms which in our experimentations will be substantially absent. Furthermore, the feature values and its range have diverse values with less number of outliers which aided us in taking the decision of using a normalization method over a standardization mechanism.
Eventually, we have had 206,055 rows and 77 columns to work within our dataset which was split into 80-20 standard train-test split to carry out our experiment.

Boosting Algorithms
Boosting can be defined as a method of converting weak learners into strong learners. Several boosting-based classification algorithms of machine learning were utilized in the prediction of APT. This section discusses the mathematical background of several boosting algorithms that were utilized in our research work. Thus, we have illustrated the mathematical descriptions of these algorithms in the following subsections.

AdaBoost
AdaBoost defines the weakness by the weak estimator's error rate [24]. In each iteration, AdaBoost identifies missclassified data points and decreases the correct weights as well as increases the incorrect weights, so that the succeeding classifier will have residual extra attention to get them right. The algorithm for adaboost is given below in Algorithm 1.
Algorithm 1 Adaboost Algorithm [24] Initial weights L : L 1 , L 2 , . . . , L n = 1 n for i in [1, W ]; W = weak classifiers fit weak classifier C i with sample weights I Low errors leads to large α, which means higher importance in the voting. output : find the class with highest vote.

Gradient Boosting
Gradient boosting takes the approach in a slightly different way. Instead of increasing or decreasing weight and points, gradient boosting utilizes the difference between the ground value of truth and prediction value [25]. The algorithm for gradient boosting is given in Algorithm 2.

Rationale for Using Boosting for Experimentation
In our experimentation, we have used the aforementioned boosting algorithms for experimentation purposes. To strengthen our case as to why we have selected boosting, a pros and cons analysis is provided in Table 2.
As boosting is basically an ensemble-based model, thus it is easily explainable, and interpretable with admirable benchmark performance measures. Usually, they are not susceptible to overfitting thus providing them with stronger prediction power making boosting methods usually greater in performance from bagging methods. However, boosting methods are sensitive to outliers and have scalability issues. As our pre-processing unveiled that the dataset does not have many outliers and the dataset being medium size in nature, we finally opted for boosting methods weighing in all the pros and cons through strength and weakness analysis.

Model Evaluation
Several performance measures were employed for evaluating our boosting-based models. Confusion matrix were generated for this multi-class classification problem which resulted in providing us with the basic derived performance measures, such as true positive (TP), true negative (TN), false positive (FP), and false negative (FN) [27]. From these values, several equations pertaining to accuracy scores, precision, recall, and F1 score were used for evaluating the model [28]. The equations for accuracy, precision, recall, and f-1 scores are addressed in Eqs. 2, 3, 4, and 5, respectively (2) Accuracy = TP + TN TP + FP + TN + FN , Our dataset inherently poses a class imbalance problem, and while measures were taken in Materials and Methods to address the imbalance, the performance measure is also expected to be robust to such irregularities. While F1 score can be defined as a sufficient metric as it merges precision and recall in a more interpretable domain, a formidable reliable performance measure is the Matthews Correlation Coefficient (MCC) [29], which is preferred over F1 score because of its balanced assessment of classifiers irrespective of class positivity or negativity for that matter. The equation for MCC is provided in Eq. (6). In addition to that, one other robust statistic which is widely used for measuring the performance is Cohen's kappa [30], whose equation is provided in Eq. (7) In our evaluation, we have also incorporated a loss measure named hamming loss which provides the fraction of the incorrect labels to the total number of labels [31]. The formula for hamming loss is provided with the following Eq. (8).

Experimental Results
Our investigation finds that class imbalance had an overall effect on experimental results. Confusion matrix of experimented ensemble boosting methods is visualized through Fig. 2, whereas the overall accuracy is shown through a visual barchart in Fig. 1. Individual class-wise precision, recall, and f1 score were constructed through Tables 3, 4 and 5 to visualize in depth the results attained through our models where DE= data exfiltration, IC=initial compromise, LM=lateral movement, NT=normal traffic, R=reconnaissance, and P=pivoting. Individual precision (Table 3) illustrates GradBoost, AdaBoost, and LightGBM models performed very poorly, whereas CatBoost and XGBoost had admirable results. The same scenario blossomed in Tables 4 and 5 for recall and F1 scores. Where AdaBoost, GradBoost, and LightGBM models performed very poorly.
In Table 6, we calculated the weighted precision, recall, and F1 score to summarize the model's performance. Where GradBoost, AdaBoost, and LightGBM weighted F1 score is, respectively, 96%, 82%, and 89%. In contrast, XGBoost and CatBoost achieved 97 and 99%. GradBoost reached 98 proportions for weighted precision, but it fell by 4% for the weighted recall. The CatBoost achieved the same weighted precision and recall score which is 99%. Moreover, Ada-Boost performed 99% for weighted precision; nevertheless, it dramatically went down by 28% for the recall. Besides, XGBoost reached the twin score for weighted precision and recall which is 97%. In addition, LightGBM gained 96% for weighted precision and 83% for recall. Table 7 shows Matthews correlation coefficient, Cohen's kappa, and hamming loss scores. According to Matthews correlation coefficient, GradBoost, AdaBoost, and Light-GBM reached, respectively, 35.68%, 19.24%, and 6.17%. On the other hand, CatBoost and XGBoost achieved 95.40% and 98.29%. The importance of dealing with class imbalance, feature utilization resonated fully in the experimental results which can be dealt in future work, where emphasize may be given towards sampling techniques to overcome class imbalance problems.

Explainable AI Visualization
Explainability has added new dimensions to machine learning and deep learning research. However, ML and DL system consists of several types of architecture which are   [32]. XAI is now essential, because people who are influenced by AI decisions. It is crucial to understand model transparency, explainability, and trust in the AI system. XAI has various benefits, including model justification, controlling, debugging, model improvement, and knowledge discovery [33]. XAI could explain why and how the model made a decision, and then, developers could easily upgrade it, improve it, as well as make it smarter.
Several methods are available in explaining the results of which SHapley Additive exPlanations (SHAP) is a wellknown XAI method [34]. Other XAI techniques include LIME, DeepLIFT, as well as layer-wise relevance propagation. The specialty of SHAP is uniquely determined by additive feature attributions and Shapley additive explanation. Where the LIME model provides predictions based on local approximation which was an additive feature attribution method [35]. Moreover, DeepLIFT was introduced as a recursive prediction explanation method. DeepLIFT used the summation-to-delta property that matches with additive feature attribution. In addition, layer-wise relevance propagation (LRP) is equivalent to DeepLIFT.
SHAP utilizes a game theory-based approach to provide an explanation of the output generated by machine learning model which in our cases are the boosting methods that were utilized in our study and experimentation.
It provides an optimal credit allocation alongside a form of local explanations by accumulating the calculation performed through the classic Shapley values [35]. Figure 3 shows each feature's importance visually briefly. High values of the latitude feature have a high contribution and the low values have a low contribution. The variation of color of features indicates the classes. The Dataset feature "Idle Max" has high variations and reached a high impact on the XGBoost model prediction; on the other hand, the "Fwd header length" reached a high impact on the CatBoost model prediction as depicted in the aforementioned figures.
In XGBoost SHAP values, the "Idle Max" feature produced the maximum influence on class 4 then classes 3, 0, 5, 2, and 1. Moreover, the "Fwd Seg Size Min" feature impressed class 3 more than class 1. Apart from CatBoost SHAP values, the dataset feature "Fwd Header Length" has a high impact on class 3 than on classes 4, 2, 5, 1, and 0. Sequentially, other features "Fwd Seg Size Min", "Idle Max" etc.
SHAP shows the force plot for the XGBoost classifier. Figure 4, where features in red color show risk factors that push up the overall probability. In contrast, the blue color protective factors push down the probability.

Conclusion
In this paper, we have proposed a novel advanced persistent threat identification approach using boosting algorithms with explainable artificial intelligence. We have extensively investigated the whole dataset and effectively preprocessed it. Our study demonstrates the data-driven pre-processing technique significantly improved algorithms performance by 4-9%. The study examined various ensemble-based boosting algorithms to avoid bias and variance problems and convert weak learners into strong learners. This study applied several boosting algorithms, including gradient, light gradient, category, and extreme gradient, and compared the performance of the models. According to the evaluation metrics, Cat-Boost and XGBoost algorithms performed better than Grad-Boost, LightGBM, and AdaBoost. This paper also applied XAI to demonstrate how much a single feature affected the output and characterizes model fairness, transparency, and its subsequent outcomes. Future direction of work will be to address the class imbalance problem depicted in this dataset to reduce the excessive FPR rates and optimize class-wise precision, recall, and accuracy, thus contributing to overall performance of the system. Furthermore, research work may be carried out to develop more adaptive, AI-based explainable cyber-security techniques that are privacy-preserving.

Availability of Data and Materials
The code is available for further research and repeatability in Github https:// github. com/ MdMah adiHa san1/ Advan ced-persi stent-threat-ident ifica tion.

Conflict of Interest
The authors declare that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.