On the detection of lateral movement through supervised machine learning and an open-source tool to create turnkey datasets from Sysmon logs

Smiliotopoulos, Christos; Kambourakis, Georgios; Barbatsalou, Konstantia

doi:10.1007/s10207-023-00725-8

On the detection of lateral movement through supervised machine learning and an open-source tool to create turnkey datasets from Sysmon logs

Regular Contribution
Open access
Published: 19 July 2023

Volume 22, pages 1893–1919, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Information Security Aims and scope Submit manuscript

On the detection of lateral movement through supervised machine learning and an open-source tool to create turnkey datasets from Sysmon logs

Download PDF

Christos Smiliotopoulos¹,
Georgios Kambourakis¹ &
Konstantia Barbatsalou¹

2388 Accesses
6 Citations
Explore all metrics

Abstract

Lateral movement (LM) is a principal, increasingly common, tactic in the arsenal of advanced persistent threat (APT) groups and other less or more powerful threat actors. It concerns techniques that enable a cyberattacker, after establishing a foothold, to maintain ongoing access and penetrate further into a network in quest of prized booty. This is done by moving through the infiltrated network and gaining elevated privileges using an assortment of tools. Concentrating on the MS Windows platform, this work provides the first to our knowledge holistic methodology supported by an abundance of experimental results towards the detection of LM via supervised machine learning (ML) techniques. We specifically detail feature selection, data preprocessing, and feature importance processes, and elaborate on the configuration of the ML models used. A plethora of ML techniques are assessed, including 10 base estimators, one ensemble meta-estimator, and five deep learning models. Vis-à-vis the relevant literature, and by considering a highly unbalanced dataset and a multiclass classification problem, we report superior scores in terms of the F1 and AUC metrics, 99.41% and 99.84%, respectively. Last but not least, as a side contribution, we offer a publicly available, open-source tool, which can convert Windows system monitor logs to turnkey datasets, ready to be fed into ML models.

A comprehensive comparison study of ML models for multistage APT detection: focus on data preprocessing and resampling

Article 16 March 2024

An intelligent behavioral-based DDOS attack detection method using adaptive time intervals

Article 24 April 2024

Insider threat detection using supervised machine learning algorithms

Article 28 December 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, numerous individuals, organizations, and government bodies have suffered from repeated incidents of lateral movement (LM). Sensitive data have been stolen or lost, including bank accounts, fighter aircraft blueprints, or even classified state secrets as part of an international information leakage cyberattack. Generally, LM refers to the broader field of the application of malicious techniques that adversaries exploit to acquire unauthorized access through a network’s endpoint towards the lateral escalation of their privileges in search of critical infrastructures to compromise and the exfiltration of valuable data [1]. Simply put, the attacker’s goal is to gain an initial foothold in a networking environment, remain undetected for as long as it is demanded for learning the targeted facilities’ topology, maintain ongoing access by moving laterally through the compromised environment, and finally elevate its privileges towards data extraction or elimination. LM tactics are categorized within the general area of advanced persistent threats (APTs) [2]; colloquially, it is the act of acquiring as much network access as possible, mostly to achieve persistence.

LM should be distinguished from the legacy cyberattacks of the past and considered more as key tactics, unbounded to specialized tools. A typical LM technique comprises three major stages, namely the reconnaissance and enumeration of the targeted computing facility, the credential dumping and privilege escalation, and finally the compromising of the targeted device. Precisely, during reconnaissance and enumeration, the adversary explores through mapping the network’s topology, devices, operating systems and user’s hierarchy. Privilege escalation is then accomplished with credential dumping though a large variety of hashing exploitation techniques and towards the final goal of compromising valuable assets. Pivoting is tightly related to the concept of LM, and in some contexts, these terms are used interchangeably. Nevertheless, pivoting is more precisely used to refer to the act of moving from host to host inside the target network, while LM also entails the act of privilege escalation on the compromised machines.

Conducting LM is a trademark of contemporary and sophisticated threat actors, as that is evidenced by MITRE’s ATT &CK Framework records of common LM techniques [3]. LM tactics are recognized as “TA0008” in MITRE’s list, which is constantly updated with the most impactful incidents and APT groups that conducted them. Prominent threat actors in this context include the APT39 cyber espionage group [4] that is alleged to be responsible for numerous thefts of personal information around the world and the APT29 group (aka “Cozy Bear”), which is reported to be behind the infamous compromise of the SolarWind’s “Orion” network monitoring software [5]. The impact of LM events around the world is so repeatedly ominous that VMware’s 2022 “Global Incident Response Threat Report” [6], revealed that LM tactics were used in 25% of all the reported attacks. Although a calibrated to LM detection endpoint detection and response (EDR) policy may be effective to some extent, mainly due to the immense volume of network traffic and audit logs, the solution lies in the introduction of a log-based intrusion detection system (IDS) that leverages contemporary machine learning (ML) techniques.

In this context, after pinpointing the shortcomings of the relevant literature, the current work delivers a multifold novel contribution regarding the ecosystem of LM IDS by means of supervised ML techniques. Concentrating on the MS Windows platform and its system service known as system monitor (Sysmon), we provide a comprehensive supervised classification methodology that involves an abundance of traditional (shallow learning) classifiers, both base estimators and ensemble meta-estimators, and deep learning (DL) models. This allows for a comprehensive picture of this potential and sets the basis for future research in this timely domain. Especially, regarding the research methodology, we explicate and contextualize the feature selection, data preprocessing, and feature importance processes, and delve into the ML model parameterization, including hyperparameters. On top of that, we offer an all-encompassing solution to generate labeled or unlabeled CSV datasets from voluminous Sysmon logs. Overall, the key contributions of this work vis-á-vis the relevant literature can be outlined as follows:

We detail the hurdles involved in the creation of turnkey unlabeled or labeled datasets in CSV format through the manipulation of EVTX Sysmon logs, and propose a software solution able to automatize this task. This contribution is key to the LM ecosystem given that, to the best of our knowledge, no pertinent datasets exist, obstructing research on ML-oriented LM detection.
We provide a detailed overarching methodology behind human-driven feature selection upon Sysmon log-based datasets. The classification features outlined can be used as a solid reasoning to the creation of robust and potentially high-rated IDS targeting LM. The suggested methodology is full-fledged, ranging from the labelling and preprocessing of data to the selection of the most applicable per ML model hyperparameters.
Differently from the existing literature, we formulate a multiclass problem, meticulously assessing the proposed methodology through a great variety of legacy classifiers and DL models.
We provide a scrupulous review of the relevant literature, also vis-à-vis our work, pinpointing misconceptions and dubious practices.

The remainder of this paper is structured as follows. The next section provides an overview of the related work. Section 3 focuses on the obstacles related to harnessing Sysmon logs for ML-powered intrusion detection, and details a solution to this end. The same section outlines the dataset used in the context of this work. Our methodology, including feature selection, data preprocessing, and feature importance, is given in Sect. 4. The setup and results of the experiments are presented in Sect. 5, followed by an in-depth discussion in Sect. 6. The last section concludes and provides pointers to future work. For easier guidance throughout the manuscript, a list of abbreviations is included at the end of the article.

2 Related work

The current section provides a brief review of the key pertinent literature regarding LM. The concentration is on the methodology of each relevant work concerning the detection of LM through either supervised or unsupervised machine learning techniques or graph-based analysis. That is, although the work at hand deals with the identification of LM by means of supervised learning, for reasons of completeness, the current section presents the related literature for all the three aforementioned categories. It is to be noted that a more detailed, focused on particular aspects, comparison with the related work is given in Sect. 6.3. For easy reference, the key characteristics of every work discussed in this section are summarized in Table 1.

2.1 Supervised learning based schemes

Based on its impact, the work in [7] is considered a state of the art regarding the subject of anomaly detection through security logs. Specifically, the authors propose an anomaly detection approach that is based on a mixture of 10 log-based generic features and eight custom-made others, respectively. Both set of features were extracted from the publicly available Los Alamos National Laboratory (LANL) dataset collected between 1996 and 2005 [8]. Sampling techniques were applied on the collected subset to facilitate processing and computational power issues with such large data volumes. Supervised ML techniques, namely, Random Forest (RF), LogitBoost (LB) and Logistic Regression (LoR) were implemented towards the classification of the identified log events into normal or malicious. The performance of the classifiers was evaluated against the false positive (FPR) and false negative (FNR) rates, while the malicious authentication predictions of the three aforementioned classifiers were fed to the ensemble Majority Voting uniform weighted algorithm and re-evaluated. The authors give no insight regarding their understanding and the extended graphical-based experimentation upon the dataset that led to the extraction of the presented composite features. Additionally, they do not provide any of the implemented R language scripts, obstructing reproducibility.

The authors in [9] introduced a hybrid anomaly detection approach, focusing on the identification of networking hosts susceptible to LM techniques during the early stages of their exposure to the threat. The first part of their work is dedicated to the graphical representation of authentication logs included in the LANL dataset towards the extraction of 29 composite features. Above that, six more flow-based features were extracted from the relevant to the network flow event-logs. In the second part, the 35 finally extracted features were evaluated under several supervised classifiers, namely, Decision Tree (DT), RF, Linear Regression (LiR), Gaussian Naive Bayes (GNB) and Label Binarizer (LaBi), as part of the proposed anomaly-based approach. Under and oversampling techniques were applied on the dataset due to its highly imbalanced nature, as long as k-fold cross validation (k=10) during the execution of each ML technique. Interestingly, the same work [9] was revisited in [10] under a case study concentrating on RDP-based LM techniques. Keeping the same principles as in [9], the authors leveraged Windows host-based RDP event logs (as evidences) through the combination of two publicly available Windows event-logs subsets of the LANL dataset, namely, “comprehensive” and “unified”, respectively. The subject of LM detection via authentication logs was re-addressed in [11] by the same authors, although extended in the examination of the effects on classification efficiency due to perturbations of LM techniques patterns.

Moreover, the work in [12] introduced a Sysmon log-based anomaly detection system based on shallow and deep neural networks (DNN) supervised techniques, namely, LSTM, RNN, and SVM. On top of that, the authors proposed a generic set of features based on the manipulation of Sysmon EventIDs and evaluated their scheme in terms of TP and TN rates.

Despite the promisingly presented results in [9,10,11,12], the hybrid-combined dataset was not made publicly available. Moreover, the superiority of the classification results against the one in [7] was fully documented only in [11] through the representation of the ROC-curve and the precision, recall, and F1-score (F1) metrics. On the other hand, in [9, 10, 12] the authors neglected to mention the criteria upon which their claims over the work in [7] were based.

2.2 Unsupervised learning based schemes

So far, only a few works considered unsupervised ML as a means for the evaluation of a sparse diversity of collected logs exclusively related to LM. The examined features were either generically fundamental to the initially analyzed log-based datasets or manually extracted from the various interrelated nodes and edges representing the topology of the network. Precisely, the work in [13] proposed an anomaly detection method that was based on ensemble unsupervised ML to identify traces on compromised hosts with LM techniques. The authors used the LANL dataset [8] to create a graph-based model, which depicts the various communications between the targeted hosts. The classification features were extracted and evaluated with an ensemble of unsupervised ML techniques, namely principal component analysis (PCA), k-means clustering, and median absolute deviation-based outlier (MADO) detection. The method’s accuracy was evaluated under a trace-related simulation case study.

In the same context, the authors in [14] employed four unsupervised ML methods, namely Autoencoder (AE), Isolation Forest (IF), lightweight on-line detection of anomalies (LODA) and local outlier factor (LOF), under an anomaly detection scheme which targets the identification of insider attacks. Various preprocessing techniques were applied on data with temporal payload to fit with deep learning (DL) algorithms and contribute in revealing patterns of adversarial changes in user’s behavior. Unsupervised ML ensembles were created to evaluate the anomaly detection performance under different algorithmic combinations. The results were compared against several state-of-the-art works using well-known datasets, including CERT [15], LANL [8] and TWOS [16]. On the downside, both the works in [13, 14] lack of experimental feedback from real-world data stemming from LM enterprise scenarios and events.

An almost similar to the works in [13, 14] hybrid approach was presented in [17]. The theories of network embedding, for mapping a network’s graphical representation into nodes and vectors, were mixed with feature aggregation techniques towards the formation of composite features. The authors evaluated the finally selected features under a proposed semi-supervised classification algorithm based on the Denoising autoencoder unsupervised model. The experiments were conducted on a balanced subset of the LANL dataset [8] that is called “The Comprehensive, Multi-Source, Cyber-Security Event” and the final classification results were evaluated under FPR, TPR, accuracy (ACC) and precision (PREC) metrics. Although the authors presented an estimated ACC of 99.9% with 91.3% PREC on a ratio of 10% of labelled data, their outcomes represent only the ideal situation of a balanced dataset in a lab-oriented pretentious way, and it is hardly applicable to real-life unbalanced data.

Additionally, the work in [18] considered the detection of malware LM on data centers though the implementation of a behavioral unsupervised ML model. Anomaly detection was conducted on the application layer network traffic of data centers via the Jaccard Similarity Coefficient and clustering measurement technique (JSCC) on several balanced datasets. On the other hand, the authors in [19] presented an unsupervised learning model of LM detection, based on the role-based approach of clustering the system connections to remote hosts into distinct roles. We argue that the type of traffic, and therefore the features obtained in both the environments of [18, 19], are significantly different compared to this study. Namely, the traffic considered by the authors is inappropriate for detecting LM. Therefore, such works are considered out of scope of the current study.

2.3 Graph-based schemes

The authors in [20] address the subject of LM detection through the definition of a graph-based impact metric. First, the evolution of the various paths, that an adversary could take among the various network nodes due to the exploitation of various vulnerabilities, is defined algorithmically via the introduction of a dynamic graph-based reachability model (DGBR). This model is then used as the basis for the calculation of a network-level impact score. The latter score is quantified based on the value and reachability score assigned to each network node that could be compromised by adversaries. Although the proposed model was implemented in the context of the so-called Windows credentials “Pass-the-Hash” (PtH) vulnerability, the authors do not consistently abide by their defined model. Instead, new concepts were introduced which lack sufficient documentation and connection with the already presented theory. Besides that, the case study scenario of PtH that was tested against the LANL dataset was based on the implementation of the proposed reachability and impact metric model on C++ source code that was not included, making replication of the experiments practically infeasible.

The work in [21] contributed a graph-based detection system, dubbed Latte, which deals in parallel with the multi-layered nature of large-scaled data stemming from LM incidents and the lack of knowledge regarding adversaries, respectively. They address the LM problem in two ways. First, hosts and user accounts were marked as nodes, while their interconnections were modeled as edges. Once an infected node is identified, it leads through the proposed forensic algorithmic analysis to any other compromised element(s). Second, a general algorithmic approach of rare paths anomalies identification leverages a remote file execution detector to recognize unknown LM attempts. The same work [21] inspired two more similar approaches. Precisely, the authors in [22] presented another tool, dubbed Hopper, that is concentrated on the identification of malicious LM events through real-life collected logs. The proposed system, tracks user’s login activities and outlines their correlations among hosts on a graph-based representation. The process ends with the detection of anomalies in login patterns, which may imply the existence of LM. Moreover, the authors in [23] introduced a custom LM detection algorithm under the title LMTracker. This scheme originated from the authors’ effort to address the gaps in the efficiency of the existing endpoint protection practices to identify LM events. Various elements included in the captured log-based traffic, namely users, computers, processes etc., were extracted and implemented as nodes for the construction of heterogeneous graphs that present the various relationships among its elements. In turn, the advanced graph neural networks theory was used for the production of two custom algorithms for the representation of the LM-related paths and the unsupervised anomaly path detection based on a predefined threshold, respectively. LMTracker was evaluated over LANL [8] and CERT 6.2 [15] datasets, and the experimental results were examined under the prism of confusion matrix rates and the ROC-AUC metric. While highly promising as an LM anomaly detection tool with approximately 0.95 ROC AUC score, the LMTracker presents noticeable FP rates.

2.4 Key observations

With reference to Sects. 2.1 to 2.3 and Table 1, almost half of the presented works (namely 5 out of 12) relied on supervised shallow classifiers, whereas from the rest two categories three contributions implemented unsupervised classification techniques and other four were based on graphs. Above that, a characteristic common to most works is that they neither construct their own set of data logs and samples, nor provide adequate reference to regularization techniques and hyperparameter optimization steps. Interestingly, all the works but two have been published from 2018 onward.

Furthermore, the vast majority of the works in Table 1 utilized logs collected as public via the Windows Event Viewer tool, all of which were related to the legacy LANL dataset of multi-source cyber-security events. Released as public in 2015, LANL is considered almost outdated due to its non-inclusion in samples derived from contemporary malicious techniques, besides LM traffic. Precisely the small proportion of the included malicious traffic, led most of the authors to reproduce artificially the aforesaid samples in order to create as custom an adequate to be manipulated with ML techniques imbalanced dataset. We argue that similar, to the aforesaid, processes related to artificial data handling and manipulation of datasets should be cared with great concern. In most of the cases, they are not related to real-life traffic and may mislead the prediction rates of the whole ML-IDS process, despite the good results that may initially reveal. Furthermore, even the works in [9,10,11,12], that introduced a different to LANL hybrid-combined dataset neglected to provide it, obstructing reproducibility. Another important aspect that needs to be pointed out is that most of the works presented in Sects. 2.1 to 2.3 neither mention the selected for the classification process features nor justify their contribution to the whole ML process. Further, no code regarding the Python or R implemented scripts is provided, not to mention the lack of hyperparameters upon which the ML models were constructed.

All in all, a general conclusion is that the majority of the studies so far have been conducted on datasets that do not meet a number of criteria, namely contemporary LM or general purpose attacks, adequate representation of all the included classes to help the ML experimental process or even multiclass labelling of the included samples. Another important observation is that all but one [12] of the works relied on MS Windows event viewer collected logs and none of them introduced Sysmon related traffic to take advantage of the enhanced headers as those are precluded by the collected event-logs. We argue that this phenomenon is mainly due to the lack of an open-source, publicly available tool able to readily convert Sysmon’s extracted logs (EVTX format), to a (un)labeled dataset in CSV format. This shortcoming is also obvious in other recent works which rely on manual investigation of log files produced by Windows Event Viewer, and for that reason the preprocessed in comma-separated format LANL was selected in most of the cases.

As it concerns Sysmon logs manipulation, the work in [24], was the first to deal with the presentation of a LM-oriented EDR policy towards the first level identification of LM incidents thought the analysis of raw log files. The work ended with the presentation and evaluation of the Python Evtx Analyzer (PeX) EDR tool, which incorporated the aforementioned EDR-policy’s criteria. The tool manipulates Sysmon files in their raw EVTX form, which are then iterated over the presented EDR policy’s features to reveal the existence of potential malicious LM activity. The PeX tool is publicly available on GitHub [25].

As an extension to [24], and for addressing the key gap of the creation of datasets through EVTX log files, among others, the current work contributes such a tool, entitled evtx_To_CSV_Export Tool (ETCExp). The tool, detailed in Sect. 3, was developed to serve as an easily configurable and above all OS-independent command line tool that helps incident response teams and researchers to parse and transform massive EVTX log files into compatible unlabeled datasets (CSV files), ready to be used along with ML algorithms. Further, ETCExp tool is designed to implement the proposed in [24] EDR policy for automatically labelling the transformed Sysmon logs, into a multiclass CSV set of samples. Besides the labelling process, the ETCExp tool performs on demand, feature selection, subsets extraction, and basic data preprocessing through One Hot Encoding and MinMax algorithms. The full presentation of the technical characteristics of the aforesaid tool is available in Sect. 3.

Table 1 Summary of the key aspects of the works included in this section. The works are arranged in chronological ascending order

On the detection of lateral movement through supervised machine learning and an open-source tool to create turnkey datasets from Sysmon logs

Abstract

Similar content being viewed by others

A comprehensive comparison study of ML models for multistage APT detection: focus on data preprocessing and resampling

An intelligent behavioral-based DDOS attack detection method using adaptive time intervals

Insider threat detection using supervised machine learning algorithms

1 Introduction

2 Related work

2.1 Supervised learning based schemes

2.2 Unsupervised learning based schemes

2.3 Graph-based schemes

2.4 Key observations

3 ETCExp: converting Sysmon logs to CSV

3.1 Preliminaries

3.2 ETCExp tool

3.3 Proof of concept

3.4 Dataset labeling

4 Methodology

4.1 Feature selection

4.2 Data preprocessing

4.3 Feature importance

5 Experiments

5.1 Shallow classifiers

5.1.1 Configuration of hyperparameters

5.1.2 Results

5.2 Deep learning

5.2.1 Configuration of hyperparameters

5.2.2 Results

6 Discussion

6.1 Shallow classification

6.2 DNN

6.3 Comparison with related work

6.4 Takeaways and future directions

7 Conclusions

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation