Fusion of Traffic Data and Alert Log Based on Sensitive Information

Cheng, Jie; Zhang, Ru; Tian, Siyuan; Lin, Bingjie; Wei, Jiahui; Zhang, Shulin

doi:10.1007/978-981-19-2456-9_9

Jie Cheng⁴⁰,
Ru Zhang⁴¹,
Siyuan Tian⁴¹,
Bingjie Lin⁴⁰,
Jiahui Wei⁴⁰ &
…
Shulin Zhang⁴⁰

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE))

Included in the following conference series:

INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND APPLICATIONS

8502 Accesses

Abstract

At present, the attack behavior that occurs in the network has gradually developed from a single-step, simple attack method to a complex multi-step attack method. Therefore, the researchers conducted a series of studies on this multi-step attack. Common methods usually use IDS to obtain network alert data as the data source, and then match a multi-step attack based on the correlation nature of the data. However, the false positives and omissions of the alert data based on IDS will lead to the failure of the resulting multi-step attack. Multi-source data is the basis of analysis and prediction in the field of network security, and fusion analysis technology is an important means of processing multi-source data. In response to this problem, this paper studies how to use sensitive information traffic as data to assist IDS alert data, and proposes a method for fusion of traffic and log data based on sensitive information. This article analyzes the purpose of each stage of the kill chain, and relies on the purpose to divide the multi-step attack behavior in stages, which is used to filter the source data. And according to the purpose of the multi-step attack, the kill chain model is used to define the multi-step attack model.

You have full access to this open access chapter, Download conference paper PDF

An Approach for Alert Correlation Using ArcSight SIEM and Open Source NIDS

Automatic Attack Pattern Mining for Generating Actionable CTI Applying Alert Aggregation

IDS Alert Priority Determination Based on Traffic Behavior

Keywords

1 Introduction

Since the birth of the Internet, cyber attacks have been threatening users and organizations. They also become more complex as computer networks become more complex. Currently, an attacker needs to perform multiple intrusion steps to achieve the ultimate goal. In order to detect network attacks, security researchers rely heavily on intrusion detection systems (IDS). However, due to the underreporting of IDS alert data and The nature of false positives. Multi-step attacks based only on alert logs are incomplete or incorrect.

In response to this problem, this paper studies and designs a flow and log data fusion method based on sensitive information. Based on the Spark framework, sensitive traffic is screened out from huge traffic information, the sensitive traffic is preprocessed, and merged with the alert log, and finally normalized data is obtained as the data source. The normalized data is preliminarily clustered based on the single feature of the IP address, combined with the kill chain model to filter within and between clusters, and finally a highly complete attack cluster that meets the kill chain attack stage is obtained.

2 Related Work

Multi-step attacks are the current mainstream attack method. So far, the correlation analysis methods of multi-step attacks can be divided into five categories: similarity correlation, causal correlation, model-based, case-based, and hybrid.

Similarity correlation is based on the idea that similar alerts have the same root cause and therefore belong to the same attack scenario. With the correct selection of similarity features, a more accurate attack scenario can be reconstructed, but it depends on the similarity of a small number of data segments.

The causal association method is based on a priori knowledge or a list of prerequisites and results of alerts determined under big data statistics. This method can correlate common attack scenarios more accurately, but the causal association based on prior knowledge lacks in reconstructing rare attacks Scenario means, due to the randomness of the attack process, the results of big data statistics lack confidence.

Model-based methods use existing or improved attack models for pattern matching, such as attack graphs, Petri nets, network kill chains, etc., which can match and reconstruct attacks that conform to the model, but lack detection methods for new attacks or APT attacks. Noel et al. [1] was the first to use the attack graph to match IDS alerts, which relies on prior knowledge such as the integrity of the attack graph and cannot detect unknown attacks. Chien and Ho. [2] proposed a color Petri net-based approach. Associated system, the attack types are divided in more detail. Yanyu Huo et al. [3] used the network kill chain model for correlation analysis.

Case-based methods can only target a certain type of attack. Vasilomanolakis et al. [4] collected real multi-step attacks through honeypots, etc., and developed case-based signatures. Salah et al. [5] modeled through reasoning or human analysis and added it to the attack database.

The hybrid method can combine the advantages and disadvantages of several methods and is the most commonly used method in recent years. Farhadi et al. [6] combined the attribute association and statistical relationship methods in the ASEA system, and used HMMs for plan identification. Shittu [7] combines Bayesian inference with attribute association.

3 Algorithm Design

3.1 Meaning of Sensitive Information

Researchers rarely use traffic data as the analysis data source, mainly due to the huge amount of traffic data and poor data readability. In order to solve these two problems, this paper proposes the meaning of sensitive information and a method of filtering sensitive information traffic based on the Spark framework.

Table 1. Sensitive information.

Full size table

The ultimate goal of the attack is defined as modifying, adding, stealing system data or destroying system behavior. Therefore, this article has obtained the sensitive information that may be contacted during the attack through a questionnaire survey by security personnel and a statistical analysis of multi-step attack behavior. Table 1 shows.

3.2 Sensitive Information Flow Screening Method Based on Spark Framework

The initially extracted traffic data contains basic information fields: time, IP information, port information, and the transmitted content body msg. In this paper, through distributed calculation of the content main body msg, the sensitive information flow is filtered out from the mass flow data according to the sensitive information list Sl (Fig. 1).

3.3 Data Normalization

The methods of multi-step attacks are ever-changing, but their essence is to rely on a combination of many single-step attacks to achieve the ultimate goal. For most of the multi-step attack processes, they are in line with the characteristics of the kill chain model. The kill chain model defines the attack stage as: reconnaissance and tracking, weapon construction, load delivery, vulnerability exploitation, installation and implantation, command and control, and goal achievement. This article is based on the above division scheme, according to The purpose of different stages of attack, the multi-step attack stage is divided into: information collection stage (reconnaissance tracking, weapon construction), vulnerability exploitation stage (load delivery, vulnerability exploitation), upload Trojan remote command execution stage (installation and implantation), remote connection The Trojan connects to the seven stages of privilege escalation stage (command and control), horizontal transmission stage, destruction, stealing and modifying information (achieving the goal), and the stage of eliminating intrusion evidence. Under the original kill chain model, the attack behavior is divided in more detail. Considering that the current multi-step attack behavior may have the nature of worm propagation (such as Wannacry, etc.), this article adds a horizontal propagation stage; in addition, it adds sensitive information flow data. The host information process that cannot be detected only with IDS alert data can be detected, so the stage of eliminating intrusion evidence is added.

In summary, the kill chain model used in this article is shown in Fig. 2.

The normalization process of data mainly depends on the selection of feature fields. The selection of feature fields mainly needs to consider the following three aspects: (1) The similarity of feature fields can indicate the similarity of attacks to a certain extent; ( 2) Feature fields can clearly contain this important piece of data; (3) Feature fields exist in all data sets. Based on the above considerations, this article selects the source IP address (src_ip), destination IP address (dst_ip), source port (src_port), destination port (dst_port), time (time), kill chain stage (killstep) and distinguishing flag (datatype). Finally get the normalized data set:

$$ \begin{array}{*{20}l} {{\text{data}}\, = \,\left\{ {{\text{d}}_{1} ,{\text{d}}_{2} , \ldots ,{\text{d}}_{\text{n}} } \right\},\,{\text{d}}_{\text{i}} \,{\text{is}}\,{\text{a}}\,7 - {\text{tuple}}\,{\text{data,}}} \hfill \\ {\,\,\,\,\,\,\,\,\,\,\,\,\,\,{\text{d}}_{\text{i}} = \left[ {{\text{src}}\_{\text{ip}},{\text{dst}}\_{\text{ip}},{\text{src}}\_{\text{port}},{\text{dst}}\_{\text{port}},{\text{time}},{\text{killstep}},{\text{datatype}}} \right]} \hfill \\ \end{array} $$

3.4 Alert Log and Sensitive Information Flow Fusion Algorithm

Definition 1: Attack cluster collection:

$\mathrm{attclusters}=\left\{{\mathrm{attcluster}}_{1},{\mathrm{attcluster}}_{2},{\mathrm{attcluster}}_{3},\dots ,{\mathrm{attcluster}}_{\mathrm{n}}\right\}$,

Where ${\mathrm{attcluster}}_{\mathrm{i}}$ represents an attack cluster: ${\mathrm{attcluster}}_{\mathrm{i}}=\left\{{\mathrm{d}}_{\mathrm{a}},{\mathrm{d}}_{\mathrm{b}},{\cdots ,\mathrm{d}}_{\mathrm{c}}\right\}{\mathrm{d}}_{\mathrm{x}}\in \mathrm{data}$

(A)
IP similarity clustering

At present, the feature selection of network attack classification using similarity method mainly includes two types: one is to use multiple features such as IP, port, time, etc. to perform fuzzy clustering according to different weights; the other is to use a single feature for strong similarity Sexual clustering. This article considers that the subsequent multi-step attack model generation algorithm can supplement the missed multi-step attack behavior to a certain extent. Therefore, this article uses the similarity of single feature IP addresses to cluster, the formula is shown in 1:

IP address similarity formula (a):
$$ F_{ip} \left( {ip_1 ,ip_2 } \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {if\,Similar\left( {src_{ip1} ,src_{ip2} } \right)and\,Similar\left( {dst_{ip1} ,dst_{ip2} } \right)} \hfill \\ {or} \hfill & {dst_{ip1} = src_{ip2} } \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right. $$
(1)

Among them,${\mathrm{src}}_{\mathrm{ip}},\mathrm{ dst}\_\mathrm{ip}$ indicates the source and destination IP addresses of the data respectively. If the source IP addresses of two pieces of data are in the same network segment and the destination IP addresses are also in the same network segment, then the similarity value is 1, and the two pieces of data can be considered to belong to the same Attack process. For example: there are two IPs, IP1 = A1.A2.A3.A4, IP2 = B1.B2.B3.B4, then the formula is as shown in 2:

IP address similarity formula (b):
$$ Similar\left( {IP1,IP2} \right) = \left\{ {\begin{array}{*{20}l} {True,} \hfill & {A1 = = B1\,and\,A2 = = B2} \hfill \\ {False,} \hfill & {\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,othrwise} \hfill \\ \end{array} } \right. $$
(2)
(B)
Combine and filter within the attack cluster (Sim_in, CFD_in)

According to the analysis of normal attack behavior, there will usually be a large number of similar attack behaviors in a short period of time. Therefore, in this paper, each attack cluster is internally merged and filtered. The similarity formula within the attack cluster is shown in3, and the confidence formula is shown in 3:
(1)
Similarity within the attack cluster:
$$ {\text{Sim}}\_{\text{in}}\left( {{\text{d}}_1 ,{\text{d}}_2 } \right) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if}}\,{\text{sametime}}\,{\text{and}}\,{\text{ip}}\left( {{\text{d}}_1 ,{\text{d}}_2 } \right)} \hfill \\ {{\text{or}}} \hfill & {{\text{neartime}}\left( {{\text{d}}_1 ,{\text{d}}_2 } \right)\,{\text{and}}\,{\text{same}}\,{\text{msg}}\,{\text{and}}\,{\text{ip}}\left( {{\text{d}}_1 ,{\text{d}}_2 } \right)} \hfill \\ 0 \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$
(3)

(2)
The built-in reliability of the attack cluster:
$$ {\text{CFD}}\_{\text{in}}({\text{d}}_1 ) = \left\{ {\begin{array}{*{20}l} 0 \hfill & {{\text{if}}\,{\text{killstep}}\left( {{\text{d}}_{\text{i}} } \right) > 3\,{\text{and}}\,{\text{killstep}}({\text{d}}_1 ) < {\text{maxkillstep}}} \hfill \\ 1 \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$
(4)

If the time and IP address of the two pieces of data are the same, the similarity is 1, which is the same piece of data generated by sensitive information traffic and alert logs; the similarity of data with the same attack name and IP address within similar time is also 1, Which means the same attack in a short period of time. In this paper, a merge operation is adopted for the data whose similarity is 1 value. For each piece of data, if its kill chain stage is greater than 3 and smaller than the maximum kill chain stage of the attack cluster to this data, the confidence is 0. This paper removes the data with confidence of 0 from the attack cluster.
(C)
Filter between attack clusters (CFD_out)

Due to the rule-based rather than result-based detection nature of the IDS system, there will be a large amount of attack failure data in the actual acquired attack data. Therefore, the attack cluster that only depends on the classification of IP addresses must contain a large number of attacks. The unsuccessful attack behavior, the attack to a certain extent due to the change of the attacker’s target or the unsuccessful attack caused the cluster set to abandon, etc., these incomplete attack behaviors will lead to the incompleteness of the subsequent multi-step attack model; therefore In order to filter incomplete and incorrect attack clusters, this paper gives the confidence formula between attack clusters as shown in formula 5:

$$ {\text{CFD}}\_{\text{in}} = \sum_{{\text{i}} = 1}^{\text{N}} {{\text{killstep}}\left( {{\text{d}}_{\text{i}} } \right)*{\text{typeCFD}}\left( {{\text{d}}_{\text{i}} } \right)} $$

(5)

where N represents the number of attack data of the attack cluster, and for each piece of data, its kill chain stage $\mathrm{killstep}$ is used as the product of authority and type confidence $\mathrm{typeCFD}$ to represent the confidence value of the corresponding data.

4 Experimental Design and Analysis

4.1 Dataset

(1)
Simulation data D1

This article uses the website management system CMS to build a Web site that contains a SQL injection backdoor, and sequentially uses Yujian to scan the website background, SQL injection to obtain the administrator account password, log in to the background, upload a sentence Trojan horse, and Chinese kitchen knife connection operations. Traffic data for this series of attacks. The attack process is shown in Fig. 3:

(2)
Campus network data D2

In this paper, a traffic monitoring system is arranged on the three subnet nodes of the campus network. One of the subnets includes the CTF competition environment in the school. Accumulatively collected 2G traffic data in the network, and passed the IDS system and sensitive information screening., 10870 pieces of alert data and 205,408 pieces of sensitive information traffic were obtained.
(3)
LLDDos 1.0 D3 of Darpa2000

This data set is widely used by researchers in the construction of multi-step attack scenarios. This article is based on its five attack steps: the attacker IPsweep scans all hosts in the network, detects the surviving hosts obtained in the previous stage, and determines which ones are The host is running the sadmind remote management tool on the Solaris operating system, the attacker enters the target host through a remote buffer overflow attack, the attacker establishes a telnet connection through the attack script, installs the Trojan horse mstream ddos software using rcp, and the attacker logs in to the target host to initiate a DDOS attack Launch attacks on other hosts in the LAN. An attack cluster is obtained through aggregation and screening, which contains 18-tone alert information.

4.2 Experimental Results

(1)
The feasibility of the fusion algorithm of alert log and sensitive information flow.

First, the collected traffic data is passed through the IDS system to obtain the alert data. The pyspark module of python uses the Spark framework to extract the sensitive information flow from the flow. After the sensitive information flow and the alert log fusion algorithm, the detection accuracy and detection integrity are compared.

Figure 4 shows the experimental results of the three data sets and the comparison results of Yanyu Huo et al. [6] in detection accuracy and detection integrity. It can be seen that after the sensitive information traffic data is added, the multi-step attack is more effective. The detection integrity has been improved to a certain extent, and the detection accuracy is equivalent to the method of Yanyu Huo et al. [6], but the method in this paper does not need to be classified by a preset threshold, so the sensitive information flow and alert log fusion algorithm proposed in this paper It is feasible in practice. The D3 data set has no difference in detection accuracy and detection integrity because the alert data covers all the attack steps.

5 Conclusion

Figure 4 shows the results of detection accuracy and detection completeness of the three data sets. The conclusion that can be drawn is that, compared with only using IDS alert logs as source data, the alert log and sensitive information flow fusion algorithm proposed in this paper can indeed be used to a certain extent. In order to compensate for the false positives and false negatives of the alert data, and based on the integrity of the attack process in the traffic data, the attack behavior can be more deeply and completely identified. Combined with the kill chain model proposed in this paper, the horizontal transmission stage is added and the evidence of intrusion is eliminated. An attack cluster with higher correlation, higher attack success rate and a certain attack stage sequence can be obtained, and then a more complete multi-step attack behavior can be obtained when the subsequent multi-step attack prediction is performed.

References

Noel, S., Robertson, E., Jajodia, S.: Correlating intrusion events and building attack scenarios through attack graph distances. In: 20th Annual Computer Security Applications Conference, pp. 350–359. IEEE (2004)
Google Scholar
Chien, S.-H., Ho, C.-S.: A novel threat prediction framework for network security. In: Advances in Information Technology and Industry Applications, pp. 1–9. Springer (2012)https://doi.org/10.1007/978-3-642-26001-8_1
Zhang, R., Huo, Y., Liu, J., et al.: Constructing APT attack scenarios based on intrusion kill chain and fuzzy clustering. Secur. Commun. Networks (2017)
Google Scholar
Vasilomanolakis, E., Srinivasa, S., García Cordero, C., Mühlhäuser, M.: Multi-stage attack detection and signature generation with ICS honeypots. In: 2016 IEEE/IFIP Network Operations and Management Symposium, NOMS 2016, pp. 1227–1232. https://doi.org/10.1109/NOMS.2016.7502992.2016
Salah, S., Maciá-Fernández, G., Díaz-Verdejo, J.E.: A model-based survey of alert correlation techniques. Comput. Netw. 57(5), 1289–1317 (2013)
Article Google Scholar
Farhadi, H., AmirHaeri, M., Khansari, M.: Alert correlation and prediction using data mining and HMM. ISC Int. J. Inf. Secur. 3(2) (2011)
Google Scholar
Shittu, R.O.: Mining intrusion detection alert logs to minimise false positives & gain attack insight. City University London. Thesis (2016)
Google Scholar

Download references

Acknowledgement

The authors would like to thank the anonymous referees for their valuable comments and helpful suggestions. The work is supported by Science and Technology Project of the Headquarters of State Grid Corporation of China ,“The research and technology for collaborative defense and linkage disposal in network security devices” (5700-202152186A-0-0-00).

Author information

Authors and Affiliations

State Grid Information and Telecommunication Branch, Beijing, China
Jie Cheng, Bingjie Lin, Jiahui Wei & Shulin Zhang
Beijing University of Posts and Telecommunications, Beijing, China
Ru Zhang & Siyuan Tian

Authors

Jie Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Ru Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Siyuan Tian
View author publications
You can also search for this author in PubMed Google Scholar
Bingjie Lin
View author publications
You can also search for this author in PubMed Google Scholar
Jiahui Wei
View author publications
You can also search for this author in PubMed Google Scholar
Shulin Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ru Zhang .

Editor information

Editors and Affiliations

College of Communication Engineering, Jilin University, Jilin, Jilin, China
Zhihong Qian
Department of AI & ML, Vardhaman College of Engineering, Hyderabad, Telangana, India
M.A. Jabbar
College of Technology, Indiana State University, Terre Haute, IN, USA
Xiaolong Li

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, J., Zhang, R., Tian, S., Lin, B., Wei, J., Zhang, S. (2022). Fusion of Traffic Data and Alert Log Based on Sensitive Information. In: Qian, Z., Jabbar, M., Li, X. (eds) Proceeding of 2021 International Conference on Wireless Communications, Networking and Applications. WCNA 2021. Lecture Notes in Electrical Engineering. Springer, Singapore. https://doi.org/10.1007/978-981-19-2456-9_9

Download citation

DOI: https://doi.org/10.1007/978-981-19-2456-9_9
Published: 13 July 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2455-2
Online ISBN: 978-981-19-2456-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics