Keywords

1 Introduction

Since the birth of the Internet, cyber attacks have been threatening users and organizations. They also become more complex as computer networks become more complex. Currently, an attacker needs to perform multiple intrusion steps to achieve the ultimate goal. In order to detect network attacks, security researchers rely heavily on intrusion detection systems (IDS). However, due to the underreporting of IDS alert data and The nature of false positives. Multi-step attacks based only on alert logs are incomplete or incorrect.

In response to this problem, this paper studies and designs a flow and log data fusion method based on sensitive information. Based on the Spark framework, sensitive traffic is screened out from huge traffic information, the sensitive traffic is preprocessed, and merged with the alert log, and finally normalized data is obtained as the data source. The normalized data is preliminarily clustered based on the single feature of the IP address, combined with the kill chain model to filter within and between clusters, and finally a highly complete attack cluster that meets the kill chain attack stage is obtained.

2 Related Work

Multi-step attacks are the current mainstream attack method. So far, the correlation analysis methods of multi-step attacks can be divided into five categories: similarity correlation, causal correlation, model-based, case-based, and hybrid.

Similarity correlation is based on the idea that similar alerts have the same root cause and therefore belong to the same attack scenario. With the correct selection of similarity features, a more accurate attack scenario can be reconstructed, but it depends on the similarity of a small number of data segments.

The causal association method is based on a priori knowledge or a list of prerequisites and results of alerts determined under big data statistics. This method can correlate common attack scenarios more accurately, but the causal association based on prior knowledge lacks in reconstructing rare attacks Scenario means, due to the randomness of the attack process, the results of big data statistics lack confidence.

Model-based methods use existing or improved attack models for pattern matching, such as attack graphs, Petri nets, network kill chains, etc., which can match and reconstruct attacks that conform to the model, but lack detection methods for new attacks or APT attacks. Noel et al. [1] was the first to use the attack graph to match IDS alerts, which relies on prior knowledge such as the integrity of the attack graph and cannot detect unknown attacks. Chien and Ho. [2] proposed a color Petri net-based approach. Associated system, the attack types are divided in more detail. Yanyu Huo et al. [3] used the network kill chain model for correlation analysis.

Case-based methods can only target a certain type of attack. Vasilomanolakis et al. [4] collected real multi-step attacks through honeypots, etc., and developed case-based signatures. Salah et al. [5] modeled through reasoning or human analysis and added it to the attack database.

The hybrid method can combine the advantages and disadvantages of several methods and is the most commonly used method in recent years. Farhadi et al. [6] combined the attribute association and statistical relationship methods in the ASEA system, and used HMMs for plan identification. Shittu [7] combines Bayesian inference with attribute association.

3 Algorithm Design

3.1 Meaning of Sensitive Information

Researchers rarely use traffic data as the analysis data source, mainly due to the huge amount of traffic data and poor data readability. In order to solve these two problems, this paper proposes the meaning of sensitive information and a method of filtering sensitive information traffic based on the Spark framework.

Table 1. Sensitive information.

The ultimate goal of the attack is defined as modifying, adding, stealing system data or destroying system behavior. Therefore, this article has obtained the sensitive information that may be contacted during the attack through a questionnaire survey by security personnel and a statistical analysis of multi-step attack behavior. Table 1 shows.

3.2 Sensitive Information Flow Screening Method Based on Spark Framework

The initially extracted traffic data contains basic information fields: time, IP information, port information, and the transmitted content body msg. In this paper, through distributed calculation of the content main body msg, the sensitive information flow is filtered out from the mass flow data according to the sensitive information list Sl (Fig. 1).

Fig. 1.
figure 1

Alert data and traffic data extracted for the first time.

3.3 Data Normalization

The methods of multi-step attacks are ever-changing, but their essence is to rely on a combination of many single-step attacks to achieve the ultimate goal. For most of the multi-step attack processes, they are in line with the characteristics of the kill chain model. The kill chain model defines the attack stage as: reconnaissance and tracking, weapon construction, load delivery, vulnerability exploitation, installation and implantation, command and control, and goal achievement. This article is based on the above division scheme, according to The purpose of different stages of attack, the multi-step attack stage is divided into: information collection stage (reconnaissance tracking, weapon construction), vulnerability exploitation stage (load delivery, vulnerability exploitation), upload Trojan remote command execution stage (installation and implantation), remote connection The Trojan connects to the seven stages of privilege escalation stage (command and control), horizontal transmission stage, destruction, stealing and modifying information (achieving the goal), and the stage of eliminating intrusion evidence. Under the original kill chain model, the attack behavior is divided in more detail. Considering that the current multi-step attack behavior may have the nature of worm propagation (such as Wannacry, etc.), this article adds a horizontal propagation stage; in addition, it adds sensitive information flow data. The host information process that cannot be detected only with IDS alert data can be detected, so the stage of eliminating intrusion evidence is added.

In summary, the kill chain model used in this article is shown in Fig. 2.

Fig. 2.
figure 2

This article kill chain model diagram.

The normalization process of data mainly depends on the selection of feature fields. The selection of feature fields mainly needs to consider the following three aspects: (1) The similarity of feature fields can indicate the similarity of attacks to a certain extent; ( 2) Feature fields can clearly contain this important piece of data; (3) Feature fields exist in all data sets. Based on the above considerations, this article selects the source IP address (src_ip), destination IP address (dst_ip), source port (src_port), destination port (dst_port), time (time), kill chain stage (killstep) and distinguishing flag (datatype). Finally get the normalized data set:

$$ \begin{array}{*{20}l} {{\text{data}}\, = \,\left\{ {{\text{d}}_{1} ,{\text{d}}_{2} , \ldots ,{\text{d}}_{\text{n}} } \right\},\,{\text{d}}_{\text{i}} \,{\text{is}}\,{\text{a}}\,7 - {\text{tuple}}\,{\text{data,}}} \hfill \\ {\,\,\,\,\,\,\,\,\,\,\,\,\,\,{\text{d}}_{\text{i}} = \left[ {{\text{src}}\_{\text{ip}},{\text{dst}}\_{\text{ip}},{\text{src}}\_{\text{port}},{\text{dst}}\_{\text{port}},{\text{time}},{\text{killstep}},{\text{datatype}}} \right]} \hfill \\ \end{array} $$

3.4 Alert Log and Sensitive Information Flow Fusion Algorithm

Definition 1: Attack cluster collection:

\(\mathrm{attclusters}=\left\{{\mathrm{attcluster}}_{1},{\mathrm{attcluster}}_{2},{\mathrm{attcluster}}_{3},\dots ,{\mathrm{attcluster}}_{\mathrm{n}}\right\}\),

Where \({\mathrm{attcluster}}_{\mathrm{i}}\) represents an attack cluster: \({\mathrm{attcluster}}_{\mathrm{i}}=\left\{{\mathrm{d}}_{\mathrm{a}},{\mathrm{d}}_{\mathrm{b}},{\cdots ,\mathrm{d}}_{\mathrm{c}}\right\}{\mathrm{d}}_{\mathrm{x}}\in \mathrm{data}\)

  1. (A)

    IP similarity clustering

    At present, the feature selection of network attack classification using similarity method mainly includes two types: one is to use multiple features such as IP, port, time, etc. to perform fuzzy clustering according to different weights; the other is to use a single feature for strong similarity Sexual clustering. This article considers that the subsequent multi-step attack model generation algorithm can supplement the missed multi-step attack behavior to a certain extent. Therefore, this article uses the similarity of single feature IP addresses to cluster, the formula is shown in 1:

    IP address similarity formula (a):

    $$ F_{ip} \left( {ip_1 ,ip_2 } \right) = \left\{ {\begin{array}{*{20}l} {1,} \hfill & {if\,Similar\left( {src_{ip1} ,src_{ip2} } \right)and\,Similar\left( {dst_{ip1} ,dst_{ip2} } \right)} \hfill \\ {or} \hfill & {dst_{ip1} = src_{ip2} } \hfill \\ {0,} \hfill & {otherwise} \hfill \\ \end{array} } \right. $$
    (1)

    Among them,\({\mathrm{src}}_{\mathrm{ip}},\mathrm{ dst}\_\mathrm{ip}\) indicates the source and destination IP addresses of the data respectively. If the source IP addresses of two pieces of data are in the same network segment and the destination IP addresses are also in the same network segment, then the similarity value is 1, and the two pieces of data can be considered to belong to the same Attack process. For example: there are two IPs, IP1 = A1.A2.A3.A4, IP2 = B1.B2.B3.B4, then the formula is as shown in 2:

    IP address similarity formula (b):

    $$ Similar\left( {IP1,IP2} \right) = \left\{ {\begin{array}{*{20}l} {True,} \hfill & {A1 = = B1\,and\,A2 = = B2} \hfill \\ {False,} \hfill & {\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,othrwise} \hfill \\ \end{array} } \right. $$
    (2)
  2. (B)

    Combine and filter within the attack cluster (Sim_in, CFD_in)

    According to the analysis of normal attack behavior, there will usually be a large number of similar attack behaviors in a short period of time. Therefore, in this paper, each attack cluster is internally merged and filtered. The similarity formula within the attack cluster is shown in3, and the confidence formula is shown in 3:

  3. (1)

    Similarity within the attack cluster:

    $$ {\text{Sim}}\_{\text{in}}\left( {{\text{d}}_1 ,{\text{d}}_2 } \right) = \left\{ {\begin{array}{*{20}l} 1 \hfill & {{\text{if}}\,{\text{sametime}}\,{\text{and}}\,{\text{ip}}\left( {{\text{d}}_1 ,{\text{d}}_2 } \right)} \hfill \\ {{\text{or}}} \hfill & {{\text{neartime}}\left( {{\text{d}}_1 ,{\text{d}}_2 } \right)\,{\text{and}}\,{\text{same}}\,{\text{msg}}\,{\text{and}}\,{\text{ip}}\left( {{\text{d}}_1 ,{\text{d}}_2 } \right)} \hfill \\ 0 \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$
    (3)
  1. (2)

    The built-in reliability of the attack cluster:

    $$ {\text{CFD}}\_{\text{in}}({\text{d}}_1 ) = \left\{ {\begin{array}{*{20}l} 0 \hfill & {{\text{if}}\,{\text{killstep}}\left( {{\text{d}}_{\text{i}} } \right) > 3\,{\text{and}}\,{\text{killstep}}({\text{d}}_1 ) < {\text{maxkillstep}}} \hfill \\ 1 \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right. $$
    (4)

    If the time and IP address of the two pieces of data are the same, the similarity is 1, which is the same piece of data generated by sensitive information traffic and alert logs; the similarity of data with the same attack name and IP address within similar time is also 1, Which means the same attack in a short period of time. In this paper, a merge operation is adopted for the data whose similarity is 1 value. For each piece of data, if its kill chain stage is greater than 3 and smaller than the maximum kill chain stage of the attack cluster to this data, the confidence is 0. This paper removes the data with confidence of 0 from the attack cluster.

  2. (C)

    Filter between attack clusters (CFD_out)

Due to the rule-based rather than result-based detection nature of the IDS system, there will be a large amount of attack failure data in the actual acquired attack data. Therefore, the attack cluster that only depends on the classification of IP addresses must contain a large number of attacks. The unsuccessful attack behavior, the attack to a certain extent due to the change of the attacker’s target or the unsuccessful attack caused the cluster set to abandon, etc., these incomplete attack behaviors will lead to the incompleteness of the subsequent multi-step attack model; therefore In order to filter incomplete and incorrect attack clusters, this paper gives the confidence formula between attack clusters as shown in formula 5:

$$ {\text{CFD}}\_{\text{in}} = \sum_{{\text{i}} = 1}^{\text{N}} {{\text{killstep}}\left( {{\text{d}}_{\text{i}} } \right)*{\text{typeCFD}}\left( {{\text{d}}_{\text{i}} } \right)} $$
(5)

where N represents the number of attack data of the attack cluster, and for each piece of data, its kill chain stage \(\mathrm{killstep}\) is used as the product of authority and type confidence \(\mathrm{typeCFD}\) to represent the confidence value of the corresponding data.

4 Experimental Design and Analysis

4.1 Dataset

  1. (1)

    Simulation data D1

    This article uses the website management system CMS to build a Web site that contains a SQL injection backdoor, and sequentially uses Yujian to scan the website background, SQL injection to obtain the administrator account password, log in to the background, upload a sentence Trojan horse, and Chinese kitchen knife connection operations. Traffic data for this series of attacks. The attack process is shown in Fig. 3:

Fig. 3.
figure 3

Simulation experiment attack process.

  1. (2)

    Campus network data D2

    In this paper, a traffic monitoring system is arranged on the three subnet nodes of the campus network. One of the subnets includes the CTF competition environment in the school. Accumulatively collected 2G traffic data in the network, and passed the IDS system and sensitive information screening., 10870 pieces of alert data and 205,408 pieces of sensitive information traffic were obtained.

  2. (3)

    LLDDos 1.0 D3 of Darpa2000

This data set is widely used by researchers in the construction of multi-step attack scenarios. This article is based on its five attack steps: the attacker IPsweep scans all hosts in the network, detects the surviving hosts obtained in the previous stage, and determines which ones are The host is running the sadmind remote management tool on the Solaris operating system, the attacker enters the target host through a remote buffer overflow attack, the attacker establishes a telnet connection through the attack script, installs the Trojan horse mstream ddos software using rcp, and the attacker logs in to the target host to initiate a DDOS attack Launch attacks on other hosts in the LAN. An attack cluster is obtained through aggregation and screening, which contains 18-tone alert information.

4.2 Experimental Results

  1. (1)

    The feasibility of the fusion algorithm of alert log and sensitive information flow.

First, the collected traffic data is passed through the IDS system to obtain the alert data. The pyspark module of python uses the Spark framework to extract the sensitive information flow from the flow. After the sensitive information flow and the alert log fusion algorithm, the detection accuracy and detection integrity are compared.

Fig. 4.
figure 4

Comparison of detection accuracy and detection completeness.

Figure 4 shows the experimental results of the three data sets and the comparison results of Yanyu Huo et al. [6] in detection accuracy and detection integrity. It can be seen that after the sensitive information traffic data is added, the multi-step attack is more effective. The detection integrity has been improved to a certain extent, and the detection accuracy is equivalent to the method of Yanyu Huo et al. [6], but the method in this paper does not need to be classified by a preset threshold, so the sensitive information flow and alert log fusion algorithm proposed in this paper It is feasible in practice. The D3 data set has no difference in detection accuracy and detection integrity because the alert data covers all the attack steps.

5 Conclusion

Figure 4 shows the results of detection accuracy and detection completeness of the three data sets. The conclusion that can be drawn is that, compared with only using IDS alert logs as source data, the alert log and sensitive information flow fusion algorithm proposed in this paper can indeed be used to a certain extent. In order to compensate for the false positives and false negatives of the alert data, and based on the integrity of the attack process in the traffic data, the attack behavior can be more deeply and completely identified. Combined with the kill chain model proposed in this paper, the horizontal transmission stage is added and the evidence of intrusion is eliminated. An attack cluster with higher correlation, higher attack success rate and a certain attack stage sequence can be obtained, and then a more complete multi-step attack behavior can be obtained when the subsequent multi-step attack prediction is performed.