Anomaly detection analysis based on correlation of features in graph neural network

Ko, Hoon; Praca, Isabel; Choi, Seong Gon

doi:10.1007/s11042-023-15635-z

Anomaly detection analysis based on correlation of features in graph neural network

Open access
Published: 21 August 2023

Volume 83, pages 25487–25501, (2024)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

Anomaly detection analysis based on correlation of features in graph neural network

Download PDF

1209 Accesses
2 Citations
Explore all metrics

Abstract

Various studies have been conducted to detect network anomalies. However, because anomaly signals are determined by the pattern characteristics using the dataset, the real-time detection problem continues. Even if there is a signal with an attack sign among the constantly transmitted and received signals, the attack cannot be blocked in advance. Moreover, it appears in many places in a distributed denial-of-service (DDoS) attack, so the real-time defense must be the best option. Therefore, it is necessary first to discover the characteristics and elements regarded as abnormal signals to discover anomalies in real time. Finally, by analyzing the correlation between network data and features, extracting the elements of the anomaly, and analyzing the behavior of the extracted elements in detail, we aim to increase the accuracy of the anomaly. In this study, we used Coburg intrusion detection and KDDCup datasets and analyzed the correlation of elements in the dataset using a graph neural network. The calculated accuracy values of the anomaly detection were 94.5% and 98.85%.

Exploiting Redundancy in Network Flow Information for Efficient Security Attack Detection

Efficient Network Representation for GNN-Based Intrusion Detection

Dimensionality Reduction for Network Anomalies Detection: A Deep Learning Approach

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Web services are crucial computing technologies, and almost all users feel they are friendly because they can break the limitation of interaction in heterogeneous platforms. Previous studies only defined the features of the signal determined as an anomaly signal, updated them to the database, and used them to determine whether a signal received in the future is an anomaly signal. Moreover, this signal cannot be processed in real-time. For real-time anomaly detection, it is necessary to discover an essential item determined as an anomaly signal. Therefore, it is necessary to analyze the correlation between data or characteristics. The services companies can meet the demands of consumers and the market by combining existing services and those that enable the companies to avoid high development costs, but they must improve the application quality through the service selection. However, security challenges of the services are becoming critical, and one of the most common attacks in multiconnection is the distributed denial of service (DDoS). DDoS attacks can be directed to compromised devices or amplification on attacks using a spoofed source address, and it uses many computers to launch the coordinated DDoS attack against multiple targets [4]. If a service provider is attacked, the person waiting for the services is affected, causing economic damage and social problems. Various methods for detecting DDoS attacks have been proposed. However, research on the features with a decisive influence on transforming them into abnormal signals is insufficient. If a feature that affects the determination of anomalies is discovered, it will be a breakthrough in research on real-time anomaly detection. In this study, to detect the attack and determine the first feature, we used a graph neural network (GNN) to trace the flow of each feature if it will be the potential anomaly signal by analyzing the datasets, Coburg intrusion detection data sets (CIDDs), and KDDCup. The remainder of this paper is organized as follows. A review of related studies is presented in Section 2, which explains the multiconnection and GNN. Section 3 describes the methodology with CIDDS and KDDCup, and Section 4 discusses the experiment and analysis. Finally, we present the conclusions of this study in Section 5.

2 Related works

2.1 Multiconnection

Multiconnection attacks, including DDoS attacks and flooding, have become common, and attackers have developed high-level attack skills and tools to launch variants of such attacks [10, 15]. [18] consists of a frequency vector; it characterized the traffic in real-time as a set of models to predict the attacks [15] and included a game-theoretical defense framework against DDoS. In a multiconnection, there are transmission control protocol (TCP) connection parameters, including the bandwidth, connection capacity, waiting time, no “acknowledgment,” and “synchronize” flooding. The TCP multiconnection attack means TCP session-based attacks. Recently, various machine learning algorithms have attempted to apply DDoS defense and achieved improved performance. Xu et al. [17] developed an approach that uses hidden Markov models and cooperative reinforcement learning to detect DDoS traffic by monitoring a source Internet protocol (IP) address. Berral et al. [1] proposed a framework that can detect anomaly behaviors, share the node information regarding their local traffic observations in an intermediate network, improve their global traffic perspective, and independently use the naive Bayes algorithm learning information. [9], investigated the anomaly detection of DDoS attacks and found that the high accuracy and rapid detection of the anomaly signal are essential in statistical attack detection. If the detection systems act on the victim’s side, it should be considered with high priority. [7] mentioned that the feature selections in existing research are not tailored to the multivariate correlation technique. To solve this problem, they recommended an anomaly detection system using a multivariate correlation-based network by evaluating the system with UNSW-NB15 and NSL-KDD datasets and analyzing for feature correlations. They identified several weaknesses, but the solution was incomplete.

2.2 GNNs

Figure 1 shows an example graph with five vertices and eleven edges, including each relation from the (c) adjacency matrix.

The GNNs update the state of the graph nodes using a function ($f_w$) defined in (1) for the input and output.

$$\begin{aligned} X_n=f_w(l_n,l_{co[n]},x_{ne[n]},l_{ne[n]}) \end{aligned}$$

(1)

In the design, they both represent the same information. The get_neighbor operation gathers attributes from neighboring nodes, which is essential for aggregation. The complexity of the operation to the get_neighbor is O(n), where n is the number of neighbors that can be a particular vertex. For (b) representation, the neighbors of a particular vertex are stored back-to-back because they are highly sparse. (b) is often used as the graph representation [5, 6]. In this study, we used (b) to describe the adjacency matrix comprising an edge array and vertex array. The vertex array defines the number of neighbors, a particular vertex, and directs the location to the corresponding edge array containing the specific neighbor vertices. GNN accelerators mainly target aggregation and combination. Aggregation can be generalized as a sparse matrix multiplication (SMM) operation, whereas the combination is typically generalized as a general matrix multiply operation. Geng et al. [6] determined two possible computing ordering regarding the forward propagation of a graph convolutional network (GCN): $(A \times X) \times W$ and $A \times (X \times W)$. Here, A is the adjacency matrix, X is the feature matrix, and W is the weight matrix. Without the computing order, any computation with A is regarded as a part of the aggregation, and any computation with W is considered a part of the combination. Other variables include F (# input features), G (# output features), and N (# of neighbors of a vertex).

3 Methodology

In this study, the method for detecting DDoS based on the GNN with CIDDS and KDDCup was analyzed, and the interrelationships between features were investigated. Therefore, there is a need to split the instance features to utilize information on the different feature group levels. After training the GNN, we attempted to set up the system for real-time detection. The framework flow is shown in Fig. 2. Here, we discuss and analyze the design and application of this framework. CIDDS contains 14 features, and KDDCup contains 47 features; however, we do not need to use all of them. The features were modified by removing the following low priority, as listed in 8 and 9.

3.1 Data description

CIDDS is a concept for generating evaluation datasets for intrusion detection systems of anomaly-based networks , which is a flow-based port scan dataset. Because the information technology industry is constantly evolving, attackers are forced to adapt and discover new ways to penetrate their targets of interest, as presented in Table 8 [8]. Hence, the development of intrusion detection systems is a constantly evolving process between the attackers’ attempts and the triggered adjustments of the defenders. Therefore, it is inexpedient to test current intrusion detection systems with old datasets [12, 13]. As summarized in Table 8, CIDDS contains 14 features, and we removed the useless features [Table 4]. KDDCup dataset was generated from a simulated air force network environment in 1998. Additionally, [14] demonstrated that discrepancies exist in the attribute values of KDDCup. However, because of the lack of a standard and benchmark dataset, KDDCup is often used for research purposes. Therefore, there is a need for a dataset that can precisely represent present-day attacks and contain attributes with true values [14]. The main dataset, KDDCup, contains 42 features, and KDDTrain, derived from KDDCup, is abstracted by 17 features (Table B9). Table B2 presents a summary of each categorized feature based on calculations comprising two features, noncancellations comprising 13 features, and one feature, which decided whether the signal was (normal signal, anomaly signal).

3.2 Aggregation

Figure 3(a) shows the data flow to CIDDS in the aggregation step. Here, because we assume an unweighted adjacency matrix, the adders are necessary . Moreover, this is because the adjacency matrix has only “1” as the value; therefore, multiplications are required. Both examples show V, which presents the vertex dimension as the outermost loop, and they are marked as a temporal area where the next vertex starts after the current vertex ends along with their inner loops. Figure 3(a) shows F as features, which is a spatial area, whereas (a) shows N, a neighbor’s node. In the inner loop, (a) shows N and F as temporal. Because the corresponding features to each neighbor must be accumulated, the data flow suggests that a reduction strategy is required. Figure 3(a) and (b) require temporal and spatial reductions, respectively, and spatial reduction is performed using a linear chain.

3.3 Combination

Figure 3(b) shows the data flow for combination. The (V$\cdot $F) and (F$\cdot $G) matrices stream in the multiply-accumulate. Each input exhibits a different cycle-following feature; thus, the dimension is temporally mapped. The other dimensions are spatial and result in (V$_{s}$, G$_{s}$, F$_{t}$). This data flow is identical to an output stationary systolic array. Figure 3(b) shows the stationary output approach, with G as temporal. This transform to each PE holding multiple outputs.

4 Experiment and analysis

We described how to process two datasets with the GNN in Section 3. In this section, we discuss the results for accuracy and the interrelationships between features

4.1 Accuracy analysis

Table 2 presents the accuracy for the whole data set and GNN. The algorithm had run the following to obtain the results. First, the algorithm defines the node attention for the node classification to improve the computational efficiency and differentiated weights [11, 16]. To compute the node attention, we defined the notation (Table 1).

Table 1 Notation

Full size table

If (2) applies in the node feature (Table 2),

$$\begin{aligned} h = \{\overrightarrow{h}_{1}, \overrightarrow{h}_{2}, \cdots , \overrightarrow{h}_{N}\} \end{aligned}$$

(2)

matrix h will be formed as $h^{'}$ after the layering process, and the shape will be $\{N, F^{'}\}$. When the shape is $W = (F^{'}, F)$ and $a = (2F^{'},1)$, Attention Coefficient defines Equation (3) for the importance of the feature. Equation (4) is used to obtain the Normalized Attention Score [16].

$$\begin{aligned} e_{i,j} = a(W\overrightarrow{h}_{i}, W\overrightarrow{h}_{j}) \end{aligned}$$

(3)

$$\begin{aligned} \alpha _{i,j} = softmax_{j}(e_{ij}) = \frac{exp(e_{ij})}{\Sigma _{k \in N_{i}} exp(e_{ik})} \end{aligned}$$

(4)

Finally, the Attention Mechanism$(\alpha )$ can define (5) with learnable parameters as a single-layer feed-forward neural network.

$$\begin{aligned} \alpha _{i,j} = \frac{exp(LeakyReLU(\overrightarrow{a}^{T}[W\overrightarrow{h}_{i} \mid W\overrightarrow{h}_{j}]))}{\Sigma _{k \in N_{i}} exp(LeakyReLU(\overrightarrow{a}^{T}[W\overrightarrow{h}_{i} \mid W\overrightarrow{h}_{k}]))} \end{aligned}$$

(5)

We designed a table [2] for each accuracy based on KDDCup (94%) and CIDDS (91%) (Table 3).

Table 2 Accuracy for full dataset

Full size table

Table 3 Kappa agreement table

Full size table

4.2 Interrelationships between features

We analyzed the accuracy for KDDCup and CIDDS with the interrelationship between features, including a kappa analysis (Fig. 4). Tables 4 and 6 list the accuracy rate of each feature and the merit rate, which is merit of best subset found, and it is based on the correlation of the feature. Tables 5 and 7 presents the results for correct instances from the dataset. The proposed techniques classified were (119,004, 97,372) for CIDDS and (90,522, 123,520) for KDDCup. The correct instances with an accuracy of (94.47%, 77,30%), (71.86%, 98.85%), and (6,969, 28,601) for CIDDS, (35,451, 1,453) for KDDCup incorrect instances with an accuracy of (5.53%, 22,7%) and (28.14%, 1,15%). The mean absolute error (MAE) values calculated using Equation (6) were (0.04%, 0.13%) for CIDDS and (0.01%, 0%) for KDDCup, and the root-mean -square error (RMSE) values were (0.15%, 0.26%) and (0.08%, 0.04%) [2, 3].

$$\begin{aligned} MAE = \frac{1}{N} \sum \Vert y_{1} - y_{2} \Vert \end{aligned}$$

(6)

$$\begin{aligned} RMSE = \sqrt{\frac{1}{N} \sum (y_{1} - y_{2})^2} \end{aligned}$$

(7)

$$\begin{aligned} P_{0} = {\frac{TP+FP}{TP+TN+FP+FN}} \end{aligned}$$

(8)

$$\begin{aligned} {\begin{matrix} P_{e} = {\left[ \frac{TP+TN}{TP+TN+FP+FN} \times \frac{TP+FP}{TP+TN+FP+FN}\right] } \\ \quad + {\left[ \frac{FP+FN}{TP+TN+FP+FN} \times \frac{TN+FN}{TP+TN+FP+FN}\right] } \end{matrix}} \end{aligned}$$

(9)

$$\begin{aligned} Kappa\;K = {\frac{P_{0}-P_{e}}{1-P_{e}}} \end{aligned}$$

(10)

Table 4 Selected features of CIDDS

Full size table

Table 5 Correct instances of CIDDS

Full size table

To analyze the interrelationship, we used a data analysis tool called Weka (ver.3.9.5). For the preprocess, the AttributeSelection filter is used, which can change from numeric to nominal because if we need to use decisionTree to features, the feature type should beNominal. As a result, the CIDDS left three features proto, class, and attackType to decisionTree. There were 11 features in the CIDDS dataset from 14 features, and the data in Table 4 was used to produce Fig. 5 by connecting the selected features as a hierarchical shape. From Fig. 5, we analyzed how each connected between features with no direction. Figure 5 shows the direction between features. In the class, it connected to {duration, src.pt, dsp.pt}, which indicates that the class affected the {duration, src.pt, dsp.pt}. We ignored attackType because it defined the results, such as {DDoS, bruteForce, normal}. In particular, between {class and src.pt} direction is bidirectional, and they have a two-way effect.

Because the class defined {normal, attacker, victim, suspicious, unknown}, all signals selected one of them, including DDoS. Therefore, we analyzed the class with decisionTree and produced Fig. 6. The attack signal goes through the attacker and the victim (Fig. 6), and the accuracy is 99.89% (Table 5).

Table 6 Selected features of KDDCup

Full size table

For KDDCup, we performed the same analysis as in CIDDS. With AttributeSelection, the KDDCup selected {service, flag, count, serror_rate, same_srv_rate, dst_host_count, dst_host_diff_srv_rate, dst_host_serror_rate, class}. The results were obtained through decisionTree (Table 7). Additionally, we performed the experiment with {service, flag, class}, which are nominal, and only nominal can perform decisionTree.

In Table 7, We also designed a hierarchical relational graph of KDDCup based on Table 6 [Fig. 7]. The black arrows represent the influence in both directions, and the gray arrows indicate that influence is in one direction. The flag had the highest accuracy (98.85%, Table 7). After analyzing how many arrows were sent and received, the flag had [4, 2, 2], which is [S:send, R:receive, B:bidirection]. Conversely, the service had [2, 1, 2]. Finally, selecting a flag would be suitable for this dataset.

4.3 Discussion

This study was conducted to overcome existing research problems, that is, the inability to detect network-based real-time anomalies. In the existing method, the dataset of the network signal is configured, and whether the pattern is the same as that observed for the abnormal signal is determined. However, because this study focused on a processing procedure that receives and stores signals for a specific time and analyzes them later, it was impossible to detect abnormal signals in real time. However, the data correlation was analyzed in this study using GNN. By determining the critical keyword of the data obtained by the abnormal network signal, it is possible to grasp the continuous flow of the corresponding data while recognizing it as abnormal signal data. An anomaly signal analysis and mutual feature correlation were conducted to generate features using the GNN (Fig. 7), and the accuracy of detecting anomalies was 98.85%. The advantage of Fig. 7 is that it describes the connectivity between features; hence, if any data at any node is determined as an abnormal signal, the abnormal signal flow of the data can be identified easily.

Table 7 Correct Instances of KDDCup

Full size table

5 Conclusion

The GNN is a suitable artificial intelligence algorithm for analyzing the correlations between data or features. If the correlation between data or features is analyzed using the GNN, it can trace the features from the anomaly signal that will be linked to which signal. By analyzing these connections, it can detect the flow of the anomaly signal in real-time. In this study, we analyzed two datasets, CIDDS and KDDCup, using the GNN with 11 and 17 features, respectively. The flow of each feature was identified to investigate the feature interrelation and analyze the correlation between features, and a graph included direction. To use decisionTree, we selected some features, proto, class from CIDDS, and service, flag from KDDCup because proto, class and service, flag are nominal, numeric cannot be used in decisionTree. After filtering, we obtained the interrelation graph and accuracy. Based on the results, the accuracies of proto in CIDDS and flag in KDDCup are higher than those of other selected features. Features with high accuracy induce more and varied effects on other features.

References

Berral JL, Poggi N, Alonso J, Gavalda R, Torres J, Parashar M (2008) “Adaptive distributed mechanism against flooding network attacks based on machine learning,” Proceedings of the 1st ACM workshop on Workshop on AISec, 43–50
Cano A (2020) Krawczyk B (2020) “Kappa updated ensemble for drifting data stream mining” Machine Learning 109(1):175–218
De Raadt A, Warrens MJ, Bosker RJ, Kiers HA (2019) Kappa coefficients for missing data. Educational and psychological measurement 79(3):558–576
Article PubMed PubMed Central Google Scholar
Douligeris C, Mitrokotsa A (2003) “DDoS attacks and defense mechanisms: a classification,” In Proceedings of the 3rd IEEE International Symposium on Signal Processing and Information Technology, 190–193, Dec. 2003
Garg R, Qin E, Martínez FM, Guirado R, Jain A, Abadal S, Abellán JL, Acacio ME, Alarcón E, Rajamanickam S et al (2020) “A Taxonomy for Classification and Comparison of Dataflows for GNN Accelerators,” Sandia National Lab.(SNL-NM), Albuquerque, NM (United States)
Geng T, Li A, Wang T, Wu C, Li Y, Shi R, Tumeo A, Che S, Reinhardt S, Herbordt M (2020) “Awb-gcn: A graph convolutional network accelerator with runtime workload rebalancing,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 922–936
Gottwalt F, Chang E, Dillon T (2019) CorrCorr: A feature selection method for multivariate correlation network anomaly detection techniques. Comput Secur 83:234–245
Article Google Scholar
Lathif MRA, Nasirifard P, Jacobsen HA (2018) “CIDDS: A configurable and distributed DAG-based distributed ledger simulation framework,” In Proceedings of the 19th International Middleware Conference (Posters), pp. 7–8, Dec. 2018. https://doi.org/10.1145/3284014.3284018
Nooribakhsh M, Mollamotalebi M (2020) A review on statistical approaches for anomaly detection in DDoS attacks. Inf Secur J: A Global Perspective 29(3):118–133
Google Scholar
Panigrahi R, Borah S, Bhoi AK, Ijaz MF, Pramanik M, Kumar Y, Jhaveri RHs, (2021) A consolidated decision tree-based intrusion detection system for binary and multiclass imbalanced datasets. Mathematics. 9(7):751
Panigrahi R, Borah S, Bhoi AK, Ijaz MF, Pramanik M, Jhaveri RH, Chowdhary CL (2021) Performance assessment of supervised classifiers for designing intrusion detection systems: a comprehensive review and recommendations for future research. Mathematics 9(6):690
Article Google Scholar
Ring M, Wunderlich S, Grüdl D, Landes D, Hotho A (2017) Creation of Flow-Based Data Sets for Intrusion Detection. J Inf Warfare 16(4):40–53
Google Scholar
Ring M, Wunderlich S, Grüdl D, Landes D, Hotho A (2017) “Flow-based benchmark data sets for intrusion detection,” Proceedings of the 16th European Conference on Cyber Warfare and Security (ECCWS), 361–369
Siddiqui MK, Naahid S (2013) Analysis of KDD CUP 99 dataset using clustering based data mining. Int J Database Theory Appl 6(5):23–34
Article Google Scholar
Spyridopoulos T, Karanikas G, Tryfonas T, Oikonomou G (2013) A game theoretic defence framework against DoS/DDoS cyber attacks. Computers 38:39–50
Google Scholar
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) “Graph attention networks,” arXiv preprint arXiv:1710.10903
Xu X, Sun Y, Huang Z (2017) “Defending DDoS attacks using hidden Markov models and cooperative reinforcement learning,” Pacific-Asia Workshop on Intelligence and Security Informatics, 196–207
Zhou W, Jia W, Wen S, Xiang Y, Zhou W (2014) Detection and defense of application-layer DDoS attacks in backbone web traffic. Futur Gener Comput Syst 38:36–46
Article Google Scholar

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(No. 2020R1A6A1A12047945). This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education(No. 2021R1I1A3040361).

Author information

Authors and Affiliations

Security R & D Lab., CMT Info & Comm Co. Ltd., Seongsu-ro 22-gil, Seongdong-gu, Seoul, 04798, South Korea
Hoon Ko
Instituto Superior de Engenharia do Porto (ISEP), Polytechnic Institute of Porto (IPP), R. Dr. Antonio Bernardino de Almeida, 431, Porto, 4249-015, Portugal
Isabel Praca
School of Information and Communication Engineering, Chungbuk National University, 8-7, Chungdae-ro 1, Seowon-Gu, Cheongju-si, 28644, Chungcheongbuk-do, South Korea
Seong Gon Choi

Authors

Hoon Ko
View author publications
You can also search for this author in PubMed Google Scholar
Isabel Praca
View author publications
You can also search for this author in PubMed Google Scholar
Seong Gon Choi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hoon Ko.

Ethics declarations

Conflicts of interest

The authors have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Isabel Praca and Seong Gon Choi are contributed equally to this work.

Appendices

Appendix A Attribute of CIDDS

Table 8 Attribute of CIDDS

Full size table

Table 9 Attribute of CIDDS

Full size table

Appendix B Attribute of KDDCup

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ko, H., Praca, I. & Choi, S.G. Anomaly detection analysis based on correlation of features in graph neural network. Multimed Tools Appl 83, 25487–25501 (2024). https://doi.org/10.1007/s11042-023-15635-z

Download citation

Received: 07 April 2022
Revised: 04 August 2022
Accepted: 22 April 2023
Published: 21 August 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-15635-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Anomaly detection analysis based on correlation of features in graph neural network

Abstract

Similar content being viewed by others

Exploiting Redundancy in Network Flow Information for Efficient Security Attack Detection

Efficient Network Representation for GNN-Based Intrusion Detection

Dimensionality Reduction for Network Anomalies Detection: A Deep Learning Approach

1 Introduction