The Journal of Supercomputing

, Volume 72, Issue 9, pp 3489–3510 | Cite as

Real time intrusion detection system for ultra-high-speed big data environments

Article

Abstract

In recent years, the number of people using the Internet and network services is increasing day by day. On a daily basis, a large amount of data is generated over the Internet from zeta byte to petabytes with a very high speed. On the other hand, we see more security threats on the network, the Internet, websites, and the enterprise network. Therefore, detecting intrusion in such ultra-high-speed environment in real time is a challenging task. Many intrusion detection systems (IDSs) are proposed for various types of network attacks using machine learning approaches. Most of them are unable to detect recent unknown attacks, whereas the others do not provide a real-time solution to overcome the above-mentioned challenges. Therefore, to address these problems, we propose a real-time intrusion detection system for ultra-high-speed big data environment using Hadoop implementation. The proposed system includes four-layered IDS architecture, which consists of the capturing layer, filtration and load balancing layer, processing or Hadoop layer, and the decision-making layer. Furthermore, feature selection scheme is proposed that selects nine parameters for classification using (FSR) and (BER), as well as from the analysis of DARPA datasets. In addition, five major machine learning approaches are used to evaluate the proposed system including J48, REPTree, random forest tree, conjunctive rule, support vector machine, and Naïve Bayes classifiers. Results show that among all these classifiers, REPTree and J48 are the best classifiers in terms of accuracy as well as efficiency. The proposed system architecture is evaluated with respect to accuracy in terms of true positive (TP) and false positive (FP), with respect to efficiency in terms of processing time and by comparing results with traditional techniques. It has more than 99 % TP and less than 0.001 % FP on REPTree and J48. The system has overall higher accuracy than existing IDSs with the capability to work in real time in ultra-high-speed big data environment.

Keywords

Machine learning Intrusion detection Threats  Big data Network 

1 Introduction

In this technological era, the network and the Internet speed has reached gigabytes and even terabytes. People from various fields, with lack of computer knowledge, are getting benefits by using Internet services. Companies are gaining profit by managing their resources and transactions on the network. Humans from different fields are expanding their resources from health care to military applications, using various types of networks such as sensors, vehicular networks, a cellular network, etc. However, the possibility of cyber-attacks by stealing personal and secret information from the computers and the network is also increasing at the same rate.

The world is full of such intruders, who try to penetrate into the secret network to steal data and destroy network resources. They might use their own single hidden system or use multiple ordinary users’ machines as zombies by taking illegal control over them without their knowledge to launch an attack on the network. Moreover, there are various other scenarios that the attacker practices to penetrate into the network and get illegal access by looking into the vulnerabilities that exist in the system. Many security mechanisms and intrusion detection systems (IDS) are proposed and are used to detect such intruders.

The intrusion detection systems concept was first introduced by Denning [1, 2] in 1986 while providing the first intrusion detection model that identifies abnormal behavior in the network. However, it is still an important topic for researchers due to the continuous evolution and changing structure of data, speed of networks, and changing adaptation techniques of the intruders.

We can define intrusion as any illegal computer activity that gets access for information gathering, eavesdropping, etc., passively, or doing harmful packet forwarding, packet dropping, or performing hole attack, etc. Many other researchers define IDS with different perspectives [3, 4, 5, 6]. Butun et al. [3] define IDS as a collection of tools, methods, and resources that help to identify, assess, and report intrusions. Intrusion detection is usually one part of the whole network security system that is installed on a system and is not a separate protection measure [4]. Zhang et al. [5] elaborate intrusion as “any set of actions that attempt to compromise the integrity, confidentiality, or availability of a resource” and intrusion prevention techniques, such as encryption, authentication, access control, secure routing, etc., are offered as the first guard against attacks. Intrusion detection emanates when prevention techniques have failed to protect resources from intruders. IDS also makes the network more secure by detecting any suspicious activity affecting the internal network performed by the network member himself. IDS can support other systems to mitigate and remediate the effects of intrusion by providing information of attacks launched by the intruder, such as intruder identification, his/her location, time of intrusion, intrusion type (e.g., active, passive, or attacks name such as worm hole, black hole, sink hole, selective forwarding, etc.), etc. IDSs are cyberspace equal of burglar alarms that are being used in current physical security systems [7].

Technological advancement in cyberspace increases the usage of ubiquitous networks, wireless sensor networks, and Web technologies. The abundant use of technology results in an exponential increase in the network data traffic and speed. According to one of the reports, 65 % of UK houses were connected to the Internet in 2008 [8] and it increased to 80 % in 2012 [9]. Moreover, in 2012 the overall computer-generated data were estimated as 2.27 zettabytes and 8 zettabytes is expected in the current year [10], of which more than 90 % contents were generated in last 2 years [11]. On the Internet, this data is transmitted at a very high speed in various ways. Therefore, an efficient system is needed, which keeps the high velocity of data under consideration to analyze such high-speed data, when needed. Those with high volume, high velocity, and with different varieties are usually termed as big data. It can be structured, semi-structured, and unstructured. In an era of big data, the IDS should be efficient enough to process ultra-high-speed transmission at a real time without losing any vital flow packets.

No ideal solution exists that provides the ideal way to be more powerful and generic with higher accuracy and efficiency rate. Therefore, it is extremely vital to come up with a better IDS to provide security to the valuable resources in the network by protecting machines from unauthorized or malicious actions, especially in high-speed network traffic environment. For any IDS, the main requirement is the accuracy of the system and then efficiency. In high-speed big data transmission, where the transmission is achieved in gigabytes per second, the efficiency of the IDS is most important.

Therefore, to address the aforementioned challenges, the proposed system meets the need for efficiency with higher accuracy while operating continuously in a parallel environment of Hadoop and introducing no extra overhead that degrades the performance of the transmission. The proposed system comprises ultra-high-speed IDS that detects any network intrusion in real time with more accuracy and efficiency. The contribution of the proposed scheme is manifold, i.e., (i) Hadoop-based architecture is proposed for intrusion detection systems, (ii) intrusion detection scheme is proposed that selects the nine best features of data flows and detects abnormal flows, (iii) implementation of the proposed intrusion detection system on Hadoop, and (iv) evaluation of the proposed method on using machine learning classifiers. The proposed system is compared with existing techniques with respect to accuracy and efficiency while considering most of the reputed machine learning techniques.

The proposed system has higher accuracy and is more efficient than existing systems. Therefore, it has the capability to work in ultra-high-speed big data environment due to its obvious advantages over the traditional system. The system can be implemented by capturing traffic from either switch, router, gateway or any other high-speed network device with high-speed capturing card. It detects any intrusion in the system from any malicious user over the Internet. An abstract level model of the system is shown in Fig. 1.
Fig. 1

Implementation model of the proposed IDS

The rest of the paper is organized as follows. Section 2 describes the background of IDS and related work. Section 3 presents the proposed system including the datasets and tools used for analysis and testing, features and parameters selection for IDS, proposed IDS architecture, and algorithms. Section 4 describes the implementation details, results and discussion, and evaluation. Sect. 5 finally concludes the article and presents the future work.

2 Background and related work

Security attacks on any network system could be broadly categorized as active or passive. In passive attacks, the attackers are typically hidden and either capture communication on a transmission link or rescind the network functioning elements, such as eavesdropping, node malfunctioning, node tampering or destruction, and illegal traffic analysis. While performing active attacks, intruders disturb the operations in the attacked network for achieving their objectives, such as attackers might want to degrade or terminate the networking services. This can be achieved by denial of service (DoS), jamming, black hole, wormhole, sinkhole, flooding and Sybil attacks. IDS is mostly built for active types of attacks. The major three security strategies adopted to cater to such attacks are prevention, detection, and mitigation. Firstly, the prevention strategy makes it possible that no intruder could penetrate into the system and get illegal access. It works on “Prevent before it happens” rule. Secondly, the detection is performed on those attacks that cannot be prevented. The system immediately starts the detection process that detects the attack and the compromised node. Thirdly, mitigation is done that reacts to the effects of attacks and cures the affected node and damage.

Intrusion detection systems could be categorized on the basis of detection mechanisms and source(s) for which the detection is provided. For detection mechanism, IDS can be anomaly based, misuse based, or specification based, while IDS can be network based, host based, or hybrid, depending on the source of audit data.

In anomaly-based IDS, the profile of standard statistical behavior of the member or network is maintained. The statistical behavior is continuously monitored and a particular deviation from the normal behavior is treated as an intrusion. Anomaly-based detection is very powerful for latest attacks that are unknown and encountered for the first time. However, for such detection techniques, the profile of the usual behavior must be updated periodically because of the changing behavior of the network with usage. The periodic updating raises the overhead of the whole system. Anomaly-based IDS could be based on statistical measurements in which the network traffic is captured and then a stochastic, statistical, or probabilistic profile of its behavior is maintained. The profile can be flow based or as a whole. Deviation from the particular threshold of anomaly score, generated from stochastic behavior profile, is detected as intrusion. Knowledge-based anomaly-based IDSs requires the prior knowledge about the network parameters in normal condition as well as in an attacking environment. It can be expert system (based on rules classification), description language (based on UML), finite state machine (state and transition are defined for available normal data as well as intrusions), and data clustering and outlier detection (data are grouped into clusters based on specified similarity or distance measure). One of the major techniques used in anomaly-based detection is machine learning (ML). In ML-based anomaly IDS, patterns are generated for a normal profile and attacked profile. The design model is updated periodically to improve the IDS performance and accuracy. Machine learning IDSs detection use Bayesian networks (use probabilistic relationships among the important parameters), Markov model (use stochastic Markov theory in which the topology and capabilities of the system are modeled as states that are interconnected through certain transition probabilities), fuzzy logic (measure estimation and uncertainty), genetic algorithms: based on evolutionary theory of biology, neural networks (inspired by human brain, principal component analysis (PCA, eigenvalues of matrix and dimensionality reduction technique), and support vector machine (SVM, matrices)

A rule-based technique is proposed in [12] that is based on known ratio propagation model by describing power decay of the message transmission rule. The technique is very powerful for various attacks, such as DoS or flooding and wormhole. They treat the message as suspicious if its transmitted power deviates from its sender’s geographical position. Puttini et al. [7] proposed the Bayesian classification statistical method that is used to detect intrusion. Their main aim is to detect packet flooding that results in DoS. The proposed model uses a behavioral model that maintains multiple users’ profiles by applying posterior Bayesian classification to them as a detection algorithm. In [13], the estimated congestion at the intermediary nodes is used as a decision-making mechanism to detect malicious behavior that causes packet dropping. The authors suggested that the traffic pattern can be one of the measurements to choose for intrusion detection from hop-to-hope. The proposed intrusion detection technique is general and suitable for bandwidth unlimited networks with strict security requirements, such as tactical systems. The IDSs proposed in [14, 15, 16] uses ML methods and classifiers to detect intrusion. They used kdd99cup datasets and introduced various parameters for ML classifiers for detecting various attacks.

Abbes et al. [17] also used ML approach for active IDS by analyzing different application protocols. They used separate and distinct adaptive decision trees for each protocol that classified records into two groups, benign and anomalies. Their system is used to identify DoS attack, scans attack, and botnets. Wagner et al. [18] and Khan et al. [19] use support vector machine (SVM) for intrusion classification. Wagner et al. use the proposed one-class SVM classifiers that can detect new anomalies [20]. Moreover, two state network IDS model is proposed [21] by using k-means clustering to group data into three clusters (e.g., C1), attack data (e.g., Probe, U2R, and R2L; C2, DoS attack data), and C3 for normal data. Gaddam et al. [22] proposed IDS scheme using K-means clustering and decision tree learning. Cho proposed the idea of using Markov model for intrusion detection by comparing the intrusion model with the typical model [23]. The author used neural networks and fuzzy logic for making their system robust and flexible. Zhenwei et al. used the idea of automatically tuned IDS in [24] for attack classification by involving operators when FP occurs.

Misuse-based detection can be signature based or rule based. Signatures or patterns of the previous attacks are identified and then used for future detection. For instance, the signature can be “more than five attempts to sign in but failed” for small brute force attack. Signature-based detection is very simple, accurate, and efficient for known attacks. However, it will now perform more accurately for new kinds of attacks. Most of the antiviruses use such type of detection mechanism. On the other hand, the authors identified some rule-based detection by identifying some rule for intrusion detection, such as interval rule, retransmission, integrity rule, delay rule, repetition rule, radio transmission range, etc. [25]. Wai et al. proposed a hybrid IDS that can work on both wired and wireless ad hoc networks [26]. It uses misuse as well as an anomaly-based detection mechanism.

In specification-based IDS, some specification and constraints for a standard application are defined. The application is monitored on those defined constraints, and if deviated then it is detected as abnormal and an intrusion. Nadkarni and Mishra propose one of the specification-based techniques, which is mainly concerned with detection attacks such as DoS, replay attacks as well as compromised node in distance-vector routing protocols such as DSDV protocol [27].

Network-based IDS monitored and made an analysis on each incoming packet of the network traffic and identified intrusions that occurred on the network. It can be implemented on network devices, such as switch router, server, gateway, etc. Most of the above-mentioned work is host network based. Francisco et al. [28] proposed a network intrusion detection system (NIDS) for the smart sensor-inspired device.

Host based is concerned with the events that occurred at each node. It identified any intrusion activity on a single node as a result of any event, such as changes in the critical system files on the host, repeated failure access attempts to the host, unusual process memory allocations, and unusual CPU activity or I/O activity. One of the hosts-based anomaly detection ADMIT was done by Sequeira and Zaki [29] by creating user profiles of a sequence of user or computers commands.

Hybrid IDS have both network-based and host-based features. It performs intrusion detection on the host as well as on network as a whole at the same time. El-Khatib proposed one of the hybrid IDS systems for the 802.11 protocol-specific attacks [30]. The author uses information gain ratio for feature selection and K-mean classifier for intrusion detection.

As the speed of the network traffic is increased day by day, it results in high-speed big data generation. In such an era, we need a high-speed system that can efficiently work in the high-speed environment. Limited works have been done in the area of intrusion detection in big data environment, which lack real-time implementation and efficiency. Tan et al. [31] proposed a theoretical framework to improve the security as well as the privacy of big data by studying the vulnerabilities that exist in cloud computing. Huang, Kalbarczyk, and Nicol [32] developed a latent Dirichlet allocation (LDA)-based hybrid approach for intrusion detection through knowledge discovery in the big data. Similarly, Ahn et al. [33] also give an idea about a new model for unknown attacks detection based on big data analysis techniques while extracting information from various sources. In addition, Marchal et al. [34] also proposed an architecture based on big data for large-scale security monitoring. However the system did not consider the accuracy and efficiency, and only limited analysis was performed. However, all of these systems present some theoretical model, framework, architecture, etc., but lack practical implementation.

Therefore, based on the challenges mentioned in the literature, it is now a challenging task to provide high-speed intrusion detection, while designing a network in which attacker is unable to find a way to break the security. ML is a most widely used approach to detect an intruder with a high accuracy. However, the existing techniques are still not efficient enough to process high-speed big data at real time. Therefore, based on the previous ML knowledge, the proposed system detects intrusions based on the nine features with higher accuracy. Moreover, efficiency is achieved in a high-speed data network by implementing the proposed architecture along with various proposed algorithms using Hadoop (MapReduce). The details of the proposed system, as well as implementation, are described in later sections.

3 Proposed model

1. Datasets, tool, and experimental environment We use publicly available and widely used benchmark dataset from three sources for analysis, testing, and evaluation. DARPA [35] is the basic dataset we have used for our analysis that contains multiple complex attacks including probing, breaking into the system by exploiting vulnerabilities, installing DDos software for the compromised system, and launching DDos attack against another target. For testing and feature selection, the most widely used dataset KDDCUP99 [36] is considered. The dataset is built on the traffic captured by DARPA [35], which have various intrusions. Each flow is characterized by 41 parameters and labeled as normal or attack of a specific type. A training dataset of KDD contains 24 specific types of intrusions with additional 14 attacks in the testing dataset, which includes denial of service (DoS) attack, user to root attack (U2R), remote to local attack (R2L), and probing attacks. Moreover, the NSL-KDD dataset [37], which removes the issues that exist in the KDD dataset, is also used while testing the proposed system. Redundant and duplicate records are also withdrawn from KDD to make it more reliable for the researchers. DARPA and KDD datasets have almost a size of 5.5 and 1 GB, respectively.

We use Java programming with weka 3.6.12 library for machine learning algorithm implementation. Moreover, we also use Hadoop in a single node setup environment using Pcap-Input Format, Hadoop-pcap-lib, and Hadoop-pcap-serde library to process real-time traffic having network packets and large datasets and calculate the flow parameters for machine learning (ML) classification algorithms for intrusion detection. Ubuntu 14.04 LTS system with 4 GB RAM and core i5-3.20 GHz processor is used while performing experiments and evaluation. However, only 2 GB RAM is used for Heap building in ML classifiers.

2. Features and parameters selection KDD99 [33] suggested 41 parameters for IDS classification. However, this number is too large to increase the computational power of ML algorithms implementation while processing large datasets or ultra-high-speed network traffic for intrusion detection. Moreover, it also reduces the accuracy rate of the system. Various techniques have been used for selection of features for intrusion detection and to find a relationship between them. Aljarrah proposed RF-FSR and RF-BER [38] feature selection techniques to select the best 16 features among 41 of the KDD-proposed features. Kayacik [14] and Araujo [15] reduced this number to 15 and 14, respectively. Still, 14 more features are there to process real-time traffic for intrusion or large datasets efficiently. Some of the ML approaches take more time to process large datasets using those features. Kantor [16] finally selects the best 6 among those 41 features while detecting intrusions, although the number is very short for efficient processing of ultra-high-speed traffic. On the other hand, it reduces the accuracy of the system, especially for unknown future attacks. While keeping in mind this requirement, we use forward selection ranking (FSR) and backward elimination ranking (BER) [38] mechanism together to select the 4 best features among 41 of them including feature 1, 2, 3, and 16. Instead of parameters 6,7 i.e., src_bytes, dst_bytes, we use “number of packets” and “packet size mean”. Furthermore, by analyzing DARPA TCPDump traffic, we observed that the packet size distribution for normal traffic and malicious flow for a particular application differs. Therefore, we added three more parameters, i.e., pkt_rate, pkt_sd_size, range_pkt_size, in our selected feature list.

In either FSR or BER feature selection technique, the weight of the feature plays a major role in the selection procedure. Enhanced support vector decision function (ESVDF) [39] is used to identify the weights of all parameters depending upon their importance in the detection. After that, random forest [40] sorts all features depending on their weights. By FSP, initially two features are selected, which have the highest weight among 41 features and form a set called selected features set (SFS). The SFS is then used for building the intrusion classification model. SFS is evaluated on accuracy and efficiency while identifying intrusions. Afterward, one more parameter is added with SFS which has higher weight among the other 39 parameters and again the evaluation is performed using SFS features. If the newly added parameter enhanced the performance of the system in terms of accuracy and efficiency, then it is kept in SFS, otherwise it is removed from SFS. This process continues until all 41 parameters are evaluated while putting them into SFS one by one. In the case of BER, initially all 41 parameters are kept in SFS. The parameters are removed one by one from SFS depending on their weight, from lower to higher. If removal of the parameter degrades the performance of the system, the parameter is again added to SFS. We use FSR and BER together to select the 4–6 best parameters amongst all 41 parameters. Finally, we decide on BER and FS-R selected features as well as our analysis-based features, Table 1 shows the details of all the nine features which we use for intrusion classification.
Table 1

Selected features of the proposed IDS

Serial #

Features

Details

1

Duration

Whole duration of the flow/session

2

protocol

Protocol (TCP, UDP, HTTP, etc.)

3

Service

Particular service the host is using

4

Num_root

Number of roots involved

5

No. of packets

No. of packets

6

Pkt_rate

Packet rate in packet/ second transmitted by a flow

7

Pkt_size_mean

Mean value of the packet size exchange between flow

8

Pkt_sd_size

Pkr sizes standard deviation

9

Range_pkt_size

The range of the packet sizes

3. Classification algorithms Various machine learning classifiers such as naïve Bayes, support vector machine, random forest, J48, and REPTree are used to identify intrusions by applying the selected features. A short description is given for each of the classifiers used in our work. Naive Bayes is a construction classifier, i.e., model that is used for assigning class labels to problem instances. Naïve Bayes does classification by a vector of feature values, made from some finite set. Naive Bayes classifier assumes that the values of specific features are independent of other features in a class variable; for instance, a fruit may be considered to be an orange if its color is orange, has a round shape and diameter 3”. Naive Bayes classifier may consider any one these features to be independently contributed to the probability that the mentioned fruit is orange, unrelated to any correlations between its shape, diameters and diameter features.

The support vector machine is an administered learning model which analyzes data and recognizes patterns. These are also used for the regression and classification analysis; for training example, each mark is assigned to one of the two categories. An SVM training algorithm is used to figure out the model that is used for the assignment of new examples into one group, making it a non-probabilistic binary linear classifier.

Conjunctive rule classifier is used in the implementation of the simple conjunctive rule learner, which can predict numeric and nominal class labels. In simple conjunctive, a rule having antecedents “AND” together and the consequent (class values) for the classification/regression. In such a case, the consequent is the distribution of the available classes in the dataset. In case the test example is not covered by this rule, then it is predicted that the usage of the default class distribution of the data by the mentioned rule is not covered in the training data.

Random forest (tree based) belongs to the ensemble learning methods that are used for classification, regression, and other tasks. They are operated by constructing a multitude of the decision tree at training time and outputting the class, i.e., the class’s mode (classifications) or mean prediction (regression) of the individual trees.

The c4.5 algorithm is used to generate a decision tree. It is the extension of the Quinlan’s earlier ID3 algorithm. They can also be used for classification in the decision tree. For this very reason, C4.5 can be referred to as a statistical classifier. In this paper, we have used Java that uses J48, which is based on C4.5.

REPTree is a fast decision tree learner that builds a decision/regression tree using information gain/variance. Afterward, it prunes by exploiting reduced error pruning algorithm (with back fitting). It is also used for sorting numeric attribute values, in which missing values are given out by splitting the corresponding instances into pieces (i.e., as in C4.5)

4. Proposed architecture The main objective of the proposed system is to process network traffic at real time for intrusion detection with higher accuracy in the high-speed big data environment. Keeping in mind the objective of the system, the proposed architecture is designed, which can be implemented on any network device such as at switch or router, and even on ISPs and telecommunication authorities’ gateways and firewalls. Initially, the traffic is captured at the above-mentioned ultra-high-speed network with high-speed capturing device and drivers such as RF_RING and TNAPI [41], so that no packet can remain uncaptured. The captured traffic is sent to the next layer filtration and loads balancing server (FLBS). FLBS has two primary responsibilities. First, it filters only those flows’ traffic for analysis, which are not yet decided as an intrusion or normal flows by efficient searching and comparisons in In-Memory intruders database. Secondly, it sends the unidentified flows traffic and required packet header information to the third layer (Hadoop layer) master servers. FLBS also balances the load by deciding which packets are sent to which master server depending on the IP addresses. The master takes the network traffic/packets and generates sequence file for each flow, so that it can be processed by Hadoop data nodes. The master node extracts the necessary information from each packet by using Pcap-Input Format, Hadoop-pcap-lib, and Hadoop-pcap-serde APIs and stores that information into the sequence file, such that each packet corresponds to one line. The process continues for a particular duration for each flow. Afterward, the sequence file is sent to data nodes which are equipped with feature value calculation algorithm implemented in MapReduce. The MapReduce code of the algorithm calculates the network flow feature by processing sequence file line by line in parallel. To achieve the real-time efficiency, we use Spark tool over the Hadoop ecosystem. Finally, the feature values are sent to layer 4 decision server(s). Decision server(s) has the implementations of the various classifiers, such as J48, REPTree, and SVM, which classify the flows as normal or attack based on their parameter values. Finally, decisions about a particular flow are stored in In-Memory intrusion list that can be used by filtration server for filtering the intruder’s traffic. In-Memory database increases the efficiency of the system by providing data with high speed for comparisons and searching. Apart from the proposed architecture, there are few existing big data processing architectures [42, 43] that have the ability to process high-speed data. However, the proposed architecture is particularly designed for the intrusion detection systems. A complete picture of the architecture is shown in Fig. 2.
Fig. 2

Proposed IDS architecture

5. Proposed algorithm A joint algorithm is proposed for all layers to identify intruder flows. Flows are distinct by four tuples i.e., source IP, destination IP, source port, and destination (src_IP, dst_IP, src_port, dst_port). Algorithm 1 describes the pseudocode of the proposed algorithm. Initially, for each captured packet, the filtration is performed at FLBS, as described in step 2. FLBS pass those packets, which belong to the flows that are not identified as an intruder or normal flows, for processing. Step 3 is performed at a master node, which checks whether the incoming packet belongs to an already registered flow. If it does not belong to an already registered flow, then it is registered as a new flow, distinct by (src_IP, dst_IP, src_port, dst_port), and a new sequence file is created for this flow by inputting necessary packets information in the first line. On the other hand, if the packet belongs to the registered flow, then the packets information is just inputted into the particular existing sequence file corresponding to that registered flow. The master node continues to copy packet information into the sequence file for a particular duration for each flow. When the duration threshold deviates, the sequence file is sent to one of the data nodes for flow parameters calculations, as coded in step 5. The data node uses the Map and Reduce function equipped with parameters calculations code to measure the final values for each of the nine features for intrusion detection. MapReduce code having Map and Reduce function have the capability to run in parallel by taking the sequence file as input on the Hadoop environment. Since each data node processes a distinct flow information in parallel, the overall performance is enhanced. Finally, the calculated feature values are sent to the decision server, which is equipped with various ML classifiers to decide about the flow: whether it is an intrusion or normal flow based on its features values. The ML algorithm used in this paper is described in Sect. 3 (3). The decision made by decision servers are then informed to the In-memory database at FLBS for updating the intruders flow list. A complete picture of the flow of the system is depicted in Fig. 3.

Fig. 3

Flowchart of the IDS algorithm

4 Implementation and evaluation

The proposed IDS is implemented in MapReduce Java programming and Spark on the top of Hadoop ecosystem using Hadoop-pcap-input, Hadoop -pcap-lib, and Hadoop -pcap-serde APIs for real-time packet processing. The proposed system is implemented on a single node Hadoop, taking it as the master and data node. ML classifiers are implemented in Java at the decision server, while Hadoop processes sequence file and calculates parameters’ values for each incoming flow. Most wide and more efficient ML classifiers are selected for evaluating the proposed system and features. The selected classifiers are naïve Bayes, support vector machine (SVM), conjunctive rule, random forest tree, REPTree, and J48 (C4.5 Java implementation), described in Sect. 3. The proposed system is evaluated for accuracy by considering true positive (TP) and false positive (FP). The system is also evaluated by considering efficiency in terms of processing time in the above-mentioned classifiers and KDD datasets. Accuracy evaluation is done by taking three KDD [36] dataset files, i.e., corrected dataset file with total flows/IPs of 311030, KDDCup.data.corrected file with first 1048576 flows/IPs, and KDDcup.data.corrected.10 % file with 494021 flows using the above stated ML classifiers. Finally, the comparison is made with older IDS with respect to accuracy and efficiency. Techniques to which the comparison is made are RF-FSR and RF-BER [38], Kayacik [14], Araujo [15], and Kantor [16]. The proposed IDS has more than 99 % TP on all intrusion datasets. The comprehensive accuracy results in terms of TP and FP are shown in Table 2.
Table 2

Accuracy of the proposed system on three files of KDD99 Dataset

Serial

Classifiers

Correcteddataset file

KddCup.Data.Corrected

KddCup.Data.Corrected_10%

Over all

TP (%)

FP (%)

TP (%)

FP (%)

TP (%)

FP (%)

TP (%)

FP (%)

1

Naive Bayes

94.1

0.002

94.7

0.0001

95

0.0015

94.6

0.0012

2

Conjunctive rule

80.1

0.05

75

0.0001

78.95

0.061

78.02

0.037

3

SVM

97.7

0.005

94.3

0.0001

95.8

0.0001

95.93

0.001

4

Random forest

98.9

0.002

99.9

0.0001

99.9

0.00001

99.57

0.0007

5

J48

99.9

0

99.9

0.0001

99.9

0

99.9

0.00003

6

RepTree

99.9

0.0005

99.9

0.0001

99.9

0

99.9

0.0002

Intrusion detection by using the proposed nine features performs well while done by J48 and REPTree classifiers. The accuracy in terms of TP is more than 99.9 % on KddCup.Data.Corrected and KddCup.Data.Corrected_10 % dataset files. The FP rate of both of these classifiers is very low, i.e., less than .0001 % for both of KddCup.Data.Corrected and KddCup.Data.Corrected_10 % dataset files. Moreover, the proposed system also has overall more than 99 % TP and less than 0.0001 % FP. The accuracy results also show that the choice of using conjunctive rule classifier for intrusion detection is not good, as it has very low TP and very high FP rate as compared to other ML classifiers.

While considering the efficiency in terms of processing time, since the proposed solution has less number of parameters and it is implemented on the parallel environment of Hadoop, it takes a shorter time to process larger datasets. The IDS implementation using REPTREE classifiers is most efficient for both building model and decision-making, as shown in Figs. 4 and 5. Naïve based classifiers also performed well using the proposed features in terms of processing time. However, naïve based classifiers are not efficient while decision-making and not more accurate. The time consumed by different classifiers on the model building by using the proposed feature on three dataset’s files is shown in Fig. 4.
Fig. 4

Time taken by each classifier to build a model of various files of the KDD99 dataset

Fig. 5

Time taken by each classifier to classify intrusion on various files of the KDD99 dataset

Figure 5 shows the time that elapsed while making a decision by each classifier after model building for the KDDCup dataset files. Random Fores, SVM, and naïve-Bayes implementation took more time while identifying intrusions in KDDcup.Data.Corrected file. REPTree, J48, and conjunctive rule classifiers took almost the same time while processing dataset for intrusion detection. However, as shown in Table 2, the conjunctive rule classifier’s accuracy is lower as compared to other classifiers. Moreover, naïve Bayes classifier is more efficient for model building, but less efficient for decision-making. The SVM is not more efficient while model building or decision-making. Finally by analyzing the accuracy and efficiency results of various ML classifiers, we conclude that REPTree and J48 are two best choices for intrusion detection with higher accuracy and more efficiency using the proposed features on Hadoop.
Table 3

Accuracy comparison among different IDS on the corrected data file of the KDD99 dataset

Classifiers

TP (%)

FP (%)

RF-FSR

RF-BER

Kayacik

Araujo

Kantor

Our system

RF-FSR

RF-BER

Kayacik

Araujo

Kantor

Our system

Naive Bayes

91.4

88

91.8

90.1

91.4

94.1

0.003

0.003

0.005

0.002

0.003

0.002

SVM

95.8

95.8

95.8

95.4

94.1

97.7

0.009

0.009

0.009

0.01

0.11

0.05

Conjunctive rule

72.2

72.2

72.2

72.2

72.2

80.1

0.067

0.067

0.067

0.067

0.067

0.005

Random forest

98.1

97.9

97.9

97.6

97.2

98.9

0.002

0.003

0.002

0.001

0.004

0.002

J48

98

99.9

97.9

97.5

97.2

99.9

0.002

0

0.002

0.001

0.004

0

REPTree

97.9

97.7

97.9

97.4

97.2

99.9

0.003

0.003

0.003

0.001

0.004

0.0005

Table 4

Accuracy comparison among different IDS on KDDcup.corrected.data file of the KDD99 dataset

Classifiers

TP (%)

FP (%)

RF-FSR

RF-BER

Kayacik

Araujo

Kantor

Our system

RF-FSR

RF-BER

Kayacik

Araujo

Kantor

Our system

Naive Bayes

94.9

94.7

94.9

94.3

97.1

94.7

0.001

0

0.001

0.001

0.001

0.0001

SVM

90

94

93.7

93.6

93.4

94.3

0.008

0.061

0.008

0.01

0.006

0.0001

Conjunctive rule

77.9

74

77.9

73.9

74

75

0.078

0.074

0.078

0.074

0.074

0.0001

Random forest

99.9

99.9

99.9

99.9

99.8

99.9

0.0001

0

0

0.00001

0.00001

0

J48

99.9

99.9

99.9

99.9

99.8

99.9

0.0001

0

0

0.00001

0.00001

0

REPTree

99.9

99.9

99.9

99.9

99.8

99.9

0.0001

0

0

0.00001

0.00001

0

Table 5

Accuracy comparison among different IDS on KDDcup.corrected.data_10 % file of the KDD99 dataset

Classifiers

TP (%)

FP (%)

RF-FSR

RF-BER

Kayacik

Araujo

Kantor

Our system

RF-FSR

RF-BER

Kayacik

Araujo

Kantor

Our system

Naive Bayes

95.5

94.8

94.5

93.6

92.8

95

0

0.001

0.001

0.001

0.007

0.0015

SVM

99.4

99.7

99.3

99.1

98.7

95.8

0.001

0.0001

0.002

0.002

0.004

0.061

Conjunctive rule

78.5

78.5

78.5

78.5

78.5

78.95

0.061

0.061

0.061

0.061

0.061

0.0001

Random forest

99.9

99.9

99.9

99.9

99.8

99.9

0

0

0

0

0.00001

0.00001

J48

99.9

99.9

99.9

99.9

99.8

99.9

0

0

0

0

0.00001

0

REPTree

99.9

99.9

99.9

99.9

99.8

99.9

0

0

0

0

0.00001

0

Finally, a comparison is made with existing techniques, mentioned above, while considering efficiency in terms of processing time and accuracy in terms of TP and FP. It is obvious from the results of various datasets that the proposed IDS system is more accurate on most of the ML classifiers as shown in Tables 34, and 5. Our technique has higher accuracy rate than most of the existing techniques on several datasets files with higher TP and lower FP. While applying the detection on KDDcup.corrected.data file, Kantor’s system using naïve Bayes classifier outperforms the proposed system in terms of TP as described in Table 4. On the other hand, the proposed system outperforms Kantor’s system with a major difference in processing time. Similarly, Conjunctive rule classifier for RF-FSR also gives better accuracy for the result of TP than our system, but in this case, the FP is quite higher and incorrect. When we consider the accuracy results on KDDcup.corrected.data\(\_10\) % dataset file shown in Table 5, most of the techniques’ accuracy is equal to the proposed system; however, the proposed system worn out all these approaches in terms of processing time efficiency.

The efficiency comparison is made based on the time consumed on building a classification model for intrusion detection as well as on decision-making i.e., classification itself for corrected dataset file. The efficiency comparison graph is shown in Fig. 6 for building model processing time in seconds. Moreover, Fig. 7 shows the classification or decision-making time using various machine learning classifiers. While considering model building, only Kantor’s system takes the same time for model building as compared to the proposed system. On the other hand, the proposed system outperforms Kantor’s IDS system using any classifier while considering making decisions, as shown in Fig. 7.
Fig. 6

Efficiency comparison of various IDS systems based on built modeling time for the correct file of the KDD dataset

Fig. 7

Efficiency comparison of various IDS systems based on the classification (detection) time for a correct file of the KDD dataset

It is quite obvious that for every ML classifier, the proposed system takes less modeling time as well as less decision-making time for all existing techniques. For REPTree and J48 classifier implementation, the proposed system is most efficient and with higher accuracy than any other system. The evaluation of the system proved that the system is accurate and efficient and has the capability to perform better in ultra-high-speed big data environment.

5 Conclusion

In this paper, we proposed a real-time intrusion detection system that includes the four-layered IDS Hadoop-based architecture, proposed feature selection algorithm, machine learning classifiers, and proposed intrusion detection algorithm with implementation details. The proposed architecture is composed of Hadoop various master and data nodes, which process high-speed real-time traffic with more efficiency due to the parallel processing nature of Hadoop. We evaluated our proposed system by implementing the system on Hadoop single node using MapReduce programming with various machine learning approaches. The system generates best results on REPTree and J48 ML classifiers by taking the proposed features with an overall accuracy of more than 99 % TP and less than 0.0001 % FP. Finally, we compared the proposed system with existing solutions with respect to efficiency in terms of processing time and with respect to accuracy in terms of TP and FP. The proposed system outperforms the existing solution in terms of accuracy and efficiency. Most widely used intrusion datasets, such as DARPA, KDDCup99, and NSL-KDD, are used for evaluation and testing the system. Finally, the proposed system with the nine identified features for intrusion detection is recommended to be implemented on Hadoop using REPTree or J48 for processing network traffic in real-time high-speed big data environment.

Notes

Acknowledgments

This study was supported by the Brain Korea 21 Plus project (SW Human Resource Development Program for Supporting Smart Life) funded by Ministry of Education, School of Computer Science and Engineering, Kyungpook National University, Korea (21A20131600005). This work is also supported by Institute for Information and Communication Technology Promotion(IITP) Grant funded by the Korean government (MSIP). [No. 10041145, Self-Organized Software Platform (SoSp) for Welfare Devices].

References

  1. 1.
    Denning D (1986) An intrusion-detection model. In: IEEE computer society Symposium on research security and privacy, pp 118–131Google Scholar
  2. 2.
    Denning DE (1987) An intrusion-detection model. IEEE Trans Softw Eng 13(2):222–232. doi:10.1109/TSE.1987.232894
  3. 3.
    Butun I, Morgera SD, Sankar R (2014) A survey of intrusion detection systems in wireless sensor networks. IEEE Commun Surv Tutor 16(1):266–282CrossRefGoogle Scholar
  4. 4.
    Ngadi M, Abdullah AH, Mandala S (2008) A survey on MANET intrusion detection. Int J Comput Sci Secur 2(1):1–11Google Scholar
  5. 5.
    Zhang Y, Lee W, Huang YA (2003) Intrusion detection techniques for mobile wireless networks. J Wirel Netw 9(5):545–556CrossRefGoogle Scholar
  6. 6.
    Patcha A, Park JM (2007) An overview of anomaly detection techniques: existing solutions and latest technological trends. Elsevier J Comput Netw 51(12):3448–3470CrossRefGoogle Scholar
  7. 7.
    Puttini R, Hanashiro M, Miziara F, de Sousa R, Garcia-Villalba L, Barenco C(2006) On the anomaly intrusion-detection in mobile ad hoc network environments. In: Proc. 11th IFIP TC6 international conference on personal wireless communications. Springer, pp 182–193Google Scholar
  8. 8.
    Engen, V.: Machine learning for network based intrusion. Ph.D. dissertation, Bournemouth Univ., Poole (2010)Google Scholar
  9. 9.
    ofcom (2013) Communications market report 2013 [Online]. http://www.ofcom.org.uk/cmruk/
  10. 10.
    Sagiroglu S, Sinanc D (2013) Big data: a review. In: Collaboration technologies and systems (CTS), 2013 International Conference on. IEEE, pp 42–47Google Scholar
  11. 11.
    Wu X, Zhu X, Wu G-Q, Ding W (2014) Data mining with big data. Knowl Data Eng IEEE Trans 26(1):97–107CrossRefGoogle Scholar
  12. 12.
    Pires Jr. WR, de Paula Figueiredo TH, Wong HC, Loureiro AAF (2004) Malicious node detection in wireless sensor networks. In: Proc. 18th Int. Parallel Distrib. Process. Symp. (2004)Google Scholar
  13. 13.
    Rao R, Kesidis G (2003) Detecting malicious packet dropping using statistically regular traffic patterns in multihop wireless networks that are not bandwidth limited. In: Proc. IEEE GLOBECOMGoogle Scholar
  14. 14.
    Kayacik HG, Zincir-Heywood AN, Heywood MI (2005) Selecting features for intrusion detection: a feature relevance analysis on kdd99 intrusion detection datasets. In: Proceedings of the third annual conference on privacy, security and trust, CiteseerGoogle Scholar
  15. 15.
    Araujo N, de Oliveira R, Ferreira E-W, Shinoda A, Bhargava B (2010) Identifying important characteristics in the kdd99 intrusion detection dataset by feature selection using a hybrid approach. In: IEEE 17th international conference on telecommunications (ICT), pp 552–558. IEEEGoogle Scholar
  16. 16.
    Kantor P, Muresan G, Roberts F et al (2005) Analysis of three intrusion detection system benchmark datasets using machine learning algorithms. In: Intelligence and security informatics, sec. 3, p 363. Springer-Verlag, Berlin, HeidelbergGoogle Scholar
  17. 17.
    Abbes T, Bouhoula A, Rusinowitch M (2010) Efficient decision tree for protocol analysis in intrusion detection. Int J Secur Netw 5(4):220–235CrossRefGoogle Scholar
  18. 18.
    Wagner C, François J, State R, Engel T (2011) Machine learning approach for IP-flow record anomaly detection. In: Proc. 10th International IFIPGoogle Scholar
  19. 19.
    Khan L, Awad M, Thuraisingham B (2007) A new intrusion detection system using support vector machines and hierarchical clustering. VLDB J 16(4):507–521CrossRefGoogle Scholar
  20. 20.
    Schölkopf B, Platt JC, Shawe-Taylor JC, Smola AJ, Williamson RC (2001) Estimating the support of a high-dimensional distribution. Neural Comput 13(7):1443–1471CrossRefMATHGoogle Scholar
  21. 21.
    Muda Z, Yassin W, Sulaiman MN, Udzir NI (2011) A K-means and naive bayes learning approach for better intrusion detection. Inf Technol J 10(3):648–655CrossRefGoogle Scholar
  22. 22.
    Gaddam SR, Phoha VV, Balagani KS (2007) K-Means+ID3: a novel method for supervised anomaly detection by cascading kmeans clustering and ID3 decision tree learning methods. IEEE Trans Knowl Data Eng 19(3):345–354CrossRefGoogle Scholar
  23. 23.
    Cho SB (2002) Incorporating soft computing techniques into a probabilistic intrusion detection ystem. Syst Man Cybern Part C Appl Rev IEEE Trans 32(2):154–160CrossRefGoogle Scholar
  24. 24.
    Yu Z, Tsai JJP, Weigert T (2007) An automatically tuning intrusion detection system. Syst Man Cybern Part B Cybern IEEE Trans 37(2):373–384CrossRefGoogle Scholar
  25. 25.
    da Silva AP, Martins M, Rocha B, Loureiro A, Ruiz L, Wong HC (2005) Decentralized intrusion detection in wireless sensor networks. In: Proc. 1st ACM International workshop on quality of service and security in wireless and mobile networks (Q2SWinet ’05), pp 16–23. ACM PressGoogle Scholar
  26. 26.
    Wai FH, Aye YN, James NH (2005) Intrusion detection in wireless ad-hoc networks. CS4274, Introduction to Mobile Computing, term paper, School of Computing, National University of SingaporeGoogle Scholar
  27. 27.
    Nadkarni K, Mishra A (2003) Intrusion detection in MANETs-the second wall of defense. In: Proc. 29th annual conference of the IEEE industrial electronics societyGoogle Scholar
  28. 28.
    Francisco M-P et al (2011) Network intrusion detection system embedded on a smart sensor. Ind Electron IEEE Trans 58(3):722–732MathSciNetCrossRefGoogle Scholar
  29. 29.
    Sequeira K, Zaki M (2002) ADMIT: anomaly-based data mining for intrusions. In: Proc. eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 386–395. ACM, New YorkGoogle Scholar
  30. 30.
    El-Khatib K (2010) Impact of feature reduction on the efficiency of wireless intrusion detection systems. Parallel Distrib Syst IEEE Trans 21(8):1143–1149Google Scholar
  31. 31.
    Tan Z, Nagar UT, Xiangjian He, Nanda P, Ren Ping Liu, Song Wang, Jiankun Hu (2014) Enhancing big data security with collaborative intrusion detection. Cloud Comput IEEE 1(3):27–33. doi:10.1109/MCC.2014.53 CrossRefGoogle Scholar
  32. 32.
    Huang J, Kalbarczyk Z, Nicol DM (2014) Knowledge discovery from big data for intrusion detection using LDA. In: Big data (BigData Congress), 2014 IEEE international congress on, June 27 2014-July 2 2014, pp 760–761. doi:10.1109/BigData.Congress.2014.111
  33. 33.
    Ahn S-H, Kim N-U, Chung T-M (2014) Big data analysis system concept for detecting unknown attacks. In: Advanced communication technology (ICACT), 2014 16th International Conference on, 16–19 Feb 2014, pp 269–272. doi:10.1109/ICACT.2014.6778962
  34. 34.
    Marchal S, Jiang X, State R, Engel T (2014) A Big data architecture for large scale security monitoring. In: Big data (BigData Congress), 2014 IEEE international congress on, June 27 2014–July 2 2014, pp 56–63. doi:10.1109/BigData.Congress.2014.18
  35. 35.
    I.S.T.G. MIT Lincoln Lab (2000) DARPA intrusion detection data sets. http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/data/2000data.html
  36. 36.
    KDDcup99 (1999) Knowledge discovery in databases DARPA archive. http://www.kdd.ics.uci.edu/databases/kddcup99/task.html
  37. 37.
    NSL-KDD (2009) NSL-KDD data set for network-based intrusion detection systems. http://iscx.cs.unb.ca/NSL-KDD/
  38. 38.
    Al-Jarrah OY et al (2014) Machine-learning-based feature selection techniques for large-scale network intrusion detection. In: Distributed computing systems workshops (ICDCSW), 2014 IEEE 34th international conference on. IEEEGoogle Scholar
  39. 39.
    ENGEN (2010) Machine learning for network based intrusion detection. Doctoral dissertation, Bournemouth UniversityGoogle Scholar
  40. 40.
    Zaman S, Karray F (2009) Features selection for intrusion detection systems based on support vector machines. In: Consumer communications and networking conference, 2009. CCNC 2009. 6th IEEE, pp 1–8Google Scholar
  41. 41.
    Fusco F, Deri L (2010) High speed network traffic analysis with commodity multi-core systems. ACM IMC 2010Google Scholar
  42. 42.
    Rathore MMU, Paul A, Ahmad A, Chen B, Huang B, Ji W (2015) Real-Time Big Data Analytical Architecture for Remote Sensing Application. Sel Top Appli Earth Observations Remote Sens, IEEE J 8(10):4610–4621. doi:10.1109/JSTARS.2015.2424683
  43. 43.
    Ahmad A, Paul A, Rathore MM (2016) An efficient divide-and-conquer approach for big data analytics in machine-to-machine communication. Neurocomputing 174:439–453Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.School of Computer Science and EngineeringKyungpook National UniversityDaeguKorea

Personalised recommendations