1 Introduction

Any set of actions that attempt to compromise the integrity, confidentiality or availability of resources is called as intrusion[1]. An intruder is an individual or a group of individuals who initiates the actions in the intrusion. An intruder may be a legitimate user of a computer system. It can also be an illegimate user who may enter in an unprotected network service on the computer by exploiting its vulnerability.

An intrusion detection system (IDS) is a monitoring system which reports alarms to the system operator whenever it infers from its detection model. IDS is software, hardware or combination of both used to detect intruder activity. It may have different capabilities depending upon how complex and sophisticated the components are. IDS are manufactured by many companies. An IDS may use signatures, anomaly based techniques or both[24]. When IDS detects an intruder, it has to inform security administrator about this using alerts. Alerts may be in the form of pop-up windows, logging to a console, sending e-mail, etc.

Today, use of IDS is considered to be one of the important protection tools. Researchers are working hard to make the IDS smart enough to detect all sorts of attacks. Various soft computing techniques, e.g., fuzzy logic, artificial neural networks and genetic algorithms, are being used for making the intrusion detection rules[57].

In a conventional GA, the length of chromosomes is fixed. It makes the GA implementation easy but at the cost of few short comings like:

  1. 1)

    There is no guarantee that all the required rules will be generated.

  2. 2)

    It causes wastage of computational time.

One solution to this problem is to use variable length chromosomes (VLCs)[8] allowing inclusion of one or more rules in chromosomes[9].

This paper presents the use of VLCs in a GA based rule generation for network intrusion detection. This is the first time VLC approach of such a type is being used for intrusion detection problem. These rules are then used for the detection of infected connections. The experimental results show that proposed technique is effective in intrusion detection.

The motivation of the presented work and the brief overview of the IDS are discussed. The remaining paper is organized as follows. Section 2 gives an overview of the genetic algorithm employed in this work. In Section 3, survey of the relevant work is made. Section 4 presents the proposed GA with VLCs. Section 5 presents the implementation and results. Section 6 concludes the paper.

2 Genetic algorithm

Genetic algorithms (GAs) are search algorithms based on the mechanics of natural selection and natural genetics.

GA has been developed by John Holland and his colleagues and students at the University of Michigan. They are different from other optimization techniques in several ways[10]. GA is blind. To perform an effective search for better structures, they only require payoff values. Simplicity of operation and power of effect are the two main attractions of GA approach.

GA has the following three operators[10]: reproduction, crossover and mutation. Genetic algorithm starts with the generation of a random population, then the fitness of the each individual is determined using appropriate fitness until function. This population undergoes an iterative process solution is found or specified computation is completed. First, the chromosomes are randomly selected using one of the selection techniques such as Roulette wheel selection, tournament selection, rank selection, steady state selection, etc. The selected chromosomes undergo regeneration process. The first step of regeneration is crossover or recombination. There are various crossover techniques such as one-point, two-point, uniform, etc. The result of crossover is the birth of two new chromosomes.

A mutation operator is applied on these newborn chromosomes. Mutation alters one or more gene values in a chromosome. Mutation is an important part of the regeneration process as it helps to prevent the population from stagnating at any local optima. Now the fitness of these chromosomes is determined using the fitness function. When the specified iterations are completed, the best fit chromosome is chosen as the solution for the problem.

3 Related work using GA approach

Different researchers have implemented GA in different ways to generate rules for intrusion detection.

Middlemiss and Dick[11] used GA for weighted feature extraction with specific application to intrusion detection data. They implemented a simple genetic algorithm which evolves weights for the features of data set. A k-nearest neighbor classifier was used for the fitness function of GA as well as to evaluate the performance of the new weighted feature set.

Gong et al.[7] used GA-based approach for network intrusion detection. The genetic algorithm is used to generate the optimized rules for network intrusion detection from network audit data. The support confidence framework is used as fitness function to calculate the fitness of each rule. The fittest rules are then used for network intrusions detection.

Zhao et al.[12] used clustering genetic algorithms to solve the computer network intrusion detection problem. It describes a prototype intelligent intrusion detection system to demonstrate the effectiveness. This system combines two stages into the process including clustering stage and genetic optimization stage. The algorithm can not only cluster the cases automatically, but also detect the unknown intruded action.

Xiao et al.[13] presented a network intrusion detection method based on information theory and genetic algorithm. They used information theory to filter the traffic data and thus reduce the complexity. A linear structure rule is used to classify the network behavior into normal and abnormal behaviors.

Lee et al.[14] presented a feature selection method that maximizes class separation between normal and attack patterns of computer network connections. They have focused on selecting a robust feature subset based on the genetic optimization procedure in order to improve a true positive intrusion detection rate.

Ashfaq et al.[15] used genetic algorithm for generating efficient rules for cost sensitive misuse detection in intrusion detection systems.

Chen et al.[16] designed a training algorithm model based on abnormality detection. The proposed experimental model is based on a hypothesis that if variable x appears more times than the desired value, there is a possibility of occurring abnormality.

In the above papers, the genetic algorithm is used either to generate the detection rules or to select the appropriate features from the data set. They all have used fixed length chromosomes consisting of only one rule in each chromosome. This conventional technique has some drawbacks. First, there is no guarantee that all the required rules will be generated. Further, it causes a lot of wastage of computational time.

4 The proposed GA-VLC based intrusion detection method

The proposed GA-based intrusion detection is implemented in two different phases. In the first phase, the classification rules are generated using a computer algorithm written in Java 6. In the second phase, these rules are used to classify or detect the infected connections.

4.1 Data set

MIT Lincoln Laboratory, under Defense Advanced Research Projects Agency (DARPA) and Air Force Research Laboratory (AFRL) sponsorship, has collected and distributed the first standard data for evaluation of computer network intrusion detection systems. This data is DARPA 1998 data[17]. This data consists of tcpdump and basic security module (BSM) list files. Each line in a list file corresponds to a separate session. Each session corresponds to an individual TCP/IP connection between two computers. The first nine columns in list file provide information which identifies the TCP/IP connection.

Table 1 gives the number of record types that present in the dataset. The first row shows the numbers of normal records. The second and third rows give the distributions of Smurf and Neptune attacks respectively.

Table 1 The distribution of record types

The Smurf and Neptune attacks are of Denial of Service type.

4.2 Feature selection and representation

Seven most important features having higher possibilities to be involved in network intrusions are selected for defining the intrusion rules[7, 18]. These are duration (h: m: s), service (integer), source port (integer), destination port (integer), source IP (a, b, c and d), destination IP (a, b, c and d), attack name (integer).

Each rule is in if-then form containing a condition and its outcome. The rule is of the form:

IF duration = 0:00:01 & protocol = telnet & source port = 19468 & destination port = 120 & source IP = 001.002.003.004 & destination IP = 172.016.112.050 THEN Neptune. The structure of the chromosome comprising of n rules is shown in Fig. 1.

Fig. 1
figure 1

VLC structure

The number of rules in a chromosome is limited. We begin by defining a particular limit to the number of rules in a chromosome, say 15. But we do not know how many rules are exactly required. This should also be identified by the algorithm. So, the chromosome should be able to increase or decrease the number of rules. We can use wild card values in each field of the rule. We have used wild card values in the third field and fourth part of both source IP and destination IP. We have put −1 in the field chosen for the wild card.

The structure of a rule comprising of genes is shown in Fig. 2. The status field just indicates the presence or absence of an attack.

Fig. 2
figure 2

Rule structure

4.3 Fitness function

The fitness function is based on the amount of errors committed by a rule and the number of rules in a chromosome. Fitness value of a chromosome decreases as the amount of errors committed by its rules increases. Both false positive and false negative errors are considered. False positive error occurs when there is no intrusion occurred but a report of an attack or an attempted attack appears. False negative error occurs when intrusion occurs with no warning. Fitness value also decreases as the number of rules in a chromosome structure increases.

$$f = {k \over {1 + error}} + {{1 - k} \over n}$$

where k is chosen to be 0.8, error is the sum of false positive and false negative errors, and n is the number of rules in a chromosome.

4.4 Crossover and mutation

Crossover is an important genetic operator that combines the two parent chromosomes to produce two new offspring chromosomes. The idea behind crossover is that the new chromosome may be better than both of the parents if it takes the best characteristics from each of the parents. Crossover occurs during evolution according to a user-defined crossover probability.

In the presented approach, one point crossover technique is used. The lengths of both the parent chromosomes are checked and the chromosome whose length is smaller is taken as parent 1. If lengths of both the chromosomes are the same, then any one chromosome is taken as parent 1. Then, a crossover point is randomly chosen for parent 1. As shown in Fig. 3, the part of both the parent chromosomes after the crossover point is interchanged[19, 20].

Fig. 3
figure 3

Crossover on a VLC

Mutation occurs on only a few individuals. Each gene in each chromosome is checked for possible mutation by generating a random number between zero and one. If this number is less than or equal to the given mutation probability, i.e., 0.01, then the gene value is changed. Mutations create diversity to search in domain regions that may otherwise be excluded.

5 Implementation and results

The GA with VLCs is implemented using Java language (JDK6). The front end development environment used is NetBeans 7.0. The GA is applied on selected subset of DARPA 1998 data.

The implementation is done in two phases. In the first phase, the classification rules are generated using GA. The number of rules in a chromosome is also determined by GA. Enumeration technique is used to determine the value of each gene for the chromosomes[21]. Normally, while generating the genes, the range of values for each gene is defined and then each gene is generated randomly. We have instead used enumeration technique to determine the value of each gene for the chromosomes. Each gene value from the data set is listed in an ordered fashion. Then, each gene value is randomly chosen out of these listed sets. An effective fitness function is used to calculate the fitness of the chromosomes. After experimentation, the various optimal GA parameters selected were k = 0.8, 2000 generations, population of 60, crossover rate of 0.5, one-point crossover and mutation rate of 0.01.

GA parameters used by Gong et al.[7] were w1 =0.2, w2 = 0.8, 5000 generations, 500 initial rules in the population, crossover rate of 0.5, two-point crossover and mutation rate of 0.02.

In the presented approach, the maximum number of rules in a chromosome is taken to be 15. The appropriate number of rules is identified by GA.

After generating the classification rules in the first phase, the fittest rule is taken for detection purpose. In the second phase, this rule is used to classify both training as well as testing data set.

We have implemented Gong et al.’s approach[7], and the results obtained are compared with the proposed GA-VLC approach as shown in Table 2.

Table 2 Detection rate comparison between the proposed approach and the Gong et al’s. approach [7]

Implementation is done using a 10-fold cross validation method. In 10-fold cross-validation method, the data set is partitioned into ten parts of equal size, and nine parts of them are used at a time for training and the remaining one is used for testing. The process is repeated ten times, with different partitions used as training data and test data. The most important statistic to collect from each run of algorithm on each data set is the mean of the classification accuracies from ten runs.

Although 10-fold cross validation gives some insight into algorithm performance, the difference is so small that conclusions cannot be made objectively. Hence, a statistical test is conducted. As the input data is normally distributed, small sample paired t-test using MINITAB software is conducted. In this test, measures of algorithm performance on every fold are taken as an input. We observe that P-value (0.000) is less than the alpha (α) level (5%). We reject null hypothesis as the difference is greater than zero (positive), i.e., there is significant difference in the detection rate of the proposed GA-VLC approach and the Gong et al.s approach[7].

As the GA runs progresses, the accuracy of intrusion detection generally improves until maximum accuracy is obtained. Often, GA may also land in local maxima unless GA parameters are properly set. Table 3 shows the percentage detection for different number of GA generations for the population size of 500 using Gong et al.’s approach.

Table 3 Number of generations against detection accuracy (Using Gong et al.’s approach[7])

Table 4 shows the percentage detection for different numbers of GA generations for the population size of 60 using GA-VLC approach.

Table 4 Number of generations against detection accuracy (Using GA-VLC approach)

As shown in Fig. 4, as the number of generations is increased, the detection rate is improved. In Gong et al.s approach[7], good results are obtained after 5000 generations In GA-VLC approach, the best results are achieved only after 2000 generations.

Fig. 4
figure 4

Effect of generations on detection accuracy

Further, GA-VLC results are compared with various algorithm results used for building decision trees, such as GATree, J48 and CART. For GATree implementation, GATree software[22] is used. J48 and simple CART algorithms are implemented in the open source software called Weka[23]. Implementations are done on 10% KDD Cup 1999 data[24]. For all implementations, 10-fold cross validation technique is used. Results obtained with decision tree algorithm are compared with the proposed GA-VLC algorithm resultsasshowninTable 5.

Table 5 Comparison of proposed GA-VLC algorithm results with decision tree algorithm results

Table 6 compares the detection rate of proposed GA-VLC approach with the detection rate of other approaches.

Table 6 Detection rate comparison of proposed approach with other approaches

Hu et al.[25] proposed online Adaboost-based intrusion detection algorithms, in which decision stumps and online Gaussian mixture models (GMMs) were used as weak classifiers for the traditional online Adaboost and the proposed online Adaboost. They give 90.13% and 91.15% detection rate. Lu et al.[26] proposed an integrated fuzzy GNP rule mining with distance based classification which yielded 97.54% detection rate. Cheng et al.[27] proposed a basic extreme learning machine (ELM) method based on random features and a kernel based ELM method for classification. By using kernel based ELM, a good detection rate of 98.81% is achieved. Altwaijry[28] developed an intrusion detection system based on Bayesian probability. The Bayesian classifier was able to detect intrusion with a detection rate of 99.36%. Li et al.[29] proposed an efficient intrusion detection system based on support vector machines and gradually feature removal method. It achieves 98.62% detection accuracy.

6 Conclusions

In this paper, an effective GA-based technique is presented for intrusion detection. It has used VLCs. An enumeration technique is used in genetic algorithm framework for the generation of classification rules. This reduces the search space and provides a good speed-up.

In the presented approach, maximum number of rules in a chromosome is taken to be 15. The appropriate number of rules is identified by GA. In the Gong et al.’s approach, the top 20 best quality rules were taken as the final classification rules. So, it is evident that the number of rules used in the presented approach is less. This reduces the computational time.

Results presented in Table 2 prove that percentage detection rate obtained by the proposed GA-VLC approach is better than the Gong et al.s approach[7].

As the number of generations is increased, the detection rate is improved. In Gong et al.s approach[7], the best results are obtained after 5000 generations, where as in GA-VLC approach, the best results are achieved only after 2000 generations. As the computational time is directly proportional to the number of generations, a substantial time is saved using GA-VLC approach.

As presented in Table 5, classification accuracy obtained with J48 and CART decision tree is extremely good. The classification accuracy obtained with proposed algorithm is a bit better than GATree algorithm.

From experimental results, it is evident that the proposed technique is effective in network intrusion detection. Because it provides better result than Gong et al.’s approach [7] even while using smaller number of classification rules.

From Table 6, it is evident that the results obtained by using the proposed GA-VLC approach are comparable with other approach results.