The goal of employing genetic algorithm (GA) is to generate both an optimal set of trusted packet sequences and those not trusted: (1) Packet sequences are learned to be trusted and therefore those packets are quickly determined to be permitted, while (2) those that will be untrusted are thus denied in advance. An ideal contribution of our GA approach is to forecast any vulnerability T3, T5 and T7 at T2. Similarly, GA results at T4 enable to skip or reduce the processes T5 and T7, and so on.
This section therefore takes the sequences of packets into consideration. A packet sequence PS can be defined over packets, pi, as follows:
$$ PS = p_{1} ,p_{2} , \ldots ,p_{n} $$
(2)
For example back in Motivating Example 2, those five sequences can constitute a packet sequence, PS = p1, p2, p3, p4, p5. The order in a packet sequence is very important in many cases. For example, a few control packets such as SYN and ACK packet come before TCP or UDP packet, and then FIN packet at last. However, it is not necessary that all packets in PS are related among themselves. For analysis purpose, PS can be simply a sequence of packets in a unit time period.
In what follows in this section, the fitness table and GA operations are characterized over network traffic packets.
4.1 Fitness Table
The parameters for the fitness functions are in three dimensions, (1) Geo-temporal factors, (2) port-relevant factors, and (3) user-relevant factors. The fitness table must represent the trustworthiness of network packets. Since it is well known that the trustworthiness of source IP’s is determined dominantly by country [6]. The trustworthiness of source or destination ports is determined in part by its convention. For example, port numbers are designated to specific applications and protocols according to the network sorcery [26]. Any port access that does not follow the network sorcery are likely to attack. The trustworthiness of users is determined in part by their privileges. For example, accesses with the role of a super user are likely to attack if the other factors are met.
These three dimensional factors determine the fitness table, which is illustrated in Fig. 6. In the cube, each cell indicates a value (between −1 and +1) for three factors, one from each factor dimension, which is denoted by
$$ - 1 < = cell\left( {g,t,u} \right) < = 1 $$
(3)
where g, t and u respectively denotes geo-temporal factors, port-relevant factors, and user-relevant factors.
So, for given packet sequence P, which is constituted over n packets, the fitness function proposed in this paper is:
$$ F\left( P \right) = \sum\limits_{i = 1}^{n} {\omega_{i} cell_{i} \left( {g,p,u} \right)} $$
(4)
where \( \omega \) denotes the weight for a cell value, \( 0 < = \omega \le 1 \). Hence, \( - 1 < = F\left( P \right) < = 1 \).
4.2 GA Operations
As illustrated in Fig. 3, for any given pair of packet series, PSi and PSj, which contain respectively n and m packets, the following three operations are performed. Crossover and Selection operations are performed a lot more frequently than Mutation operation. Mutation operations take place at a rate of one to five percent of the frequency of selection or crossover operations.
One of the goals for using GA approaches is to learn the dichotomy of packet sequences. An example of packet sequence dichotomy is shown in Fig. 7(c). Three packet sequences, each sequence consists of 10 packet frames. The initial fitness of those packet sequences is closer to 0, which means they are neither fitness 1, i.e., trusted, nor −1, i.e., untrusted. As combination of GA operators is iterated multiple times, the fitness of each is converged to either 1 or −1. Some sequences can be converged early, depending of the values on the fitness cube.