Introduction

Cybercrime cases have accelerated in recent years [1], involving malware attacks and various illegal software applications. This attack can be a botnet that runs malware performing malicious activities such as phishing, Distributed Denial of Services (DDoS), and spamming [2,3,4,5,6]. A botnet consists of bots, a collection of machines infected with malware that communicate and connect to carry out attacks controlled by the botmaster [7,8,9,10]. In practice, the attacks start from infecting devices to form a bot network before reaching the target machine [11]. This infection process runs underground [12, 13], sometimes leading to detection difficulty. The infected bots are connected via the Command and Control (C&C) service that the botmaster [9, 14] uses by sending commands, exchanging information, and updating the botnet control system code, affecting big data [15, 16].

Botnet architecture evolves from centralized to decentralized [17,18,19]. The centralized botnets connect and communicate directly with botmasters via C&C, such as IRC or HTTP [4, 20], and are typically anticipated by identifying botmaster activity on C&C servers [14, 21]. When botmaster communication access is blocked, all bots in the network cannot attack the target. On the other hand, in the decentralized structure, each bot client can become a bot master dynamically by utilizing the peer-to-peer (P2P) model [22, 23], making it more difficult to detect [17, 23].

Botnets have unique attack characteristics, different from other attack types. Based on their attack period and intensity, botnets can be categorized into sporadic, periodic, and simultaneous [24,25,26]. Sporadic and periodic botnets use a botnet type, like Neris, Rbot, or NSIS.ay, to attack the target. Sporadic and periodic botnets differ in their attack period [26, 27], where the former has systematic attacks repeated every certain period [25,26,27]. It can be seen from the peak of its attack activity spreading over time. Meanwhile, sporadic botnets tend to be more random in determining the attack time [21, 28], concentrating on a particular time without a regular pattern. Their attack tends to peak at a specific time in a scenario [21, 27]. Unlike those two attacking groups, the simultaneous botnets have many botnet types performing attacks parallelly and consistently high starting from their inception [27].

Previous studies have introduced botnet detection models with several approaches, such as mining-based [10, 15], behavior analysis-based [29], signature-based [13, 30], deep learning [6, 7, 20, 31], graph-based [10, 18, 32,33,34], and anomaly-based [21, 35]. In addition, optimization is implemented in the pre-processing stage to improve detection accuracy, such as segmentation time-based activity analysis [16, 36] and optimization model on feature selection [37, 38]. As a result, the model can more accurately recognize bot (attack) and normal (non-attack) activities.

Some research can analyze activity scenario patterns [39], detect P2P bots [23, 40, 41], analyze botnet group activity correlation [32], and analyze communication patterns within botnets [17]. However, they still focus on detecting particular attacks, where practically, many attacks combine sporadic, periodic, and simultaneous attacks. Therefore, an appropriate and accurate model to detect the attack activity based on their characteristics is essential.

This research proposes Botnet Characteristic Attack Type (B-CAT), a new model to detect botnet activities and their attack characteristics, whether periodic, sporadic, or simultaneous categories, using several analytical approaches based on knowledge base extraction. It aims to detect and recognize botnet attack characteristics from network traffic data. The sequentially linked activities are analyzed from the mining network traffic process, including the source and destination parameters and the time distance between network traffic. This research takes three botnet datasets with different characteristics extracted from the knowledge base. Each incoming network traffic passing the sequential traffic mining process is analyzed based on its similarity with existing patterns in the knowledge base.

The detection results provide more information to network security administrators to appropriately handle attacks based on their characteristics to anticipate botnet attacks. Thus, this research novelties can be summarized as follows:

  • Detecting and recognizing bot attack characteristics by combining sequential traffic mining, similarity analysis, and dynamic threshold analysis.

  • Developing a knowledge base extraction mechanism of botnet activity to determine the characteristics of an actual botnet based on the sporadic, simultaneous, and periodic activity category on large network traffic data types.

This study is structured as follows: “Introduction” section explains the botnet background and basic concepts; “Related works” section describes related works; the details of the proposed model are in “Proposed method” section; the evaluation results and discussion are provided in “Result and discussion” section; finally, “Conclusions” section concludes the research and depicts possible future works.

Related works

Botnet activity detection models have been developed [7, 16, 18, 23, 42], considering their attack characteristics, which are sporadic [21, 24], periodic [16, 25], and simultaneous [21, 27]. The experiment in those previous studies used a corresponding public dataset for periodic [25], simultaneous [27], and sporadic [28] attacks, which can be either single [7, 42] or group bot attacks [16, 24, 32]. A bot group activity [32], which is an activity performed by some bots having a similar characteristic pattern [18], has a causal relationship with each other [16, 17, 39] and forms a patterned attack stage [17, 39].

Improving the detection model performance involves optimizing the recognition of the attack characteristics, such as optimizing the pre-processing in the feature selection [17, 30, 38] and feature extraction [18, 23]. Some approaches have been implemented, such as machine learning [10, 42, 43] and deep learning [7, 20, 30]. However, none of these detection models focus on recognizing the characteristics of botnet attacks.

On the other hand, a dataset that can be used as a benchmark for an attack is needed to recognize the characteristics of botnet. Several studies have simulated botnet attacks to be used as a basis for research to identify botnet characters [28, 44]. Garcia et al. [28] simulated a botnet attack by executing seven types of malwares: neris, rbot, virut, menti, sogou, murlo, and NSIS.ay. Randomly executed attacks produce records of different attack activities such as SPAM, Distributed Denial of Service (DDoS), Click Fraud, and others. Thus, Alomari et al. [44] simulated an HTTP botnet attack that carried out a DDoS attack. Flooding attacks with the HTTP-GET method are executed by two botnet types: Black Energy v.1.8 Bot and Zemra bot. The simulation, which was compiled using 40 VMWare zombies, 1 Command and Control (C&C) server, and a target, succeeded in recording botnet attacks in real-time.

Chowdhury et al. [18] introduce a detection model with a clustering approach based on each bot activity, represented in seven features: in-degree, out-degree, in-degree weight, out-degree weight, clustering coefficient, the node between centrality, and eigenvector centrality. Overall, these features construct clusters using Self Organizing Map (SOM). The filtering technique for inactive nodes is to improve detection performance. Their experiment shows that the method can detect botnets optimally. Nevertheless, the details of false positive values are not provided, and it can only detect sporadic attacks.

Dollah et al. [42] use machine learning and classification to build the detection system. The bot attack characteristics are extracted from HTTP traffic and DNS queries involving five types of botnets: Dorkbot, Zeus, Citadel, SpyEye, and Cutwail, known as malicious network traffic. In the training process, data traffic is marked by labeling 1 for malicious and 0 for non-malicious traffic. The data are cleaned from meaningless noise by ignoring the IP and port number in the pre-processing stage. The detection process uses four types of machine learning algorithms, namely: Decision Tree, \(k\)-Nearest Neighbor (\(k\)-NN), Naïve Bayes (NB), and Random Forest (RF). The experimental results show that the \(k\)-NN algorithm produces the best botnet detection. However, this study does not explicitly detect the attack type. It only informs the presence of malicious bot activity in the network traffic.

A hybrid detection technique, which combines several approaches, has been introduced [7, 16, 23] to optimize the detection results. Khan et al. [23] focus on detecting the P2P botnet through a multi-layer concept with machine learning as the basis. Each layer in this detection model has specific objectives, such as reducing data traffic, which refers to the issue of device resource limitations and detection computational efficiency. At the second layer, network traffic is filtered and classified into two traffic classes: P2P and non-P2P. Then, each traffic class is analyzed based on data packets, data streams, and session Layers. After obtaining two classes, the process continues in the third layer: feature extraction and feature reduction. This layer focuses on obtaining and using the best features in the detection model, which takes a Decision Tree algorithm as a classification model. It shows relatively good detection accuracy. However, more information about other botnet attack types is needed, as the sporadic botnet attack does not only occur in P2P communication.

Implementing the fuzzy logic method with hierarchical Artificial Neural Networks (ANN) is also considered [7] by generating features in traffic data and analyzing them using feature participation. The experiment shows that the model can detect the fourth hidden layer in the CTU-13 dataset. Similar to the previous research, it can only recognize sporadic attacks. Indeed, botnet attacks can be periodic, repeated, and simultaneous in computer networks.

Hostiadi and Ahmad [16] investigate the bot group activity, adopting correlation analysis to get the causality of each bot attack based on the activity time. Sliding window-based time segmentation analysis is used to obtain the transition of attack activity in the analyzed period. In the segmentation stage, the system searches for frequent host activities in network traffic and extracts them to obtain periodic and intense patterns. It measures the activity similarity in each time segment using the Mean Absolute Error (MAE) and the causality analysis using the Pearson correlation. The research can detect the causality based on the time of attack occurrence and the analysis of forward and backward correlation. Furthermore, the system can detect the characteristics of sporadic and periodic botnet attacks. However, similar to other studies, it has not been able to detect coinciding or simultaneous botnet attacks.

Botnet detection using the signature-based or rule-based approach has relatively good speed accuracy. Alzahrani and Ghorbani [45] proposed a signature-based detection to detect Short Message Service (SMS) botnet activity. SMS is taken as a medium to carry out infections, attacks, or share other malicious applications. The proposed method uses pattern matching and rule-based techniques to label incoming or outgoing SMS suspicious or normal. The evaluation results show that the proposed model performs well for recognizing malicious SMS. The rule-based botnet detection model was also introduced by Alieyan et al. [46], which detects Domain Name System (DNS) queries and responses that are considered abnormal. There are three stages in the proposed model: the first stage is for network traffic filtration and data cleansing; the second stage focuses on feature analysis; and the third stage is implementing DNS rule-based abnormality detection. The evaluation results show that the proposed model can detect DNS botnets better than the others. Ayo et al. [47] also introduce an approach to improve the performance of Fast Flux Botnet detection (FFB). The problem of detection speed and large feature dimensions are investigated using the Genetic Algorithm (GA) and \(k\)-Nearest Neighbor (\(k\)-NN) approaches. The proposed model comprises four main modules: extractor, filter, resolver, and detector. This approach builds decision tree rules and K-dimensional (KD) trees from the \(k\)-NN algorithm to classify DNS. The evaluation results show that the proposed method can improve the performance of FFP.

Some previous studies have considerable detection performance, but they do not recognize the characteristics of each attack, which is crucial to overcoming the attack. Bot attacks can occur in real-time on computer networks simultaneously with various characteristics and resemble specific attack patterns. Therefore, characteristic recognition is required to determine the proper response to botnet attacks on computer networks.

Proposed method

This paper proposes a detection model to detect botnet activity through similarity analysis of sequential activity. The detection process consists of several processes, from traffic splitting to building a knowledge base, sequential activity mining, repetitive sequential activity analysis, sequential activity transformation, similarity analysis, and activity measurement. The entire detection process and flowchart of the proposed method are presented in Figs. 1 and 2, respectively.

Fig. 1
figure 1

Proposed method

Fig. 2
figure 2

Flowchart of the proposed method

Traffic splitting

The traffic is divided into the botnet and normal by analyzing the \(Label\) feature in the dataset, which is to be adopted as a knowledge base for botnet attack activities. The splitting process is done with a regular expression (RegEx), a notation used to define patterns of words to be searched. In the traffic splitting phase, RegEx finds network traffic with a label feature with the keyword ‘botnet’. At first, the value in the \(Label\) feature is converted to lowercase. Suppose network traffic has a lowercase \(Label\) value matching the predefined pattern; the traffic will be grouped into the botnet attack category. On the other hand, data that does not have a lowercase \(Label\) value according to a predefined pattern is grouped in the normal activity category. The activity group with the botnet attack category is carried forward to the next stage. This traffic-splitting process is illustrated in Fig. 3.

Fig. 3
figure 3

Traffic splitting

Sequential traffic mining

Botnet data separated at the previous traffic splitting stage has had network header information called features. Each network traffic record (\(Trf\)) has fifteen features: StartTime, Dur, Proto, SrcAddr, Sport, Dir, DstAddr, Dport, State, sTos, dTos, TotPkts, TotBytes, SrcBytes, and Label. Thus, traffic (\(Trf\)) is denoted as \(Trf=\left\{ StartTime, Dur, Proto, SrcAddr,\dots , Label\right\}\). Sequentially, each traffic in the activity period can relate to the activity in the next period [39]. Incoming network traffic records will first go through \(SrcAddr\) and \(DstAddr\) analysis to form a pairwise. The pairwise of \(SrcAddr\) feature in \(Trf\)(\({Trf}^{SrcAddr}\)) and \(DstAddr\) feature in \(Trf\)(\({Trf}^{DstAddr})\) is called the communication ID, denoted as \(c\). In other words, if the set of pairwise in a dataset is \(C\), then \(C={\{ c}_{1},{c}_{2}, \dots , {c}_{m}\}\) where \(c \in C\). The value of \(m\) in \(C\) is the number of elements in the set of communication Id between \(SrcAddr\) dan \(DstAddr\) as the pairwise in botnet data. Furthermore, each \(Trf\) having the same pairwise is grouped into a group of activities sequentially according to the occurrence time of the activity represented by the value in the \(StartTime\) feature. This group of activities in the order of occurrence time is called Sequential Traffic (\(SeqA\)), so it becomes \(SeqA= [{Trf}_{1},{Trf}_{2}, \dots , {Trf}_{n}]\), where \(n\) is the number of \(Trf\) in a \(SeqA\). In addition to having the same \(c\), \(SeqA\) must meet the requirement that the time interval between the occurrence of \({Trf}_{1}\) and the occurrence of \({Trf}_{n}\) must be less than or equal to 60 min which is called the segment period (\(\partial )\). The determination of value \(\partial\) is based on the time required in one botnet attack cycle, which is 60 min [48]. Sequential activity in each segment period can have a different or similar sequence pattern. If \(Trf\) has the same pairwise \(c\) but the time interval between \({Trf}_{1}\) in \(SeqA\) is greater than \(\partial\), then a new \(SeqA\) will be formed with other \(Trf\) groups that follow. The sequential activity analysis process is carried out until the traffic record in the last segment period obtains a group of sequential pairs of activities. Analysis of sequential traffic mining is shown in Algorithm 1.

Algorithm 1
figure a

Sequential Activity Mining

Algorithm 2
figure b

Repetitive Sequential Activity Analysis

Repetitive sequential traffic Analysis

This stage analyzes the \(Label\) feature. First, a trace for the same pattern is carried out by taking the \(Label\) feature from \(Trf\), which is in \(SeqA\). For example, if there is an \(SeqA=[{Trf}_{g},{Trf}_{h}\), \({Trf}_{i}]\) where \(g<h<i; g,h,i\) is an index of \(Trf\) in \(SeqA\), \({Trf}_{g}^{Label}=\)“flow = From-Botnet-V48-UDP-DNS”, \({Trf}_{h}^{Label}=\)”flow = From-Botnet-V48-TCP-WEB-Established”, and \({Trf}_{h}^{Label}=\)“flow = From-Botnet-V48-TCP-Attempt”, then a \(SeqA\) is declared as [“flow = From-Botnet-V48-UDP-DNS”, “flow = From-Botnet-V48-TCP-WEB-Established”, “flow = From-Botnet-V48-TCP-Attempt”]. The proposed system traces the similarity of the \(SeqA\) values contained in \(kb\). If there is no similar pattern, \(SeqA\) is stored in \(kb\) as a knowledge base; otherwise, the \(SeqA\) is included in Repetitive Sequential Traffic, which is ignored. Algorithm 2 presents a repetitive sequential traffic analysis process flow to form a \(kb\) containing a set of \(SeqA\) and is denoted as \(kb=\left\{{SeqA}_{1},{SeqA}_{2},.., {SeqA}_{o}\right\}\) where \(o\) is the number element \(SeqA\) in the \(kb\) set.

Feature analysis

At this stage, two features of \(Trf\) in \(SeqA\) are analyzed: \(StartTime\) and \(Label\). The analysis begins by extracting the \(StartTime\) feature to generate a new feature called \(Diff\), obtained using Eq. (1) that if \({Trf}_{i}\) with \(i=0\), then \(Diff\) is 0. The \(Label\) feature extraction is performed to get a description of the activity of a \(Trf\) in \(SeqA\). For example, if a \(Label\) has the information “flow = From-Botnet-V48-TCP-WEB-Established”, then there are sections that are deleted, namely: “flow = From”, “V48”, and “Botnet”. The “flow = From” part is removed because it is a meaningless series of words. The extraction results change the Label to “TCP-WEB-Established” and store it in a new feature called \(NetworkActivity\). This feature extraction process is illustrated in Fig. 4.

$$Trf_{i}^{Diff} = \left\{ {\begin{array}{*{20}l} {0;} & {if\;i = 0} \\ {Trf_{j}^{StartTime} - Trf_{i}^{StartTime} ;} & {if\;i \ne 0} \\ \end{array} } \right.$$
(1)

where \(j = i+1\); \({Trf}_{i}\) and \({Trf}_{j}\) in the same \(SeqA\)

Fig. 4
figure 4

Feature analysis (extraction)

Similarity analysis

The botnet activity detection process identifies botnet attack characteristics by measuring the similarity between the new traffic data, known as traffic testing data, and data patterns stored in \(kb\). It is to find how close \(SeqA\) is to \(kb\) with the categories of the periodic knowledge base (\({kb}^{per}\)), sporadic knowledge base (\({kb}^{spo}\)), and simultaneous knowledge base (\({kb}^{sim}\)).

The system scores the \(NetworkActivity\) similarity to get the bot's suspected activity. A score of 1 is given if \(NetworkActivity\) in \(Trf\) testing has the same information as \(NetworkActivity\) in one of the types of \(kb\), calculated with Eq. (2). The similarity of \(SeqA\) is generated based on the average scores obtained; this process is called pattern similarity analysis (\(Psim\)) and can be expressed in Eq. (3), where \(x = \mathrm{0,1},2, \dots , n\); \(n\) is the number of \(Trf\) in \(SeqA\) or the number of \(kbTrf\) in \(kbSeqA\) at a single characteristic.

The next step is measuring the time difference similarity (\(Dsim\)) to identify the attack characteristics by comparing the \(Diff\) feature of \(Trf\) in the testing data with that in \(kb\). \(Dsim\) is determined by adopting the MAPE error rate calculation, denoted with Eq. (4), and the similarity analysis value is calculated using Eq. (5), taking the previous \(Psim\) and \(Dsim\). Each \(SeqA\) similarity analysis result is collected to get the highest value of \(SeqA\) with each \(kb\) characteristic. This process is repeated until all \(SeqA\) have obtained the highest similarity to each \(kb\) having different characteristics. Algorithm 3 explains how to obtain the highest similarity to each characteristic \(({perSc}_{max}\), \({spoSc}_{max}\), \({simSc}_{max})\).

$$SimNetA\left( {Trf_{x}^{NetAct} ,kbTrf_{x}^{NetAct} } \right) = \left\{ {\begin{array}{*{20}l} {1 ;} & { if \, Trf_{x}^{NetAct} = kbTrf_{x}^{NetAct} } \\ {0 ;} & { if \, Trf_{x}^{NetAct} \ne kbTrf_{x}^{NetAct} } \\ \end{array} } \right.$$
(2)
$$PSim\left( {SeqA,kbSeqA} \right) = \frac{1}{n} \mathop \sum \limits_{x = 0}^{n} SimNetA\left( {Trf_{x}^{NetAct} ,kbTrf_{x}^{NetAct} } \right)$$
(3)
$$DSim\left( {SeqA, \varsigma SeqA} \right) = 1 - \left( { \frac{1}{n} \mathop \sum \limits_{x = 0}^{n} \left| {\frac{{Trf_{x}^{Diff} - kbTrf_{x}^{Diff} }}{{kbTrf_{x}^{Diff} }}} \right|} \right)$$
(4)
$$Sim\left( {SeqA, \varsigma SeqA} \right) = \frac{1}{2}\left( {PSim\left( {SeqA, kbSeqA} \right) + DSim\left( {SeqA, kbSeqA} \right)} \right)$$
(5)

The previous measurement inferred the similarity level between bot activity and each \(kb\) characteristic that a heuristic approach is implemented to find a threshold value ( ) determining this level. Each \(Max\left({perSc}_{max}, {spoSc}_{max},{ and simSc}_{max}\right)\) is tested, ranging between 0 and 1. That threshold is employed in Eq. (6) to determine whether an activity is either botnet or normal.

(6)
Algorithm 3
figure c

Similarity Analysis

Activity category detection

Network traffic activity detected as a botnet attack can be either sporadic, periodic, or simultaneous during the same attack period. A measurement of the proximity to the \(SeqA\) value, which has the highest probability of being specific characteristics of a botnet attack, determines how precise the emerging characteristics are. Thus, the maximum value of the three highest similarities is measured using Eq. (7) to categorize the attack.

$$Botnet\;Attack\;Characteristic = \left\{ {\begin{array}{*{20}l} {Periodic;} & {if \, Max\left( {perSim_{avg} , spoSim_{avg} ,simSim_{avg} } \right) = perSim_{avg} } \\ {Sporadic;} & {if \, Max\left( {perSim_{avg} , spoSim_{avg} ,simSim_{avg} } \right) = spoSim_{avg} } \\ {Simultaneous;} & {if \, Max\left( {perSim_{avg} , spoSim_{avg} ,simSim_{avg} } \right) = simSim_{avg} } \\ \end{array} } \right.$$
(7)

Result and discussion

The method is implemented in Python 3.10 using an Intel Core i7-9700F 3.00 GHz processor, 256 GB SSD Storage, and 16 GB RAM. Three datasets are used: CTU-13 [28], NCC-1 [25], and NCC-2 [27], representing sporadic, periodic, and simultaneous activities, whose characteristics are provided in Tables 1 and 2. The CTU-13 and NCC-1 datasets consist of 13 sub-datasets called scenarios, which describe some entities, like the differences in attack types, botnet names, and protocols used. Differently, NCC-2 consists of 3 sub-datasets called sensors, whereas NCC-2 records network activity consisting of more than one attack scenario. From those three datasets, botnet data are generated with 1,784,699 records.

Table 1 Detail of the datasets
Table 2 Detail of the sub-datasets

The similarity measurement has a range of values from 0 to 1. The substantial value of similarity between sequential activities and the characteristics of botnets in the knowledge base is determined based on the measurement of the threshold value ( ). If this similarity level is higher than the threshold, the sequential activity is detected as a botnet. The threshold is specified using a heuristic approach [49] in the range of between 0.10 and 0.90 by paying attention to changes in the True Positive (TP) and False Positive (FP) values shown in Fig. 5. This figure shows that the FP value began to decrease at a threshold of 0.50, so a more specific heuristic test was carried out with a value range between 0.45 and 0.60. The result of the same evaluation value with the threshold between 0.45 and 0.60 is presented in Fig. 6. It is tested on three sub-dataset samples: scenario 3 of CTU-13, scenario 3 of NCC-1, and sensor 3 of NCC-2, resulting in = 0.51, whose FP decreases to almost 0.

Fig. 5
figure 5

Evaluation threshold values with possible threshold, from 0.10 to 0.90 using several datasets

Fig. 6
figure 6

Evaluation threshold values with possible threshold, from 0.45 to 0.60 using several datasets

Sequential traffic mining and repetitive sequential traffic analysis

The model formed the sequential traffic at this stage, as shown in Table 3, and then removed the repeated sequential traffic to generate a knowledge base. The repetitive sequential traffic analysis process causes traffic reduction in each dataset, as shown in Table 4. In the CTU-13 dataset, the highest reduction value is in scenario 3, at 99.97%. For NCC-1, the highest activity reduction value is 98.00%, obtained from scenario 6. For the NCC-2 dataset, the activity reduction value is on sensor 3, 93.74%. Repetitive sequential analysis reduces sequential traffic with the same attack pattern and different targets, so a spread attack has a high reduction value. The Port Scanning (PS) attack determines the targets randomly [50] and causes repetitive sequential activities with the same pattern removed. It leads to a high reduction value obtained in the sub-dataset with the PS attack type.

Table 3 Sequential activity mining results
Table 4 Repetitive sequential traffic analysis results

Meanwhile, the lowest reduction value in the CTU-13 dataset is 0 in scenario 11, the same as in the NCC-1 dataset in scenarios 10 and 11. In contrast, the NCC-2 dataset has the lowest reduction value in sensor 1, at 89.32%. Sequential traffic is formed by analyzing the intensity, pattern, source, and target of attacks so that an intense and concentrated attack creates a complex and non-repeating pattern. This is consistent with the lowest reduction value of 0 obtained from a sub-dataset with a DDoS attack type aiming at flooding a specific target [51].

Detection result with similarity approach

The results of the similarity analysis produce network traffic in the form of \(SeqA\), which has a substantial similarity based on the threshold value. Three parameters are used to evaluate the performance: accuracy, precision, and recall, the results of which are provided in Table 5.

Table 5 Detection result with similarity approach

The detection result shows that the method works the best on the CTU-13 dataset, whose highest average accuracy is 99.97% and average recall is 97.38%. This proposed method has the highest average precision with 100% on NCC-2. Nevertheless, it underperforms on NCC-1, affected by the poor results in scenarios 3 and 12, which are 44.38% and 31.98%, respectively. It contrasts the other sub-datasets, which get a precision score of 100%. The detection results for sub-datasets other than scenarios 3 and 12 show the lowest recall value of 95.66%, much higher than that from scenarios 3 and 12, which are 39.90% and 31.34%, respectively. It indicates that the generated knowledge base cannot cover their features. It is worth noting that this dataset focuses on periodic botnet activities.

Botnet characteristic

The measurement in the previous stage obtained bot activity based on its average similarity to \({kb}^{per}\), \({kb}^{spo}\) and \({kb}^{sim}\), where the highest average value indicates their attack characteristic, as depicted in Table 6. As described in the previous section, those three datasets have different characteristics: CTU-13, NCC-1, and NCC-2 represent sporadic, periodic, and simultaneous attacks, respectively.

Table 6 Botnet characteristic detection results

In certain cases, an attack characteristic may have a relatively high probability of meeting all characteristics, although only one is chosen. For example, the similarity level of scenario 12 of the CTU-13 dataset is more than 60% for all characteristics. In that scenario, the IP numbers 147.32.84.165, 147.32.84.191, and 147.32.84.192 have sporadic activity because their attack spreads to the target computers. Some botnet activities repeatedly attack the same targets at different times with different intensities, representing periodic. Furthermore, simultaneous attacks at a specific period with different targets make them simultaneous activities. Overall, the method can fully detect a correct botnet attack characteristic.

In the NCC-1 dataset, the method can recognize 12 out of 13 scenarios, where three are detected with 100%, which are scenarios 3, 7, and 11. In scenarios 3, 9, and 10, the method delivers more than 80% of being simultaneous; however, their similarity to the periodic attack is still detected closer. In scenario 12, the attack is incorrectly detected as simultaneous instead of periodic. It is because the method detects the attack simultaneously carried out by different IP numbers: 147.32.84.165 and 147.32.84.191 at a specific period.

Those results inferred that a botnet attack can be either sporadic, periodic, simultaneous, or even a combination of them. For example, the DDoS attack, aiming to flood the target with traffic and form a sequential activity, has a series of activities [10, 34, 51]. Scenarios 10 and 11 of the NCC and CTU-13 datasets result in extensive \(SeqA\) lengths over 60 min with DDoS attacks. A botnet may target different or the same machine, repeating it at different or the same period. On the other hand, an average activity length affects the formation of \(SeqA\), as shown in scenario 4 of the CTU-13 and NCC-1 datasets, even though the number of bot actors is only one. In the case of Port Scanning (PS) and Controlled by Us (US) activities, the length of sequential activity is relatively small. It is due to the attack technique being carried out in a scattered manner without looking at the similarity of the target or focusing on only one target. So, there is no repeated scanning activity at different periods, which makes it referred to as sporadic.

Comparison with other studies

This research compares the experiment results with previous studies to analyze the method further, as provided in Table 7. In general, the detection result using the CTU-13 dataset performs better than previous studies, with an accuracy of 99.97%, a precision of 99.89%, and a recall of 97.38%. The proposed method has a higher accuracy of 0.03% compared to [7]. At the same time, the precision value is 0.04%, slightly lower than [42] with the Decision Tree. The proposed method shows that the recall value is still 2.07% below [42] with Naïve Bayes. The lower recall value is because the proposed method measures the similarity of \(SeqA\) with the same size \(kb\), while the attack stages may have more or fewer steps depending on the attack cycle. Detection performance can be increased by improving the ability of the proposed method to measure the similarity of \(SeqA\) with different sizes \(kb\).

Table 7 Comparison with other studies

In the NCC-1 dataset, the method has better accuracy than previous studies, with a value of 99.60%, slightly below [16]. While in precision, the proposed method still needs to be better than [42] with the Decision Tree. Furthermore, its recall is 10.25% lower than [42], dropped by scenarios 3 and 12.

The detection results on the NCC-2 dataset show good precision with a value of 100%, much higher than in previous studies. As for accuracy, the proposed method is still 0.12% below [42]. For recall, the proposed method gets a score of 97.38%, which is still 2.58% below [42] with Random Forest. The tree-based detection model is superior because this algorithm fits the characteristics of the NCC-2 dataset, which has a large enough botnet traffic data and improves the detection rules.

The proposed model generally has a better detection performance than previous studies. The average value indicates it in those three datasets with different attack characteristics, which results in a detection accuracy of 99.82%, precision of 96.79%, and recall of 94.64%. Table 7 shows that Joshi et al. [7] has a higher average accuracy than the proposed method of 99.94%. It is worth noting that it is an average value, which is the total accuracy obtained from the test results on each dataset divided by the number of datasets used in the testing. In this case, Joshi et al. [7] only used one dataset, namely CTU-13 with sporadic attack characteristics. In contrast, this proposed method is tested on three datasets with different attack characteristics. By using the same dataset (CTU-13) only, the proposed method has better accuracy than [7]. In addition to detecting botnet attacks, this method can further recognize the characteristics of botnet attacks required to determine appropriate handling and anticipation once actual attacks occur on computer networks. In contrast, the other methods cannot do it, as shown in Table 7, because they do not analyze the attack intensity and period.

Discussion of the complexity and overhead

The sequential activity mining process shown in Algorithm 1 is divided into three steps, where overall, the most dominant complexity is in step 2 (time interval analysis). In the worst case, its complexity is \(O(m),\) where \(m\) is the number of \(c\) in \(C\). Meanwhile, the overhead may occur in step 1 when tracking \(c\) in \(C\). Thus, the proposed model is efficient with \(O(1)\) on average. Still, if more \(Trf\) is processed, it will cause a bit of computational overhead. The sequential activity mining process also allows updates to occur continuously on both set \(C\) and set \(SeqA\), allowing overhead due to the memory reallocation process.

In Algorithm 2 (repetitive sequential activity analysis), complexity again comes from the looping process (\(O(o)\)) where \(o\) is the number of \(SeqA\). The greater the number of sequential activities that have been formed, the longer the looping process will be. Comprising a set of \(kb\) will also cause space complexity for the same reason, depending on the number of unique \(SeqA\) (\(O(o)\)). On the other hand, the repetitive sequential activity analysis process is made as efficient as possible. It keeps overhead to a minimum from the management set and iteration process.

The similarity analysis presented in Algorithm 3 consists of four steps; the time complexity of each step depends on the iteration process (\(O(z)\)) where \(z\) is the number of \(SeqA\) in \(kb\). Meanwhile, space complexity is tied to the storage management of \(perSim\), \(spoSim\), and \(simSim\). Besides these three variables, \(kb\) size also contributes significantly to space complexity. On the other hand, the overhead of Algorithm 3 comes from the iterative comparison process followed by the similarity calculation over \(kb\). The entire process will become increasingly complex depending on the data size. Overall, the complexity of the proposed model comes from the iteration process. In contrast, the opportunity for overhead to occur comes from the memory reallocation process, depending on how large the \(SeqA\) and \(kb\) are processed.

Conclusions

This research proposes a model to detect the characteristics of botnet attacks using a sequential traffic mining approach and similarity analysis. It consists of six main processes, starting with traffic splitting, which aims to get botnet traffic. The second part is sequential traffic mining, focusing on mining and forming sequentially related activity groups. Then, repetitive sequential traffic analysis is used to eliminate repetitive activities. The following process calculates the similarity between the sequential activity, and by using the threshold value, it successfully detects botnet activity. Finally, the method is successful in determining the characteristics of botnet attacks.

In the detection process, the method can detect botnet activity optimally and shows better results than most research, with a detection accuracy of 99.97% on the CTU-13 dataset and 99.60% on the NCC-1 dataset. The proposed model gives better results in precision measurement than previous studies, specifically on the NCC-2 datasets. On average, the proposed model got an optimal accuracy value of 99.82% with the three datasets. In calculating the precision value, the proposed method achieved an average of 96.79%. At the same time, the average recall calculation of the proposed method is 94.64%. The proposed method gets the most optimal value in accuracy compared with previous research. The most optimal values ​​for accuracy mean that the proposed method has the best performance in recognizing botnet activity compared to detection methods in previous studies.

The proposed model has advantages in analyzing the characteristics of botnet attacks. Each detected botnet is analyzed to see the closeness of the similarity of characteristics and see the dominant characteristics contained in each dataset, which can be sporadically, periodically, and simultaneously. First, the model can recognize sporadic activity as an attack by one bot to several targets at different times and carried out with high intensity. Second, on periodic activity, the model can recognize the attack of one bot or group of bots on the same and different targets repeated at different periods. Finally, the model successfully detects simultaneous activity, which identifies an attack by one bot or group of bots simultaneously in the same period. This model provides more information to assist network security administrators in handling botnet attacks based on their characteristics.

In the future, the development will focus on better understanding botnet activities to reduce FP values and optimize recall values. Reducing FP and optimizing recall value can be made by extracting the pattern of sequential activity to have a more varied pattern and the ability to measure the similarity of \(SeqA\) with different sizes \(kb\). Efficiency and effectiveness will also be the main focus of future research while maintaining the complexity and overhead of the proposed model. In addition, it also needs to be tested with other datasets.