Unsupervised detection of botnet activities using frequent pattern tree mining

A botnet is a network of remotely-controlled infected computers that can send spam, spread viruses, or stage denial-of-service attacks, without the consent of the computer owners. Since the beginning of the 21st century, botnet activities have steadily increased, becoming one of the major concerns for Internet security. In fact, botnet activities are becoming more and more difficult to be detected, because they make use of Peer-to-Peer protocols (eMule, Torrent, Frostwire, Vuze, Skype and many others). To improve the detectability of botnet activities, this paper introduces the idea of association analysis in the field of data mining, and proposes a system to detect botnets based on the FP-growth (Frequent Pattern Tree) frequent item mining algorithm. The detection system is composed of three parts: packet collection processing, rule mining, and statistical analysis of rules. Its characteristic feature is the rule-based classification of different botnet behaviors in a fast and unsupervised fashion. The effectiveness of the approach is validated in a scenario with 11 Peer-to-Peer host PCs, 42063 Non-Peer-to-Peer host PCs, and 17 host PCs with three different botnet activities (Storm, Waledac and Zeus). The recognition accuracy of the proposed architecture is shown to be above 94%. The proposed method is shown to improve the results reported in literature.


Introduction
With the continuous development of the Internet, the network has expanded from an interconnection of PCs to a mobile Internet. With the advent of 5G technology, further expansion is expected towards the Internet of Things and the Internet of Everything scenarios [1]. As a result, the amount of information exchanged over the Internet has reached unprecedented levels, but so have the threats and the need for security.
Today, botnets have become one of the major threats to Internet security [25]. A botnet attack typically occurs as fol- School of Mathematics, Southeast University, Nanjing, China lows: attackers invade a large number of hosts and implant virus programs through various software equipment vulnerabilities, sending phishing emails, or brute force cracking. Any host that is infected becomes a member of the botnet, and falls under the control of the bot virus program. A botnet can eventually stage several malicious activities, such as sending spam, stealing private information, or launching denial-of-service (DoS) attacks. Because the infected host is typically referred to as a 'zombie host', another popular name for a botnet is 'zombie network'.
The first botnet program was SubSeven 2.1 and was created in June 1999 (the interested reader can check https:// f-secure.com/v-descs/subseven.shtml). SubSeven 2.1 relied on the IRC (Internet Relay Chat) protocol to control a large number of zombie hosts. Since 1999, a large number of bot virus programs based on the IRC protocol have appeared, and examples include GTBot, Sdbot, etc. [4,6]. Fortunately, IRCbased botnets can be easily defeated by shutting down and restarting the IRC server. However, attackers have found new protocols through which delivering the botnet activities, so that botnets based on HTTP protocol and P2P (Peer-to-Peer) protocol have appeared [24,32]. The latter have the most dangerous characteristics in terms of decentralization and strong resilience, since P2P-based botnets cannot be easily shut down like IRC-based botnets and its activities are more difficult to detect [26,27]. The emergence of more and more P2P-based botnet programs poses a great threat to Internet security [5,20,30]. Dangerous botnet activities are also more and more observed in Internet-of-Things (IoT) environments [9,29] and social Internet-of-Things (IoT) environments [28].
Popular detection methods for P2P-based botnets are signature-based intrusion detection systems [8], which are similar to antivirus software and firewalls. However, packet encryption will invalidate such methods. The authors in [21] proposed a detection model, named detection by mining regional periodicity (DMRP), based on capturing the event time series, mining the hidden periodicity of host behaviors, and evaluating the mined periodic patterns to identify P2P bot traffic. The authors of [18] proposed a three-layer filtering botnet detection system, which is responsible for packet filtering, P2P application packet filtering, and P2P botnet detection, respectively. High accuracy with low false alarm rate have been reported by using such periodicity based methods, although the behavior characteristics considered for the botnet is too simplistic as compared to real-life botnets.
In order to handle the detection of more complex botnet activities, methods based on machine learning [2,10-12, 14,16,17,23] have been commonly used. A brief account for these methods is given hereafter: [16,17] combined a stream-based method and a session-based method to design a two-layer structure machine learning classifier that can distinguish between zombie P2P activities and normal P2P activities. Note that [16] improves the work of [17]. The method in [10] clustered the normal and abnormal P2P behavior; [23] compared, based on existing datasets, five commonly-used machine learning techniques for online botnet detection. The results of the evaluation show that it is possible to detect effectively botnets during the botnet Command-and-Control (C&C) phase and before bots launch their attacks using traffic behaviors only. Based on the characteristics of unlimited network data flow and drift concept, [12] proposed a multi-layer multi-classifier group detection system based on the research of single classifier and multiclassifier to store the optimal K -classifiers. All machine learning algorithms are faced with similar problems, most notably, the long training time [11,14]. Another crucial problem is the need for labeled examples, i.e., each classifier must be trained for a specific type of botnet, which makes the classifier in general unable to handle new/unknown botnet activities for which labelled examples are unavailable [29].
In view of this observation, methods in [7,13,31,32] are all based on botnet behavior, i.e. they classified botnet behavior based on time intervals without having seen a complete network flow. There are, however, several challenges which must be overcome to realize a full implementation of such behaviour-based detection systems, such as the lack of scal-ability to huge datasets, and the need for installing individual detectors on every network device and on any networks with more than a few hundred nodes.
The advantages and flaws of the different methods adopted in the literature, motivate us towards pursuing a different approach to the detection of botnet activities. In this paper, we exploit and tailor the Frequent Pattern Tree to botnet detection. Frequent Pattern Tree is a data mining method used for frequent pattern mining (also known as Association Rule Mining). The purpose of the algorithm is to discover frequent patterns or associations from data sets. Because the method can automatically detect rules, it does not need to be trained from specifically labeled botnet activities as in machine learning approaches. In addition, because the data set is stored in a tree structure called Frequent Pattern Tree, frequent items can be found by simply traversing the data set twice. The tree structure results in higher efficiency and lower runtime cost as compared to the other data mining algorithms. For example, it was shown that when the number of records in the data set is relatively large, the Frequent Pattern Tree algorithm has a significant advantage over the Apriori algorithm in terms of speed [15]. Speed and memory efficiency also makes the Frequent Pattern Tree algorithm widely used in search engines [3].
Most botnet detection approaches rely on machine learning algorithms, which have a long training time cost and are targeted to deal with known (labeled) botnet types. In the presence on unknown botnet types and large amount of data, the performance of machine learning methods might deteriorate seriously. The contribution of this article is the first introduction of the frequent item mining algorithm Frequent Pattern Tree in the field of botnet detection. The proposed approach relies on discovering essential characteristics of scriptability and frequent similarity in the underlying communication of P2P botnets: therefore, it can cope with different types of P2P botnets without the need for the different types to be labeled. In fact, it is shown via the PeerRush Dataset [22] that botnet activities can be detected and classified automatically, without presetting or pre-training for specific botnet types. Also, the proposed approach can process tens of millions of data sets in around half a minute, which is again shown via the PeerRush Dataset [22] with tens of millions of data. Extensive experiments show that, as compared with machine learning methods reported in literature for the same data set, the proposed methods improves in terms of efficiency and accuracy.
The remainder of this paper is organized as follows. The Frequent Pattern Tree data mining algorithm is recalled in second section. In third section, we introduce the implementation steps of the proposed detection system. Experimental results are illustrated in fourth section. Followed by conclusions in last section and evaluation of ideas for future work. Algorithm 1 Frequent Pattern Tree: blacknote that an refers to the n-th element in A, rm refers to the m-th element in R, listg refers to the g-th element in List 1: Input: Record set R with attributes belongs to attribute set A 2: Output: Rule set containing frequent item pair set 3: {a1,a2,a3,..,an}←Attribute set A 4: {r1,r2,blackr3,..,rm}←Record set R 5: Cai←Count of Attribute ai 6: S←Minimum Support 7: Bai←Conditional Pattern Base of ai 8: C←Item Header List i 9: List={list1,list2,..,listg}←Item Header Table  10: for ri in R do 11: for aj in A do 12: if

Frequent pattern tree
This section is devoted to giving a background on the Frequent Pattern Tree algorithm. Preliminary notions related to Frequent Pattern Tree are the following: • Attribute: is an item which could appear in a record. For example, all items in a shopping list can be used as attributes of the shopping record. • Record: is a collection of multiple attributes.
• Minimum support: is the minimum number of occurrences that the considered attributes should have. • Conditional pattern base: is the set of all prefix paths ending with the searched element.
Our study used the Frequent Pattern Tree algorithm jar package provided in the open source machine learning software Weka. The process of mining frequent item sets by Frequent Pattern Tree algorithm can be divided into three steps (cf. Algorithm 1): (1) Preprocessing of Data set (a) Perform the first scan of the data set and count the number of occurrences of each attribute; (b) Filter the attribute appearing in the data set according to the set minimum support degree and remove the attribute whose number of occurrences is less than the minimum support; (c) Perform a second scan of the data set and readjust the order of attributes in each record from high to low according to the number of occurrences of attribute.
(2) Build the Frequent Pattern Tree (a) Traverse all records in the data set, starting from the root node, add the corresponding nodes according to the attributes in the records, and add the first occurrence and node entries in the entry header

Methodology
Based on the frequent item set mining algorithm, we propose a botnet detection system comprising the following three parts (cf. Fig. 1).
a) Collection and preprocessing of network data packets; b) Rules mining using Frequent Pattern Tree algorithm module; c) Statistical analysis rule set to determine the bot host IP.

Preprocessing of data set
Let us introduce some notions about data processing: 1. pcap: is a common file format for storing network packets. Collected packets are stored in a pcap file with a name like xxx.pcap. 2. arff : is the specified dataset format that Weka software can read. Below is an example of arff file.
@relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric The collection of network data packets requires Winpcap library (on Window platform) or Libpcap library (on Linux platform). The Winpcap (Libpcap) library integrates relevant function interfaces for sniffing and collecting network data packets from network adapters (network cards). After sniffing the data packet, the attributes of the data packet are extracted and stored into the specified file according to the arff sparse format specified by Weka. Each network flow corresponds to a record. In this work, the pcap packet files are obtained from the PeerRush [22] dataset (many other pcap datasets are available online): such datasets comprise normal and abnormal network traffic data over a few weeks, where the data have already been preprocessed into a pcap format. It must be remarked that Processing such pcap file into the arff sparse format takes around one day for the PeerRush dataset (which is composed of tens of millions of data). This amount to less than 0.01s per data: therefore, when processing data online, this processing time will be so fast that can be neglected.

Mining rules using frequent pattern tree algorithm
The Frequent Pattern Tree algorithm has two important requirements for the data set: the data set must be in a sparse format and the attributes in the data set must be discrete. The processing of the dataset into arff sparse format guarantees that such requirements are satisfied.

Statistical analysis of Rule set
After mining frequent items through the Frequent Pattern Tree algorithm, the data set outputs all the rules that meet the set minimum support. The rule set at this time may contain all kinds of rules that are combined between all characteristic attribute items. In this paper, in view of the need to determine the identity of the host, to ensure that the system can detect the IP of the zombie host that exists in the local area network, so the rule set is first filtered. Because the IP address string format has a special structure, we use regular expression to filter out all the rule lines that match the IP address string format, and record each independent IP address while counting the independent IP addresses. blackRegular expression is a tool to find strings fitting the pattern set by the user. The pattern under consideration in this work is a pattern to find an IP address. The meaning of the regular expression is the following:  Also refer to https://www.cuminas.jp/sdk/regularExpression. html for more details on regular expressions. When the rule set has been screened and counted, it will enter the judgment. If the rule count value associated with a host's IP address exceeds the threshold C, the host will be judged as a zombie host, issue a warning and record the IP address for the network administrator to conduct subsequent security investigations.

Dataset description
Our study used the PeerRush Dataset (2018) [22], which is composed of three data sets (see Fig. 2) 1. Normal P2P application data set (with data collected from 11 P2P hosts); 2. Three P2P zombie virus program data sets (with data collected from 13 hosts infected by Storm, 1 host infected by Waledac and 3 hosts infected by Zeus); 3. Non-P2P application data set (with data collected from 42,063 hosts).
The first part of the data packet is generated by a local area network consisting of 11 hosts set up by the PeerRush team. These 11 hosts run 5 popular P2P applications (eMule, µ-Torrent, Frostwire, Vuze, Skype). The application adopts the custom P2P protocol, which ensures the diversification of the P2P application behavior and the underlying mode of the data packet. In order to ensure the human-like characteristics of the data set, the PeerRush team also adopted AutoIt script software to interact with the P2P application to realize the operation of downloading files. The second part of the Peer-Rush Dataset uses 17 hosts infected by representative P2P botnet virus programs, namely Storm (13 hosts), Waledac (1 host) and Zeus (3 hosts). The third part of the PeerRush Dataset is collected in a separate local area network with 42,063 hosts. In this part of the dataset the use of Snort was combined with existing P2P detection method to filter out suspicious P2P application traffic.

Results and evaluation
The crucial parameter to be selected for the Frequent Pattern Tree algorithm is the minimum support S. A too small minimum support level will mine many worthless rules, which wastes unnecessary calculation time, and might also cause the rule analysis link to become long. On the other hands, if the minimum support level is set too large, it might avoid mining some valuable rules. support means that the number of attribute instances in the total number of instances must be greater than or equal to the minimum support before it is discarded by the algorithm. We use regular expression extraction to experiment with rules related to IP addresses in the mining rule results. The host is identified by IP address, as shown in the results of Table 1.
Since there are 17 bot hosts in total, from the data Table 1, it can be seen that for all values of minimum support, 100% of the zombie hosts (13 hosts infected by Storm, 1 host infected by Waledac and 3 hosts infected by Zeus) are included in the rule results mined by the Frequent Pattern Tree algorithm. It must be also noticed that all the related rule number of each bot IP address are greater than 0, which means that the method blackincludes all the bot hosts; it is the rule number that eventually decides whether it is a bot or not.
However, when the minimum support is low (0.001), 267 rules of normal hosts have been mined (false alarms). More specifically, 14 hosts out of the 11+42,063 hosts are detected as infected: note that there is a rule related to the Normal host with the IP of 139.205.84.245, and there are 19 rules for each of the 14 hosts with 66.154.83. [130-133, 135, 137,  139, 140-142, 145, 147, 150, 152]. blackThe false alarm rate at this time is 46.875% blackwhen the threshold C is set to 10. If C = 10, any IP whose related rules is greater than C, will be considered as a bot IP. As a result, because 19>C the 14 hosts with IP 66.154.83. [130-133, 135, 137, 139, 140-142,  145, 147, 150, 152] are blackconsidered as bots even though they are actually not.    When blackthe minimum support is greater than 0.001, no false alarms will occur: it is also worth noticing that the number of rules needed to characterize a botnet typically decreases when increasing the minimum support. Except for the Zeus zombie host with an IP of 10.0.2.15, the number of rules is greatly reduced, if the minimum support is greater than blackor equal to 0.0015.
blackTherefore, another parameter that determines the detection effect of the system is the threshold C of the number of independent IP-related rules determined to be a zombie host: when the independent IP-related rule exceeds C, it is determined as a zombie host. The IP is recorded, and then the system issues a warning indicating that a suspicious host is found in the network. We tested C = 10 and C = 3, shown in Table 2. To quantify the performance more accurately, we follow the four commonly used standards for measuring the effectiveness of models in the field of machine learning [19]: The following conclusions can be drawn from the results in Table 2: 1. When the Frequent Pattern Tree mining algorithm support level is set too low, it mines worthless rules containing the IP addresses of normal hosts, resulting in unsatisfactory detection result. Although a zero false alarm rate is guaranteed and no zombie host is spared, many normal hosts are also judged as zombie hosts, resulting in a high false alarm rate (0.4375); 2. The best minimum support is 0.0015, meaning that only attributes with more than 50,000 are considered by the algorithm. This gives the ideal state of 100% accuracy and 100% precision, and there are no missing or false alarms. That is to say, the parameter setting at this time can ensure a greater degree of separation between the botnet host and the normal host, and without losing the rules of the botnet host, as far as possible, limit the occurrence of normal host worthless rules; 3. When the minimum support is further increased, a low false negative situation occurs. After analysis, all this happens because the number of Zeus botnet flow packets in the overall data set is too small compared to other Storm and Waledac, so the strict minimum support makes the Zeus botnet IP related rules reduce below the threshold. Therefore, the choice of minimum support is very important for the detection effect of the detection system designed in this paper.
When C = 3, the proposed method can recognize 100% bot IP without any false alarm if the minimum support is set to 0.0015, 0.002, 0.0025 or 0.003. Our method improves reported results for the same dataset [16]. This is clarified in the comparisons of Table 3 for different values of S. When C = 10, the proposed method has identical results as C = 3 in terms of Accuracy, Precision, FPR and FNR, if the minimum support is set to 0.001, 0.0015, 0.0035, 0.004. The only differences are when the minimum support is set to 0.002, 0.0025, 0.003. In this case C = 10 has same Precision and FPR, but slightly worse accuracy (99.997 % instead of 100%) and higher FNR (i.e. some zombie hosts are wrongly judged as normal hosts). Taking this into account, the most robust minimum support level is 0.0015, because in this case both C = 3 and C = 10 work perfectly.

Conclusion and future work
In this paper, the use of frequent item mining to detect P2P botnets has been investigated. The proposed architecture has shown promising values of accuracy, and the following points can be considered to further improve our investigation. First, the data set used in this article contains only three common P2P botnets, Storm, Zeus, and Waledac. It is of interest to consider more botnet activities possibly including unknown botnets. Second, when selecting attributes, we have considered the characteristics of the source IP address SrcIP, source port number SrcPort, destination port number DstPort, transmission protocol, and the length of the first packet, and it is of interest to consider more attributes (packet number in a flow, bytes in a flow, average packet length in a flow etc.), since a small number of attributes may lose some hidden rules not being mined. Third, we have not considered that the network equipment may change the IP after offline due to the DHCP protocol. In order to address this point, it may be necessary to add a MAC address for dual identity binding in the future.

Compliance with ethical standards
Conflict of interest The authors declare no conflict of interest. No author has a financial or personal relationship with a third party whose interests could be positively or negatively influenced by the article's content. On behalf of all authors, the corresponding author states that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.