Background

During the last 15 years, botnets have caused some of the most devastating and costly internet security incidents [1]. A client machine becomes a bot in a botnet when infected by a particular type of malware [2] that an attacker can exploit remotely to control an infected machine. An attacker can issue commands to a single bot, or to all the bots in botnet. The attacker controlling the botnet is sometimes referred to as the ‘botherder’, ‘botmaster’ or ‘controller’ [3]. Figure 1 shows a typical botnet cycle. Contrary to existing malwares such as viruses and worms, which focus on attacking the infecting host, bots can receive commands from the botmaster, allowing them to be used in a distributed attack platform [4].

Fig. 1
figure 1

An example of botnet lifecycle

Botnets can significantly damage the security of individuals and businesses. They pose a serious and growing threat against cyber-security as they provide a distributed platform for many cyber-crimes such as distributed denial of service (DDoS) attacks against critical targets, malware dissemination, phishing, and click fraud [5,6,7,8]. Majority of the existing botnet detection approaches concentrate primarily on particular botnet command and control (C&C) protocols (e.g., HTTP, IRC and many more) and structures (e.g., centralized or P2P). They follow rule based approaches to detect botnets in network. However, these approaches can become ineffective and obsolete if botnets change their structure and C&C protocol in order to evade detection [4]. Thus, a robust botnet detection approach that can detect any type of botnet with varying characteristics is of utmost importance. Before exploring existing botnet detection schemes in literature, we first survey some of the studies done in anomaly detection. Later, we describe existing efforts dedicated to bot detection are identified that can be divided into two broad categories: botnet detection using net flow based features and graph-based features.

Anomaly detection techniques

Researchers have conducted extensive research on anomaly detection techniques for cyber security. For example: Fadlullah et al. [9] develop a novel detection technique called DTRAB to infer DDoS attacks. The authors investigate the detection of attacks against application-level protocols that are encapsulated via encryption. In essence, this detection scheme is a distributed detection mechanism capable of detecting the anomalous events as early as possible. Moreover, DTRAB is able to simultaneously construct a defensive mechanism to discover attacks as well as find out the root of the threat by tracing back the attacker’s original network. The effectiveness of this scheme is validated via simulation. Flow correlation information is utilized by Zhang et al. [10,11,12] to further improve the classification accuracy considering only a small number of training instances based on K-Nearest-Neighbor and Naive Bayes classifiers that are used to detect anomalies in the network. Yan et al. [13] propose a framework of security and trust for 5G based on the perspective that the next generation network functions will be highly virtualized and software defined networking is applied for traffic control. The proposed approach by the researchers utilizes adaptive trust evaluation and management technologies as well as sustainable trusted computing technologies to achieve computing platform trust and software defined networking security. A qualitative comparison is provided by Shu et al. [14] between the advantages and disadvantages of software-defined networking and traditional networking regarding security issues concerning overall architecture. Moreover, a detailed assessment of the threats of software-defined networking from the perspective of functional layers and attack types is also provided by the authors’. Note that majority of the researchers conducted on anomaly detection focus on rule based techniques that may fall short for botnet detection as new generation botnets can easily encrypt their commands or use sophisticated cloaking techniques that do not follow any predefined rules. Moreover, due to the sophisticated nature of botnet in general, typical anomaly detection techniques are not applicable for botnet detection.

Flow-based methods for botnet detection

NetFlow is a network protocol that is able to collect IP network traffic as it enters or exits an interface. NetFlow based features (or flow based features) have been used to detect anomalies including botnets in a high speed, large volume data networks. Flow based features can be defined as group of packets sharing common attributes such as source and destination IP, source and destination port, protocol type and many more. Examining each packet that is routed through the network can lead to identifying unique flows based on conjoint attributes of the packets. These attributes are considered as fingerprint of the packet and leveraging them one can determine if the packet is distinctive or share common attributes with others [5, 15, 16]. Hence, using this features, many researchers have investigated the potential of detecting anomalies in the network. The botnet detection literature using net flow based features is a rich one and many researchers have significantly contributed in this area (e.g., [15,16,17]). Most of the existing detection schemes falls into either of the three types of methods: clustering, classification [18,19,20], and others.

Clustering is a popular approach taken by researchers to detect botnets using flow based features. Zeidanloo et al. have proposed a botnet detection framework that can detect botnets without prior knowledge of them by finding similar communication patterns and behaviors among the group of hosts that are performing at least one malicious activity using X-means clustering [21]. Using Audit Record Generation and Utilization System (ARGUS) [22], the authors have collected flow based information such as source IP address, destination IP address, source port, destination port, duration, protocol, number of packets, and number of bytes transferred in both directions, which are later used to detect the group of hosts that exhibit similar behavior and communication pattern. Karasaridis et al. [23] have developed a K-means based method that employs scalable non-intrusive algorithms that analyze vast amounts of summary traffic data. Gu et al. [24] have proposed a novel anomaly-based botnet detection system that is independent of the protocol and structure used by botnets. This detection system has exploited the essential definition and properties of botnets, i.e., bots within the same botnet exhibit similar C&C communication patterns and similar malicious activities patterns. It utilizes a number of flow based information such as time, source IP, destination IP, source port, destination port, duration, and the number of packets and bytes transferred in both directions. A C-plane clustering method is used to read the communication logs generated by C plane monitor and finds clusters that share similar communication pattern. Arshad et al. [25] have developed an anomaly-based method that require not a priori knowledge of bot signatures, botnet C&C protocols, and the C&C server addresses. Flow characteristics such as IP, port, packet event times, and bytes per packet are examined by Amini et al. [26] to detect botnets where these NetFlow data is collected, filtered, and is finally clustered using hierarchical clustering. Rule based methods are then applied to refine the clusters to reduce the percentage of false positives.

Among the authors’ who use classification techniques, Strayer et al. have developed detection approaches by examining flow characteristics such as bandwidth, packet timing, and burst duration. In this study, the authors first eliminate traffic that is unlikely to be a part of a botnet, classify the remaining traffic into a group that is likely to be part of a botnet by using J48 decision trees, naïve Bayes, and Bayesian classifier. Finally, the likely traffic is correlated to find common communications patterns that would suggest the activity of a botnet [18, 20]. Fairly recently, a decision tree classifier has been used by Zhao et al. [19] to detect botnets by investigating 12 flow based features. Their proposed method can detect botnets during the C&C and attack phases based on the observation of flow based features for specific time intervals. It does not require significant malicious activity to occur before detection as it can recognize command and control signals. Simultaneously, it does not require the knowledge about group behavior of several bots before it can be confident about making a decision.

Lu et al. [27] have incorporated both classification and clustering techniques in detection of botnets where they have developed an unsupervised botnet detection framework where they first identify network traffic from existing known applications and then focus on each application community that might include botnet communication flows. This network traffic is then clustered to find the anomalous behaviors on that specific application community based on the n-gram features extracted from the content of network flows. The proposed detection framework has been evaluated on IRC community data, and results show that this approach obtains a high detection rate with a very low false alarm rate when detecting IRC botnet traffic.

Apart from classification and clustering techniques, there are a number of other studies that employ other approaches in botnet detection using NetFlow based features. Interested readers can refer to [28,29,30,31,32] for such related works.

Limitation

Existing methods of botnet detection based on NetFlow traffic features rely on computing statistical features of flow traffic. As a result, these methods only capture the characteristics of bots effects on individual links, rather than on the topological structure of a neighborhood/subgraph as a whole. In particular, flow based detection methods require the comparison of each traffic flow to all the others in order to determine malicious traffic, instead of monitoring the network behaviors in a holistic manner. Such techniques are also deficient in that attackers can evade detection by the use of encrypting commands or changes in data volume. This can also be accomplished by changing some other behavioral characteristics such as using variable length encryption or changing packet structure that leads to new behavioral characteristics. To overcome this deficiency, another stream of research has focused on detecting botnets based on graph based features. This approach is fundamentally more efficient than flow based approaches since it avoids the need to cross compare flows across the dataset [33].

Graph-based methods for botnet detection

There are a number of studies that use different graph based features to detect anomalies. Literature in this domain can be broadly categorized into two groups: detecting anomalies in static graphs and detecting anomalies in dynamic graphs. The static graphs can be further categorized into plain graphs and attributed graphs. Among the studies that use plain graphs for anomaly detection, Ding et al. [34], Henderson et al. [35], Henderson et al. [36], Kang et al. [37], Aggarwal [38], Zimek et al. [39], Chen and Giles [40] and many more utilize structure based patterns to detect anomalies. On the other hand, studies done by Sun et al. [41], Tong and Lin [42], Ambai et al. [43], Nikulin and Huang [44] focus on the utilization of community based patterns to detect anomalies. Similarly, for attributed graphs Davis et al. [45], Eberle and Holder [46], and Kontkanen and Myllymki [47] use a structure based pattern whereas Gao et al. [48], Muller et al. [49], Perozzi et al. [50] use community based patterns to detect anomalies. With dynamic graphs, authors have used the notion of graph similarity based on certain properties such degree distribution and diameter [51,52,53] by resorting to matrix or tensor decomposition of the time-varying graphs [54,55,56,57] or by monitoring graph communities over time and reporting events when there is structural or contextual change in any of them [58, 59].

Botnet detection studies using graph based features mainly exploit the spatial relationships in communication traffic [31, 60, 61]. Collins and Reiter [62] have proposed a method to identify bots by noting that scanning behavior initiated by bot infected hosts would tend to connect different disconnected components of protocol specific traffic graphs. Wang and Paschalidis [63] use behavioral characteristics of bots to detect botnets. Primarily, the authors have focused on analyzing the social relationships that are modeled as graph of nodes. The authors have considered both social interaction graphs and social correlation graphs and have applied the proposed method to a real world case study. However, for this detection scheme to be successful bots need to show systematic patterns in behavior that may not be very robust for stealthy botnet. ‘Graption’ is a graph-based method proposed by Iliofotou et al. [64] that identifies peer-to-peer flows by calculating the in-degree to out degree ratio of hosts in protocol traffic graphs. However, this method can be defeated by protocol randomization. A graph-based detection approach to detect web-account abuse attack has been proposed by Zhao et al. [65] where the correlations among botnet activities are uncovered by constructing large user–user graphs. This approach, termed as ‘BotGraph’ has two components: aggressive sign-up detection and stealthy bot user connection. The first component ensures that the total number of possible bots are limited whereas second component detects stealthy bot users based on constructing a user–user random undirected graph. Only the edge weight feature has been used to detect bots in the graph. Although the detection rate is very high, this method’s accuracy can be disputed if other types of botnet except the spamming one need to be detected. Jaikumar and Kak [66] have presented a graph-based framework for isolating botnets in a network. This framework uses temporal co-occurrences in the activity space to detect botnets. This makes the framework independent of the software architecture of the malware infecting the hosts. The proposed framework has been validated by applying it to a simulated environment. However, this approach falls short if bots don’t exhibit temporally co-occurring malicious activities. Nagaraja et al. [67] have proposed a botnet detection technique based on structured graph analysis that localizes botnet members by identifying unique communication patterns arising from the overlay topologies prevalent in a command and control structure. However, this approach must be paired with some other malware detection scheme to clearly distinguish botnets from regular flows. Francois et al. [68] have proposed an approach called ‘BotTrack’ where NetFlow related data is correlated and a host dependency model is leveraged for advanced data mining purposes. They have used the popular linkage analysis algorithm called ‘PageRank’ with an additional clustering process to efficiently detect botnets. However, to validate the proposed method, the researchers have only used a small dataset (13.7 GB) and also they have generated the botnet randomly as the dataset was not labeled. Moreover, the authors’ have assumed that a certain percentage of bots and their characteristic were known beforehand. So, if an unknown botnet exists in the network, their approach may not give good results. Francois et al. [69] have further extended their work on ‘BotTrack’ by developing a scalable method called ‘BotCloud’ for detecting botnets regarding the relationships between hosts. The evaluation of this method has showed a good detection accuracy and a good efficiency based on a Hadoop cluster. But, in this case also, the authors have initially used a botnet free dataset and later randomly have generated botnets in them. Hang et al. [70] have used community detection based clustering to identify long-lived low intensity flows using graph based features.

Limitation

Similar to botnet detection methods using statistical features of flow/packet traffic or deep packet inspection, existing graph based botnet detection methods available in the literature have some major limitations. Many of them apply the botnet detection scheme in a simulated environment (e.g. [66]). Moreover, the detection approaches proposed in the literature are mostly rule based, meaning that a predetermined rule needs to be established beforehand to detect botnets from a graph (e.g., [60]). This approach may lead to unwarranted result if bots behave differently from a common norm. Although many of the graph based detection schemes use filtering to remove bot free data (e.g., [65, 67]) and then apply a detection method, the amount of data that needs to be investigated to detect botnet is still very large. Simultaneously, if the dataset is large, the computational expense is often high for the detection approach, which is a huge disadvantage if faster detection is required [65].

Significance of our approach

An important step towards developing a new graph based detection approach would be to develop a method that is fast and does not follow any particular rule to detect botnets. Simultaneously, the approach must be validated in a real world dataset with different types of botnets. This detection scheme should also be robust enough so that it can detect any kind of botnet present in the dataset. In this study, we have proposed an approach based on graph based features that can fulfil these requirements. Our main contributions can be summarized as:

  • We present a novel graph-based method for the detection of botnets in a computer network.

  • Our approach does not depend on any rules to detect botnets and is capable of capturing the changing behavior of bots.

  • Seven graph based features are used to characterize the topological structure of the network and utilize them to detect botnets.

  • The proposed method can detect different types of botnets with different types of behavioral characteristics.

  • A real world dataset is used to validate the results.

The rest of the paper is organized as follows. “Data description” provides a brief description of the real dataset used in this study. “Proposed graph based clustering” discusses in detail the seven features used to detect botnets and the clustering methodology implemented to cluster these features. “Case study-detecting bots in CTU-13” provides numerical results obtained after applying a clustering methodology to the real dataset as well as giving a comparative overview of applying classification techniques. “Conclusions” concludes our work and reviews our main contribution to the existing literature.

Data description

Big data has been an area of interest among researchers in recent years. For instance: Tsai et al. [71] have provided a comprehensive review on studies that attempt to develop new schemes capable of handling big data during the input, analysis, and output stages of knowledge discovery. They have found that majority of the existing literature is focused on innovative methods for data mining and analysis. However, little to no attention have been given to the pre- and post-analysis processing methods. Evolution based algorithms such as accelerated particle swam optimization is used to reduce the dimensionality of big data by Fong et al. [72]. Authors have investigated the applicability of their method on exceptionally large volume of data with high degree dimensions and have found that the proposed method results in enhanced analytical accuracy within a reasonable amount of processing time. In this study, a big NetFlow dataset that includes botnet behavior is used for validating the proposed detection methodology. In this study we use CTU-13 dataset which is one of the biggest labelled datasets available that consists of botnet traffic as well as normal and background labeled data. It was captured in the Czech Technical University (CTU) in 2011. The developers of the dataset have originally developed it to compare three detection methods, namely Cooperative Adaptive Mechanism for Network Protection (CAMNEP) method, BCIus detetection method, and BotHunter method [73]. Researchers have found that BCIus and CAMNEP detection methods cannot be generalized for all types of botnet behavior. Each of them seems fit for different types of behavior. Analysis of BotHunter detection method shows that in real environments it can still be useful to have blacklists of known malicious IP addresses known beforehand.

The CTU-13 dataset has been used by Grill et al. [74] to evaluate the effects of local adaptive multivariate smoothing (LAMS) model on the NetFlow anomaly detection engine. The proposed method is able to reduce false alarm rate of anomaly detection based intrusion detection systems. Fairly recently, Chanthakoummane et al. [75] have utilized five packets of CTU-13 dataset to evaluate the Snort-IDS rules detection botnets and analyze the function of the botnets in three rules packet such as botnet-cnc.rules, blacklist.rules, and spyware-put.rules. Experimental results show that botnet-cnc.rules can detect botnets for 29,798 alerts. Blacklist.rules can detect botnets for up to 44 alerts. Spyware-put.rules cannot detect any botnet. The researchers eventually surmise that botnet-cnc.rules are most proficient in detecting botnets.

Although, researchers are hopeful of the potential of using CTU-13 datasets in detecting botnets, (e.g., see Malowidzki et al. [76], Chanthakoumman et al. [75]) according to best of authors’ knowledge, no significant work except [73] has been done using CTU-13 data in the detection of botnets. CTU 13 dataset consists of 13 captures (called scenarios) of different botnet samples [61]. This dataset was designed with goals such as.

  • Dataset must have real botnet attacks, not simulated attacks.

  • Must have real world traffic.

  • Must have ground truth labels for training and evaluating methods discussed in [73].

  • Must include multiple types of botnets.

  • Must have several bots infected simultaneously to capture synchronization patterns.

  • Must have NetFlow files to protect the privacy of the users.

A scenario in the CTU13 dataset can be defined as a particular infection of the virtual machines using a specific malware. The data collection period for each scenario is significantly different from one another. The duration of recorded NetFlow data vary from 0.26 to 66.85 h and subsequently the amount of NetFlow data also varies accordingly. Multiple types of bots are found in the scenarios. Majority of the scenarios have only one bot (scenario 1–8 and 13), whereas scenarios 9–12 have multiple bots in them. Percentage of botnet flow is also very negligible (<2%) compared to total NetFlow for majority of the scenarios. However, botnet flow percentage increases (6–8%) when there are multiple bots present in the dataset (except scenario 12). Another distinctive feature of CTU-13 dataset is that each scenario has been manually analyzed and labeled. The labeling process was performed inside the NetFlow files. Table 1 provides a summary of the amount of data on each botnet scenario and percentage of botnet on each scenario.

Table 1 Amount of data on each botnet scenario [55]

The proposed approach in this study is first of its kind to convert the NetFlow features available from CTU-13 dataset into graph based features and use these graph features to detect botnets. As CTU-13 dataset is the most complete real world dataset [76], we choose this dataset to prove the concept of our approach, the details of which are discussed in “Proposed graph based clustering”.

Proposed graph based clustering

This section discusses in detail the seven features used to detect botnets and the clustering methodology implemented to cluster these features. First of all, a directed graph is constructed for each of 13 data sets of CTU-13. A directed graph (digraph) can be defined as a set of nodes connected by directed edges.

Mathematically, a directed graph can be expressed as an ordered pair G(V,E) where V is a set of nodes and E is a set of edges. In this study each node denotes a unique IP address and each edge denotes the connection between one IP address to another. Subsequently, the feature values are calculated of these directed graphs and afterwards a clustering methodology is applied to find the nodes with similar features.

Graph features

The seven features used in this study are in degree, out degree, in degree weight, out degree weight, clustering coefficient, node betweenness, and eigenvector centrality. Graph-tool [77, 78] is used to calculate these seven features. A brief discussion of these seven features and rationale behind choosing them is provided below:

In degree

If many suspected bots contact a malicious domain for C&C reasons, this will result in a relatively high in degree for this domain. Keeping this in mind, in degree has been chosen as a feature to detect botnet in a network. For a particular node in a directed graph, in degree can be defined as the total number of head ends (arrows pointing towards the node) adjacent to that node. High value of in degree for a node indicates the neighboring nodes tendency to establish more connection where as low value indicates the opposite. For example: Fig. 2 shows that node a has an indegree of two.

Fig. 2
figure 2

A directed graph with three nodes

Out degree

For a particular node in a directed graph, the total number of tail ends (arrows pointing outwards from the node) adjacent to a node is called the out degree of the node. A high value of out degree for a node implies that this node tends to make more connections with other nodes and a low value implies the opposite. Bots tend to make more connection with other potential victim computers to spread the reach of botnet or to C&C domain for transferring information. So, out degree can be a useful indicator of botnet activity in a graph. As evident from Fig. 2, we can see that, node a has an out degree of two.

In degree weight

In degree weight refers to the total number of data packets received by a particular node transferred from its neighboring connected nodes. The mechanics of transferring data packets consists of setting up the data connection to the appropriate ports and choosing the parameters for transfer. Besides the raw data every data packet contains, it also has headers that carry certain types of metadata, along with the routing information and trailers that help in refining data transmission [77, 78]. Botnets tend to communicate with each other or to the C&C server to transfer information or update their commands. Same type of botnets usually show similar behavior in transferring information or updating commands. We assume that bots will receive the same type of command and receive approximately same volume of information that can be used to differentiate between bots and non-bots.

Out degree weight

Out degree weight and can be described as the total number of data packets sent by a particular node to its neighboring connected nodes. Same as in degree weight, we assume that bots will have similarity in the volume of data it sends out to other IP addresses in the network and can be a useful indicator of botnet activity in a network.

Node betweenness centrality

In graph theory, node betweenness centrality quantifies the number of times a node acts as a bridge along the shortest path between two other nodes. More specifically, node betweenness centrality indicates a particular node’s centrality in graph, which refers to how many shortest paths from all nodes to all others pass through that particular node [79]. Node betweenness centrality can be mathematically expressed as [80]:

$$N_{B} (v) = \sum\limits_{a \ne b \ne v \in V} {\frac{{\sigma_{ab} (v)}}{{\sigma_{ab} }}}$$

where, \(\sigma_{ab}\) is the total number of shortest paths from node a to b and \(\sigma_{ab} (v)\) is total number of shortest paths that pass through node v. Figure 3 illustrates the concept of node betweenness [81].

Fig. 3
figure 3

Node betweenness centrality for node ‘a’

Node betweenness centrality can be a useful feature to detect botnets especially in detecting P2P botnets where bots are more interconnected without a central C2C structure. So, we assume that for a P2P bot in a botnet should have a higher node betweenness centrality in a graph.

$$n_{B} \left( a \right) = \frac{3.5}{6} \approx 0.583$$

Local clustering coefficient

Local clustering coefficient of a node indicates how concentrated the neighborhood of that node is. More specifically, local clustering coefficient is a metric to evaluate how close a node’s neighbors are to each other. If Ka denotes the number of neighbors of node a and ea denotes the number of connected pairs between all neighbors of node a, then local clustering coefficient for node a can be given by [82]:

$$C_{a} = \frac{{e_{a} }}{{K_{a} (K_{a} - 1)}}$$
$$C_{a} = \frac{1}{4.3} \approx 0.083$$

Figure 4 shows the clustering coefficient of node a, which is 0.083. Local clustering coefficient can also be a very significant indicator of a P2P botnet. As explained before, bots in P2P botnet have a decentralized structure where bots connect and communicate with each other to remove the need of a centralized server. As a result, interconnectedness can be a very significant feature to detect P2P botnets which is essentially the basis of local clustering coefficient.

Fig. 4
figure 4

Clustering coefficient of node ‘a’ in directed graph

Eigen vector centrality

Eigen vector centrality, also known as Eigen centrality is a measurement criterion of influence of a node in a graph. It is essentially the weight of a node in a graph [83]. Each node is assigned a relative value based on the concept that connections to high-scoring nodes contribute more to the score of the node than equal connections to low-scoring nodes. Let G(V,E) be a graph where \(\left| V \right|\) is total number of nodes and \(\left| E \right|\) is the total number of edges. Let, A = (a v,w ) be the adjacency matrix where

$$a_{v,w} = \left\{ {\begin{array}{*{20}l} {1 \quad if\;node\;v\;is\;linked\;to\;node\;w} \\ {0 \quad if\;node\;v\;is\;not\;linked\;to\;node\;w} \\ \end{array} } \right.$$

Then the centrality score can be given as

$$x_{v} = \frac{1}{\lambda }\mathop \sum \limits_{w \in M(v)} a_{v,w} x_{w}$$
(1)

where M (v) is the set of neighbors of node v and λ is a constant. Now Eq. (1) can be rewritten as

$$Ax = \lambda x$$

There exists a positive solution λ with final eigenvector after using power method based Perron–Frobenius theorem [84]. Here λ is also the largest eigenvalue associated with the eigenvector of the adjacency matrix [85]. Eigenvector centrality is a natural extension of degree centrality. In-degree centrality awards equal centrality point for every link a node receives. But not all nodes are equivalent: some are more important than others based on their edge weight, and, reasonably, connections from important nodes count more. We expect that bots eigenvector centrality measure should be significantly different than non-malicious nodes and hence is used as a feature to detect botnets in this study.

Self organizing map

Self-organizing map (SOM) belongs to a class of unsupervised system that is based on competitive learning in which the output neurons compete amongst themselves to be activated. The primary goal of an SOM is to convert an incoming data of arbitrary dimension into a one or two-dimensional discrete map, and to perform this transformation adaptively in a topologically ordered fashion [86]. In this study, we have considered on a particular kind of SOM known as Kohonen network that was developed by Tuevo Kohonen in 1982 [87].

The basic structure of SOM is shown in Fig. 5, which shows a 3×3 SOM network. For this small SOM network, there are 63 connections. Notice that the map nodes (C 1C 9) are not connected to one another. In this 2-D representation of SOM, each map node has a unique (i,j) coordinate. Simultaneously, as map nodes are only connected to an input vector (F 1F 7), map nodes are never aware of what other map nodes values are. A map node’s weight (W) will only be updated if and only if allowed by the input vector. Algorithm 1 illustrates the basic methodology behind SOM.

Fig. 5
figure 5

Structure of SOM

In this study, we have used a 5×5 network. Weights of each map node in the network are initially assigned with seven random values, one for each element in the input vector. After that, input vectors each containing these seven features are presented to the network. Thenceforth, step 3–5 is followed to get the desired number of clusters.

In its essence, algorithm 1 is essentially screening the dataset and assigning the nodes to different clusters. Thus, this algorithm does not distinguish bots from non bots. Hence another algorithm is developed to detect bots in the clusters. This bot detection algorithm, which is run after algorithm 1, is illustrated below:

In this study the number of nodes in all clusters besides the cluster with largest number of nodes, an indicator of the number of nodes needed to be searched further to identify the bots, is denoted by N s . This parameter is a measure of the efficacy of the filtering performed by the SOM clustering. After this filtering step, a practitioner would use other methods to detect bots within the remaining nodes (e.g., statistical classification, signature analysis, or consultation with subject matter experts).

Case study-detecting bots in CTU-13

We apply SOM to the CTU-13 dataset to investigate the effectiveness and efficiency of our proposed method. An enhanced filtering algorithm, based on the degree of bots, is proposed to further improve the botnet detection efficiency. The results are benchmarked against a support vector machine based classification algorithm to demonstrate the strength of our proposed procedure.

Graph features extraction

We first extract the seven graph-based features of CTU-13 data sets as discussed in “Proposed graph based clustering”. Because the CTU-13 data sets contains more than 20 million NetFlow records, High performance computing is needed to speed up the extraction of graph-based features. The computation tasks of feature extraction are performed using the Shadow-2 system, a super computer cluster available at The high performance computing collaboratory (HPC2) of Mississippi State University. The Shadow system is equipped with a Cray CS300-LC cluster with 4800 Intel Ivy Bridge processor cores and 28,800 Intel Xeon Phi cores; each compute node has 20 processors. All the processors share the 500 GB memory available to them. However, in this study, only one processor was used as sequential algorithm was implemented for extracting feature values. With the aid of high performance computing capacity, we are able to the graph-based features from all CTU-13 data sets within 30 h. The resultant graph-based features are numbered and labeled by Feature 1–Feature 7 for the notational convenience, as shown in Table 2.

Table 2 Graph-based features used for clustering

Graph based botnet detection using clustering

We apply the SOM-based botnet detection algorithm (Algorithm 1) to the extracted graph-based features. Figure 6 demonstrates the results of SOM clustering based on dataset-6. There is a total number of 25 cells, each representing a possible cluster of graph-based features. We chose the total number of cells to be 25 so that the SOM algorithm captures various types of node behaviors while not significantly increasing computation costs.

Fig. 6
figure 6

Mapping nodes of CTU-13/dataset-6 into different clusters based on feature values

The numbers in each cell represent the total number of nodes that belong to the corresponding cluster. These nodes share similar behaviors in terms of the identified graph-based features. For example, there exist 105,672 nodes in the biggest cluster (in blue), which accounts for over 99% of nodes in dataset 6. Note that abnormal/malicious behaviors are rare in most of real-world networks. These nodes are unlikely to be bots. This helps to narrow down the identification of bots to the remaining few nodes, which account for less than 1% of the total nodes. Similar observations are made for the other CTU-13 datasets that the majority of nodes belongs to the biggest cluster and can be eliminated from the consideration of bot detection (see Table 3). For most datasets, the biggest cluster consists of over 99% of nodes. This allows us to eliminate majority of the dataset for further bot identification, significantly reducing the cost of computation.

Table 3 Number of nodes in the biggest cluster (normal nodes)

We apply the proposed bot search algorithm (Algorithm 2) to the clusters obtained via SOM. Table 4 shows the number of nodes to search (N s ) to identify all bots in each data set. The sizes of clusters that include the bots are also reported. Bots can be isolated in small clusters for most data sets. As a result, bots can be identified by examining a limited number of nodes, i.e., small N s values. For dataset 11, which contains three bots, the proposed algorithm requires searching only 24 nodes to identify the first two bots. However, it requires searching additional 1306 nodes to identify the third bot in this data set. The last column represents the percentage of nodes to be examined to identify the bots in each dataset, which is calculated as the ratio between N s and the total number of nodes. For most of the data sets, the proposed algorithm only requires examining less than 0.2% of the total nodes to identify all bots. This will significantly reduce the required computation costs.

Table 4 Number of nodes to search for bot identification (N s )

Enhanced bot detection via filtering

To improve the efficiency of bot detection, we apply a filtering algorithm by removing the single-degree nodes before clustering. These nodes have one connection only and remain inactive during most of the period of data collection. Considering that the nature of bots is to attack and infest normal nodes, these nodes are highly unlikely to be bots. We filtered out these nodes and run the SOM clustering algorithm based on the graph features of the remaining nodes. The filtering step significantly reduces the number of nodes contained in the clusters of bots. For example, Fig. 7 shows that the number of nodes in the bot cluster of dataset seven drops from 11 to 8, making the bot easier to identify. A similar decrease is also observed for clusters of normal nodes.

Fig. 7
figure 7

Size of bot cluster in dataset 7 before and after filtering (star is the bot)

As a result, the numbers of nodes to search for bot identification (N s ) are shown in Table 5. Significant reduction in N s can be observed. For example, 1306 nodes need to be searched for identifying the third bot in dataset 11. After filtering, 384 nodes need to be searched only, a reduction of over 2% of the total number of nodes. For dataset 12, the two clusters containing bots are combined into one after filtering, requiring searching 36 nodes only compared to 113 nodes before clustering. However, we also observed the N s values slightly increase for datasets 5 and 6, which may result from randomness of the clustering algorithm.

Table 5 Improvement of N s using filtering

Benchmark against classification

We compare our approach with a detection method based on support vector machine (SVM), which is a supervised machine learning technique used for classification and regression analysis [88,89,90]. MATLAB packages are used for implementing SVM classification. We train SVM using datasets 9 and 10, and apply the SVM classifier on the remaining CTU-13 data sets. The number of detected bots in each dataset using SVM classifiers are presented in Table 6. Results show that the SVM classifier is capable of detecting the bot in dataset 2 only. For the rest of datasets, the bots cannot be detected. The results did not improve when other data sets are used for training.

Table 6 Number of detected bots using SVM

The graph-based features (Features 1–7) of bots vary tremendously across all 13 datasets. The range of Features 1–7 associated with bots is summarized in Fig. 8. For example, the values of Feature 1 range from 1 to 6842; and those of Feature 2 values range from 3 to 11,571 across the bots in CTU-13 datasets. Similar observations are made for the other features. This means that there is no interval for the value of Features 1–7 indicating bots. In other words, the value range of graph-based features associated with bots is so large that it also includes a portion of normal nodes, and thus, the bots cannot be distinguished from normal nodes only based on the values/intervals of features. This partially explains why the classification- or rule based bot detection methods (e.g., SVM) may not be able to detect the bots, since these methods rely on estimating the intervals of feature values to distinguish bots from normal dots, which vary tremendously from one dataset to another. Therefore, the classifier trained using one dataset may not apply for another. On the other hand, our method is more robust against the changing values of features. As long as the bots behave differently from normal nodes, and as long as the graph-based features of bots are different from those of normal nodes, such different behaviors can be captured by our clustering-based detection algorithm.

Fig. 8
figure 8

Varying range of graph-based features across CTU-13 datasets

Conclusions

In this work, we propose a graph-based botnet detection approach that can detect changing behaviors of bots. This is novel because the existing approaches mainly rely on flow based features and thus do not capture the changes in the topological structure of networks caused by bot activities. We investigate seven graphed-based features that are may be connected to bot activities: in degree, out degree, in degree weight, out degree weight, clustering coefficient, node betweenness, and eigenvector centrality. SOM is applied to establish the clusters of nodes based on these graph features. Our approach is capable of isolating bots in clusters with very small sizes (less than 100 nodes), which enables fast detection of bot nodes. The proposed algorithm is further enhanced by filtering out inactive nodes, which are unlikely to be bots. We verify the proposed methods using CTU-13 datasets, the largest dataset that consists of bot-labeled nodes. Numerical results show that our proposed procedure is capable of detecting the bots by searching limited number of nodes (less than 0.1% of all nodes).

We compare our approach with a classification-based SVM detection algorithm using the same graph-based features. The SVM method is not capable of detecting most of the bots because of the varying values of bot features across different datasets. The advantage of our approach is that we focusing on capturing the abnormal behaviors of bots in terms of their graph-based behaviors. In other words, our method is more robust against the changing behaviors of bots because the proposed approach does not rely on any particular value/range of features. With different types of bot behavior, the proposed method can still detect bot with reasonable accuracy. What this approach ensures is that, bots will be found in small sized clusters with the majority of nodes (>99%) removed from further consideration. Numerical results from our study shows that, as long as the bots behave differently from normal nodes, such different behaviors can be captured by our clustering-based detection algorithm. Future work is needed to incorporate additional graph-based features and reduce the computational costs of graph feature extraction. Note that, as feature extraction cost contributes to the overall computational cost, future work is needed to investigate how feature extraction cost can be minimized. Effect of incorporating more relevant graph based features into the detection methodology is also a future research direction.

Note to the practitioners

Botnet detection has received significant attention from the cyber security research community in recent years due to the possible catastrophic effect it can have for a critical internet infrastructure. This work has been an effort to help practitioners detect botnets efficiently and in a faster manner. This method can be applicable for any real world network with varying bot characteristics. In this study, seven different types of bots were present which had been detected successfully. This method also reduces the large dataset into smaller ones that lessens the computational burden significantly. This is particularly useful for real world practitioners who work with real world dataset that are often huge. Hence, this detection approach can help them in ensuring cyber network integrity and protect it from future botnet attacks.