SN Computer Science

, 1:16 | Cite as

Darknet Traffic Analysis and Classification Using Numerical AGM and Mean Shift Clustering Algorithm

  • R. NiranjanaEmail author
  • V. Anil Kumar
  • Shina Sheen
Original Research
Part of the following topical collections:
  1. Advances in Internet Research and Engineering


The cyberspace continues to evolve more complex than ever anticipated, and same is the case with security dynamics there. As our dependence on cyberspace is increasing day-by-day, regular and systematic monitoring of cyberspace security has become very essential. A darknet is one such monitoring framework for deducing malicious activities and the attack patterns in the cyberspace. Darknet traffic is the spurious traffic observed in the empty address space, i.e., a set of globally valid Internet Protocol (IP) addresses which are not assigned to any hosts or devices. In an ideal secure network system, no traffic is expected to arrive on such a darknet IP space. However, in reality, noticeable amount of traffic is observed in this space primarily due to the Internet wide malicious activities, attacks and sometimes due to the network level misconfigurations. Analyzing such traffic and finding distinct attack patterns present in them can be a potential mechanism to infer the attack trends in the real network. In this paper, the existing Basic and Extended AGgregate and Mode (AGM) data formats for darknet traffic analysis is studied and an efficient 29-tuple Numerical AGM data format suitable for analyzing the source IP address validated TCP connections (three-way handshake) is proposed to find attack patterns in this traffic using Mean Shift clustering algorithm. Analyzing the patterns detected from the clusters results in providing the traces of various attacks such as Mirai bot, SQL attack, and brute force. Analyzing the source IP validated TCP, darknet traffic is a potential technique in Cyber security to find the attack trends in the network.


Darknet traffic analysis Pattern recognition Clustering AGgregate and mode 


The ARPANET in the 1960 was not built by taking security into account. Rapid growth in the network has made it insecure. Due to the characteristics of digitally stored information, an intruder can delay, disrupt, corrupt, exploit, destroy, steal, and modify the digital data. Depending on the value of the information such actions will have different impacts with varying degrees of damage.

Intruders and hackers use different techniques and methods to exploit information. Some of them are worms, malware, virus, phishing, etc. Cyber security [1] consists of the techniques of protecting computers, networks, programs and data from unauthorized access or attacks that are aimed for exploitation. It helps to detect, prevent and recover data from the malicious activities. One of the Cyber security strategies is to analyze the darknet traffic. Darknet and darknet traffic are often referred to as darkspace, black hole monitors, network telescopes, unsolicited network traffic, Internet Background Radiation (IBR) [2], spurious traffic, etc.

Analyzing darknet traffic is known to be powerful as this traffic is lesser in amount than the real traffic and also comprises of malicious activity traces in abundance. It is advantageous to analyze this traffic to find the attack trends in the real network. Each attack follows a pattern in exploiting the data present in the network. Finding these patterns can help us to trace it back to its corresponding attacks and clustering is a good approach to find patterns in an unclassified data like the darknet traffic.

Iglesias and Zseby [3] have applied consensual clustering algorithm to the AGM data format they proposed and recognized attack patterns in the/8 IPv4 CAIDA UCSD Network Telescope “Patch Tuesday” Dataset [4]. Similarly, in this research work, the attack trends and patterns has been found by applying clustering techniques to the darknet traffic by proposing a suitable and efficient 29-tuple data format specially designed for source IP addresses validated darknet traffic analysis.

Literature Survey

A survey by Fachkhka and Debbabi [5] provides definitions of darknet, comparison between darknet and other trap-based monitoring systems such as IP Gray Space, Honeypot, Honeynet, Greynet and a clear vision on the research aspects of darknet and explains the darknet sensor deployment and data-handling techniques, analysis techniques and its contributions.

Bhuyan et al. [6] discussed about the various tools such as Wireshark, Gulp, tcpdump, and nmap, and methods such as classification (SVM, KNN, and Neural Networks), clustering and other statistical techniques for analyzing the network traffic. Sperotto et al. [7] explained why payload inspection is not suitable for high-speed traffic analysis, therefore, leaving us with the header inspection option.

CoralReef [8] is a tool devised to monitor and analyze traffic at the network level using a simple 5-tuple data format: source IP, destination IP, source port, destination port, protocol as mentioned in [9], i.e., using only the header information. To analyze the darknet traffic, Corsaro [10], a high-speed analysis software suite was devised which used an 8-tuple data format: source IP, destination IP, source port, destination port, protocol, Time To Live (TTL), TCP flags and packet length for analysis.

Iglesias and Zseby [3] proposed an efficient darknet traffic flow representation format called AGgregation and Mode (AGM) which is a 22-tuple representation and extended it to a 50-tuple format called Extended AGM to have numerical parameters. A 29-tuple Numerical AGM, extended from the 22-tuple AGM which is suitable to analyze the validated unsolicited TCP traffic is proposed in this paper. These data format will be discussed in detail in the forthcoming sections.

Recently, research in darknet traffic analysis has caught researcher’s attention. The following research works give us insights about how darknet traffic can be used for security analysis. Wang et al. [11] used darknet traffic to infer the Internet worm temporal behaviors by applying statistical estimation techniques and proposed methods of moments, maximum likelihood, and linear regression estimators for analysis. Dainotti et al. [12] presented the measurement and analysis of a horizontal scan of the entire IPv4 darknet address space conducted by the Sality botnet in February 2011 including general methods to correlate, visualize, and extrapolate the botnet behavior across the global Internet.

Bou-Harb et al. [13] introduced a novel probabilistic darknet preprocessing model for data sanitization and reduced the dimensions of big data by extracting and analyzing probing time series using formal methods rooted in Fourier transform and Kalman filtering. Bou-Harb et al. [14] leveraged unsolicited real darknet data and proposed a novel system, CSC-Detector, that aims in identifying Cyber Scanning Campaigns. It was empirically evaluated and validated using 240 GB of real darknet data. The outcome has disclosed three recent, previously unreported large-scale probing campaigns targeting diverse Internet services.

Proposed Model

Figure 1 shows the overall workflow of the proposed model. The raw darknet traffic data stored in the pcap file are processed by converting it into a more convenient and appropriate data format, the AGM format. Each AGM data are transformed to a Numerical AGM format to apply clustering in the next step. The distinct patterns present in the clusters are found by human inspection. Finally, the patterns detected are mapped to the corresponding attacks and results are obtained. Each step is explained in detail in “Darknet Dataset”, “Preprocessing and Clustering”, and “Results and Analysis”.
Fig. 1

Workflow of the proposed model

Darknet Dataset

The darknet traffic data used in the experiment are taken from the /24 Network. These data consist of TCP three-way handshake traffic originating from all over the Internet which is targeted to the /24 dark IP address space. Each connection request (SYN packet) is validated against source IP address spoofing by appropriately responding to the incoming SYN packet with a SYN/ACK packet and subsequently establishing a TCP connection. The data were collected and analyzed for a period of 20 days: 01 July to 20 July 2017. Each day’s pcap data file size is around 150–600 MB.

Basic AGM

Iglesias and Zseby [3] proposed a 22-feature vector AGM data format, suitable for darknet traffic analysis as shown in Listing 1. Each AGM corresponds to a particular source IP (srcIP_i) and it contains the aggregation and mode traffic information of that source IP as observed in a particular time window. A 24-h time window is used in this experiment.

Listing 1: AGM data format

  • # stands for number of. For example, #Protocol is the number of different protocols used by srcIP_i during the observed time period.

  • M(…) stands for statistical mode. For instance, M(Protocol) is the most frequently used protocol by srcIP_i during the observed time period.

  • #pkts indicates the number of packets sent by srcIP_i during the observed time period.

  • For example, #pkts[M(Protocol)] stands for the number of packets sent by srcIP_i using the most frequent protocol in the observed time period.

The advantage of the AGM format is that the aggregation and mode fields of the source IP give a deep understanding and characterization of a particular source IP. An AGM can imply if the source IP was doing a horizontal scan or sending Backscatter data. Algorithm 1 gives the pseudo code of AGM generation from the raw darknet traffic data.

Numerical AGM

By looking at the Basic AGM deeply, it can be inferred that some of the statistical mode features are not numerical: M(dstIP), M(srcPort), M(dstPort), M(Protocol) and M(flag). It should be noted that even though M(srcPort), M(dstPort), M(Protocol) and M(flag) are numbers, their values cannot be interpreted numerically as they do not represent a count or a magnitude like M(length). Therefore, it cannot be used to find the distance values during the clustering process. Dummy variables can be used for the conversion of categorical attributes to numerical attributes.

In Ref. [3], a 50-tuple Extended AGM format was proposed for applying the clustering algorithm on the darknet traffic data. Dummy variables were used by taking the top 1% of the field values into account. The M(dstIP) feature was also removed under the assumption that the darkspace addresses were known. This feature is of great significance and will be included when attempting to find a hotspot in the darkspace. In this paper, a 29-tuple Numerical AGM format suitable for validated TCP darknet traffic analysis is proposed. Algorithm 2 gives the pseudo code of Numerical AGM generation from the AGM data.

Listing 2: Numerical AGM data format

The points that were taken into account while proposing the Numerical AGM:
  • M(dstIP) was removed as the darkspace addresses are known and finding a hotspot in the darkspace is not the main focus.

  • M(srcPort) was removed since it is not an important feature as it is randomly set by the source.

  • M(dstPort) field has been expanded into 10 different features by adding the dummy variables which are the top 10 destination ports found based on its frequency as shown in Table 1.
    Table 1

    Destination ports frequency distribution in percentage

    Destination port

    1st July 2017

    2nd July 2017

    3rd July 2017

    4th July 2017

    5th July 2017



















































    < 0.97

    < 0.86







    < 0.96



    < 0.78




    < 0.94

  • All the fields related to Protocol has been removed as the TCP traffic is only considered here.

  • M(flag) field has been expanded to 4 dummy variables: 0x02 (SYN), 0x10 (ACK), 0x18 (PSH + ACK), 0x14 (RST + ACK). These were the only flags which contributed to M(flag) where 0x10 (ACK) and 0x14 (RST + ACK) contributed less than 1% (Figs. 2, 3). Here, it should be noted that the flag 0x12 (SYN + ACK) set packets are not taken into account for analysis as it is sent by the TCP responder as a part of the TCP three-way handshake.
    Fig. 2

    Flag count distribution each day

    Fig. 3

    Frequency distribution of flags

Preprocessing and Clustering

Preprocessing the data is a crucial step in machine learning and also in the proposed method for darknet traffic analysis. Dimensionality reduction is a part of preprocessing the data. The curse of dimensionality is a term introduced by Bellman [15] to describe the problem caused by the exponential increase in volume of associated data while adding extra dimensions to the Euclidean space. Data with 29 features appear to give us bad results while clustering. To overcome the curse of dimensionality, data can be reduced to lesser dimensions. PCA (principal component analysis) [16, 17] was used to reduce the 29-dimensional space to a 3-dimensional space.

As the darknet traffic data are not classified already, we do not have a training dataset to work with classification. Therefore, unsupervised learning technique is a good choice to find the attacks patterns in the traffic. Clustering allows us to find data points with similar behavior which helps us to find the AGMs corresponding to a specific attack pattern. There are two types of clustering algorithms:
  • Flat clustering Partitions are independent of each other and the number of partitions/clusters should be given as an input, e.g., k-means [18], k-medoids [19], Gaussian mixture models, etc.

  • Hierarchical clustering Number of clusters need not be specified, e.g., Birch [20], Mean Shift, etc.

Here, with our darknet traffic, the number of the distinct clusters or patterns available in a day’s data is not known. Therefore, hierarchical clustering algorithm is more suitable for this experiment. One of them is the Mean Shift clustering algorithm [21]. This algorithm works based on kernel bandwidth which decides the clusters. Mean Shift clustering algorithm available in the scikit-learn library was used to cluster the AGMs. It resulted with around 100 clusters each day but most of the clusters have very less data points in them. Therefore, the top four clusters based on the cluster size were taken into account for pattern recognition. It has to be noted here that each cluster may have more than one pattern present in them. Clustering here just helps us to put the AGMs together for easier human inspection. After this point, the AGMs present in each cluster should be looked manually to find distinct attack patterns.

Results and Analysis

Figure 4a through j shows the plots which were obtained after clustering the preprocessed AGMs from July 01, 2017–July 10, 2017. The top most cluster data points are marked red, next cluster is marked green, third is blue and the fourth top most cluster is marked as yellow. After clustering the AGMs, the packets corresponding to each cluster were separately stored in four pcap files per day. Then, to find the packets distribution in each cluster throughout the day, the packets count for each hour was calculated and was plotted against the hours of the day. The graphs in Fig. 5a through j indicate us that the traffic in each cluster is spread almost throughout the whole day.
Fig. 4

a Clusters on July 01, 2017. b Clusters on July 02, 2017. c Clusters on July 03, 2017. d Clusters on July 04, 2017. e Clusters on July 05, 2017. f Clusters on July 06, 2017. g Clusters on July 07, 2017. h Clusters on July 08, 2017. i Clusters on July 09, 2017. j Clusters on July 10, 2017

Fig. 5

a Traffic distribution throughout the day on July 01, 2017. b Traffic distribution throughout the day on July 02, 2017. c Traffic distribution throughout the day on July 03, 2017. d Traffic distribution throughout the day on July 04, 2017. e Traffic distribution throughout the day on July 05, 2017. f Traffic distribution throughout the day on July 06, 2017. g Traffic distribution throughout the day on July 07, 2017. h Traffic distribution throughout the day on July 08, 2017. i Traffic distribution throughout the day on July 09, 2017. j Traffic distribution throughout the day on July 10, 2017

After manually analyzing the AGMs belonging to each cluster, we could find distinguishable attack patterns, for example, traces of Mirai Bot [22] and Brute Force attack were found in the first cluster. SQL injection attempt patterns were found in the second cluster. The third cluster had some traces of Remote Desktop attack.

We could also infer that the most predominant destination ports used by the source IPs were 23, 22, 2323, 81, 9000, 7547, 1433, and 3389 and most of the packets were set with the flags 0 × 2 and 0 × 18.

Algorithm 3 gives us the overall flow of the darknet traffic analysis procedure. Step 1 and step 2 has been discussed in Algorithms 1 and 2, respectively. Steps 3–6 are done by implementing the corresponding algorithms using the in-built functions. Steps 7, 8 and 9 are done manually using the domain knowledge about the well-known attacks. Each pattern can be mapped to the attack based on the destination ports, the source IP used. These found patterns can be stored and can be compared with the AGMs of the future data. Near real-time analysis of the darknet traffic can be done to find the attack patterns faster. Each traffic pattern was analyzed and was mapped [23] to well-known attacks.

The obtained patterns are mapped to the attacks based on the domain knowledge and to the attack profiles identified by other researchers. There are also some other patterns present in the darknet traffic which are not mapped to the attack trends as that traffic was very less in amount, say 1 AGM out of 33,000 AGMs. The advantage of analyzing this traffic is that we can find the hidden attacks which were not identified in the real network similar to the research done by Bou-Harb et al. [14]. If an unknown pattern is found in large quantity after clustering, it can be inspected using domain knowledge and hidden attacks can also be found and labeled.

Activity on TCP Port 23

Around 55% of the total source IPs detected per day were attempting connections to TCP port 23 in the darkspace. Port 23 corresponds to telnet, a relatively old but a popular Internet protocol for remote login. It may account to the Mirai Bot attack traces because it was found that Mirai Bot used Port 23 as one of the destination port for its propagation. The following describes the profile of TCP23.

#protocols = 1, M(protocol) = TCP,

#dstPorts = 1, M(dstPort) = 23,

#flag = 2, M(flag) = SYN

Activity on TCP Port 22

Approximately 10% of the total IPs detected per day were attempting connections to TCP Port 22 the most times in the darkspace. Port 22 is linked to SSH, a network protocol used for remote access. This port is vulnerable to brute force password-cracking attempts. This pattern may be mapped to the well-known brute force attack. The following describes the profile of TCP22.

#protocols = 1, M(protocol) = TCP,

#dstPorts = 1, M(dstPort) = 22,

#flag = 2, M(flag) = SYN

Activity on TCP Port 2323

About 4% of the total IPs detected per day were attempting connections to TCP Port 2323. Destination port 2323 is also used by Mirai Bot for its propagation. Hence, this particular activity is believed to be Mirai Bot attack traces. The following describes the profile of TCP2323.

#protocols = 1, M(protocol) = TCP,

#dstPorts = 1, M(dstPort) = 2323,

#flag = 2, M(flag) = SYN

Activity on TCP Port 81

About 8% of the total IPs detected per day were attempting connections to TCP Port 81. This is significant, as a new variant of Mirai attempted to use this port to infect CCTV-DVR cameras. The following describes the profile of TCP81.

#protocols = 1, M(protocol) = TCP,

#dstPorts = 1, M(dstPort) = 81,

#flag = 3, M(flag) = PSH + ACK

Activity on TCP Port 9000

Around 7% of the total IPs detected per day were attempting connections to TCP port 9000. This port is attached to CSlistener. It is believed that this port is used for SQL injection attacks. The following describes the profile of TCP9000.

#protocols = 1, M(protocol) = TCP,

#dstPorts = 1, M(dstPort) = 9000,

#flag = 3, M(flag) = PSH + ACK

Activity on TCP Port 7547

Around 3% of the total IPs detected per day were attempting connections to TCP 7547 mostly and it is associated with TR-069, an application layer protocol for remote management of end-user devices. It was found that a new Mirai Bot variant uses this port as well. The following describes the profile of TCP7547.

#protocols = 1, M(protocol) = TCP,

#dstPorts = 1, M(dstPort) = 7547,

#flag = 3, M(flag) = PSH + ACK

Activity on TCP Port 1433

This port is associated with the Microsoft SQL server. Port 1433 is often used for SQL injection attacks and, therefore, it can be mapped to that. The following describes the profile of TCP1433.

#protocols = 1, M(protocol) = TCP,

#dstPorts = 1, M(dstPort) = 1433,

#flag = 3, M(flag) = PSH + ACK

Activity on TCP Port 3389

Port 3389 is attached to the Remote Desktop Protocol (RDP) which can be exploited by Remote Desktop attack, MITM (man-in-the-middle attack), etc., in general. The following describes the profile of TCP1433.

#protocols = 1, M(protocol) = TCP,

#dstPorts = 1, M(dstPort) = 3389,

#flag = 3, M(flag) = PSH + ACK

After analyzing the first 20 days of July 2017, the results obtained are tabulated as in Table 2.
Table 2

Consolidated patterns

Cluster no.

Destination port




23, 22, 2323

0 × 2

Brute Force Attack, Mirai Bot


81, 9000, 7547

0 × 18

SQL Attack, Mirai Bot


1433, 3389

0 × 18

SQL Attack, Remote Desktop Attack


23, 22, 2323, 81, 9000, 7547, 1433, 3389

0 × 2, 0 × 18

Brute Force Attack, Mirai Bot, SQL Attack, Remote Desktop Attack

Conclusion and Future Work

This research work on darknet suggests that the attack and network trends in the real network can be found out by analyzing only the packet header information of the darknet traffic without actually inspecting the payload or analyzing the even larger real-network traffic. In this paper, an efficient 29-tuple Numerical AGM data format which is suitable for analyzing the validated high-speed TCP darknet traffic using any machine learning or statistical techniques is proposed. This format is useful for any technique which demands numerical parameters for analysis.

This research work can be further extended by storing the known-patterns in a database for classifying the future incoming traffic. This whole process can be made interactive by creating an API. As mobile darknet traffic is also increasing these days, these techniques can be used to analyze the traffic and security aspects of the mobile networks. This darknet traffic can also be used to find the hotspot in the darkspace, the critical destination IP address present in the empty address space.


Compliance with Ethical Standards

Conflict of interest

The authors declare that they have no conflict of interest.


  1. 1.
    Cavelty MD. Contemporary security studies. Oxford: Oxford University Press; 2018.Google Scholar
  2. 2.
    Pang R, Yegneswaran V, Barford P, Paxson V, Peterson L. Characteristics of internet background radiation. In: Proceedings of the 4th ACM SIGCOMM conference on internet measurement—IMC’04. ACM; 2004. p. 27–40.Google Scholar
  3. 3.
    Iglesias F, Zseby T. Pattern discovery in internet background radiation. In: IEEE transactions on big data (Early Access), July 2017.Google Scholar
  4. 4.
    CAIDA. The CAIDA UCSD Network Telescope “Patch Tuesday” Dataset. 2014.
  5. 5.
    Fachkhka C, Debbabi M. Darknet as a source of cyber intelligence: survey, taxonomy, and characterization. IEEE Commun Surv Tutor. 2016;18(2):1197–227 (Second Quarter).CrossRefGoogle Scholar
  6. 6.
    Bhuyan MH, Bhattacharyya DK, Kalita JK. Network anomaly detection: methods, systems and tools. IEEE Commun Surv Tutor. 2014;16(1):303–36.CrossRefGoogle Scholar
  7. 7.
    Sperotto A, Schaffrath G, Sadre R, Morariu C, Pras A, Stiller B. An overview of ip flow-based intrusion detection. IEEE Commun Surv Tutor. 2010;12(3):343–56.CrossRefGoogle Scholar
  8. 8.
  9. 9.
    Keys K, Moore D, Koga R, Lagache E, Tesch M. The architecture of CoralReef: an internet traffic monitoring software suite. In: Passive and active network measurement workshop (PAM); 2001.Google Scholar
  10. 10.
    Iglesias F, Zseby T. Modelling ip darkspace traffic by means of clustering techniques. 27 in 2014 IEEE conference on communications and network security; 2014. p. 166–174.Google Scholar
  11. 11.
    Wang Q, Chen Z, Chen C. Darknet-based inference of internet worm temporal characteristics. IEEE Trans Inf Forensics Secur. 2011;6(4):1382–93.CrossRefGoogle Scholar
  12. 12.
    Dainotti A, King A, Claffy K, Papale F, Pescape A. Analysis of a “/0” stealth scan from a botnet. IEEE/ACM Trans Netw. 2015;23(2):341–54.CrossRefGoogle Scholar
  13. 13.
    Bou-Harb E, Husak M, Debbabi M, Assi C. Big data sanitization and cyber situational awareness: a network telescope perspective. In: IEEE transactions on big data; 2018. p. 1.Google Scholar
  14. 14.
    Bou-Harb E, Assi C, Debbabi M. Csc-detector: a system to infer large-scale probing campaigns. IEEE Trans Dependable Secure Comput. 2016;15(3):364–77.CrossRefGoogle Scholar
  15. 15.
    Bellman RE. Dynamic programming. New York: Dover Publications, Inc; 2003.zbMATHGoogle Scholar
  16. 16.
    Person K. LIII on lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72. Scholar
  17. 17.
    Hotelling H. Analysis of a complex of statistical variables into principal components. J Educ Psychol. 1933;24(6):417–41.CrossRefGoogle Scholar
  18. 18.
    Hartigan JA, Wong MA. Algorithm AS 136: a K-means clustering algorithm. J R Stat Soc Ser C. 1979;28(1):100–8.zbMATHGoogle Scholar
  19. 19.
    Guha S, Rastogi R, Shim K. CURE: an efficient clustering algorithm for large databases. SIGMOD Rec. 1998;27(2):73–84. Scholar
  20. 20.
    Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. SIGMOD Rec. 1996;25(2):103–14. Scholar
  21. 21.
    Cheng Y. Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Mach Intell. 1995;17(8):790–9.CrossRefGoogle Scholar
  22. 22.
    Kolias C, Kambourakis G, Stavrou A, Voas J. DDoS in the IoT: Mirai and other botnets. Computer. 2017;50(7):80–4.CrossRefGoogle Scholar
  23. 23.
    TCP ports information.

Copyright information

© Springer Nature Singapore Pte Ltd 2019

Authors and Affiliations

  1. 1.PSG College of TechnologyCoimbatoreIndia
  2. 2.CSIR Fourth Paradigm InstituteBangaloreIndia

Personalised recommendations