Beyond Labeling: Using Clustering to Build Network Behavioral Profiles of Malware Families

Nadeem, Azqa; Hammerschmidt, Christian; Gañán, Carlos H.; Verwer, Sicco

doi:10.1007/978-3-030-62582-5_15

Azqa Nadeem⁴,
Christian Hammerschmidt⁴,
Carlos H. Gañán⁴ &
…
Sicco Verwer⁴

2165 Accesses
6 Citations
6 Altmetric

Abstract

Malware family labels are known to be inconsistent. They are also black-box since they do not represent the capabilities of malware. The current state of the art in malware capability assessment includes mostly manual approaches, which are infeasible due to the ever-increasing volume of discovered malware samples. We propose a novel unsupervised machine learning-based method called MalPaCA, which automates capability assessment by clustering the temporal behavior in malware’s network traces. MalPaCA provides meaningful behavioral clusters using only 20 packet headers. Behavioral profiles are generated based on the cluster membership of malware’s network traces. A Directed Acyclic Graph shows the relationship between malwares according to their overlapping behaviors. The behavioral profiles together with the DAG provide more insightful characterization of malware than current family designations. We also propose a visualization-based evaluation method for the obtained clusters to assist practitioners in understanding the clustering results. We apply MalPaCA on a financial malware dataset collected in the wild that comprises 1.1 k malware samples resulting in 3.6 M packets. Our experiments show that (i) MalPaCA successfully identifies capabilities, such as port scans and reuse of Command and Control servers; (ii) It uncovers multiple discrepancies between behavioral clusters and malware family labels; and (iii) It demonstrates the effectiveness of clustering traces using temporal features by producing an error rate of 8.3%, compared to 57.5% obtained from statistical features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.cybersecurity-insiders.com/top-15-cyber-threats-for-2019/.
2.
https://www.av-test.org/en/statistics/malware/.
3.
In white-box ML, all steps are explainable—the input, output and how the output was generated. In contrast, only the input and output are known in black-box ML, e.g., Neural Networks.
4.
https://github.com/azqa/malpaca-pub.
5.
https://virustotal.github.io/yara/.
6.
https://www.virustotal.com/.
7.
https://github.com/azqa/malpaca-pub.
8.
Handshake traffic refers to the introductory few packets of a connection.
9.
https://www.enterprisetimes.co.uk/2016/10/20/ecj-rules-ip-address-is-pii/.
10.
https://www.ixiacom.com/company/blog/mirai-botnet-things.
11.
len can be adjusted based on the required behavioral specificity.
12.
https://whatismyipaddress.com/port-scan.

References

Acar, Abbas, Hossein Fereidooni, Tigist Abera, Amit Kumar Sikder, Markus Miettinen, Hidayet Aksu, Mauro Conti, Ahmad-Reza Sadeghi, and A. Selcuk Uluagac. 2018. Peek-a-boo: I see your smart home activities, even encrypted! arXiv.
Google Scholar
Aiolli, Fabio, Mauro Conti, Ankit Gangwal, and Mirko Polato. 2019. Mind your wallet’s privacy: Identifying bitcoin wallet apps and user’s actions through network traffic analysis. In SIGAPP, 1484–1491. ACM.
Google Scholar
Anderson, Blake, and David McGrew. 2017. Machine learning for encrypted malware traffic classification: Accounting for noisy labels and non-stationarity. In Proceedings of the 23rd ACM SIGKDD, 1723–1732.
Google Scholar
Anderson, Blake, Subharthi Paul, and David McGrew. 2017. Deciphering malware’s use of TLS (without decryption). CVHT Journal 14 (3).
Google Scholar
Azab, Ahmad, Mamoun Alazab, and Mahdi Aiash. 2016. Machine learning based botnet identification traffic. In IEEE Trustcom/BigDataSE/ISPA, 1788–1794. IEEE.
Google Scholar
Azab, Ahmad Robert Layton, Mamoun Alazab, and Jonathan Oliver. 2014. Mining malware to detect variants. In Cybercrime and trustworthy computing conference, 44–53. IEEE.
Google Scholar
Bayer, Ulrich, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Kruegel, and Engin Kirda. 2009. Scalable, behavior-based malware clustering. In NDSS, vol. 9, 8–11. Citeseer.
Google Scholar
Baysa, Donabelle, Richard M. Low, and Mark Stamp. 2013. Structural entropy and metamorphic malware. CVHT Journal 9 (4): 179–192.
Google Scholar
Berndt, Donald J., and James Clifford. 1994. Using dynamic time warping to find patterns in time series. KDD 10: 359–370
Google Scholar
Bilge, Leyla, Davide Balzarotti, William Robertson, Engin Kirda, and Christopher Kruegel. 2012. Disclosure: Detecting botnet command and control servers through large-scale netflow analysis. In ACSAC, 129–138. ACM.
Google Scholar
Black, Paul, Iqbal Gondal, and Robert Layton. 2017. A survey of similarities in banking malware behaviours. Computers and Security.
Google Scholar
Campello, Ricardo J.G.B., Davoud Moulavi, and Jörg Sander. 2013. Density-based clustering based on hierarchical density estimates. In PAKDD, 160–172. Springer
Google Scholar
Canfora, Gerardo, Andrea De Lorenzo, Eric Medvet, Francesco Mercaldo, and Corrado Aaron Visaggio. 2015. Effectiveness of opcode ngrams for detection of multi family android malware. In ARES, 333–340. IEEE.
Google Scholar
Cavallaro, Lorenzo, Christopher Kruegel, Giovanni Vigna, Fang Yu, Muath Alkhalaf, Tevfik Bultan, Lili Cao, Lei Yang, Heather Zheng, Christopher C. Cipriano, et al. 2009. Mining the network behavior of bots. Technical report 2009-12.
Google Scholar
Chakkaravarthy, S. Sibi, D. Sangeetha, and V. Vaidehi. 2019. A survey on malware analysis and mitigation techniques. Computer Science Review 32: 1–23.
Google Scholar
Chan, Neil Wong Hon, and Shanchieh Jay Yang. 2017. Scanner: Sequence clustering of android resource accesses. In IEEE DSC 2017.
Google Scholar
Conti, Mauro, Luigi V. Mancini, Riccardo Spolaor, and Nino Vincenzo Verde. 2015. Can’t you hear me knocking: Identification of user actions on android apps via traffic analysis. In CODASPY, 297–304. ACM.
Google Scholar
Davies, David L. and Donald W. Bouldin. 1979. A cluster separation measure. In TPAMI 1979.
Google Scholar
Dyer, Kevin P., Scott E. Coull, Thomas Ristenpart, and Thomas Shrimpton. 2012. Peek-a-boo, i still see you: Why efficient traffic analysis countermeasures fail. In S&P, 332–346. IEEE.
Google Scholar
Elfeky, Mohamed G., Walid G. Aref, and Ahmed K. Elmagarmid. 2005. Warp: Time warping for periodicity detection. In Data Mining, 8–pp. IEEE.
Google Scholar
Feng, Yu, Saswat Anand, Isil Dillig, and Alex Aiken. 2014. Apposcopy: Semantics-based detection of android malware through static analysis. In SIGSOFT, 576–587. ACM.
Google Scholar
Gandotra, Ekta, Divya Bansal, and Sanjeev Sofat. 2014. Malware analysis and classification: A survey. Information Security Journal 5 (02): 56.
Article Google Scholar
Garcia, Sebastian. 2015. Modelling the network behaviour of malware to block malicious patterns. the stratosphere project: A behavioural IPS. VB.
Google Scholar
Garcia-Teodoro, Pedro, Jesus Diaz-Verdejo, Gabriel Maciá-Fernández, and Enrique Vázquez. 2009. Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers and Security 28 (1–2): 18–28.
Article Google Scholar
Ghafir, Ibrahim and Vaclav Prenosil. 2015. Blacklist-based malicious IP traffic detection. In GCCT, 229–233. IEEE.
Google Scholar
Ghorbani, Ali A., and Saeed Nari. 2013. Automated malware classification based on network behavior. In ICNC, 642–647. IEEE.
Google Scholar
Hammerschmidt, Christian, Samuel Marchal, Radu State, and Sicco Verwer. 2016. Behavioral clustering of non-stationary IP flow record data. In CNSM, 297–301. IEEE.
Google Scholar
Kalgutkar, Vaibhavi, Natalia Stakhanova, Paul Cook, and Alina Matyukhina. 2018. Android authorship attribution through string analysis. In ARES, 4. ACM.
Google Scholar
Kantchelian, Alex, Michael Carl Tschantz, Sadia Afroz, Brad Miller, Vaishaal Shankar, Rekha Bachwani, Anthony D. Joseph, and J Doug Tygar. 2015. Better malware ground truth: Techniques for weighting anti-virus vendor labels. In AISec.
Google Scholar
Kim, Ki-Hyeon and Mi-Jung Choi. 2015. Android malware detection using multivariate time-series technique. In APNOMS, 198–202.
Google Scholar
Kovacs-Vajna, Zsolt Miklos. 2000. A fingerprint verification system based on triangular matching and dynamic time warping. TPAMI 22 (11): 1266–1276.
Google Scholar
Lee, Jehyun, and Heejo Lee. 2014. Gmad: Graph-based malware activity detection by DNS traffic analysis. Computer Communications 49.
Google Scholar
Li, Peng, Limin Liu, Debin Gao, and Michael K. Reiter. 2010. On challenges in evaluating malware clustering. In RAID, 238–255. Springer.
Google Scholar
Li, Wei-Jen, Ke Wang, Salvatore J. Stolfo, and Benjamin Herzog. 2005. Fileprints: Identifying file types by n-gram analysis. In IEEE SMC information assurance workshop, 64–71. IEEE.
Google Scholar
Li, Yuping, Jiyong Jang, Xin Hu, and Xinming Ou. 2017. Android malware clustering through malicious payload mining. In RAID, 192–214. Springer.
Google Scholar
Lin, Qin, Sridha Adepu, Sicco Verwer, and Aditya Mathur. 2018. Tabor: a graphical model-based approach for anomaly detection in industrial control systems. In Asia CCS, 525–536. ACM.
Google Scholar
Maggi, Federico, Andrea Bellini, Guido Salvaneschi, and Stefano Zanero. 2011. Finding non-trivial malware naming inconsistencies. In ICISS, 144–159
Google Scholar
Mohaisen, Aziz, Omar Alrawi, Matt Larson, and Danny McPherson. 2013. Towards a methodical evaluation of antivirus scans and labels. In ISA workshop, 231–241. Springer.
Google Scholar
Mohaisen, Aziz, Omar Alrawi, and Manar Mohaisen. 2015. Amal: High-fidelity, behavior-based automated malware analysis and classification. Computers and Security 52.
Google Scholar
Moubarak, Joanna, Maroun Chamoun, and Eric Filiol. 2017. Comparative study of recent mea malware phylogeny. In ICCCS, 16–20. IEEE.
Google Scholar
Ntlangu, Mbulelo Brenwen, and Alireza Baghai-Wadji. 2017. Modelling network traffic using time series analysis: A review. In IoTBDS, 209–215.
Google Scholar
Oregi, Izaskun, Aritz Pérez, Javier Del Ser, and José A Lozano. 2017. On-line dynamic time warping for streaming time series. In ECML-PKDD, 591–605. Springer.
Google Scholar
Pellegrino, Gaetano, Qin Lin, Christian Hammerschmidt, and Sicco Verwer. 2017. Learning behavioral fingerprints from netflows using timed automata. In IFIP, 308–316. IEEE.
Google Scholar
Perdisci, Roberto, Wenke Lee, and Nick Feamster. 2010. Behavioral clustering of http-based malware and signature generation using malicious network traces. In NSDI, vol. 10.
Google Scholar
Pomorova, Oksana, Oleg Savenko, Sergii Lysenko, Andrii Kryshchuk, and Kira Bobrovnikova. 2015. A technique for the botnet detection based on DNS-traffic analysis. In CN, 127–138. Springer.
Google Scholar
Rafique, M. Zubair, and Juan Caballero. 2013. Firma: Malware clustering and network signature generation with mixed network behaviors. In RAID, 144–163. Springer.
Google Scholar
Rieck, Konrad, Philipp Trinius, Carsten Willems, and Thorsten Holz. 2011. Automatic analysis of malware behavior using machine learning. Journal of Computer Security 19 (4): 639–668.
Article Google Scholar
Rousseeuw, Peter J. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. CAM Journal 20.
Google Scholar
Sebastián, Marcos, Richard Rivera, Platon Kotzias, and Juan Caballero. 2016. Avclass: A tool for massive malware labeling. In RAID, 230–253. Springer.
Google Scholar
Sharma, Arushi, Ekta Gandotra, Divya Bansal, and Deepak Gupta. 2019. Malware capability assessment using fuzzy logic. Cybernetics and Systems 1–16.
Google Scholar
Suarez-Tangil, Guillermo, Juan E. Tapiador, Pedro Peris-Lopez, and Jorge Blasco. 2014. Dendroid: A text mining approach to analyzing and classifying code structures in android malware families. Expert Systems with Applications 41 (4).
Google Scholar
Sun, Mingshen, Xiaolei Li, John C.S. Lui, Richard T.B. Ma, and Zhenkai Liang. 2017. Monet: a user-oriented behavior-based malware variants detection system for android. TIFS 12 (5).
Google Scholar
Tajalizadehkhoob, S.T., Hadi Asghari, Carlos Gañán, and M.J.G. Van Eeten. 2014. Why them? extracting intelligence about target selection from zeus financial malware. In WEIS.
Google Scholar
Tegeler, Florian, Xiaoming Fu, Giovanni Vigna, and Christopher Kruegel. 2012. Botfinder: Finding bots in network traffic without deep packet inspection. In CoNEXT, 349–360. ACM.
Google Scholar
Tian, Ronghua, Lynn Batten, Rafiqul Islam, and Steve Versteeg. 2009. An automated classification system based on the strings of trojan and virus families. In MALWARE. IEEE.
Google Scholar
Verwer, Sicco, Rémi Eyraud, and Colin De La Higuera. 2014. Pautomac: A probabilistic automata and hidden Markov models learning competition. Machine Learning 96 (1–2): 129–154.
Article MathSciNet Google Scholar
Vinod, P., V. Laxmi, M.S. Gaur, and Grijesh Chauhan. 2012. Momentum: Metamorphic malware exploration techniques using MSA signatures. In IIT, 232–237. IEEE.
Google Scholar
Volis, George, Christos Makris, and Andreas Kanavos. 2016. Two novel techniques for space compaction on biological sequences. WEBIST.
Google Scholar
Wang, An, Aziz Mohaisen, Wentao Chang, and Songqing Chen. 2015. Capturing DDoS attack dynamics behind the scenes. In DIMVA, 205–215. Springer.
Google Scholar
Wang, Wei, Ming Zhu, Xuewen Zeng, Xiaozhou Ye, and Yiqiang Sheng. 2017. Malware traffic classification using convolutional neural network for representation learning. In ICOIN, 712–717.
Google Scholar
Wang, Yipeng, Zhibin Zhang, Danfeng Daphne Yao, Buyun Qu, and Li Guo. 2011. Inferring protocol state machine from network traces: a probabilistic approach. In ACNS, 1–18. Springer.
Google Scholar
Yadav, Tarun and Arvind Mallari Rao. 2015. Technical aspects of cyber kill chain. In SSCC.
Google Scholar
Zahrotun, Lisna. 2016. Comparison jaccard similarity, cosine similarity and combined both of the data clustering with shared nearest neighbor method. CE&AJ 5 (1): 11–18.
Google Scholar

Download references

Author information

Authors and Affiliations

Delft University of Technology, Delft, The Netherlands
Azqa Nadeem, Christian Hammerschmidt, Carlos H. Gañán & Sicco Verwer

Authors

Azqa Nadeem
View author publications
You can also search for this author in PubMed Google Scholar
Christian Hammerschmidt
View author publications
You can also search for this author in PubMed Google Scholar
Carlos H. Gañán
View author publications
You can also search for this author in PubMed Google Scholar
Sicco Verwer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Azqa Nadeem .

Editor information

Editors and Affiliations

Department of Computer Science, San Jose State University, San Jose, CA, USA
Mark Stamp
College of Engineering, IT & Environment, Charles Darwin University, Darwin, NT, Australia
Mamoun Alazab
Faculty of Information Technology and Electrical Engineering, Norwegian University of Science and Techology, Gjøvik, Norway
Andrii Shalaginov

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nadeem, A., Hammerschmidt, C., Gañán, C.H., Verwer, S. (2021). Beyond Labeling: Using Clustering to Build Network Behavioral Profiles of Malware Families. In: Stamp, M., Alazab, M., Shalaginov, A. (eds) Malware Analysis Using Artificial Intelligence and Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-62582-5_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-62582-5_15
Published: 21 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62581-8
Online ISBN: 978-3-030-62582-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics