Skip to main content

Beyond Labeling: Using Clustering to Build Network Behavioral Profiles of Malware Families

  • Chapter
  • First Online:
Malware Analysis Using Artificial Intelligence and Deep Learning

Abstract

Malware family labels are known to be inconsistent. They are also black-box since they do not represent the capabilities of malware. The current state of the art in malware capability assessment includes mostly manual approaches, which are infeasible due to the ever-increasing volume of discovered malware samples. We propose a novel unsupervised machine learning-based method called MalPaCA, which automates capability assessment by clustering the temporal behavior in malware’s network traces. MalPaCA provides meaningful behavioral clusters using only 20 packet headers. Behavioral profiles are generated based on the cluster membership of malware’s network traces. A Directed Acyclic Graph shows the relationship between malwares according to their overlapping behaviors. The behavioral profiles together with the DAG provide more insightful characterization of malware than current family designations. We also propose a visualization-based evaluation method for the obtained clusters to assist practitioners in understanding the clustering results. We apply MalPaCA on a financial malware dataset collected in the wild that comprises 1.1 k malware samples resulting in 3.6 M packets. Our experiments show that (i) MalPaCA successfully identifies capabilities, such as port scans and reuse of Command and Control servers; (ii) It uncovers multiple discrepancies between behavioral clusters and malware family labels; and (iii) It demonstrates the effectiveness of clustering traces using temporal features by producing an error rate of 8.3%, compared to 57.5% obtained from statistical features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.cybersecurity-insiders.com/top-15-cyber-threats-for-2019/.

  2. 2.

    https://www.av-test.org/en/statistics/malware/.

  3. 3.

    In white-box ML, all steps are explainable—the input, output and how the output was generated. In contrast, only the input and output are known in black-box ML, e.g., Neural Networks.

  4. 4.

    https://github.com/azqa/malpaca-pub.

  5. 5.

    https://virustotal.github.io/yara/.

  6. 6.

    https://www.virustotal.com/.

  7. 7.

    https://github.com/azqa/malpaca-pub.

  8. 8.

    Handshake traffic refers to the introductory few packets of a connection.

  9. 9.

    https://www.enterprisetimes.co.uk/2016/10/20/ecj-rules-ip-address-is-pii/.

  10. 10.

    https://www.ixiacom.com/company/blog/mirai-botnet-things.

  11. 11.

    len can be adjusted based on the required behavioral specificity.

  12. 12.

    https://whatismyipaddress.com/port-scan.

References

  1. Acar, Abbas, Hossein Fereidooni, Tigist Abera, Amit Kumar Sikder, Markus Miettinen, Hidayet Aksu, Mauro Conti, Ahmad-Reza Sadeghi, and A. Selcuk Uluagac. 2018. Peek-a-boo: I see your smart home activities, even encrypted! arXiv.

    Google Scholar 

  2. Aiolli, Fabio, Mauro Conti, Ankit Gangwal, and Mirko Polato. 2019. Mind your wallet’s privacy: Identifying bitcoin wallet apps and user’s actions through network traffic analysis. In SIGAPP, 1484–1491. ACM.

    Google Scholar 

  3. Anderson, Blake, and David McGrew. 2017. Machine learning for encrypted malware traffic classification: Accounting for noisy labels and non-stationarity. In Proceedings of the 23rd ACM SIGKDD, 1723–1732.

    Google Scholar 

  4. Anderson, Blake, Subharthi Paul, and David McGrew. 2017. Deciphering malware’s use of TLS (without decryption). CVHT Journal 14 (3).

    Google Scholar 

  5. Azab, Ahmad, Mamoun Alazab, and Mahdi Aiash. 2016. Machine learning based botnet identification traffic. In IEEE Trustcom/BigDataSE/ISPA, 1788–1794. IEEE.

    Google Scholar 

  6. Azab, Ahmad Robert Layton, Mamoun Alazab, and Jonathan Oliver. 2014. Mining malware to detect variants. In Cybercrime and trustworthy computing conference, 44–53. IEEE.

    Google Scholar 

  7. Bayer, Ulrich, Paolo Milani Comparetti, Clemens Hlauschek, Christopher Kruegel, and Engin Kirda. 2009. Scalable, behavior-based malware clustering. In NDSS, vol. 9, 8–11. Citeseer.

    Google Scholar 

  8. Baysa, Donabelle, Richard M. Low, and Mark Stamp. 2013. Structural entropy and metamorphic malware. CVHT Journal 9 (4): 179–192.

    Google Scholar 

  9. Berndt, Donald J., and James Clifford. 1994. Using dynamic time warping to find patterns in time series. KDD 10: 359–370

    Google Scholar 

  10. Bilge, Leyla, Davide Balzarotti, William Robertson, Engin Kirda, and Christopher Kruegel. 2012. Disclosure: Detecting botnet command and control servers through large-scale netflow analysis. In ACSAC, 129–138. ACM.

    Google Scholar 

  11. Black, Paul, Iqbal Gondal, and Robert Layton. 2017. A survey of similarities in banking malware behaviours. Computers and Security.

    Google Scholar 

  12. Campello, Ricardo J.G.B., Davoud Moulavi, and Jörg Sander. 2013. Density-based clustering based on hierarchical density estimates. In PAKDD, 160–172. Springer

    Google Scholar 

  13. Canfora, Gerardo, Andrea De Lorenzo, Eric Medvet, Francesco Mercaldo, and Corrado Aaron Visaggio. 2015. Effectiveness of opcode ngrams for detection of multi family android malware. In ARES, 333–340. IEEE.

    Google Scholar 

  14. Cavallaro, Lorenzo, Christopher Kruegel, Giovanni Vigna, Fang Yu, Muath Alkhalaf, Tevfik Bultan, Lili Cao, Lei Yang, Heather Zheng, Christopher C. Cipriano, et al. 2009. Mining the network behavior of bots. Technical report 2009-12.

    Google Scholar 

  15. Chakkaravarthy, S. Sibi, D. Sangeetha, and V. Vaidehi. 2019. A survey on malware analysis and mitigation techniques. Computer Science Review 32: 1–23.

    Google Scholar 

  16. Chan, Neil Wong Hon, and Shanchieh Jay Yang. 2017. Scanner: Sequence clustering of android resource accesses. In IEEE DSC 2017.

    Google Scholar 

  17. Conti, Mauro, Luigi V. Mancini, Riccardo Spolaor, and Nino Vincenzo Verde. 2015. Can’t you hear me knocking: Identification of user actions on android apps via traffic analysis. In CODASPY, 297–304. ACM.

    Google Scholar 

  18. Davies, David L. and Donald W. Bouldin. 1979. A cluster separation measure. In TPAMI 1979.

    Google Scholar 

  19. Dyer, Kevin P., Scott E. Coull, Thomas Ristenpart, and Thomas Shrimpton. 2012. Peek-a-boo, i still see you: Why efficient traffic analysis countermeasures fail. In S&P, 332–346. IEEE.

    Google Scholar 

  20. Elfeky, Mohamed G., Walid G. Aref, and Ahmed K. Elmagarmid. 2005. Warp: Time warping for periodicity detection. In Data Mining, 8–pp. IEEE.

    Google Scholar 

  21. Feng, Yu, Saswat Anand, Isil Dillig, and Alex Aiken. 2014. Apposcopy: Semantics-based detection of android malware through static analysis. In SIGSOFT, 576–587. ACM.

    Google Scholar 

  22. Gandotra, Ekta, Divya Bansal, and Sanjeev Sofat. 2014. Malware analysis and classification: A survey. Information Security Journal 5 (02): 56.

    Article  Google Scholar 

  23. Garcia, Sebastian. 2015. Modelling the network behaviour of malware to block malicious patterns. the stratosphere project: A behavioural IPS. VB.

    Google Scholar 

  24. Garcia-Teodoro, Pedro, Jesus Diaz-Verdejo, Gabriel Maciá-Fernández, and Enrique Vázquez. 2009. Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers and Security 28 (1–2): 18–28.

    Article  Google Scholar 

  25. Ghafir, Ibrahim and Vaclav Prenosil. 2015. Blacklist-based malicious IP traffic detection. In GCCT, 229–233. IEEE.

    Google Scholar 

  26. Ghorbani, Ali A., and Saeed Nari. 2013. Automated malware classification based on network behavior. In ICNC, 642–647. IEEE.

    Google Scholar 

  27. Hammerschmidt, Christian, Samuel Marchal, Radu State, and Sicco Verwer. 2016. Behavioral clustering of non-stationary IP flow record data. In CNSM, 297–301. IEEE.

    Google Scholar 

  28. Kalgutkar, Vaibhavi, Natalia Stakhanova, Paul Cook, and Alina Matyukhina. 2018. Android authorship attribution through string analysis. In ARES, 4. ACM.

    Google Scholar 

  29. Kantchelian, Alex, Michael Carl Tschantz, Sadia Afroz, Brad Miller, Vaishaal Shankar, Rekha Bachwani, Anthony D. Joseph, and J Doug Tygar. 2015. Better malware ground truth: Techniques for weighting anti-virus vendor labels. In AISec.

    Google Scholar 

  30. Kim, Ki-Hyeon and Mi-Jung Choi. 2015. Android malware detection using multivariate time-series technique. In APNOMS, 198–202.

    Google Scholar 

  31. Kovacs-Vajna, Zsolt Miklos. 2000. A fingerprint verification system based on triangular matching and dynamic time warping. TPAMI 22 (11): 1266–1276.

    Google Scholar 

  32. Lee, Jehyun, and Heejo Lee. 2014. Gmad: Graph-based malware activity detection by DNS traffic analysis. Computer Communications 49.

    Google Scholar 

  33. Li, Peng, Limin Liu, Debin Gao, and Michael K. Reiter. 2010. On challenges in evaluating malware clustering. In RAID, 238–255. Springer.

    Google Scholar 

  34. Li, Wei-Jen, Ke Wang, Salvatore J. Stolfo, and Benjamin Herzog. 2005. Fileprints: Identifying file types by n-gram analysis. In IEEE SMC information assurance workshop, 64–71. IEEE.

    Google Scholar 

  35. Li, Yuping, Jiyong Jang, Xin Hu, and Xinming Ou. 2017. Android malware clustering through malicious payload mining. In RAID, 192–214. Springer.

    Google Scholar 

  36. Lin, Qin, Sridha Adepu, Sicco Verwer, and Aditya Mathur. 2018. Tabor: a graphical model-based approach for anomaly detection in industrial control systems. In Asia CCS, 525–536. ACM.

    Google Scholar 

  37. Maggi, Federico, Andrea Bellini, Guido Salvaneschi, and Stefano Zanero. 2011. Finding non-trivial malware naming inconsistencies. In ICISS, 144–159

    Google Scholar 

  38. Mohaisen, Aziz, Omar Alrawi, Matt Larson, and Danny McPherson. 2013. Towards a methodical evaluation of antivirus scans and labels. In ISA workshop, 231–241. Springer.

    Google Scholar 

  39. Mohaisen, Aziz, Omar Alrawi, and Manar Mohaisen. 2015. Amal: High-fidelity, behavior-based automated malware analysis and classification. Computers and Security 52.

    Google Scholar 

  40. Moubarak, Joanna, Maroun Chamoun, and Eric Filiol. 2017. Comparative study of recent mea malware phylogeny. In ICCCS, 16–20. IEEE.

    Google Scholar 

  41. Ntlangu, Mbulelo Brenwen, and Alireza Baghai-Wadji. 2017. Modelling network traffic using time series analysis: A review. In IoTBDS, 209–215.

    Google Scholar 

  42. Oregi, Izaskun, Aritz Pérez, Javier Del Ser, and José A Lozano. 2017. On-line dynamic time warping for streaming time series. In ECML-PKDD, 591–605. Springer.

    Google Scholar 

  43. Pellegrino, Gaetano, Qin Lin, Christian Hammerschmidt, and Sicco Verwer. 2017. Learning behavioral fingerprints from netflows using timed automata. In IFIP, 308–316. IEEE.

    Google Scholar 

  44. Perdisci, Roberto, Wenke Lee, and Nick Feamster. 2010. Behavioral clustering of http-based malware and signature generation using malicious network traces. In NSDI, vol. 10.

    Google Scholar 

  45. Pomorova, Oksana, Oleg Savenko, Sergii Lysenko, Andrii Kryshchuk, and Kira Bobrovnikova. 2015. A technique for the botnet detection based on DNS-traffic analysis. In CN, 127–138. Springer.

    Google Scholar 

  46. Rafique, M. Zubair, and Juan Caballero. 2013. Firma: Malware clustering and network signature generation with mixed network behaviors. In RAID, 144–163. Springer.

    Google Scholar 

  47. Rieck, Konrad, Philipp Trinius, Carsten Willems, and Thorsten Holz. 2011. Automatic analysis of malware behavior using machine learning. Journal of Computer Security 19 (4): 639–668.

    Article  Google Scholar 

  48. Rousseeuw, Peter J. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. CAM Journal 20.

    Google Scholar 

  49. Sebastián, Marcos, Richard Rivera, Platon Kotzias, and Juan Caballero. 2016. Avclass: A tool for massive malware labeling. In RAID, 230–253. Springer.

    Google Scholar 

  50. Sharma, Arushi, Ekta Gandotra, Divya Bansal, and Deepak Gupta. 2019. Malware capability assessment using fuzzy logic. Cybernetics and Systems 1–16.

    Google Scholar 

  51. Suarez-Tangil, Guillermo, Juan E. Tapiador, Pedro Peris-Lopez, and Jorge Blasco. 2014. Dendroid: A text mining approach to analyzing and classifying code structures in android malware families. Expert Systems with Applications 41 (4).

    Google Scholar 

  52. Sun, Mingshen, Xiaolei Li, John C.S. Lui, Richard T.B. Ma, and Zhenkai Liang. 2017. Monet: a user-oriented behavior-based malware variants detection system for android. TIFS 12 (5).

    Google Scholar 

  53. Tajalizadehkhoob, S.T., Hadi Asghari, Carlos Gañán, and M.J.G. Van Eeten. 2014. Why them? extracting intelligence about target selection from zeus financial malware. In WEIS.

    Google Scholar 

  54. Tegeler, Florian, Xiaoming Fu, Giovanni Vigna, and Christopher Kruegel. 2012. Botfinder: Finding bots in network traffic without deep packet inspection. In CoNEXT, 349–360. ACM.

    Google Scholar 

  55. Tian, Ronghua, Lynn Batten, Rafiqul Islam, and Steve Versteeg. 2009. An automated classification system based on the strings of trojan and virus families. In MALWARE. IEEE.

    Google Scholar 

  56. Verwer, Sicco, Rémi Eyraud, and Colin De La Higuera. 2014. Pautomac: A probabilistic automata and hidden Markov models learning competition. Machine Learning 96 (1–2): 129–154.

    Article  MathSciNet  Google Scholar 

  57. Vinod, P., V. Laxmi, M.S. Gaur, and Grijesh Chauhan. 2012. Momentum: Metamorphic malware exploration techniques using MSA signatures. In IIT, 232–237. IEEE.

    Google Scholar 

  58. Volis, George, Christos Makris, and Andreas Kanavos. 2016. Two novel techniques for space compaction on biological sequences. WEBIST.

    Google Scholar 

  59. Wang, An, Aziz Mohaisen, Wentao Chang, and Songqing Chen. 2015. Capturing DDoS attack dynamics behind the scenes. In DIMVA, 205–215. Springer.

    Google Scholar 

  60. Wang, Wei, Ming Zhu, Xuewen Zeng, Xiaozhou Ye, and Yiqiang Sheng. 2017. Malware traffic classification using convolutional neural network for representation learning. In ICOIN, 712–717.

    Google Scholar 

  61. Wang, Yipeng, Zhibin Zhang, Danfeng Daphne Yao, Buyun Qu, and Li Guo. 2011. Inferring protocol state machine from network traces: a probabilistic approach. In ACNS, 1–18. Springer.

    Google Scholar 

  62. Yadav, Tarun and Arvind Mallari Rao. 2015. Technical aspects of cyber kill chain. In SSCC.

    Google Scholar 

  63. Zahrotun, Lisna. 2016. Comparison jaccard similarity, cosine similarity and combined both of the data clustering with shared nearest neighbor method. CE&AJ 5 (1): 11–18.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Azqa Nadeem .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Nadeem, A., Hammerschmidt, C., Gañán, C.H., Verwer, S. (2021). Beyond Labeling: Using Clustering to Build Network Behavioral Profiles of Malware Families. In: Stamp, M., Alazab, M., Shalaginov, A. (eds) Malware Analysis Using Artificial Intelligence and Deep Learning. Springer, Cham. https://doi.org/10.1007/978-3-030-62582-5_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-62582-5_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-62581-8

  • Online ISBN: 978-3-030-62582-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics