Skip to main content
Log in

Scalable detection of botnets based on DGA

Efficient feature discovery process in machine learning techniques

  • Focus
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Botnets are evolving, and their covert modus operandi, based on cloud technologies such as the virtualisation and the dynamic fast-flux addressing, has been proved challenging for classic intrusion detection systems and even the so-called next-generation firewalls. Moreover, dynamic addressing has been spotted in the wild in combination with pseudo-random domain names generation algorithm (DGA), ultimately leading to an extremely accurate and effective disguise technique. Although these concealing methods have been exposed and analysed to great extent in the past decade, the literature lacks some important conclusions and common-ground knowledge, especially when it comes to Machine Learning (ML) solutions. This research horizontally navigates the state of the art aiming to polish the feature discovery process, which is the single most time-consuming part of any ML approach. Results show that only a minor fraction of the defined features are indeed practical and informative, especially when considering 0-day botnet identification. The contributions described in this article will ease the detection process, ultimately enabling improved and more scalable solutions for DGA-based botnets detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Including four features (NLP-L-x , NLP-R-NUM-x , NLP-R-VOW-x , NLP-R-CON-x ) for each domain name level: the FQDN, the Second Level Domain Name (2LD) or all the others sub-levels as a whole (OLD).

  2. According to ICANN specifics, the minimum length of a domain name without considering the Top Level Domain (TLD) is three characters. The maximum, including symbols and extensions, is 255, having a maximum length per-level of 63 characters.

  3. The IG, is purely theoretic, it does not consider any particular classification algorithm.

  4. By experimentally demonstrating that users’ data are not strictly required to recognise malwares in the wild. See Sect. 3.3.

References

Download references

Acknowledgements

This study was founded by a predoctoral and a postdoctoral INCIBE Grant within the “Ayudas para la Excelencia de los Equipos de Investigación Avanzada en Ciberseguridad” program, with Codes INCIBEI-2015-27353 and INCIBEI-2015-27352.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gregorio Martínez Pérez.

Ethics declarations

Conflict of interest

The authors declare that they do not have any conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Communicated by B. B. Gupta.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zago, M., Gil Pérez, M. & Martínez Pérez, G. Scalable detection of botnets based on DGA. Soft Comput 24, 5517–5537 (2020). https://doi.org/10.1007/s00500-018-03703-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-018-03703-8

Keywords

Navigation