Skip to main content

ILAB: An Interactive Labelling Strategy for Intrusion Detection

  • Conference paper
  • First Online:
Research in Attacks, Intrusions, and Defenses (RAID 2017)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 10453))

Abstract

Acquiring a representative labelled dataset is a hurdle that has to be overcome to learn a supervised detection model. Labelling a dataset is particularly expensive in computer security as expert knowledge is required to perform the annotations. In this paper, we introduce ILAB, a novel interactive labelling strategy that helps experts label large datasets for intrusion detection with a reduced workload. First, we compare ILAB with two state-of-the-art labelling strategies on public labelled datasets and demonstrate it is both an effective and a scalable solution. Second, we show ILAB is workable with a real-world annotation project carried out on a large unlabelled NetFlow dataset originating from a production environment. We provide an open source implementation (https://github.com/ANSSI-FR/SecuML/) to allow security experts to label their own datasets and researchers to compare labelling strategies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://contagiodump.blogspot.fr/.

  2. 2.

    http://www.unb.ca/cic/research/datasets/nsl.html.

  3. 3.

    The IP addresses have been hidden for privacy reasons.

References

  1. Almgren, M., Jonsson, E.: Using active learning in intrusion detection. In: CSFW, pp. 88–98 (2004)

    Google Scholar 

  2. Antonakakis, M., Perdisci, R., Nadji, Y., Vasiloglou, N., Abu-Nimeh, S., Lee, W., Dagon, D.: From throw-away traffic to bots: detecting the rise of DGA-based malware. In: USENIX Security, pp. 491–506 (2012)

    Google Scholar 

  3. Baldridge, J., Palmer, A.: How well does active learning actually work?: Time-based evaluation of cost-reduction strategies for language documentation. In: EMNLP, pp. 296–305 (2009)

    Google Scholar 

  4. Berlin, K., Slater, D., Saxe, J.: Malicious behavior detection using windows audit logs. In: AISEC, pp. 35–44 (2015)

    Google Scholar 

  5. Bilge, L., Balzarotti, D., Robertson, W., Kirda, E., Kruegel, C.: Disclosure: detecting botnet command and control servers through large-scale netflow analysis. In: ACSAC, pp. 129–138 (2012)

    Google Scholar 

  6. Claise, B.: Cisco systems netflow services export version 9 (2004)

    Google Scholar 

  7. Corona, I., Maiorca, D., Ariu, D., Giacinto, G.: Lux0r: detection of malicious PDF-embedded JavaScript code through discriminant analysis of API references. In: AISEC, pp. 47–57 (2014)

    Google Scholar 

  8. Dasgupta, S., Hsu, D.: Hierarchical sampling for active learning. In: ICML, pp. 208–215 (2008)

    Google Scholar 

  9. Druck, G., Settles, B., McCallum, A.: Active learning by labeling features. In: EMNLP, pp. 81–90 (2009)

    Google Scholar 

  10. Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, Berlin (2001). doi:10.1007/978-0-387-21606-5

    MATH  Google Scholar 

  11. Gascon, H., Yamaguchi, F., Arp, D., Rieck, K.: Structural detection of android malware using embedded call graphs. In: AISEC, pp. 45–54 (2013)

    Google Scholar 

  12. Görnitz, N., Kloft, M., Brefeld, U.: Active and semi-supervised data domain description. In: ECML-PKDD, pp. 407–422 (2009)

    Google Scholar 

  13. Görnitz, N., Kloft, M., Rieck, K., Brefeld, U.: Active learning for network intrusion detection. In: AISEC, pp. 47–54 (2009)

    Google Scholar 

  14. Görnitz, N., Kloft, M.M., Rieck, K., Brefeld, U.: Toward supervised anomaly detection. JAIR 46, 235–262 (2013)

    MathSciNet  MATH  Google Scholar 

  15. Hachey, B., Alex, B., Becker, M.: Investigating the effects of selective sampling on the annotation task. In: CoNLL, pp. 144–151 (2005)

    Google Scholar 

  16. Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)

    Article  Google Scholar 

  17. Jones, E., Oliphant, T., Peterson, P.: SciPy: open source scientific tools for Python (2001). http://www.scipy.org/

  18. Jung, J., Paxson, V., Berger, A.W., Balakrishnan, H.: Fast portscan detection using sequential hypothesis testing. In: S&P, pp. 211–225 (2004)

    Google Scholar 

  19. Khasawneh, K.N., Ozsoy, M., Donovick, C., Abu-Ghazaleh, N., Ponomarev, D.: Ensemble learning for low-level hardware-supported malware detection. In: Bos, H., Monrose, F., Blanc, G. (eds.) RAID 2015. LNCS, vol. 9404, pp. 3–25. Springer, Cham (2015). doi:10.1007/978-3-319-26362-5_1

    Chapter  Google Scholar 

  20. Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: SIGIR, pp. 3–12 (1994)

    Google Scholar 

  21. Miller, B., Kantchelian, A., Afroz, S., Bachwani, R., Dauber, E., Huang, L., Tschantz, M.C., Joseph, A.D., Tygar, J.: Adversarial active learning. In: AISEC, pp. 3–14 (2014)

    Google Scholar 

  22. Nappa, A., Rafique, M.Z., Caballero, J.: The MALICIA dataset: identification and analysis of drive-by download operations. IJIS 14(1), 15–33 (2015)

    Article  Google Scholar 

  23. Omohundro, S.M.: Five Balltree Construction Algorithms. International Computer Science Institute, Berkeley (1989)

    Google Scholar 

  24. Paxson, V.: Bro: a system for detecting network intruders in real-time. Comput. Netw. 31(23), 2435–2463 (1999)

    Article  Google Scholar 

  25. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. JMLR 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  26. Pelleg, D., Moore, A.W.: Active learning for anomaly and rare-category detection. In: NIPS, pp. 1073–1080 (2004)

    Google Scholar 

  27. Rieck, K.: Computer security and machine learning: worst enemies or best friends? In: SysSec, pp. 107–110 (2011)

    Google Scholar 

  28. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  MATH  Google Scholar 

  29. Schütze, H., Velipasaoglu, E., Pedersen, J.O.: Performance thresholding in practical text classification. In: CIKM, pp. 662–671 (2006)

    Google Scholar 

  30. Sculley, D.: Online active learning methods for fast label-efficient spam filtering. In: CEAS, pp. 1–4 (2007)

    Google Scholar 

  31. Sculley, D., Otey, M.E., Pohl, M., Spitznagel, B., Hainsworth, J., Zhou, Y.: Detecting adversarial advertisements in the wild. In: KDD, pp. 274–282 (2011)

    Google Scholar 

  32. Settles, B.: Active learning literature survey. Univ. Wisconsin Madison 52(55–66), 11 (2010)

    Google Scholar 

  33. Settles, B.: From theories to queries: active learning in practice. JMLR 16, 1–18 (2011)

    Google Scholar 

  34. Settles, B.: Active learning. Synth. Lect. Artif. Intell. Mach. Learn. 6(1), 1–114 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  35. Smutz, C., Stavrou, A.: Malicious PDF detection using metadata and structural features. In: ACSAC, pp. 239–248 (2012)

    Google Scholar 

  36. Smutz, C., Stavrou, A.: Malicious PDF detection using metadata and structural features. In: Technical report. George Mason University (2012)

    Google Scholar 

  37. Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast–but is it good?: Evaluating non-expert annotations for natural language tasks. In: EMNLP. pp. 254–263 (2008)

    Google Scholar 

  38. Sommer, R., Paxson, V.: Outside the closed world: On using machine learning for network intrusion detection. In: S&P, pp. 305–316 (2010)

    Google Scholar 

  39. Song, J., Takakura, H., Okabe, Y., Eto, M., Inoue, D., Nakao, K.: Statistical analysis of honeypot data and building of kyoto 2006+ dataset for NIDS evaluation. In: BADGERS, pp. 29–36 (2011)

    Google Scholar 

  40. Stokes, J.W., Platt, J.C., Kravis, J., Shilman, M.: Aladin: active learning of anomalies to detect intrusions. Technical report. Microsoft Network Security Redmond, WA (2008)

    Google Scholar 

  41. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD CUP 99 data set. In: CISDA (2009)

    Google Scholar 

  42. Tax, D.M., Duin, R.P.: Support vector data description. Mach. Learn. 54(1), 45–66 (2004)

    Article  MATH  Google Scholar 

  43. Tomanek, K., Olsson, F.: A web survey on the use of active learning to support annotation of text data. In: ALNLP, pp. 45–48 (2009)

    Google Scholar 

  44. Veeramachaneni, K., Arnaldo, I.: AI2: training a big data machine to defend. In: DataSec, pp. 49–54 (2016)

    Google Scholar 

  45. Whittaker, C., Ryner, B., Nazif, M.: Large-scale automatic classification of phishing pages. In: NDSS, vol. 10 (2010)

    Google Scholar 

  46. Wright, S., Nocedal, J.: Numerical optimization. Springer Sci. 35, 67–68 (1999)

    MATH  Google Scholar 

  47. Zhang, T., Oles, F.: The value of unlabeled data for classification problems. In: ICML, pp. 1191–1198 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anaël Beaugnon .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (txt 1 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Beaugnon, A., Chifflier, P., Bach, F. (2017). ILAB: An Interactive Labelling Strategy for Intrusion Detection. In: Dacier, M., Bailey, M., Polychronakis, M., Antonakakis, M. (eds) Research in Attacks, Intrusions, and Defenses. RAID 2017. Lecture Notes in Computer Science(), vol 10453. Springer, Cham. https://doi.org/10.1007/978-3-319-66332-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66332-6_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66331-9

  • Online ISBN: 978-3-319-66332-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics