Abstract
Acquiring a representative labelled dataset is a hurdle that has to be overcome to learn a supervised detection model. Labelling a dataset is particularly expensive in computer security as expert knowledge is required to perform the annotations. In this paper, we introduce ILAB, a novel interactive labelling strategy that helps experts label large datasets for intrusion detection with a reduced workload. First, we compare ILAB with two state-of-the-art labelling strategies on public labelled datasets and demonstrate it is both an effective and a scalable solution. Second, we show ILAB is workable with a real-world annotation project carried out on a large unlabelled NetFlow dataset originating from a production environment. We provide an open source implementation (https://github.com/ANSSI-FR/SecuML/) to allow security experts to label their own datasets and researchers to compare labelling strategies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
The IP addresses have been hidden for privacy reasons.
References
Almgren, M., Jonsson, E.: Using active learning in intrusion detection. In: CSFW, pp. 88–98 (2004)
Antonakakis, M., Perdisci, R., Nadji, Y., Vasiloglou, N., Abu-Nimeh, S., Lee, W., Dagon, D.: From throw-away traffic to bots: detecting the rise of DGA-based malware. In: USENIX Security, pp. 491–506 (2012)
Baldridge, J., Palmer, A.: How well does active learning actually work?: Time-based evaluation of cost-reduction strategies for language documentation. In: EMNLP, pp. 296–305 (2009)
Berlin, K., Slater, D., Saxe, J.: Malicious behavior detection using windows audit logs. In: AISEC, pp. 35–44 (2015)
Bilge, L., Balzarotti, D., Robertson, W., Kirda, E., Kruegel, C.: Disclosure: detecting botnet command and control servers through large-scale netflow analysis. In: ACSAC, pp. 129–138 (2012)
Claise, B.: Cisco systems netflow services export version 9 (2004)
Corona, I., Maiorca, D., Ariu, D., Giacinto, G.: Lux0r: detection of malicious PDF-embedded JavaScript code through discriminant analysis of API references. In: AISEC, pp. 47–57 (2014)
Dasgupta, S., Hsu, D.: Hierarchical sampling for active learning. In: ICML, pp. 208–215 (2008)
Druck, G., Settles, B., McCallum, A.: Active learning by labeling features. In: EMNLP, pp. 81–90 (2009)
Friedman, J., Hastie, T., Tibshirani, R.: The Elements of Statistical Learning. Springer Series in Statistics, vol. 1. Springer, Berlin (2001). doi:10.1007/978-0-387-21606-5
Gascon, H., Yamaguchi, F., Arp, D., Rieck, K.: Structural detection of android malware using embedded call graphs. In: AISEC, pp. 45–54 (2013)
Görnitz, N., Kloft, M., Brefeld, U.: Active and semi-supervised data domain description. In: ECML-PKDD, pp. 407–422 (2009)
Görnitz, N., Kloft, M., Rieck, K., Brefeld, U.: Active learning for network intrusion detection. In: AISEC, pp. 47–54 (2009)
Görnitz, N., Kloft, M.M., Rieck, K., Brefeld, U.: Toward supervised anomaly detection. JAIR 46, 235–262 (2013)
Hachey, B., Alex, B., Becker, M.: Investigating the effects of selective sampling on the annotation task. In: CoNLL, pp. 144–151 (2005)
Hanley, J.A., McNeil, B.J.: The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1), 29–36 (1982)
Jones, E., Oliphant, T., Peterson, P.: SciPy: open source scientific tools for Python (2001). http://www.scipy.org/
Jung, J., Paxson, V., Berger, A.W., Balakrishnan, H.: Fast portscan detection using sequential hypothesis testing. In: S&P, pp. 211–225 (2004)
Khasawneh, K.N., Ozsoy, M., Donovick, C., Abu-Ghazaleh, N., Ponomarev, D.: Ensemble learning for low-level hardware-supported malware detection. In: Bos, H., Monrose, F., Blanc, G. (eds.) RAID 2015. LNCS, vol. 9404, pp. 3–25. Springer, Cham (2015). doi:10.1007/978-3-319-26362-5_1
Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: SIGIR, pp. 3–12 (1994)
Miller, B., Kantchelian, A., Afroz, S., Bachwani, R., Dauber, E., Huang, L., Tschantz, M.C., Joseph, A.D., Tygar, J.: Adversarial active learning. In: AISEC, pp. 3–14 (2014)
Nappa, A., Rafique, M.Z., Caballero, J.: The MALICIA dataset: identification and analysis of drive-by download operations. IJIS 14(1), 15–33 (2015)
Omohundro, S.M.: Five Balltree Construction Algorithms. International Computer Science Institute, Berkeley (1989)
Paxson, V.: Bro: a system for detecting network intruders in real-time. Comput. Netw. 31(23), 2435–2463 (1999)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. JMLR 12, 2825–2830 (2011)
Pelleg, D., Moore, A.W.: Active learning for anomaly and rare-category detection. In: NIPS, pp. 1073–1080 (2004)
Rieck, K.: Computer security and machine learning: worst enemies or best friends? In: SysSec, pp. 107–110 (2011)
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Schütze, H., Velipasaoglu, E., Pedersen, J.O.: Performance thresholding in practical text classification. In: CIKM, pp. 662–671 (2006)
Sculley, D.: Online active learning methods for fast label-efficient spam filtering. In: CEAS, pp. 1–4 (2007)
Sculley, D., Otey, M.E., Pohl, M., Spitznagel, B., Hainsworth, J., Zhou, Y.: Detecting adversarial advertisements in the wild. In: KDD, pp. 274–282 (2011)
Settles, B.: Active learning literature survey. Univ. Wisconsin Madison 52(55–66), 11 (2010)
Settles, B.: From theories to queries: active learning in practice. JMLR 16, 1–18 (2011)
Settles, B.: Active learning. Synth. Lect. Artif. Intell. Mach. Learn. 6(1), 1–114 (2012)
Smutz, C., Stavrou, A.: Malicious PDF detection using metadata and structural features. In: ACSAC, pp. 239–248 (2012)
Smutz, C., Stavrou, A.: Malicious PDF detection using metadata and structural features. In: Technical report. George Mason University (2012)
Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast–but is it good?: Evaluating non-expert annotations for natural language tasks. In: EMNLP. pp. 254–263 (2008)
Sommer, R., Paxson, V.: Outside the closed world: On using machine learning for network intrusion detection. In: S&P, pp. 305–316 (2010)
Song, J., Takakura, H., Okabe, Y., Eto, M., Inoue, D., Nakao, K.: Statistical analysis of honeypot data and building of kyoto 2006+ dataset for NIDS evaluation. In: BADGERS, pp. 29–36 (2011)
Stokes, J.W., Platt, J.C., Kravis, J., Shilman, M.: Aladin: active learning of anomalies to detect intrusions. Technical report. Microsoft Network Security Redmond, WA (2008)
Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD CUP 99 data set. In: CISDA (2009)
Tax, D.M., Duin, R.P.: Support vector data description. Mach. Learn. 54(1), 45–66 (2004)
Tomanek, K., Olsson, F.: A web survey on the use of active learning to support annotation of text data. In: ALNLP, pp. 45–48 (2009)
Veeramachaneni, K., Arnaldo, I.: AI2: training a big data machine to defend. In: DataSec, pp. 49–54 (2016)
Whittaker, C., Ryner, B., Nazif, M.: Large-scale automatic classification of phishing pages. In: NDSS, vol. 10 (2010)
Wright, S., Nocedal, J.: Numerical optimization. Springer Sci. 35, 67–68 (1999)
Zhang, T., Oles, F.: The value of unlabeled data for classification problems. In: ICML, pp. 1191–1198 (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Beaugnon, A., Chifflier, P., Bach, F. (2017). ILAB: An Interactive Labelling Strategy for Intrusion Detection. In: Dacier, M., Bailey, M., Polychronakis, M., Antonakakis, M. (eds) Research in Attacks, Intrusions, and Defenses. RAID 2017. Lecture Notes in Computer Science(), vol 10453. Springer, Cham. https://doi.org/10.1007/978-3-319-66332-6_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-66332-6_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66331-9
Online ISBN: 978-3-319-66332-6
eBook Packages: Computer ScienceComputer Science (R0)