Designing a Machine Learning Intrusion Detection System

Training a Classifier

Your browser needs to be JavaScript capable to view this video

Try reloading this page, or reviewing your browser settings

In this segment learn how to train the ML classifier components of a ML-based IDS.

Keywords

  • machine learning
  • IDS
  • Isolation Forest
  • model
  • algorithm

About this video

Author(s)
Emmanuel Tsukerman
First online
08 October 2020
DOI
https://doi.org/10.1007/978-1-4842-6591-8_5
Online ISBN
978-1-4842-6591-8
Publisher
Apress
Copyright information
© Emmanuel Tsukerman 2020

Video Transcript

Welcome to module five on how to train a classifier for our next-generation IDF. We’ve finally come to the part where we get our hands dirty. The code that you’re going to see now is available in the repo for this course.

And what are we going to do now is read in the data, process it, and then train a machine learning model on this data. The result will be a classifier, which is a program that automatically looks at new data and classifies it as malicious or benign, and this forms the core of the next-generation IDF.

So first thing I’m going to do, load in some libraries. Then I’m going to read in the data set. The data set is the KDD Cup data set, and as you can see, it’s this large data set consisting of many features of a network event. And finally, a label, such as normal, and otherwise malicious event.

So we’re going to read this in, into a data frame. And we can take a quick look, here it is. In total, this data set has 40,000 events, and 40-ish features. And we can see what the distribution of labels is. So out of the 40,000-ish events, 39,000 are normal, in other words, benign. It’s normal, benign traffic. And about 2,000 that are different types of malicious events. For instance, a port sweep.

So evidently, this data set is extremely imbalanced, in the sense that there’s way more of one class, the normal event, than there are the other class, the abnormal events. And this is extremely important, because that means that a, the problem is going to be challenging. And b, you can’t use any metric that you desire. You cannot use, for instance, accuracy.

If you were to use accuracy, you’re going to get something like 99% accuracy, in the most trivial way you can think of. You can simply label everything as normal, and because there are so few abnormal events, your accuracy would be very high. But of course, you’re not actually predicting anything. So it’s not helpful at all.

Next, we are going to convert all categorical features into numbers. So for instance, if we had as protocol type TCP, UDP, et cetera, we can’t really use this information directly. It has to be numerical. So we’re going to convert that information into numbers.

Now, we’re going to take all the features and put it into an x matrix, and we’re going to take all the labels and put them into a y array. And now, we’re going to perform a training validation testing split, as discussed earlier, to make sure we are keeping our data hygiene.

Now, we are going to be using a machine learning model called isolation forest. The basic idea behind isolation forest is that anomalous events are much easier to describe than normal events. To give you an example, very few people are the President of the United States. In fact, only one is, at a time. So if I want to describe this person, all I have to say is the President of the United States, and that’s it.

But if I want to describe some generic person, then I have to be very, very specific. I have to give you their full name, or some other elaborate way to describe them, such as their social security number. Or I might tell you their address, plus additional information, like how old they are, what gender, maybe even what they look like, and other additional information to allow you to identify them, and exactly them.

So as you can see, it’s much easier to describe an outlier, an extreme, than it is a generic point, or a generic event. So using this idea, isolation forest uses trees. Trees in the sense of computer science, to give a description of each event, each point in the data set.

And then some points are much easier to describe. And by easier, we mean require fewer trees in the forest to describe them, whereas others require many trees to be described. Since anomalies are easy to describe, this means that the points that require few trees to describe are anomalous, so that’s how this algorithm works, conceptually.

Now, in terms of parameters, it benefits from knowing what percentage of the population is considered anomalous. And this parameter is called contamination parameter. In this case, I give it the exact number, which is the number of malicious events in the training set. And then I declare this isolation forest classifier with some default parameters. So the number of estimators is being 100, and max samples being 256, is what is recommended by the paper that introduces this classifier.

So now that I have instantiated this classifier, all I need to do is train it. And another word for training is fitting. So I’ve done that. Now, what isolation forest can do is give a score to each event. The score represents the complexity. And as mentioned, the more complex, the more normal, and the less complex, the more anomalous. And we can actually see these scores. So if I pass in my validation set, I can plot the distribution of scores.

So looking here, we see that a large proportion of data has 0.2, and so on, and a very small proportion has minus 0.2. And this suggests that the stuff on the left is easier to describe, and therefore, is probably anomalous.

However, this is the real world, and it’s not going to be perfect. So that some normal events are here, some normal events are here. But if your features are good, and your data makes sense, then you expect there to be, to a good approximation, a cutoff where stuff on the right is normal, stuff the left is anomalous.

And we can choose this cutoff. Generally, you will systematically try out different cutoffs to get the best one, but for the purpose of illustration, I’ve chosen some cutoff and we’re going to look at what it means.

So I take, for instance, this cutoff here at minus 0.07, and I can look at how this performs on the validation set. So first of all, here’s the distribution of all labels in the validation set. So we have about 8,000 normal events, and about 400 abnormal. And if I pick this cutoff, then I should be capturing 140 normal, and 30-ish abnormal. In other words, 140 are false alerts, and then about 30 are correctly-captured, malicious events.

So what this means is we know that, in this data set, the ratio of benign normal events to malicious ones is about 20 to 1. So if I’m just guessing at random, my odds are 20 to 1. But if I use the classifier, then my odds improve to 1 to 30. So that’s already great.

So in practice, at this point, and more generally always, when you’re doing machine learning, you can iterate, and look to improve things. You can try different cutoffs, you can re-engineer some more features, you can collect more data, you can clean it up for mislabeled data. You can try a different algorithm, not isolation forest, but perhaps nearest neighbors, or a neural network. There are many avenues in which you can go, and in fact, probably infinitely many.

So assuming that you’re satisfied with your choice, now you can test out your classifier on the testing set. And I’ll give you an indication of how it will perform in real life, assuming that the KDD Cup data set is similar to the real data which you’ll be encountering.

So as you round out, I did the same exact test, you can see the original odds are 20 to 1. And again, the classifier makes it so the odds become even better than 1 to 30. So that means there was no overfitting to the training set somehow, or the validation. Which makes sense because I trained it here, on the training set, and I picked a cutoff using my own intuition. But I could have also looked at the training set, or the validation set.

And then finally, I tested it out. So I don’t expect any performance degradation, and this is a good indicator of how it will perform in real life on KDD-like data.

So to elaborate on this concept, and why we go through all this trouble of doing a training validation test and why we’re so careful is, imagine you have some points. Maybe these are housing prices, with x being number of rooms, and a y being thousands or tens of thousands of dollars. And you’re trying to predict what the housing prices will be like. Your model is essentially finding a way to fit this data. And in this case, you can see a straight line seems pretty reasonable.

But there is a better way to fit, at least, apparently. And it’s this curve. It fits better, but realistically speaking, it’s not going to generalize better. This line is more likely to be the true model, than this fancy curve that fits the points precisely, but is overfitting.

But if this is all the data that you have, you have no way of knowing. Is it overfitting? Or is it actually the perfect model that you fortunately managed to find? So that’s why we split up the data, so we’re careful not to make such mistakes.